What operations are being performed by these vertices? If there's no
advantage of reading multiple sources in a single task - using separate
vertices is preferable. At least for Hive, when it read multiple sources in
the same vertex, it had to perform some tagging etc for the reduce side to
differentiate the inputs.
MultiMRInput can be used for public consumption. Like Rohini mentioned, it
is used for SMB joins in Hive. IIRC, hive ends up setting this up to read
multiple buckets within the same vertex/task.
Also - it is possible to hook multiple MRInputs into a single vertex. That
will require a custom vertex manager to figure out the parallelism, and how
splits from these sources are to be combined. Hive does this for SMB joins,
where it'll send a single bucket / groups of buckets from different sources
to the same task. (Both sides ordered, and bucketed - so it's possible to
do a merge join in this vertex).


On Mon, Feb 13, 2017 at 5:37 PM, Piyush Narang <[email protected]> wrote:

> hi folks,
>
> While debugging the DAG generated by a Scalding / Cascading job, I noticed
> that in Tez we end up with two input vertices - one vertex for each input
> path. In case of Hadoop on the other hand we end up with our map phase
> reading from both input datasets. Is this supported in Tez? I noticed that
> Cascading is currently using MRInput
> <https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MRInput.java>
>  to
> set up its Tez inputs. I wasn't sure if we could use MultiMRInput
> <https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MultiMRInput.java>
>  to
> read from multiple input directories in the same vertex in Tez or if it has
> a different purpose. If we can use it, is it safe for public consumption?
> (noticed it is still annotated with @Evolving).
>
> Thanks,
>
> --
> - Piyush
>

Reply via email to