Re: Reading from multiple paths in a single Tez node

Piyush Narang Tue, 14 Feb 2017 15:22:03 -0800

Thanks for getting back Rohini and Siddharth. To provide some context, we
have two input vertices each reading lzo thrift data from a different path
on hdfs. We then merge
<http://docs.cascading.org/cascading/2.0/javadoc/cascading/pipe/Merge.html> the
data from the two vertices and then groupBy and some aggregations one of
the fields. In MR, the reading from the 2 inputs and the merge happens on
the mappers and the group + aggregations on the reducers. In case of Tez we
have the merge on a different vertex and the group + aggregations on a
different vertex (with Cascading choosing scatter gather edges in both
cases). Exploring if it would be possible to combine the merge with the
groupBy in Cascading. I was wondering if the MultiMRInput would have been
an option in cases where we read from 2 or more sources and follow that up
with a merge. That might be an option to explore if we're not able to
collapse the merge and groupBy.


On Tue, Feb 14, 2017 at 9:14 AM, Siddharth Seth <[email protected]> wrote:

> What operations are being performed by these vertices? If there's no
> advantage of reading multiple sources in a single task - using separate
> vertices is preferable. At least for Hive, when it read multiple sources in
> the same vertex, it had to perform some tagging etc for the reduce side to
> differentiate the inputs.
> MultiMRInput can be used for public consumption. Like Rohini mentioned, it
> is used for SMB joins in Hive. IIRC, hive ends up setting this up to read
> multiple buckets within the same vertex/task.
> Also - it is possible to hook multiple MRInputs into a single vertex. That
> will require a custom vertex manager to figure out the parallelism, and how
> splits from these sources are to be combined. Hive does this for SMB joins,
> where it'll send a single bucket / groups of buckets from different sources
> to the same task. (Both sides ordered, and bucketed - so it's possible to
> do a merge join in this vertex).
>
>
> On Mon, Feb 13, 2017 at 5:37 PM, Piyush Narang <[email protected]>
> wrote:
>
>> hi folks,
>>
>> While debugging the DAG generated by a Scalding / Cascading job, I
>> noticed that in Tez we end up with two input vertices - one vertex for each
>> input path. In case of Hadoop on the other hand we end up with our map
>> phase reading from both input datasets. Is this supported in Tez? I noticed
>> that Cascading is currently using MRInput
>> <https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MRInput.java>
>>  to
>> set up its Tez inputs. I wasn't sure if we could use MultiMRInput
>> <https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MultiMRInput.java>
>>  to
>> read from multiple input directories in the same vertex in Tez or if it has
>> a different purpose. If we can use it, is it safe for public consumption?
>> (noticed it is still annotated with @Evolving).
>>
>> Thanks,
>>
>> --
>> - Piyush
>>
>
>


-- 
- Piyush

Re: Reading from multiple paths in a single Tez node

Reply via email to