Thanks a ton, Rohini. I'll take a look at that. On Tue, Feb 14, 2017 at 3:34 PM, Rohini Palaniswamy <[email protected] > wrote:
> In Pig, we implement this by doing 3 vertices. Vertex1 (Load with > Combiner), Vertex2 (Load with Combiner) -> Vertex3 (Group by). Vertex1 and > Vertex2 are made part of a VertexGroup (logical abstraction and not a real > vertex), so that their output is seen as one single output by Vertex 3. > This approach also works well if Vertex1 and Vertex2 were intermediate > vertices and not root vertices with MRInput. > > https://github.com/apache/pig/blob/trunk/test/org/apache/ > pig/test/data/GoldenFiles/tez/TEZC-Union-2.gld (Plan using VertexGroup > and 3 vertices) > https://github.com/apache/pig/blob/trunk/test/org/apache/ > pig/test/data/GoldenFiles/tez/TEZC-Union-2-OPTOFF.gld (This is the > unoptimized plan with 4 vertices which is similar to your current cascading > plan) > > > On Tue, Feb 14, 2017 at 3:20 PM, Piyush Narang <[email protected]> > wrote: > >> Thanks for getting back Rohini and Siddharth. To provide some context, we >> have two input vertices each reading lzo thrift data from a different path >> on hdfs. We then merge >> <http://docs.cascading.org/cascading/2.0/javadoc/cascading/pipe/Merge.html> >> the >> data from the two vertices and then groupBy and some aggregations one of >> the fields. In MR, the reading from the 2 inputs and the merge happens on >> the mappers and the group + aggregations on the reducers. In case of Tez we >> have the merge on a different vertex and the group + aggregations on a >> different vertex (with Cascading choosing scatter gather edges in both >> cases). Exploring if it would be possible to combine the merge with the >> groupBy in Cascading. I was wondering if the MultiMRInput would have been >> an option in cases where we read from 2 or more sources and follow that up >> with a merge. That might be an option to explore if we're not able to >> collapse the merge and groupBy. >> >> On Tue, Feb 14, 2017 at 9:14 AM, Siddharth Seth <[email protected]> wrote: >> >>> What operations are being performed by these vertices? If there's no >>> advantage of reading multiple sources in a single task - using separate >>> vertices is preferable. At least for Hive, when it read multiple sources in >>> the same vertex, it had to perform some tagging etc for the reduce side to >>> differentiate the inputs. >>> MultiMRInput can be used for public consumption. Like Rohini mentioned, >>> it is used for SMB joins in Hive. IIRC, hive ends up setting this up to >>> read multiple buckets within the same vertex/task. >>> Also - it is possible to hook multiple MRInputs into a single vertex. >>> That will require a custom vertex manager to figure out the parallelism, >>> and how splits from these sources are to be combined. Hive does this for >>> SMB joins, where it'll send a single bucket / groups of buckets from >>> different sources to the same task. (Both sides ordered, and bucketed - so >>> it's possible to do a merge join in this vertex). >>> >>> >>> On Mon, Feb 13, 2017 at 5:37 PM, Piyush Narang <[email protected]> >>> wrote: >>> >>>> hi folks, >>>> >>>> While debugging the DAG generated by a Scalding / Cascading job, I >>>> noticed that in Tez we end up with two input vertices - one vertex for each >>>> input path. In case of Hadoop on the other hand we end up with our map >>>> phase reading from both input datasets. Is this supported in Tez? I noticed >>>> that Cascading is currently using MRInput >>>> <https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MRInput.java> >>>> to >>>> set up its Tez inputs. I wasn't sure if we could use MultiMRInput >>>> <https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MultiMRInput.java> >>>> to >>>> read from multiple input directories in the same vertex in Tez or if it has >>>> a different purpose. If we can use it, is it safe for public consumption? >>>> (noticed it is still annotated with @Evolving). >>>> >>>> Thanks, >>>> >>>> -- >>>> - Piyush >>>> >>> >>> >> >> >> -- >> - Piyush >> > > -- - Piyush
