Reading each input in its own vertex gives better control and allow tuning differently. That is why in general MRInput is used. MultiMRInput <https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MultiMRInput.java> was added for hive smb joins and it is used in hive code - https://www.codatlas.com/search?q=MultiMRInput&projid=github.com%2Fapache%2Fhive&searchType=code . So it should be safe for public consumption.
On Mon, Feb 13, 2017 at 5:37 PM, Piyush Narang <[email protected]> wrote: > hi folks, > > While debugging the DAG generated by a Scalding / Cascading job, I noticed > that in Tez we end up with two input vertices - one vertex for each input > path. In case of Hadoop on the other hand we end up with our map phase > reading from both input datasets. Is this supported in Tez? I noticed that > Cascading is currently using MRInput > <https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MRInput.java> > to > set up its Tez inputs. I wasn't sure if we could use MultiMRInput > <https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/input/MultiMRInput.java> > to > read from multiple input directories in the same vertex in Tez or if it has > a different purpose. If we can use it, is it safe for public consumption? > (noticed it is still annotated with @Evolving). > > Thanks, > > -- > - Piyush >
