[ 
https://issues.apache.org/jira/browse/TEZ-3459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manuel Godbert updated TEZ-3459:
--------------------------------
    Description: 
After applying the patch delivered in TEZ-3330, I enriched the MapredColorCount 
example to reproduce some of the other issues I encountered on the jobs I wish 
to see running with Tez.

I am attaching a jar to the JIRA, including source code, and a script file 
detailing the observed results in comments.

It adresses 4 issues:
- the embedded jars in /lib are ignored by Tez, but YARN uses them without 
additional configuration
- The use of a combiner causes a NullPointerException
- The counters incremented in the Reporter objects stay at 0
- The additional output configured is missing in the final job output folder. 
It seems that we actually have 2 issues at task commit time:
-> there is no task committing for maps in a map+reduce job, but in our example 
we generated outputs in map phase using MultipleOutputs
-> the temporary task folder used for files coming from the MultipleOutputs is 
not always the same as for the main output files (more difficult to illustrate 
with simple example). This happens to cause issues at task commit time.

For information we observe about 10% of performance gain using Tez and working 
around above issues in our use cases with production data volumes, which is 
really great!

I am using HDP2.4

  was:
After applying the patch delivered in TEZ-3330, I enriched the MapredColorCount 
example to reproduce some of the other issues I encountered on the jobs I wish 
to see running with Tez.

I am attaching a jar to the JIRA, including source code, and a script file 
detailing the observed results in comments.

It adresses 4 issues:
- the embedded jars in /lib are ignored by Tez, but YARN uses them without 
additional configuration
- The use of a combiner causes a NullPointerException
- The counters incremented in the Reporter objects stay at 0
- The additional output configured is missing in the final job output folder. 
It seems the problem occurs at task commit time, as the new output file is not 
in the same folder as the main output file.

I am using HDP2.4


> Issues running M/R jobs with Tez
> --------------------------------
>
>                 Key: TEZ-3459
>                 URL: https://issues.apache.org/jira/browse/TEZ-3459
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Manuel Godbert
>         Attachments: colorCount.sh, colorCount.sh, mr-example.jar, 
> mr-example.jar
>
>
> After applying the patch delivered in TEZ-3330, I enriched the 
> MapredColorCount example to reproduce some of the other issues I encountered 
> on the jobs I wish to see running with Tez.
> I am attaching a jar to the JIRA, including source code, and a script file 
> detailing the observed results in comments.
> It adresses 4 issues:
> - the embedded jars in /lib are ignored by Tez, but YARN uses them without 
> additional configuration
> - The use of a combiner causes a NullPointerException
> - The counters incremented in the Reporter objects stay at 0
> - The additional output configured is missing in the final job output folder. 
> It seems that we actually have 2 issues at task commit time:
> -> there is no task committing for maps in a map+reduce job, but in our 
> example we generated outputs in map phase using MultipleOutputs
> -> the temporary task folder used for files coming from the MultipleOutputs 
> is not always the same as for the main output files (more difficult to 
> illustrate with simple example). This happens to cause issues at task commit 
> time.
> For information we observe about 10% of performance gain using Tez and 
> working around above issues in our use cases with production data volumes, 
> which is really great!
> I am using HDP2.4



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to