> We are using OrderedSerialization in a bunch of our jobs. In this job we're 
> not using it on both the Hadoop side and the Tez side. The datasets both jobs 
> are reading are identical. 

That single comparator call was the biggest fraction of slow-down when I ran 
profiles with Tez.

I profiled through that codepath for TEZ-2505, of course YMMV.

My estimate was that a raw byte OrderedSerialization + TezRawComparator could 
save ~50% of the total CPU of some jobs.

> Our suspicion internally was also around pipelining and speculative execution 
> across steps which doesn't happen in Hadoop between jobs

https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlanes.sh
+
https://github.com/apache/tez/blob/master/tez-tools/analyzers/job-analyzer/src/main/java/org/apache/tez/analyzer/plugins/CriticalPathAnalyzer.java

Those help a lot in locating issues with Tez scheduling and targeting 
optimizations.

Cheers,
Gopal



Reply via email to