> We are using OrderedSerialization in a bunch of our jobs. In this job we're > not using it on both the Hadoop side and the Tez side. The datasets both jobs > are reading are identical.
That single comparator call was the biggest fraction of slow-down when I ran profiles with Tez. I profiled through that codepath for TEZ-2505, of course YMMV. My estimate was that a raw byte OrderedSerialization + TezRawComparator could save ~50% of the total CPU of some jobs. > Our suspicion internally was also around pipelining and speculative execution > across steps which doesn't happen in Hadoop between jobs https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlanes.sh + https://github.com/apache/tez/blob/master/tez-tools/analyzers/job-analyzer/src/main/java/org/apache/tez/analyzer/plugins/CriticalPathAnalyzer.java Those help a lot in locating issues with Tez scheduling and targeting optimizations. Cheers, Gopal
