Thanks Gopal, we did run into a lot of overhead in the comparator in one of our other jobs. Turning on OrderedSerialization in Scalding seemed to have helped there. While we were comparing Hadoop and Tez in that job, we were seeing Tez's reducers taking substantially more time (and the overhead was in the comparison methods). I tried a few runs after forcing Cascading to using the rawBytes comparator - https://github.com/cwensel/ cascading/blob/wip-3.2/cascading-hadoop/src/main/ shared/cascading/tuple/hadoop/util/DeserializerComparator.java#L59 (returning true there), that did help as well. From what I understand, though, we need to do this from the Scalding side as a lot of our jobs use complex objects (e.g. thrift structs / scala case classes). If we don't have ordered serialization enabled from Scalding I'm not sure the raw comparators will make sense (think some of the work there was to ensure the byte representations of these objects can be compared sanely).
Thanks for the two links, they look really useful! I was able to test out a few variants of our job with slowstart = 0.999 to see the if the pipelining would explain the resource usage. Turns out that it was contributing a good deal to the resource usage. When we set this value, we end up seeing's Tez using around 20 (container reuse=false) - 27 (container reuse=true)% lower mb_millis than MR. Runtime wise Tez is still better, takes around half the time as Hadoop. Thanks, On Fri, Mar 17, 2017 at 12:37 PM, Gopal Vijayaraghavan <[email protected]> wrote: > > > We are using OrderedSerialization in a bunch of our jobs. In this job > we're not using it on both the Hadoop side and the Tez side. The datasets > both jobs are reading are identical. > > That single comparator call was the biggest fraction of slow-down when I > ran profiles with Tez. > > I profiled through that codepath for TEZ-2505, of course YMMV. > > My estimate was that a raw byte OrderedSerialization + TezRawComparator > could save ~50% of the total CPU of some jobs. > > > Our suspicion internally was also around pipelining and speculative > execution across steps which doesn't happen in Hadoop between jobs > > https://github.com/apache/tez/blob/master/tez-tools/ > swimlanes/yarn-swimlanes.sh > + > https://github.com/apache/tez/blob/master/tez-tools/ > analyzers/job-analyzer/src/main/java/org/apache/tez/analyzer/plugins/ > CriticalPathAnalyzer.java > > Those help a lot in locating issues with Tez scheduling and targeting > optimizations. > > Cheers, > Gopal > > > > -- - Piyush
