[ https://issues.apache.org/jira/browse/TEZ-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ming Ma updated TEZ-3430: ------------------------- Issue Type: Improvement (was: Bug) > Make split sorting optional > --------------------------- > > Key: TEZ-3430 > URL: https://issues.apache.org/jira/browse/TEZ-3430 > Project: Apache Tez > Issue Type: Improvement > Reporter: Ming Ma > Assignee: Ming Ma > Fix For: 0.9.0 > > Attachments: TEZ-3430.patch > > > The fair routing design in TEZ-3209 addresses the skewed partitions where one > partition could be much larger than the others. But to simplify the stats > tracking, it assumes a given partition's data is distributed evenly to some > degree across source tasks so that it can group consecutive source tasks > together. > However, this assumption is invalid given {{MRInputHelpers}}'s > generateNewSplits and generateOldSplits sort the splits by size, thus the > data size in the beginning of source task range is bigger than that of at the > end. > {noformat} > Arrays.sort(splits, new InputSplitComparator()); > {noformat} > One way to fix this is to have fair routing track not only the aggregated > size of each partition, but also the size of each partition of each source > task. But that will significantly increase the memory footprint. > Alternatively, it can skip the sorting above. Test results for TEZ-3209 show > that jobs can finish 30% faster, given the source tasks output size is more > balanced. -- This message was sent by Atlassian JIRA (v6.3.4#6332)