Ming Ma created TEZ-3430:
----------------------------
Summary: Make split sorting optional
Key: TEZ-3430
URL: https://issues.apache.org/jira/browse/TEZ-3430
Project: Apache Tez
Issue Type: Bug
Reporter: Ming Ma
The fair routing design in TEZ-3209 addresses the skewed partitions where one
partition could be much larger than the others. But to simplify the stats
tracking, it assumes a given partition's data is distributed evenly to some
degree across source tasks so that it can group consecutive source tasks
together.
However, this assumption is invalid given {{MRInputHelpers}}'s
generateNewSplits and generateOldSplits sort the splits by size, thus the data
size in the beginning of source task range is bigger than that of at the end.
{noformat}
Arrays.sort(splits, new InputSplitComparator());
{noformat}
One way to fix this is to have fair routing track not only the aggregated size
of each partition, but also the size of each partition of each source task. But
that will significantly increase the memory footprint.
Alternatively, it can skip the sorting above. Test results for TEZ-3209 show
that jobs can finish 30% faster, given the source tasks output size is more
balanced.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)