okumin created TEZ-4521:
---------------------------
Summary: Partition stats should be always uncompressed size
Key: TEZ-4521
URL: https://issues.apache.org/jira/browse/TEZ-4521
Project: Apache Tez
Issue Type: Sub-task
Affects Versions: 0.10.2
Reporter: okumin
Assignee: okumin
We always put compressed size in
[ExternalSorter#partitionStats|https://github.com/apache/tez/blob/rel/release-0.10.2/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/ExternalSorter.java#L134]
while we put uncompressed size in
[UnorderedPartitionedKVWriter#sizePerPartition|https://github.com/apache/tez/blob/rel/release-0.10.2/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/writers/UnorderedPartitionedKVWriter.java#L130-L131].
Those should have consistent semantics.
As far as I know, the uncompressed size is preferable because of some reasons.
# The stats are used in FairShuffleVertexManager to configure the parallelism.
The normal ShuffleVertexManager which is broadly used computes parallelism
based on uncompressed size. Otherwise, we need to tune
`tez.fair-shuffle-vertex-manager.desired-task-input-size` based on compressed
size though `tez.shuffle-vertex-manager.desired-task-input-size` must be based
on decompressed size
# Ming pointed out we should use uncompressed size in TEZ-3206. Looks like, we
missed creating a follow-up ticket
--
This message was sent by Atlassian Jira
(v8.20.10#820010)