okumin created TEZ-4521:
---------------------------

             Summary: Partition stats should be always uncompressed size
                 Key: TEZ-4521
                 URL: https://issues.apache.org/jira/browse/TEZ-4521
             Project: Apache Tez
          Issue Type: Sub-task
    Affects Versions: 0.10.2
            Reporter: okumin
            Assignee: okumin


We always put compressed size in 
[ExternalSorter#partitionStats|https://github.com/apache/tez/blob/rel/release-0.10.2/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/sort/impl/ExternalSorter.java#L134]
 while we put uncompressed size in 
[UnorderedPartitionedKVWriter#sizePerPartition|https://github.com/apache/tez/blob/rel/release-0.10.2/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/writers/UnorderedPartitionedKVWriter.java#L130-L131].
 Those should have consistent semantics.

 

As far as I know, the uncompressed size is preferable because of some reasons.
 # The stats are used in FairShuffleVertexManager to configure the parallelism. 
The normal ShuffleVertexManager which is broadly used computes parallelism 
based on uncompressed size. Otherwise, we need to tune 
`tez.fair-shuffle-vertex-manager.desired-task-input-size` based on compressed 
size though `tez.shuffle-vertex-manager.desired-task-input-size` must be based 
on decompressed size
 # Ming pointed out we should use uncompressed size in TEZ-3206. Looks like, we 
missed creating a follow-up ticket



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to