Rajesh Balamohan created TEZ-972:
------------------------------------

             Summary: Shuffle Phase - optimize memory usage of empty partition 
data in DataMovementEvent
                 Key: TEZ-972
                 URL: https://issues.apache.org/jira/browse/TEZ-972
             Project: Apache Tez
          Issue Type: Improvement
            Reporter: Rajesh Balamohan


Empty partition details are stored in byte[] in compressed format and sent via 
DataMovementEvent in shuffle phase.  Quick standalone tests reveals that BitSet 
would be more efficient than compressing the byte[].  

PartitionSize=1 , BitSetSize=1 , CompressedBitSetSize=9 , 
NormalByteArrayCompressed=9
PartitionSize=101 , BitSetSize=13 , CompressedBitSetSize=22 , 
NormalByteArrayCompressed=42
PartitionSize=201 , BitSetSize=26 , CompressedBitSetSize=37 , 
NormalByteArrayCompressed=62
PartitionSize=301 , BitSetSize=38 , CompressedBitSetSize=49 , 
NormalByteArrayCompressed=76
..
PartitionSize=1001 , BitSetSize=126 , CompressedBitSetSize=137 , 
NormalByteArrayCompressed=197
..
PartitionSize=2001 , BitSetSize=251 , CompressedBitSetSize=262 , 
NormalByteArrayCompressed=374
PartitionSize=4001 , BitSetSize=501 , CompressedBitSetSize=512 , 
NormalByteArrayCompressed=686
PartitionSize=8001 , BitSetSize=1001 , CompressedBitSetSize=1012 , 
NormalByteArrayCompressed=1330
PartitionSize=16001 , BitSetSize=2001 , CompressedBitSetSize=1979 , 
NormalByteArrayCompressed=2569
PartitionSize=32001 , BitSetSize=4001 , CompressedBitSetSize=3885 , 
NormalByteArrayCompressed=5000
-This is based on considering random bit positions as empty partitions.

It is not possible to directly use JDK 1.6's BitSet directly as it does not 
support valueOf, toByteArray() functions.  Suggestion is to have Tez specific 
BitSet (until Tez moves to JDK 1.7) and make the compression as a job 
configuration.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to