Rajesh Balamohan created TEZ-972:
------------------------------------
Summary: Shuffle Phase - optimize memory usage of empty partition
data in DataMovementEvent
Key: TEZ-972
URL: https://issues.apache.org/jira/browse/TEZ-972
Project: Apache Tez
Issue Type: Improvement
Reporter: Rajesh Balamohan
Empty partition details are stored in byte[] in compressed format and sent via
DataMovementEvent in shuffle phase. Quick standalone tests reveals that BitSet
would be more efficient than compressing the byte[].
PartitionSize=1 , BitSetSize=1 , CompressedBitSetSize=9 ,
NormalByteArrayCompressed=9
PartitionSize=101 , BitSetSize=13 , CompressedBitSetSize=22 ,
NormalByteArrayCompressed=42
PartitionSize=201 , BitSetSize=26 , CompressedBitSetSize=37 ,
NormalByteArrayCompressed=62
PartitionSize=301 , BitSetSize=38 , CompressedBitSetSize=49 ,
NormalByteArrayCompressed=76
..
PartitionSize=1001 , BitSetSize=126 , CompressedBitSetSize=137 ,
NormalByteArrayCompressed=197
..
PartitionSize=2001 , BitSetSize=251 , CompressedBitSetSize=262 ,
NormalByteArrayCompressed=374
PartitionSize=4001 , BitSetSize=501 , CompressedBitSetSize=512 ,
NormalByteArrayCompressed=686
PartitionSize=8001 , BitSetSize=1001 , CompressedBitSetSize=1012 ,
NormalByteArrayCompressed=1330
PartitionSize=16001 , BitSetSize=2001 , CompressedBitSetSize=1979 ,
NormalByteArrayCompressed=2569
PartitionSize=32001 , BitSetSize=4001 , CompressedBitSetSize=3885 ,
NormalByteArrayCompressed=5000
-This is based on considering random bit positions as empty partitions.
It is not possible to directly use JDK 1.6's BitSet directly as it does not
support valueOf, toByteArray() functions. Suggestion is to have Tez specific
BitSet (until Tez moves to JDK 1.7) and make the compression as a job
configuration.
--
This message was sent by Atlassian JIRA
(v6.2#6252)