shuffle write size

Chen Song Tue, 17 Mar 2015 14:24:07 -0700

I have a map reduce job that reads from three logs and joins them on some
key column. The underlying data is protobuf messages in sequence
files. Between mappers and reducers, the underlying raw byte arrays for
protobuf messages are shuffled . Roughly, for 1G input from HDFS, there is
2G data output from map phase.


I am testing spark jobs (v1.3.0) on the same input. I found that shuffle
write is 3 - 4 times input size. I tried passing protobuf Message object
and ArrayByte but neither gives good shuffle write output.

Is there any good practice on shuffling

* protobuf messages
* raw byte array

Chen

shuffle write size

Reply via email to