I have a map reduce job that reads from three logs and joins them on some
key column. The underlying data is protobuf messages in sequence
files. Between mappers and reducers, the underlying raw byte arrays for
protobuf messages are shuffled . Roughly, for 1G input from HDFS, there is
2G data output from map phase.

I am testing spark jobs (v1.3.0) on the same input. I found that shuffle
write is 3 - 4 times input size. I tried passing protobuf Message object
and ArrayByte but neither gives good shuffle write output.

Is there any good practice on shuffling

* protobuf messages
* raw byte array

Chen

Reply via email to