David Moravek created BEAM-6214:
-----------------------------------

             Summary: Spark: CombineByKey performance
                 Key: BEAM-6214
                 URL: https://issues.apache.org/jira/browse/BEAM-6214
             Project: Beam
          Issue Type: Improvement
          Components: runner-spark
    Affects Versions: 2.8.0
            Reporter: David Moravek
            Assignee: David Moravek


Right now spark runner's implementation of combineByKey causes at least two 
serializations / de-serializations of each element, because combine 
accumulators are byte based.

We can do much better by letting accumulators to work on user defined java 
types and only serialize accumulators when we need to send them over the 
network.

In order to do this, we need following:
 * Acummulator wrapper -> contains transient `T` value + byte payload, that is 
filled in during serialization
 * JavaSerialization: accumulator (beam wrapper) needs to implement 
Serializable and override writeObject and readObject methods and use beam coder
 * KryoSerialization: we need a custom kryo serializer for accumulator wrapper

This should be enough to hook into all possible spark serialization interfaces.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to