David Moravek created BEAM-6214: ----------------------------------- Summary: Spark: CombineByKey performance Key: BEAM-6214 URL: https://issues.apache.org/jira/browse/BEAM-6214 Project: Beam Issue Type: Improvement Components: runner-spark Affects Versions: 2.8.0 Reporter: David Moravek Assignee: David Moravek
Right now spark runner's implementation of combineByKey causes at least two serializations / de-serializations of each element, because combine accumulators are byte based. We can do much better by letting accumulators to work on user defined java types and only serialize accumulators when we need to send them over the network. In order to do this, we need following: * Acummulator wrapper -> contains transient `T` value + byte payload, that is filled in during serialization * JavaSerialization: accumulator (beam wrapper) needs to implement Serializable and override writeObject and readObject methods and use beam coder * KryoSerialization: we need a custom kryo serializer for accumulator wrapper This should be enough to hook into all possible spark serialization interfaces. -- This message was sent by Atlassian JIRA (v7.6.3#76005)