Hi everyone, I was using JavaPairRDD¹s combineByKey() to compute all of my aggregations before, since I assumed that every aggregation required a key. However, I realized I could do my analysis using JavaRDD¹s aggregate() instead and not use a key.
I have set spark.serializer to use Kryo. As a result, JavaRDD¹s combineByKey requires that a ³createCombiner² function is provided, and the return value from that function must be serializable using Kryo. When I switched to using rdd.aggregate I assumed that the zero value would also be strictly Kryo serialized, as it is a data item and not part of a closure or the aggregation functions. However, I got a serialization exception as the closure serializer (only valid serializer is the Java serializer) was used instead. I was wondering the following: 1. What is the rationale for making the zero value be serialized using the closure serializer? This isn¹t part of the closure, but is an initial data item. 2. Would it make sense for us to perhaps write a version of rdd.aggregate() that takes a function as a parameter, that generates the zero value? This would be more intuitive to be serialized using the closure serializer. I believe aggregateByKey is also affected. Thanks, -Matt Cheah
smime.p7s
Description: S/MIME cryptographic signature