Hi all,

I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no more 
there in the Dataset API.  So, I
translated it to:


KeyValueGroupedDataset<K, KV<K, InputT>> groupedDataset =
    keyedDataset.groupByKey(KVHelpers.extractKey(), 
EncoderHelpers.genericEncoder());

Dataset<Tuple2<K, OutputT>> combinedDataset =
    groupedDataset.agg(
        new Aggregator<KV<K, InputT>, AccumT, OutputT>(combineFn).toColumn());

I have an interrogation regarding performance : as GroupByKey is generally less 
performant (entails shuffle and possible
OOM if a given key has a lot of data associated to it), I was wondering if the 
new spark optimizer translates such a DAG
into a combinePerKey behind the scene. In other words, is such a DAG going to 
be translated to a local (or partial I
don't know what terminology you use) combine and then a global combine to avoid 
shuffle?

Thanks

Etienne

Reply via email to