Hi all, I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no more there in the Dataset API. So, I translated it to:
KeyValueGroupedDataset<K, KV<K, InputT>> groupedDataset = keyedDataset.groupByKey(KVHelpers.extractKey(), EncoderHelpers.genericEncoder()); Dataset<Tuple2<K, OutputT>> combinedDataset = groupedDataset.agg( new Aggregator<KV<K, InputT>, AccumT, OutputT>(combineFn).toColumn()); I have an interrogation regarding performance : as GroupByKey is generally less performant (entails shuffle and possible OOM if a given key has a lot of data associated to it), I was wondering if the new spark optimizer translates such a DAG into a combinePerKey behind the scene. In other words, is such a DAG going to be translated to a local (or partial I don't know what terminology you use) combine and then a global combine to avoid shuffle? Thanks Etienne