Hi all,
I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no more
there in the Dataset API. So, I
translated it to:
KeyValueGroupedDataset<K, KV<K, InputT>> groupedDataset =
keyedDataset.groupByKey(KVHelpers.extractKey(),
EncoderHelpers.genericEncoder());
Dataset<Tuple2<K, OutputT>> combinedDataset =
groupedDataset.agg(
new Aggregator<KV<K, InputT>, AccumT, OutputT>(combineFn).toColumn());
I have an interrogation regarding performance : as GroupByKey is generally less
performant (entails shuffle and possible
OOM if a given key has a lot of data associated to it), I was wondering if the
new spark optimizer translates such a DAG
into a combinePerKey behind the scene. In other words, is such a DAG going to
be translated to a local (or partial I
don't know what terminology you use) combine and then a global combine to avoid
shuffle?
Thanks
Etienne