With RDD API, you could optimize shuffling data by making sure that bytes are shuffled instead of objects and using the appropriate ser/de mechanism before and after the shuffle, for example:
Before parallelize, transform to bytes using a dedicated serializer, parallelize, and immediately after desirialize (happens on the nodes). The same optimization could be applied in combinePerKey, and when collecting the data to the driver. My question: is this relevant with the Dataset API ? Datasets have a dedicated Encoder and I guess that the binary encoder is less informative then say Integer/String or general Kryo encoder for Objects, and as a result will "lose" some optimization abilities. Is this correct ? Thanks, Amit