With RDD API, you could optimize shuffling data by making sure that bytes
are shuffled instead of objects and using the appropriate ser/de mechanism
before and after the shuffle, for example:

Before parallelize, transform to bytes using a dedicated serializer,
parallelize, and immediately after desirialize (happens on the nodes).
The same optimization could be applied in combinePerKey, and when
collecting the data to the driver.

My question: is this relevant with the Dataset API ? Datasets have a
dedicated Encoder and I guess that the binary encoder is less informative
then say Integer/String or general Kryo encoder for Objects, and as a
result will "lose" some optimization abilities.

Is this correct ?

Thanks,
Amit

Reply via email to