we just converted a job from RDD to Dataset. the job does a single map-red
phase using aggregators. we are seeing very bad performance for the Dataset
version, about 10x slower.

in the Dataset version we use kryo encoders for some of the aggregators.
based on some basic profiling of spark in local mode i believe the bad
performance is due to the kryo encoders. about 70% of time is spend in kryo
related classes.

since we also use kryo for serialization with the RDD i am surprised how
big the performance difference is.

has anyone seen the same thing? any suggestions for how to improve this?

Reply via email to