i started printing out when kryo serializes my buffer data structure for my
aggregator.

i would expect every buffer object to ideally get serialized only once: at
the end of the map-side before the shuffle (so after all the values for the
given key within the partition have been reduced into it). i realize that
in reality due to the order of the elements coming in this can not always
be achieved. but what i see instead is that the buffer is getting
serialized after every call to reduce a value into it, always. could this
be the reason it is so slow?

On Thu, Jan 19, 2017 at 4:17 PM, Koert Kuipers <ko...@tresata.com> wrote:

> we just converted a job from RDD to Dataset. the job does a single map-red
> phase using aggregators. we are seeing very bad performance for the Dataset
> version, about 10x slower.
>
> in the Dataset version we use kryo encoders for some of the aggregators.
> based on some basic profiling of spark in local mode i believe the bad
> performance is due to the kryo encoders. about 70% of time is spend in kryo
> related classes.
>
> since we also use kryo for serialization with the RDD i am surprised how
> big the performance difference is.
>
> has anyone seen the same thing? any suggestions for how to improve this?
>
>

Reply via email to