Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Patrick Wendell
Hey Mingyu, I think it's broken out separately so we can record the time taken to serialize the result. Once we serializing it once, the second serialization should be really simple since it's just wrapping something that has already been turned into a byte buffer. Do you see a specific issue

Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Mingyu Kim
The concern is really just the runtime overhead and memory footprint of Java-serializing an already-serialized byte array again. We originally noticed this when we were using RDD.toLocalIterator() which serializes the entire 64MB partition. We worked around this issue by kryo-serializing and

Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Mingyu Kim
Hi all, It looks like the result of task is serialized twice, once by serializer (I.e. Java/Kryo depending on configuration) and once again by closure serializer (I.e. Java). To link the actual code, The first one:

Re: Task result is serialized twice by serializer and closure serializer

2015-03-04 Thread Patrick Wendell
Yeah, it will result in a second serialized copy of the array (costing some memory). But the computational overhead should be very small. The absolute worst case here will be when doing a collect() or something similar that just bundles the entire partition. - Patrick On Wed, Mar 4, 2015 at 5:47