Hey Mingyu,

I think it's broken out separately so we can record the time taken to
serialize the result. Once we serializing it once, the second
serialization should be really simple since it's just wrapping
something that has already been turned into a byte buffer. Do you see
a specific issue with serializing it twice?

I think you need to have two steps if you want to record the time
taken to serialize the result, since that needs to be sent back to the
driver when the task completes.

- Patrick

On Wed, Mar 4, 2015 at 4:01 PM, Mingyu Kim <m...@palantir.com> wrote:
> Hi all,
>
> It looks like the result of task is serialized twice, once by serializer 
> (I.e. Java/Kryo depending on configuration) and once again by closure 
> serializer (I.e. Java). To link the actual code,
>
> The first one: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L213
> The second one: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L226
>
> This serializes the "value", which is the result of task run twice, which 
> affects things like collect(), takeSample(), and toLocalIterator(). Would it 
> make sense to simply serialize the DirectTaskResult once using the regular 
> "serializer" (as opposed to closure serializer)? Would it cause problems when 
> the Accumulator values are not Kryo-serializable?
>
> Alternatively, if we can assume that Accumator values are small, we can 
> closure-serialize those, put the serialized byte array in DirectTaskResult 
> with the raw task result "value", and serialize DirectTaskResult.
>
> What do people think?
>
> Thanks,
> Mingyu

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to