A chain of map and flatmap does not cause any
serialization-deserialization.



On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann <mark.heim...@kard.info>
wrote:

> Hello everyone,
>
> I am wondering what the effect of serialization is within a stage.
>
> My understanding of Spark as an execution engine is that the data flow
> graph is divided into stages and a new stage always starts after an
> operation/transformation that cannot be pipelined (such as groupBy or join)
> because it can only be completed after the whole data set has "been taken
> care off". At the end of a stage shuffle files are written and at the
> beginning of the next stage they are read from.
>
> Within a stage my understanding is that pipelining is used, therefore I
> wonder whether there is any serialization overhead involved when there is
> no shuffling taking place. I am also assuming that my data set fits into
> memory and must not be spilled to disk.
>
> So if I would chain multiple *map* or *flatMap* operations and they end
> up in the same stage, will there be any serialization overhead for piping
> the result of the first *map* operation as a parameter into the following
> *map* operation?
>
> Any ideas and feedback appreciated, thanks a lot.
>
> Best regards,
> Mark
>

Reply via email to