A chain of map and flatmap does not cause any serialization-deserialization.
On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann <mark.heim...@kard.info> wrote: > Hello everyone, > > I am wondering what the effect of serialization is within a stage. > > My understanding of Spark as an execution engine is that the data flow > graph is divided into stages and a new stage always starts after an > operation/transformation that cannot be pipelined (such as groupBy or join) > because it can only be completed after the whole data set has "been taken > care off". At the end of a stage shuffle files are written and at the > beginning of the next stage they are read from. > > Within a stage my understanding is that pipelining is used, therefore I > wonder whether there is any serialization overhead involved when there is > no shuffling taking place. I am also assuming that my data set fits into > memory and must not be spilled to disk. > > So if I would chain multiple *map* or *flatMap* operations and they end > up in the same stage, will there be any serialization overhead for piping > the result of the first *map* operation as a parameter into the following > *map* operation? > > Any ideas and feedback appreciated, thanks a lot. > > Best regards, > Mark >