Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Hemant Bhanawat
A chain of map and flatmap does not cause any
serialization-deserialization.



On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann mark.heim...@kard.info
wrote:

 Hello everyone,

 I am wondering what the effect of serialization is within a stage.

 My understanding of Spark as an execution engine is that the data flow
 graph is divided into stages and a new stage always starts after an
 operation/transformation that cannot be pipelined (such as groupBy or join)
 because it can only be completed after the whole data set has been taken
 care off. At the end of a stage shuffle files are written and at the
 beginning of the next stage they are read from.

 Within a stage my understanding is that pipelining is used, therefore I
 wonder whether there is any serialization overhead involved when there is
 no shuffling taking place. I am also assuming that my data set fits into
 memory and must not be spilled to disk.

 So if I would chain multiple *map* or *flatMap* operations and they end
 up in the same stage, will there be any serialization overhead for piping
 the result of the first *map* operation as a parameter into the following
 *map* operation?

 Any ideas and feedback appreciated, thanks a lot.

 Best regards,
 Mark



Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Mark Heimann
Thanks a lot guys, that's exactly what I hoped for :-).

Cheers,
Mark

2015-08-13 6:35 GMT+02:00 Hemant Bhanawat hemant9...@gmail.com:

 A chain of map and flatmap does not cause any
 serialization-deserialization.



 On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann mark.heim...@kard.info
 wrote:

 Hello everyone,

 I am wondering what the effect of serialization is within a stage.

 My understanding of Spark as an execution engine is that the data flow
 graph is divided into stages and a new stage always starts after an
 operation/transformation that cannot be pipelined (such as groupBy or join)
 because it can only be completed after the whole data set has been taken
 care off. At the end of a stage shuffle files are written and at the
 beginning of the next stage they are read from.

 Within a stage my understanding is that pipelining is used, therefore I
 wonder whether there is any serialization overhead involved when there is
 no shuffling taking place. I am also assuming that my data set fits into
 memory and must not be spilled to disk.

 So if I would chain multiple *map* or *flatMap* operations and they end
 up in the same stage, will there be any serialization overhead for piping
 the result of the first *map* operation as a parameter into the
 following *map* operation?

 Any ideas and feedback appreciated, thanks a lot.

 Best regards,
 Mark





Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Zoltán Zvara
Serialization only occurs intra-stage, when you are using Python, and as
far as I know, only in the first stage, when reading the data and passing
it to the Python interpreter the first time.

Multiple operations are just chains of simple *map *and *flatMap *operators
at task level on simple Scala *Iterator of type T*, where T is the type of
RDD.

On Thu, Aug 13, 2015 at 4:09 PM Hemant Bhanawat hemant9...@gmail.com
wrote:

 A chain of map and flatmap does not cause any
 serialization-deserialization.



 On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann mark.heim...@kard.info
 wrote:

 Hello everyone,

 I am wondering what the effect of serialization is within a stage.

 My understanding of Spark as an execution engine is that the data flow
 graph is divided into stages and a new stage always starts after an
 operation/transformation that cannot be pipelined (such as groupBy or join)
 because it can only be completed after the whole data set has been taken
 care off. At the end of a stage shuffle files are written and at the
 beginning of the next stage they are read from.

 Within a stage my understanding is that pipelining is used, therefore I
 wonder whether there is any serialization overhead involved when there is
 no shuffling taking place. I am also assuming that my data set fits into
 memory and must not be spilled to disk.

 So if I would chain multiple *map* or *flatMap* operations and they end
 up in the same stage, will there be any serialization overhead for piping
 the result of the first *map* operation as a parameter into the
 following *map* operation?

 Any ideas and feedback appreciated, thanks a lot.

 Best regards,
 Mark





What is the Effect of Serialization within Stages?

2015-08-12 Thread Mark Heimann
Hello everyone,

I am wondering what the effect of serialization is within a stage.

My understanding of Spark as an execution engine is that the data flow
graph is divided into stages and a new stage always starts after an
operation/transformation that cannot be pipelined (such as groupBy or join)
because it can only be completed after the whole data set has been taken
care off. At the end of a stage shuffle files are written and at the
beginning of the next stage they are read from.

Within a stage my understanding is that pipelining is used, therefore I
wonder whether there is any serialization overhead involved when there is
no shuffling taking place. I am also assuming that my data set fits into
memory and must not be spilled to disk.

So if I would chain multiple *map* or *flatMap* operations and they end up
in the same stage, will there be any serialization overhead for piping the
result of the first *map* operation as a parameter into the following *map*
operation?

Any ideas and feedback appreciated, thanks a lot.

Best regards,
Mark