Re: What is the Effect of Serialization within Stages?
A chain of map and flatmap does not cause any serialization-deserialization. On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann mark.heim...@kard.info wrote: Hello everyone, I am wondering what the effect of serialization is within a stage. My understanding of Spark as an execution engine is that the data flow graph is divided into stages and a new stage always starts after an operation/transformation that cannot be pipelined (such as groupBy or join) because it can only be completed after the whole data set has been taken care off. At the end of a stage shuffle files are written and at the beginning of the next stage they are read from. Within a stage my understanding is that pipelining is used, therefore I wonder whether there is any serialization overhead involved when there is no shuffling taking place. I am also assuming that my data set fits into memory and must not be spilled to disk. So if I would chain multiple *map* or *flatMap* operations and they end up in the same stage, will there be any serialization overhead for piping the result of the first *map* operation as a parameter into the following *map* operation? Any ideas and feedback appreciated, thanks a lot. Best regards, Mark
Re: What is the Effect of Serialization within Stages?
Thanks a lot guys, that's exactly what I hoped for :-). Cheers, Mark 2015-08-13 6:35 GMT+02:00 Hemant Bhanawat hemant9...@gmail.com: A chain of map and flatmap does not cause any serialization-deserialization. On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann mark.heim...@kard.info wrote: Hello everyone, I am wondering what the effect of serialization is within a stage. My understanding of Spark as an execution engine is that the data flow graph is divided into stages and a new stage always starts after an operation/transformation that cannot be pipelined (such as groupBy or join) because it can only be completed after the whole data set has been taken care off. At the end of a stage shuffle files are written and at the beginning of the next stage they are read from. Within a stage my understanding is that pipelining is used, therefore I wonder whether there is any serialization overhead involved when there is no shuffling taking place. I am also assuming that my data set fits into memory and must not be spilled to disk. So if I would chain multiple *map* or *flatMap* operations and they end up in the same stage, will there be any serialization overhead for piping the result of the first *map* operation as a parameter into the following *map* operation? Any ideas and feedback appreciated, thanks a lot. Best regards, Mark
Re: What is the Effect of Serialization within Stages?
Serialization only occurs intra-stage, when you are using Python, and as far as I know, only in the first stage, when reading the data and passing it to the Python interpreter the first time. Multiple operations are just chains of simple *map *and *flatMap *operators at task level on simple Scala *Iterator of type T*, where T is the type of RDD. On Thu, Aug 13, 2015 at 4:09 PM Hemant Bhanawat hemant9...@gmail.com wrote: A chain of map and flatmap does not cause any serialization-deserialization. On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann mark.heim...@kard.info wrote: Hello everyone, I am wondering what the effect of serialization is within a stage. My understanding of Spark as an execution engine is that the data flow graph is divided into stages and a new stage always starts after an operation/transformation that cannot be pipelined (such as groupBy or join) because it can only be completed after the whole data set has been taken care off. At the end of a stage shuffle files are written and at the beginning of the next stage they are read from. Within a stage my understanding is that pipelining is used, therefore I wonder whether there is any serialization overhead involved when there is no shuffling taking place. I am also assuming that my data set fits into memory and must not be spilled to disk. So if I would chain multiple *map* or *flatMap* operations and they end up in the same stage, will there be any serialization overhead for piping the result of the first *map* operation as a parameter into the following *map* operation? Any ideas and feedback appreciated, thanks a lot. Best regards, Mark
What is the Effect of Serialization within Stages?
Hello everyone, I am wondering what the effect of serialization is within a stage. My understanding of Spark as an execution engine is that the data flow graph is divided into stages and a new stage always starts after an operation/transformation that cannot be pipelined (such as groupBy or join) because it can only be completed after the whole data set has been taken care off. At the end of a stage shuffle files are written and at the beginning of the next stage they are read from. Within a stage my understanding is that pipelining is used, therefore I wonder whether there is any serialization overhead involved when there is no shuffling taking place. I am also assuming that my data set fits into memory and must not be spilled to disk. So if I would chain multiple *map* or *flatMap* operations and they end up in the same stage, will there be any serialization overhead for piping the result of the first *map* operation as a parameter into the following *map* operation? Any ideas and feedback appreciated, thanks a lot. Best regards, Mark