I think you are mixing the notion of job from hadoop map reduce world with
spark. In spark, RDDs are immutable and transformations are lazy. So the
first time rdd is actually fills up memory is when you run first
transformation. After that, it stays up in memory until either application
is stopped
On Tue, Aug 18, 2015 at 1:16 PM, Dawid Wysakowicz
wysakowicz.da...@gmail.com wrote:
No, the data is not stored between two jobs. But it is stored for a
lifetime of a job. Job can have multiple actions run.
I too thought so but wanted to confirm. Thanks.
For a matter of sharing an rdd
One spark application can have many jobs,eg,first call rdd.count then call
rdd.collect
At 2015-08-18 15:37:14, Hemant Bhanawat hemant9...@gmail.com wrote:
It is still in memory for future rdd transformations and actions.
This is interesting. You mean Spark holds the data in memory
When I do an rdd.collect().. The data moves back to driver Or is still
held in memory across the executors?
No, the data is not stored between two jobs. But it is stored for a
lifetime of a job. Job can have multiple actions run.
For a matter of sharing an rdd between jobs you can have a look at Spark
Job Server(spark-jobserver https://github.com/ooyala/spark-jobserver) or
some In-Memory storages:
It is still in memory for future rdd transformations and actions.
This is interesting. You mean Spark holds the data in memory between two
job executions. How does the second job get the handle of the data in
memory? I am interested in knowing more about it. Can you forward me a
spark article or
It is still in memory for future rdd transformations and actions. What you
get in driver is a copy of the data.
Regards
Sab
On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote:
When I do an rdd.collect().. The data moves back to driver Or is still
held in memory across the