Can someone correct me on this:
1. Jobs run and finish independently of each other. There is no
correlation between job 1 and job2
2. If job 2 depends on job1 output, then a persistent storage like
Parquet file on HDFS can be used to save the outcome of job 1 and job2 can
start
If you mean to persist data in an RDD, then you should do just that --
persist the RDD to durable storage so it can be read later by any
other app. Checkpointing is not a way to store RDDs, but a specific
way to recover the same application in some cases. Parquet has been
supported for a long
I understood the approach.
Does spark 1.6 support Parquet format, I mean saving and loading from
Parquet file.
Also if I use checkpoint, what I understand is that RDD location on
filesystem is not removed when job is over. So I can read that RDD in next
job.
Is that one of the usecase of
You just save the data in the RDD in whatever form you want to
whatever persistent storage you want, and then re-read it from another
job. This could be Parquet format on HDFS for example. Parquet is just
a common file format. There is no need to keep the job running just to
keep an RDD alive.
On
Hi Sachin,
Have a look at the spark job server project, it allows you to share rdds &
dataframes between spark jobs running in the same context, the catch is you
have to implement your spark job as a spark job server spark job.
Hi,
I would need some thoughts or inputs or any starting point to achieve
following scenario.
I submit a job using spark-submit with a certain set of parameters.
It reads data from a source, does some processing on RDDs and generates
some output and completes.
Then I submit same job again with