You just save the data in the RDD in whatever form you want to
whatever persistent storage you want, and then re-read it from another
job. This could be Parquet format on HDFS for example. Parquet is just
a common file format. There is no need to keep the job running just to
keep an RDD alive.

On Mon, Aug 29, 2016 at 5:30 AM, Sachin Mittal <sjmit...@gmail.com> wrote:
> Hi,
> I would need some thoughts or inputs or any starting point to achieve
> following scenario.
> I submit a job using spark-submit with a certain set of parameters.
>
> It reads data from a source, does some processing on RDDs and generates some
> output and completes.
>
> Then I submit same job again with next set of parameters.
> It should also read data from a source do same processing and at the same
> time read data from the result generated by previous job and merge the two
> and again store the results.
>
> This process goes on and on.
>
> So I need to store RDD or output of RDD into some storage of previous job to
> make it available to next job.
>
> What are my options.
> 1. Use checkpoint
> Can I use checkpoint on the final stage of RDD and then load the same RDD
> again by specifying checkpoint path in next job. Is checkpoint right for
> this kind of situation.
>
> 2. Save output of previous job into some json file and then create a data
> frame of that in next job.
> Have I got this right, is this option better than option 1.
>
> 3. I have heard a lot about paquet files. However I don't know how it
> integrates with spark.
> Can I use that here as intermediate storage.
> Is this available in spark 1.6?
>
> Any other thoughts or idea.
>
> Thanks
> Sachin
>
>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to