You just save the data in the RDD in whatever form you want to whatever persistent storage you want, and then re-read it from another job. This could be Parquet format on HDFS for example. Parquet is just a common file format. There is no need to keep the job running just to keep an RDD alive.
On Mon, Aug 29, 2016 at 5:30 AM, Sachin Mittal <sjmit...@gmail.com> wrote: > Hi, > I would need some thoughts or inputs or any starting point to achieve > following scenario. > I submit a job using spark-submit with a certain set of parameters. > > It reads data from a source, does some processing on RDDs and generates some > output and completes. > > Then I submit same job again with next set of parameters. > It should also read data from a source do same processing and at the same > time read data from the result generated by previous job and merge the two > and again store the results. > > This process goes on and on. > > So I need to store RDD or output of RDD into some storage of previous job to > make it available to next job. > > What are my options. > 1. Use checkpoint > Can I use checkpoint on the final stage of RDD and then load the same RDD > again by specifying checkpoint path in next job. Is checkpoint right for > this kind of situation. > > 2. Save output of previous job into some json file and then create a data > frame of that in next job. > Have I got this right, is this option better than option 1. > > 3. I have heard a lot about paquet files. However I don't know how it > integrates with spark. > Can I use that here as intermediate storage. > Is this available in spark 1.6? > > Any other thoughts or idea. > > Thanks > Sachin > > > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org