If you mean to persist data in an RDD, then you should do just that -- persist the RDD to durable storage so it can be read later by any other app. Checkpointing is not a way to store RDDs, but a specific way to recover the same application in some cases. Parquet has been supported for a long while, yes. It's the most common binary format. You could also literally store the serialized form of your objects.
On Mon, Aug 29, 2016 at 9:27 AM, Sachin Mittal <sjmit...@gmail.com> wrote: > I understood the approach. > Does spark 1.6 support Parquet format, I mean saving and loading from > Parquet file. > > Also if I use checkpoint, what I understand is that RDD location on > filesystem is not removed when job is over. So I can read that RDD in next > job. > Is that one of the usecase of checkpoint. Basically does my current problem > can be solved using checkpoint. > > Also which option would be better, store the output of RDD to a persistent > storage, or store the new RDD of that ouput itself using checkpoint. > > Thanks > Sachin > > > > > On Mon, Aug 29, 2016 at 1:39 PM, Sean Owen <so...@cloudera.com> wrote: >> >> You just save the data in the RDD in whatever form you want to >> whatever persistent storage you want, and then re-read it from another >> job. This could be Parquet format on HDFS for example. Parquet is just >> a common file format. There is no need to keep the job running just to >> keep an RDD alive. >> >> On Mon, Aug 29, 2016 at 5:30 AM, Sachin Mittal <sjmit...@gmail.com> wrote: >> > Hi, >> > I would need some thoughts or inputs or any starting point to achieve >> > following scenario. >> > I submit a job using spark-submit with a certain set of parameters. >> > >> > It reads data from a source, does some processing on RDDs and generates >> > some >> > output and completes. >> > >> > Then I submit same job again with next set of parameters. >> > It should also read data from a source do same processing and at the >> > same >> > time read data from the result generated by previous job and merge the >> > two >> > and again store the results. >> > >> > This process goes on and on. >> > >> > So I need to store RDD or output of RDD into some storage of previous >> > job to >> > make it available to next job. >> > >> > What are my options. >> > 1. Use checkpoint >> > Can I use checkpoint on the final stage of RDD and then load the same >> > RDD >> > again by specifying checkpoint path in next job. Is checkpoint right for >> > this kind of situation. >> > >> > 2. Save output of previous job into some json file and then create a >> > data >> > frame of that in next job. >> > Have I got this right, is this option better than option 1. >> > >> > 3. I have heard a lot about paquet files. However I don't know how it >> > integrates with spark. >> > Can I use that here as intermediate storage. >> > Is this available in spark 1.6? >> > >> > Any other thoughts or idea. >> > >> > Thanks >> > Sachin >> > >> > >> > >> > > > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org