Hi, I would need some thoughts or inputs or any starting point to achieve following scenario. I submit a job using spark-submit with a certain set of parameters.
It reads data from a source, does some processing on RDDs and generates some output and completes. Then I submit same job again with next set of parameters. It should also read data from a source do same processing and at the same time read data from the result generated by previous job and merge the two and again store the results. This process goes on and on. So I need to store RDD or output of RDD into some storage of previous job to make it available to next job. What are my options. 1. Use checkpoint Can I use checkpoint on the final stage of RDD and then load the same RDD again by specifying checkpoint path in next job. Is checkpoint right for this kind of situation. 2. Save output of previous job into some json file and then create a data frame of that in next job. Have I got this right, is this option better than option 1. 3. I have heard a lot about paquet files. However I don't know how it integrates with spark. Can I use that here as intermediate storage. Is this available in spark 1.6? Any other thoughts or idea. Thanks Sachin