Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Mich Talebzadeh
Can someone correct me on this: 1. Jobs run and finish independently of each other. There is no correlation between job 1 and job2 2. If job 2 depends on job1 output, then a persistent storage like Parquet file on HDFS can be used to save the outcome of job 1 and job2 can start

Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Sean Owen
If you mean to persist data in an RDD, then you should do just that -- persist the RDD to durable storage so it can be read later by any other app. Checkpointing is not a way to store RDDs, but a specific way to recover the same application in some cases. Parquet has been supported for a long

Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Sachin Mittal
I understood the approach. Does spark 1.6 support Parquet format, I mean saving and loading from Parquet file. Also if I use checkpoint, what I understand is that RDD location on filesystem is not removed when job is over. So I can read that RDD in next job. Is that one of the usecase of

Re: How can we connect RDD from previous job to next job

2016-08-29 Thread Sean Owen
You just save the data in the RDD in whatever form you want to whatever persistent storage you want, and then re-read it from another job. This could be Parquet format on HDFS for example. Parquet is just a common file format. There is no need to keep the job running just to keep an RDD alive. On

Re: How can we connect RDD from previous job to next job

2016-08-28 Thread Roger Marin
Hi Sachin, Have a look at the spark job server project, it allows you to share rdds & dataframes between spark jobs running in the same context, the catch is you have to implement your spark job as a spark job server spark job.

How can we connect RDD from previous job to next job

2016-08-28 Thread Sachin Mittal
Hi, I would need some thoughts or inputs or any starting point to achieve following scenario. I submit a job using spark-submit with a certain set of parameters. It reads data from a source, does some processing on RDDs and generates some output and completes. Then I submit same job again with