Re: How can we connect RDD from previous job to next job

Mich Talebzadeh Mon, 29 Aug 2016 02:09:00 -0700

Can someone correct me on this:


   1. Jobs run and finish independently of each other. There is no
   correlation between job 1 and job2
   2. If job 2 depends on job1 output, then a persistent storage like
   Parquet file on HDFS can be used to save the outcome of job 1 and job2 can
   start from reading this file. That is the only pipeline.

I read some article on Spark server in Bloomberg. Has anyone looked at
this? I did not make much sense out of it

http://www.slideshare.net/JenAman/spark-at-bloomberg-dynamically-composable-analytics

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 August 2016 at 09:43, Sean Owen <so...@cloudera.com> wrote:

> If you mean to persist data in an RDD, then you should do just that --
> persist the RDD to durable storage so it can be read later by any
> other app. Checkpointing is not a way to store RDDs, but a specific
> way to recover the same application in some cases. Parquet has been
> supported for a long while, yes. It's the most common binary format.
> You could also literally store the serialized form of your objects.
>
> On Mon, Aug 29, 2016 at 9:27 AM, Sachin Mittal <sjmit...@gmail.com> wrote:
> > I understood the approach.
> > Does spark 1.6 support Parquet format, I mean saving and loading from
> > Parquet file.
> >
> > Also if I use checkpoint, what I understand is that RDD location on
> > filesystem is not removed when job is over. So I can read that RDD in
> next
> > job.
> > Is that one of the usecase of checkpoint. Basically does my current
> problem
> > can be solved using checkpoint.
> >
> > Also which option would be better, store the output of RDD to a
> persistent
> > storage, or store the new RDD of that ouput itself using checkpoint.
> >
> > Thanks
> > Sachin
> >
> >
> >
> >
> > On Mon, Aug 29, 2016 at 1:39 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> You just save the data in the RDD in whatever form you want to
> >> whatever persistent storage you want, and then re-read it from another
> >> job. This could be Parquet format on HDFS for example. Parquet is just
> >> a common file format. There is no need to keep the job running just to
> >> keep an RDD alive.
> >>
> >> On Mon, Aug 29, 2016 at 5:30 AM, Sachin Mittal <sjmit...@gmail.com>
> wrote:
> >> > Hi,
> >> > I would need some thoughts or inputs or any starting point to achieve
> >> > following scenario.
> >> > I submit a job using spark-submit with a certain set of parameters.
> >> >
> >> > It reads data from a source, does some processing on RDDs and
> generates
> >> > some
> >> > output and completes.
> >> >
> >> > Then I submit same job again with next set of parameters.
> >> > It should also read data from a source do same processing and at the
> >> > same
> >> > time read data from the result generated by previous job and merge the
> >> > two
> >> > and again store the results.
> >> >
> >> > This process goes on and on.
> >> >
> >> > So I need to store RDD or output of RDD into some storage of previous
> >> > job to
> >> > make it available to next job.
> >> >
> >> > What are my options.
> >> > 1. Use checkpoint
> >> > Can I use checkpoint on the final stage of RDD and then load the same
> >> > RDD
> >> > again by specifying checkpoint path in next job. Is checkpoint right
> for
> >> > this kind of situation.
> >> >
> >> > 2. Save output of previous job into some json file and then create a
> >> > data
> >> > frame of that in next job.
> >> > Have I got this right, is this option better than option 1.
> >> >
> >> > 3. I have heard a lot about paquet files. However I don't know how it
> >> > integrates with spark.
> >> > Can I use that here as intermediate storage.
> >> > Is this available in spark 1.6?
> >> >
> >> > Any other thoughts or idea.
> >> >
> >> > Thanks
> >> > Sachin
> >> >
> >> >
> >> >
> >> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: How can we connect RDD from previous job to next job

Reply via email to