Re: Pipeline in pyspark

ayan guha Thu, 23 Apr 2015 00:37:21 -0700

I do not think you can share data across spark contexts. So as long as you
can pass it around you should be good.
On 23 Apr 2015 17:12, "Suraj Shetiya" <surajshet...@gmail.com> wrote:


> Hi,
>
> I have come across ways of building pipeline of input/transform and output
> pipelines with Java (Google Dataflow/Spark etc). I also understand that
> Spark itelf provides ways for creating a pipeline within mlib for
> MLtransforms (primarily fit) Both of the above are available in Java/Scala
> environment and the later being supported on Python as well.
>
> However, if my understanding is correct, pipelines within mltransforms
> donot create a complete dataflow transform for non-ml scenarios (ex. io
> transforms, dataframe/graph transforms). Correct me if otherwise. I would
> like to know, what is the best way to create spark dataflow pipeline in a
> generic way. I have a use case where I have my input files in different
> formats and would like to convert them to rdd and further build the
> dataframe transforms and stream/store them finally. I hope not to do Disk
> I/Os between pipeline tasks.
>
>  I also came across luigi(http://luigi.readthedocs.org/en/latest/) on
> Python, but I found that it stores the contents onto disc and reloads it
> for the next phase of the pipeline.
>
> Appreciate if you can share your thoughts.
>
>
> --
> Regards,
> Suraj
>

Re: Pipeline in pyspark

Reply via email to