I do not think you can share data across spark contexts. So as long as you can pass it around you should be good. On 23 Apr 2015 17:12, "Suraj Shetiya" <surajshet...@gmail.com> wrote:
> Hi, > > I have come across ways of building pipeline of input/transform and output > pipelines with Java (Google Dataflow/Spark etc). I also understand that > Spark itelf provides ways for creating a pipeline within mlib for > MLtransforms (primarily fit) Both of the above are available in Java/Scala > environment and the later being supported on Python as well. > > However, if my understanding is correct, pipelines within mltransforms > donot create a complete dataflow transform for non-ml scenarios (ex. io > transforms, dataframe/graph transforms). Correct me if otherwise. I would > like to know, what is the best way to create spark dataflow pipeline in a > generic way. I have a use case where I have my input files in different > formats and would like to convert them to rdd and further build the > dataframe transforms and stream/store them finally. I hope not to do Disk > I/Os between pipeline tasks. > > I also came across luigi(http://luigi.readthedocs.org/en/latest/) on > Python, but I found that it stores the contents onto disc and reloads it > for the next phase of the pipeline. > > Appreciate if you can share your thoughts. > > > -- > Regards, > Suraj >