Hi,

I came across documentation for creating a pipeline in mlib library of
pyspark. I wanted to know if something similar exists for pyspark input
transformations. I have a use case where I have my input files in different
formats and would like to convert them to rdd and store them in memory and
perform certain custom tasks in a pipeline without storing it back to disc
in any step. I came across luigi(http://luigi.readthedocs.org/en/latest/),
but I found that it stores the contents onto disc and reloads it for the
next phase of the pipeline.

-- 
Thanks and regards,
Suraj

Reply via email to