Hi, I came across documentation for creating a pipeline in mlib library of pyspark. I wanted to know if something similar exists for pyspark input transformations. I have a use case where I have my input files in different formats and would like to convert them to rdd and store them in memory and perform certain custom tasks in a pipeline without storing it back to disc in any step. I came across luigi(http://luigi.readthedocs.org/en/latest/), but I found that it stores the contents onto disc and reloads it for the next phase of the pipeline.
-- Thanks and regards, Suraj