You can persist the transformed Dataframes, for example val data : DF = ... val hashedData = hashingTF.transform(data) hashedData.cache() // to cache DataFrame in memory
Future usage of hashedData read from an in-memory cache now. You can also persist to disk, eg: hashedData.write.parquet(FilePath) // to write DataFrame in Parquet format to disk ... val savedHashedData = sqlContext.read.parquet(FilePath) Future uses of hash Like my earlier response, this will still require you call each PipelineStage's `transform` method (i.e. to NOT use the overall Pipeline.setStages API) On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com> wrote: > Hey Feynman, > > Thanks for your response, but I'm afraid "model save/load" is not exactly > the feature I'm looking for. > > What I need to cache and reuse are the intermediate outputs of > transformations, not transformer themselves. Do you know any related dev. > activities or plans? > > Best, > Lewis > > 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>: > >> Lewis, >> >> Many pipeline stages implement save/load methods, which can be used if >> you instantiate and call the underlying pipeline stages `transform` methods >> individually (instead of using the Pipeline.setStages API). See associated >> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>. >> >> Pipeline persistence is on the 1.6 roadmap, JIRA here >> <https://issues.apache.org/jira/browse/SPARK-6725>. >> >> Feynman >> >> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com> >> wrote: >> >>> Hi all, >>> >>> I have a question regarding the ability of ML pipeline to cache >>> intermediate results. I've posted this question on stackoverflow >>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline> >>> but got no answer, hope someone here can help me out. >>> >>> =========== >>> Lately I'm planning to migrate my standalone python ML code to spark. >>> The ML pipeline in spark.ml turns out quite handy, with streamlined API >>> for chaining up algorithm stages and hyper-parameter grid search. >>> >>> Still, I found its support for one important feature obscure in existing >>> documents: caching of intermediate results. The importance of this feature >>> arise when the pipeline involves computation intensive stages. >>> >>> For example, in my case I use a huge sparse matrix to perform multiple >>> moving averages on time series data in order to form input features. The >>> structure of the matrix is determined by some hyper-parameter. This step >>> turns out to be a bottleneck for the entire pipeline because I have to >>> construct the matrix in runtime. >>> >>> During parameter search, I usually have other parameters to examine in >>> addition to this "structure parameter". So if I can reuse the huge matrix >>> when the "structure parameter" is unchanged, I can save tons of time. For >>> this reason, I intentionally formed my code to cache and reuse these >>> intermediate results. >>> >>> So my question is: can Spark's ML pipeline handle intermediate caching >>> automatically? Or do I have to manually form code to do so? If so, is there >>> any best practice to learn from? >>> >>> P.S. I have looked into the official document and some other material, >>> but none of them seems to discuss this topic. >>> >>> >>> >>> Best, >>> Lewis >>> >> >> >