Hey Feynman, Thanks for your response, but I'm afraid "model save/load" is not exactly the feature I'm looking for.
What I need to cache and reuse are the intermediate outputs of transformations, not transformer themselves. Do you know any related dev. activities or plans? Best, Lewis 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>: > Lewis, > > Many pipeline stages implement save/load methods, which can be used if you > instantiate and call the underlying pipeline stages `transform` methods > individually (instead of using the Pipeline.setStages API). See associated > JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>. > > Pipeline persistence is on the 1.6 roadmap, JIRA here > <https://issues.apache.org/jira/browse/SPARK-6725>. > > Feynman > > On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com> wrote: > >> Hi all, >> >> I have a question regarding the ability of ML pipeline to cache >> intermediate results. I've posted this question on stackoverflow >> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline> >> but got no answer, hope someone here can help me out. >> >> =========== >> Lately I'm planning to migrate my standalone python ML code to spark. The >> ML pipeline in spark.ml turns out quite handy, with streamlined API for >> chaining up algorithm stages and hyper-parameter grid search. >> >> Still, I found its support for one important feature obscure in existing >> documents: caching of intermediate results. The importance of this feature >> arise when the pipeline involves computation intensive stages. >> >> For example, in my case I use a huge sparse matrix to perform multiple >> moving averages on time series data in order to form input features. The >> structure of the matrix is determined by some hyper-parameter. This step >> turns out to be a bottleneck for the entire pipeline because I have to >> construct the matrix in runtime. >> >> During parameter search, I usually have other parameters to examine in >> addition to this "structure parameter". So if I can reuse the huge matrix >> when the "structure parameter" is unchanged, I can save tons of time. For >> this reason, I intentionally formed my code to cache and reuse these >> intermediate results. >> >> So my question is: can Spark's ML pipeline handle intermediate caching >> automatically? Or do I have to manually form code to do so? If so, is there >> any best practice to learn from? >> >> P.S. I have looked into the official document and some other material, >> but none of them seems to discuss this topic. >> >> >> >> Best, >> Lewis >> > >