If you're doing hyperparameter grid search, consider using ml.tuning.CrossValidator which does cache the dataset <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L85> .
Otherwise, perhaps you can elaborate more on your particular use case for caching intermediate results and if the current API doesn't support it we can create a JIRA for it. On Tue, Sep 15, 2015 at 10:26 PM, Jingchu Liu <liujing...@gmail.com> wrote: > Yeah I understand on the low-level we should do as you said. > > But since ML pipeline is a high-level API, it is pretty natural to expect > the ability to recognize overlapping parameters between successive runs. > (Actually, this happen A LOT when we have lots of hyper-params to search > for) > > I can also imagine the implementation by appending parameter information > to the cached results. Let's say if we implemented an "equal" method for > param1. By comparing param1 with the previous run, the program will know > data1 is reusable. And time used for generating data1 can be saved. > > Best, > Lewis > > 2015-09-15 23:05 GMT+08:00 Feynman Liang <fli...@databricks.com>: > >> Nope, and that's intentional. There is no guarantee that rawData did not >> change between intermediate calls to searchRun, so reusing a cached data1 >> would be incorrect. >> >> If you want data1 to be cached between multiple runs, you have a few >> options: >> * cache it first and pass it in as an argument to searchRun >> * use a creational pattern like singleton to ensure only one instantiation >> >> On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com> >> wrote: >> >>> Hey Feynman, >>> >>> I doubt DF persistence will work in my case. Let's use the following >>> code: >>> ============================================== >>> def searchRun( params = [param1, param2] ) >>> data1 = hashing1.transform(rawData, param1) >>> data1.cache() >>> data2 = hashing2.transform(data1, param2) >>> data2.someAction() >>> ============================================== >>> Say if we run "searchRun()" for 2 times with the same "param1" but >>> different "param2". Will spark recognize that the two local variables >>> "data1" in consecutive runs has the same content? >>> >>> >>> Best, >>> Lewis >>> >>> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>: >>> >>>> You can persist the transformed Dataframes, for example >>>> >>>> val data : DF = ... >>>> val hashedData = hashingTF.transform(data) >>>> hashedData.cache() // to cache DataFrame in memory >>>> >>>> Future usage of hashedData read from an in-memory cache now. >>>> >>>> You can also persist to disk, eg: >>>> >>>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet >>>> format to disk >>>> ... >>>> val savedHashedData = sqlContext.read.parquet(FilePath) >>>> >>>> Future uses of hash >>>> >>>> Like my earlier response, this will still require you call each >>>> PipelineStage's `transform` method (i.e. to NOT use the overall >>>> Pipeline.setStages API) >>>> >>>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com> >>>> wrote: >>>> >>>>> Hey Feynman, >>>>> >>>>> Thanks for your response, but I'm afraid "model save/load" is not >>>>> exactly the feature I'm looking for. >>>>> >>>>> What I need to cache and reuse are the intermediate outputs of >>>>> transformations, not transformer themselves. Do you know any related dev. >>>>> activities or plans? >>>>> >>>>> Best, >>>>> Lewis >>>>> >>>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>: >>>>> >>>>>> Lewis, >>>>>> >>>>>> Many pipeline stages implement save/load methods, which can be used >>>>>> if you instantiate and call the underlying pipeline stages `transform` >>>>>> methods individually (instead of using the Pipeline.setStages API). See >>>>>> associated JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>. >>>>>> >>>>>> Pipeline persistence is on the 1.6 roadmap, JIRA here >>>>>> <https://issues.apache.org/jira/browse/SPARK-6725>. >>>>>> >>>>>> Feynman >>>>>> >>>>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I have a question regarding the ability of ML pipeline to cache >>>>>>> intermediate results. I've posted this question on stackoverflow >>>>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline> >>>>>>> but got no answer, hope someone here can help me out. >>>>>>> >>>>>>> =========== >>>>>>> Lately I'm planning to migrate my standalone python ML code to >>>>>>> spark. The ML pipeline in spark.ml turns out quite handy, with >>>>>>> streamlined API for chaining up algorithm stages and hyper-parameter >>>>>>> grid >>>>>>> search. >>>>>>> >>>>>>> Still, I found its support for one important feature obscure in >>>>>>> existing documents: caching of intermediate results. The importance of >>>>>>> this >>>>>>> feature arise when the pipeline involves computation intensive stages. >>>>>>> >>>>>>> For example, in my case I use a huge sparse matrix to perform >>>>>>> multiple moving averages on time series data in order to form input >>>>>>> features. The structure of the matrix is determined by some >>>>>>> hyper-parameter. This step turns out to be a bottleneck for the entire >>>>>>> pipeline because I have to construct the matrix in runtime. >>>>>>> >>>>>>> During parameter search, I usually have other parameters to examine >>>>>>> in addition to this "structure parameter". So if I can reuse the huge >>>>>>> matrix when the "structure parameter" is unchanged, I can save tons of >>>>>>> time. For this reason, I intentionally formed my code to cache and reuse >>>>>>> these intermediate results. >>>>>>> >>>>>>> So my question is: can Spark's ML pipeline handle intermediate >>>>>>> caching automatically? Or do I have to manually form code to do so? If >>>>>>> so, >>>>>>> is there any best practice to learn from? >>>>>>> >>>>>>> P.S. I have looked into the official document and some other >>>>>>> material, but none of them seems to discuss this topic. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> Lewis >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> >