Nope, and that's intentional. There is no guarantee that rawData did not change between intermediate calls to searchRun, so reusing a cached data1 would be incorrect.
If you want data1 to be cached between multiple runs, you have a few options: * cache it first and pass it in as an argument to searchRun * use a creational pattern like singleton to ensure only one instantiation On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com> wrote: > Hey Feynman, > > I doubt DF persistence will work in my case. Let's use the following code: > ============================================== > def searchRun( params = [param1, param2] ) > data1 = hashing1.transform(rawData, param1) > data1.cache() > data2 = hashing2.transform(data1, param2) > data2.someAction() > ============================================== > Say if we run "searchRun()" for 2 times with the same "param1" but > different "param2". Will spark recognize that the two local variables > "data1" in consecutive runs has the same content? > > > Best, > Lewis > > 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>: > >> You can persist the transformed Dataframes, for example >> >> val data : DF = ... >> val hashedData = hashingTF.transform(data) >> hashedData.cache() // to cache DataFrame in memory >> >> Future usage of hashedData read from an in-memory cache now. >> >> You can also persist to disk, eg: >> >> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet >> format to disk >> ... >> val savedHashedData = sqlContext.read.parquet(FilePath) >> >> Future uses of hash >> >> Like my earlier response, this will still require you call each >> PipelineStage's `transform` method (i.e. to NOT use the overall >> Pipeline.setStages API) >> >> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com> >> wrote: >> >>> Hey Feynman, >>> >>> Thanks for your response, but I'm afraid "model save/load" is not >>> exactly the feature I'm looking for. >>> >>> What I need to cache and reuse are the intermediate outputs of >>> transformations, not transformer themselves. Do you know any related dev. >>> activities or plans? >>> >>> Best, >>> Lewis >>> >>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>: >>> >>>> Lewis, >>>> >>>> Many pipeline stages implement save/load methods, which can be used if >>>> you instantiate and call the underlying pipeline stages `transform` methods >>>> individually (instead of using the Pipeline.setStages API). See associated >>>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>. >>>> >>>> Pipeline persistence is on the 1.6 roadmap, JIRA here >>>> <https://issues.apache.org/jira/browse/SPARK-6725>. >>>> >>>> Feynman >>>> >>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I have a question regarding the ability of ML pipeline to cache >>>>> intermediate results. I've posted this question on stackoverflow >>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline> >>>>> but got no answer, hope someone here can help me out. >>>>> >>>>> =========== >>>>> Lately I'm planning to migrate my standalone python ML code to spark. >>>>> The ML pipeline in spark.ml turns out quite handy, with streamlined >>>>> API for chaining up algorithm stages and hyper-parameter grid search. >>>>> >>>>> Still, I found its support for one important feature obscure in >>>>> existing documents: caching of intermediate results. The importance of >>>>> this >>>>> feature arise when the pipeline involves computation intensive stages. >>>>> >>>>> For example, in my case I use a huge sparse matrix to perform multiple >>>>> moving averages on time series data in order to form input features. The >>>>> structure of the matrix is determined by some hyper-parameter. This step >>>>> turns out to be a bottleneck for the entire pipeline because I have to >>>>> construct the matrix in runtime. >>>>> >>>>> During parameter search, I usually have other parameters to examine in >>>>> addition to this "structure parameter". So if I can reuse the huge matrix >>>>> when the "structure parameter" is unchanged, I can save tons of time. For >>>>> this reason, I intentionally formed my code to cache and reuse these >>>>> intermediate results. >>>>> >>>>> So my question is: can Spark's ML pipeline handle intermediate caching >>>>> automatically? Or do I have to manually form code to do so? If so, is >>>>> there >>>>> any best practice to learn from? >>>>> >>>>> P.S. I have looked into the official document and some other material, >>>>> but none of them seems to discuss this topic. >>>>> >>>>> >>>>> >>>>> Best, >>>>> Lewis >>>>> >>>> >>>> >>> >> >