Hey Feynman, I doubt DF persistence will work in my case. Let's use the following code: ============================================== def searchRun( params = [param1, param2] ) data1 = hashing1.transform(rawData, param1) data1.cache() data2 = hashing2.transform(data1, param2) data2.someAction() ============================================== Say if we run "searchRun()" for 2 times with the same "param1" but different "param2". Will spark recognize that the two local variables "data1" in consecutive runs has the same content?
Best, Lewis 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>: > You can persist the transformed Dataframes, for example > > val data : DF = ... > val hashedData = hashingTF.transform(data) > hashedData.cache() // to cache DataFrame in memory > > Future usage of hashedData read from an in-memory cache now. > > You can also persist to disk, eg: > > hashedData.write.parquet(FilePath) // to write DataFrame in Parquet format > to disk > ... > val savedHashedData = sqlContext.read.parquet(FilePath) > > Future uses of hash > > Like my earlier response, this will still require you call each > PipelineStage's `transform` method (i.e. to NOT use the overall > Pipeline.setStages API) > > On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com> > wrote: > >> Hey Feynman, >> >> Thanks for your response, but I'm afraid "model save/load" is not exactly >> the feature I'm looking for. >> >> What I need to cache and reuse are the intermediate outputs of >> transformations, not transformer themselves. Do you know any related dev. >> activities or plans? >> >> Best, >> Lewis >> >> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>: >> >>> Lewis, >>> >>> Many pipeline stages implement save/load methods, which can be used if >>> you instantiate and call the underlying pipeline stages `transform` methods >>> individually (instead of using the Pipeline.setStages API). See associated >>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>. >>> >>> Pipeline persistence is on the 1.6 roadmap, JIRA here >>> <https://issues.apache.org/jira/browse/SPARK-6725>. >>> >>> Feynman >>> >>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com> >>> wrote: >>> >>>> Hi all, >>>> >>>> I have a question regarding the ability of ML pipeline to cache >>>> intermediate results. I've posted this question on stackoverflow >>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline> >>>> but got no answer, hope someone here can help me out. >>>> >>>> =========== >>>> Lately I'm planning to migrate my standalone python ML code to spark. >>>> The ML pipeline in spark.ml turns out quite handy, with streamlined >>>> API for chaining up algorithm stages and hyper-parameter grid search. >>>> >>>> Still, I found its support for one important feature obscure in >>>> existing documents: caching of intermediate results. The importance of this >>>> feature arise when the pipeline involves computation intensive stages. >>>> >>>> For example, in my case I use a huge sparse matrix to perform multiple >>>> moving averages on time series data in order to form input features. The >>>> structure of the matrix is determined by some hyper-parameter. This step >>>> turns out to be a bottleneck for the entire pipeline because I have to >>>> construct the matrix in runtime. >>>> >>>> During parameter search, I usually have other parameters to examine in >>>> addition to this "structure parameter". So if I can reuse the huge matrix >>>> when the "structure parameter" is unchanged, I can save tons of time. For >>>> this reason, I intentionally formed my code to cache and reuse these >>>> intermediate results. >>>> >>>> So my question is: can Spark's ML pipeline handle intermediate caching >>>> automatically? Or do I have to manually form code to do so? If so, is there >>>> any best practice to learn from? >>>> >>>> P.S. I have looked into the official document and some other material, >>>> but none of them seems to discuss this topic. >>>> >>>> >>>> >>>> Best, >>>> Lewis >>>> >>> >>> >> >