Re: Caching intermediate results in Spark ML pipeline?

Feynman Liang Tue, 15 Sep 2015 08:06:17 -0700

Nope, and that's intentional. There is no guarantee that rawData did not
change between intermediate calls to searchRun, so reusing a cached data1
would be incorrect.


If you want data1 to be cached between multiple runs, you have a few
options:
* cache it first and pass it in as an argument to searchRun
* use a creational pattern like singleton to ensure only one instantiation

On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com> wrote:

> Hey Feynman,
>
> I doubt DF persistence will work in my case. Let's use the following code:
> ==============================================
> def searchRun( params = [param1, param2] )
>       data1 = hashing1.transform(rawData, param1)
>       data1.cache()
>       data2 = hashing2.transform(data1, param2)
>       data2.someAction()
> ==============================================
> Say if we run "searchRun()" for 2 times with the same "param1" but
> different "param2". Will spark recognize that the two local variables
> "data1" in consecutive runs has the same content?
>
>
> Best,
> Lewis
>
> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>
>> You can persist the transformed Dataframes, for example
>>
>> val data : DF = ...
>> val hashedData = hashingTF.transform(data)
>> hashedData.cache() // to cache DataFrame in memory
>>
>> Future usage of hashedData read from an in-memory cache now.
>>
>> You can also persist to disk, eg:
>>
>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>> format to disk
>> ...
>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>
>> Future uses of hash
>>
>> Like my earlier response, this will still require you call each
>> PipelineStage's `transform` method (i.e. to NOT use the overall
>> Pipeline.setStages API)
>>
>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>> wrote:
>>
>>> Hey Feynman,
>>>
>>> Thanks for your response, but I'm afraid "model save/load" is not
>>> exactly the feature I'm looking for.
>>>
>>> What I need to cache and reuse are the intermediate outputs of
>>> transformations, not transformer themselves. Do you know any related dev.
>>> activities or plans?
>>>
>>> Best,
>>> Lewis
>>>
>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>
>>>> Lewis,
>>>>
>>>> Many pipeline stages implement save/load methods, which can be used if
>>>> you instantiate and call the underlying pipeline stages `transform` methods
>>>> individually (instead of using the Pipeline.setStages API). See associated
>>>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>>>
>>>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>>>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>>>
>>>> Feynman
>>>>
>>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a question regarding the ability of ML pipeline to cache
>>>>> intermediate results. I've posted this question on stackoverflow
>>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>>>> but got no answer, hope someone here can help me out.
>>>>>
>>>>> ===========
>>>>> Lately I'm planning to migrate my standalone python ML code to spark.
>>>>> The ML pipeline in spark.ml turns out quite handy, with streamlined
>>>>> API for chaining up algorithm stages and hyper-parameter grid search.
>>>>>
>>>>> Still, I found its support for one important feature obscure in
>>>>> existing documents: caching of intermediate results. The importance of 
>>>>> this
>>>>> feature arise when the pipeline involves computation intensive stages.
>>>>>
>>>>> For example, in my case I use a huge sparse matrix to perform multiple
>>>>> moving averages on time series data in order to form input features. The
>>>>> structure of the matrix is determined by some hyper-parameter. This step
>>>>> turns out to be a bottleneck for the entire pipeline because I have to
>>>>> construct the matrix in runtime.
>>>>>
>>>>> During parameter search, I usually have other parameters to examine in
>>>>> addition to this "structure parameter". So if I can reuse the huge matrix
>>>>> when the "structure parameter" is unchanged, I can save tons of time. For
>>>>> this reason, I intentionally formed my code to cache and reuse these
>>>>> intermediate results.
>>>>>
>>>>> So my question is: can Spark's ML pipeline handle intermediate caching
>>>>> automatically? Or do I have to manually form code to do so? If so, is 
>>>>> there
>>>>> any best practice to learn from?
>>>>>
>>>>> P.S. I have looked into the official document and some other material,
>>>>> but none of them seems to discuss this topic.
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>> Lewis
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Caching intermediate results in Spark ML pipeline?

Reply via email to