Re: Caching intermediate results in Spark ML pipeline?

Jingchu Liu Tue, 15 Sep 2015 22:27:12 -0700

Yeah I understand on the low-level we should do as you said.

But since ML pipeline is a high-level API, it is pretty natural to expect
the ability to recognize overlapping parameters between successive runs.
(Actually, this happen A LOT when we have lots of hyper-params to search
for)


I can also imagine the implementation by appending parameter information to
the cached results. Let's say if we implemented an "equal" method for
param1. By comparing param1 with the previous run, the program will know
data1 is reusable. And time used for generating data1 can be saved.

Best,
Lewis

2015-09-15 23:05 GMT+08:00 Feynman Liang <fli...@databricks.com>:

> Nope, and that's intentional. There is no guarantee that rawData did not
> change between intermediate calls to searchRun, so reusing a cached data1
> would be incorrect.
>
> If you want data1 to be cached between multiple runs, you have a few
> options:
> * cache it first and pass it in as an argument to searchRun
> * use a creational pattern like singleton to ensure only one instantiation
>
> On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com>
> wrote:
>
>> Hey Feynman,
>>
>> I doubt DF persistence will work in my case. Let's use the following code:
>> ==============================================
>> def searchRun( params = [param1, param2] )
>>       data1 = hashing1.transform(rawData, param1)
>>       data1.cache()
>>       data2 = hashing2.transform(data1, param2)
>>       data2.someAction()
>> ==============================================
>> Say if we run "searchRun()" for 2 times with the same "param1" but
>> different "param2". Will spark recognize that the two local variables
>> "data1" in consecutive runs has the same content?
>>
>>
>> Best,
>> Lewis
>>
>> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>
>>> You can persist the transformed Dataframes, for example
>>>
>>> val data : DF = ...
>>> val hashedData = hashingTF.transform(data)
>>> hashedData.cache() // to cache DataFrame in memory
>>>
>>> Future usage of hashedData read from an in-memory cache now.
>>>
>>> You can also persist to disk, eg:
>>>
>>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>>> format to disk
>>> ...
>>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>>
>>> Future uses of hash
>>>
>>> Like my earlier response, this will still require you call each
>>> PipelineStage's `transform` method (i.e. to NOT use the overall
>>> Pipeline.setStages API)
>>>
>>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>>> wrote:
>>>
>>>> Hey Feynman,
>>>>
>>>> Thanks for your response, but I'm afraid "model save/load" is not
>>>> exactly the feature I'm looking for.
>>>>
>>>> What I need to cache and reuse are the intermediate outputs of
>>>> transformations, not transformer themselves. Do you know any related dev.
>>>> activities or plans?
>>>>
>>>> Best,
>>>> Lewis
>>>>
>>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>>
>>>>> Lewis,
>>>>>
>>>>> Many pipeline stages implement save/load methods, which can be used if
>>>>> you instantiate and call the underlying pipeline stages `transform` 
>>>>> methods
>>>>> individually (instead of using the Pipeline.setStages API). See associated
>>>>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>>>>
>>>>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>>>>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>>>>
>>>>> Feynman
>>>>>
>>>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have a question regarding the ability of ML pipeline to cache
>>>>>> intermediate results. I've posted this question on stackoverflow
>>>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>>>>> but got no answer, hope someone here can help me out.
>>>>>>
>>>>>> ===========
>>>>>> Lately I'm planning to migrate my standalone python ML code to spark.
>>>>>> The ML pipeline in spark.ml turns out quite handy, with streamlined
>>>>>> API for chaining up algorithm stages and hyper-parameter grid search.
>>>>>>
>>>>>> Still, I found its support for one important feature obscure in
>>>>>> existing documents: caching of intermediate results. The importance of 
>>>>>> this
>>>>>> feature arise when the pipeline involves computation intensive stages.
>>>>>>
>>>>>> For example, in my case I use a huge sparse matrix to perform
>>>>>> multiple moving averages on time series data in order to form input
>>>>>> features. The structure of the matrix is determined by some
>>>>>> hyper-parameter. This step turns out to be a bottleneck for the entire
>>>>>> pipeline because I have to construct the matrix in runtime.
>>>>>>
>>>>>> During parameter search, I usually have other parameters to examine
>>>>>> in addition to this "structure parameter". So if I can reuse the huge
>>>>>> matrix when the "structure parameter" is unchanged, I can save tons of
>>>>>> time. For this reason, I intentionally formed my code to cache and reuse
>>>>>> these intermediate results.
>>>>>>
>>>>>> So my question is: can Spark's ML pipeline handle intermediate
>>>>>> caching automatically? Or do I have to manually form code to do so? If 
>>>>>> so,
>>>>>> is there any best practice to learn from?
>>>>>>
>>>>>> P.S. I have looked into the official document and some other
>>>>>> material, but none of them seems to discuss this topic.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Lewis
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Caching intermediate results in Spark ML pipeline?

Reply via email to