Re: Caching intermediate results in Spark ML pipeline?

Feynman Liang Tue, 15 Sep 2015 22:30:07 -0700

If you're doing hyperparameter grid search, consider using
ml.tuning.CrossValidator which does cache the dataset
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L85>
.


Otherwise, perhaps you can elaborate more on your particular use case for
caching intermediate results and if the current API doesn't support it we
can create a JIRA for it.

On Tue, Sep 15, 2015 at 10:26 PM, Jingchu Liu <liujing...@gmail.com> wrote:

> Yeah I understand on the low-level we should do as you said.
>
> But since ML pipeline is a high-level API, it is pretty natural to expect
> the ability to recognize overlapping parameters between successive runs.
> (Actually, this happen A LOT when we have lots of hyper-params to search
> for)
>
> I can also imagine the implementation by appending parameter information
> to the cached results. Let's say if we implemented an "equal" method for
> param1. By comparing param1 with the previous run, the program will know
> data1 is reusable. And time used for generating data1 can be saved.
>
> Best,
> Lewis
>
> 2015-09-15 23:05 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>
>> Nope, and that's intentional. There is no guarantee that rawData did not
>> change between intermediate calls to searchRun, so reusing a cached data1
>> would be incorrect.
>>
>> If you want data1 to be cached between multiple runs, you have a few
>> options:
>> * cache it first and pass it in as an argument to searchRun
>> * use a creational pattern like singleton to ensure only one instantiation
>>
>> On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com>
>> wrote:
>>
>>> Hey Feynman,
>>>
>>> I doubt DF persistence will work in my case. Let's use the following
>>> code:
>>> ==============================================
>>> def searchRun( params = [param1, param2] )
>>>       data1 = hashing1.transform(rawData, param1)
>>>       data1.cache()
>>>       data2 = hashing2.transform(data1, param2)
>>>       data2.someAction()
>>> ==============================================
>>> Say if we run "searchRun()" for 2 times with the same "param1" but
>>> different "param2". Will spark recognize that the two local variables
>>> "data1" in consecutive runs has the same content?
>>>
>>>
>>> Best,
>>> Lewis
>>>
>>> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>
>>>> You can persist the transformed Dataframes, for example
>>>>
>>>> val data : DF = ...
>>>> val hashedData = hashingTF.transform(data)
>>>> hashedData.cache() // to cache DataFrame in memory
>>>>
>>>> Future usage of hashedData read from an in-memory cache now.
>>>>
>>>> You can also persist to disk, eg:
>>>>
>>>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>>>> format to disk
>>>> ...
>>>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>>>
>>>> Future uses of hash
>>>>
>>>> Like my earlier response, this will still require you call each
>>>> PipelineStage's `transform` method (i.e. to NOT use the overall
>>>> Pipeline.setStages API)
>>>>
>>>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Feynman,
>>>>>
>>>>> Thanks for your response, but I'm afraid "model save/load" is not
>>>>> exactly the feature I'm looking for.
>>>>>
>>>>> What I need to cache and reuse are the intermediate outputs of
>>>>> transformations, not transformer themselves. Do you know any related dev.
>>>>> activities or plans?
>>>>>
>>>>> Best,
>>>>> Lewis
>>>>>
>>>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>>>
>>>>>> Lewis,
>>>>>>
>>>>>> Many pipeline stages implement save/load methods, which can be used
>>>>>> if you instantiate and call the underlying pipeline stages `transform`
>>>>>> methods individually (instead of using the Pipeline.setStages API). See
>>>>>> associated JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>>>>>
>>>>>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>>>>>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>>>>>
>>>>>> Feynman
>>>>>>
>>>>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have a question regarding the ability of ML pipeline to cache
>>>>>>> intermediate results. I've posted this question on stackoverflow
>>>>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>>>>>> but got no answer, hope someone here can help me out.
>>>>>>>
>>>>>>> ===========
>>>>>>> Lately I'm planning to migrate my standalone python ML code to
>>>>>>> spark. The ML pipeline in spark.ml turns out quite handy, with
>>>>>>> streamlined API for chaining up algorithm stages and hyper-parameter 
>>>>>>> grid
>>>>>>> search.
>>>>>>>
>>>>>>> Still, I found its support for one important feature obscure in
>>>>>>> existing documents: caching of intermediate results. The importance of 
>>>>>>> this
>>>>>>> feature arise when the pipeline involves computation intensive stages.
>>>>>>>
>>>>>>> For example, in my case I use a huge sparse matrix to perform
>>>>>>> multiple moving averages on time series data in order to form input
>>>>>>> features. The structure of the matrix is determined by some
>>>>>>> hyper-parameter. This step turns out to be a bottleneck for the entire
>>>>>>> pipeline because I have to construct the matrix in runtime.
>>>>>>>
>>>>>>> During parameter search, I usually have other parameters to examine
>>>>>>> in addition to this "structure parameter". So if I can reuse the huge
>>>>>>> matrix when the "structure parameter" is unchanged, I can save tons of
>>>>>>> time. For this reason, I intentionally formed my code to cache and reuse
>>>>>>> these intermediate results.
>>>>>>>
>>>>>>> So my question is: can Spark's ML pipeline handle intermediate
>>>>>>> caching automatically? Or do I have to manually form code to do so? If 
>>>>>>> so,
>>>>>>> is there any best practice to learn from?
>>>>>>>
>>>>>>> P.S. I have looked into the official document and some other
>>>>>>> material, but none of them seems to discuss this topic.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Lewis
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Caching intermediate results in Spark ML pipeline?

Reply via email to