Re: Caching intermediate results in Spark ML pipeline?

Jingchu Liu Tue, 15 Sep 2015 00:50:20 -0700

Hey Feynman,

I doubt DF persistence will work in my case. Let's use the following code:
==============================================
def searchRun( params = [param1, param2] )
      data1 = hashing1.transform(rawData, param1)
      data1.cache()
      data2 = hashing2.transform(data1, param2)
      data2.someAction()
==============================================
Say if we run "searchRun()" for 2 times with the same "param1" but
different "param2". Will spark recognize that the two local variables
"data1" in consecutive runs has the same content?



Best,
Lewis

2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:

> You can persist the transformed Dataframes, for example
>
> val data : DF = ...
> val hashedData = hashingTF.transform(data)
> hashedData.cache() // to cache DataFrame in memory
>
> Future usage of hashedData read from an in-memory cache now.
>
> You can also persist to disk, eg:
>
> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet format
> to disk
> ...
> val savedHashedData = sqlContext.read.parquet(FilePath)
>
> Future uses of hash
>
> Like my earlier response, this will still require you call each
> PipelineStage's `transform` method (i.e. to NOT use the overall
> Pipeline.setStages API)
>
> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
> wrote:
>
>> Hey Feynman,
>>
>> Thanks for your response, but I'm afraid "model save/load" is not exactly
>> the feature I'm looking for.
>>
>> What I need to cache and reuse are the intermediate outputs of
>> transformations, not transformer themselves. Do you know any related dev.
>> activities or plans?
>>
>> Best,
>> Lewis
>>
>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>
>>> Lewis,
>>>
>>> Many pipeline stages implement save/load methods, which can be used if
>>> you instantiate and call the underlying pipeline stages `transform` methods
>>> individually (instead of using the Pipeline.setStages API). See associated
>>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>>
>>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>>
>>> Feynman
>>>
>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have a question regarding the ability of ML pipeline to cache
>>>> intermediate results. I've posted this question on stackoverflow
>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>>> but got no answer, hope someone here can help me out.
>>>>
>>>> ===========
>>>> Lately I'm planning to migrate my standalone python ML code to spark.
>>>> The ML pipeline in spark.ml turns out quite handy, with streamlined
>>>> API for chaining up algorithm stages and hyper-parameter grid search.
>>>>
>>>> Still, I found its support for one important feature obscure in
>>>> existing documents: caching of intermediate results. The importance of this
>>>> feature arise when the pipeline involves computation intensive stages.
>>>>
>>>> For example, in my case I use a huge sparse matrix to perform multiple
>>>> moving averages on time series data in order to form input features. The
>>>> structure of the matrix is determined by some hyper-parameter. This step
>>>> turns out to be a bottleneck for the entire pipeline because I have to
>>>> construct the matrix in runtime.
>>>>
>>>> During parameter search, I usually have other parameters to examine in
>>>> addition to this "structure parameter". So if I can reuse the huge matrix
>>>> when the "structure parameter" is unchanged, I can save tons of time. For
>>>> this reason, I intentionally formed my code to cache and reuse these
>>>> intermediate results.
>>>>
>>>> So my question is: can Spark's ML pipeline handle intermediate caching
>>>> automatically? Or do I have to manually form code to do so? If so, is there
>>>> any best practice to learn from?
>>>>
>>>> P.S. I have looked into the official document and some other material,
>>>> but none of them seems to discuss this topic.
>>>>
>>>>
>>>>
>>>> Best,
>>>> Lewis
>>>>
>>>
>>>
>>
>

Re: Caching intermediate results in Spark ML pipeline?

Reply via email to