Re: Caching intermediate results in Spark ML pipeline?

Jingchu Liu Mon, 14 Sep 2015 22:46:18 -0700

Hey Feynman,

Thanks for your response, but I'm afraid "model save/load" is not exactly
the feature I'm looking for.


What I need to cache and reuse are the intermediate outputs of
transformations, not transformer themselves. Do you know any related dev.
activities or plans?

Best,
Lewis

2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:

> Lewis,
>
> Many pipeline stages implement save/load methods, which can be used if you
> instantiate and call the underlying pipeline stages `transform` methods
> individually (instead of using the Pipeline.setStages API). See associated
> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>
> Pipeline persistence is on the 1.6 roadmap, JIRA here
> <https://issues.apache.org/jira/browse/SPARK-6725>.
>
> Feynman
>
> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have a question regarding the ability of ML pipeline to cache
>> intermediate results. I've posted this question on stackoverflow
>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>> but got no answer, hope someone here can help me out.
>>
>> ===========
>> Lately I'm planning to migrate my standalone python ML code to spark. The
>> ML pipeline in spark.ml turns out quite handy, with streamlined API for
>> chaining up algorithm stages and hyper-parameter grid search.
>>
>> Still, I found its support for one important feature obscure in existing
>> documents: caching of intermediate results. The importance of this feature
>> arise when the pipeline involves computation intensive stages.
>>
>> For example, in my case I use a huge sparse matrix to perform multiple
>> moving averages on time series data in order to form input features. The
>> structure of the matrix is determined by some hyper-parameter. This step
>> turns out to be a bottleneck for the entire pipeline because I have to
>> construct the matrix in runtime.
>>
>> During parameter search, I usually have other parameters to examine in
>> addition to this "structure parameter". So if I can reuse the huge matrix
>> when the "structure parameter" is unchanged, I can save tons of time. For
>> this reason, I intentionally formed my code to cache and reuse these
>> intermediate results.
>>
>> So my question is: can Spark's ML pipeline handle intermediate caching
>> automatically? Or do I have to manually form code to do so? If so, is there
>> any best practice to learn from?
>>
>> P.S. I have looked into the official document and some other material,
>> but none of them seems to discuss this topic.
>>
>>
>>
>> Best,
>> Lewis
>>
>
>

Re: Caching intermediate results in Spark ML pipeline?

Reply via email to