Re: Caching intermediate results in Spark ML pipeline?

Feynman Liang Mon, 14 Sep 2015 22:59:20 -0700

You can persist the transformed Dataframes, for example

val data : DF = ...
val hashedData = hashingTF.transform(data)
hashedData.cache() // to cache DataFrame in memory


Future usage of hashedData read from an in-memory cache now.

You can also persist to disk, eg:

hashedData.write.parquet(FilePath) // to write DataFrame in Parquet format
to disk
...
val savedHashedData = sqlContext.read.parquet(FilePath)

Future uses of hash

Like my earlier response, this will still require you call each
PipelineStage's `transform` method (i.e. to NOT use the overall
Pipeline.setStages API)

On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com> wrote:

> Hey Feynman,
>
> Thanks for your response, but I'm afraid "model save/load" is not exactly
> the feature I'm looking for.
>
> What I need to cache and reuse are the intermediate outputs of
> transformations, not transformer themselves. Do you know any related dev.
> activities or plans?
>
> Best,
> Lewis
>
> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>
>> Lewis,
>>
>> Many pipeline stages implement save/load methods, which can be used if
>> you instantiate and call the underlying pipeline stages `transform` methods
>> individually (instead of using the Pipeline.setStages API). See associated
>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>
>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>
>> Feynman
>>
>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I have a question regarding the ability of ML pipeline to cache
>>> intermediate results. I've posted this question on stackoverflow
>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>> but got no answer, hope someone here can help me out.
>>>
>>> ===========
>>> Lately I'm planning to migrate my standalone python ML code to spark.
>>> The ML pipeline in spark.ml turns out quite handy, with streamlined API
>>> for chaining up algorithm stages and hyper-parameter grid search.
>>>
>>> Still, I found its support for one important feature obscure in existing
>>> documents: caching of intermediate results. The importance of this feature
>>> arise when the pipeline involves computation intensive stages.
>>>
>>> For example, in my case I use a huge sparse matrix to perform multiple
>>> moving averages on time series data in order to form input features. The
>>> structure of the matrix is determined by some hyper-parameter. This step
>>> turns out to be a bottleneck for the entire pipeline because I have to
>>> construct the matrix in runtime.
>>>
>>> During parameter search, I usually have other parameters to examine in
>>> addition to this "structure parameter". So if I can reuse the huge matrix
>>> when the "structure parameter" is unchanged, I can save tons of time. For
>>> this reason, I intentionally formed my code to cache and reuse these
>>> intermediate results.
>>>
>>> So my question is: can Spark's ML pipeline handle intermediate caching
>>> automatically? Or do I have to manually form code to do so? If so, is there
>>> any best practice to learn from?
>>>
>>> P.S. I have looked into the official document and some other material,
>>> but none of them seems to discuss this topic.
>>>
>>>
>>>
>>> Best,
>>> Lewis
>>>
>>
>>
>

Re: Caching intermediate results in Spark ML pipeline?

Reply via email to