Re: Caching intermediate results in Spark ML pipeline?

2015-09-18 Thread Jingchu Liu
Thanks buddy I'll try it out in my project.

Best,
Lewis

2015-09-16 13:29 GMT+08:00 Feynman Liang <fli...@databricks.com>:

> If you're doing hyperparameter grid search, consider using
> ml.tuning.CrossValidator which does cache the dataset
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L85>
> .
>
> Otherwise, perhaps you can elaborate more on your particular use case for
> caching intermediate results and if the current API doesn't support it we
> can create a JIRA for it.
>
> On Tue, Sep 15, 2015 at 10:26 PM, Jingchu Liu <liujing...@gmail.com>
> wrote:
>
>> Yeah I understand on the low-level we should do as you said.
>>
>> But since ML pipeline is a high-level API, it is pretty natural to expect
>> the ability to recognize overlapping parameters between successive runs.
>> (Actually, this happen A LOT when we have lots of hyper-params to search
>> for)
>>
>> I can also imagine the implementation by appending parameter information
>> to the cached results. Let's say if we implemented an "equal" method for
>> param1. By comparing param1 with the previous run, the program will know
>> data1 is reusable. And time used for generating data1 can be saved.
>>
>> Best,
>> Lewis
>>
>> 2015-09-15 23:05 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>
>>> Nope, and that's intentional. There is no guarantee that rawData did not
>>> change between intermediate calls to searchRun, so reusing a cached data1
>>> would be incorrect.
>>>
>>> If you want data1 to be cached between multiple runs, you have a few
>>> options:
>>> * cache it first and pass it in as an argument to searchRun
>>> * use a creational pattern like singleton to ensure only one
>>> instantiation
>>>
>>> On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com>
>>> wrote:
>>>
>>>> Hey Feynman,
>>>>
>>>> I doubt DF persistence will work in my case. Let's use the following
>>>> code:
>>>> ==
>>>> def searchRun( params = [param1, param2] )
>>>>   data1 = hashing1.transform(rawData, param1)
>>>>   data1.cache()
>>>>   data2 = hashing2.transform(data1, param2)
>>>>   data2.someAction()
>>>> ==
>>>> Say if we run "searchRun()" for 2 times with the same "param1" but
>>>> different "param2". Will spark recognize that the two local variables
>>>> "data1" in consecutive runs has the same content?
>>>>
>>>>
>>>> Best,
>>>> Lewis
>>>>
>>>> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>>
>>>>> You can persist the transformed Dataframes, for example
>>>>>
>>>>> val data : DF = ...
>>>>> val hashedData = hashingTF.transform(data)
>>>>> hashedData.cache() // to cache DataFrame in memory
>>>>>
>>>>> Future usage of hashedData read from an in-memory cache now.
>>>>>
>>>>> You can also persist to disk, eg:
>>>>>
>>>>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>>>>> format to disk
>>>>> ...
>>>>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>>>>
>>>>> Future uses of hash
>>>>>
>>>>> Like my earlier response, this will still require you call each
>>>>> PipelineStage's `transform` method (i.e. to NOT use the overall
>>>>> Pipeline.setStages API)
>>>>>
>>>>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Feynman,
>>>>>>
>>>>>> Thanks for your response, but I'm afraid "model save/load" is not
>>>>>> exactly the feature I'm looking for.
>>>>>>
>>>>>> What I need to cache and reuse are the intermediate outputs of
>>>>>> transformations, not transformer themselves. Do you know any related dev.
>>>>>> activities or plans?
>>>>>>
>>>>>> Best,
>>>>>> Lewis
>>>>>>
>>>>>> 2015-09-15 13:03 GMT+08:00 Feyn

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
If you're doing hyperparameter grid search, consider using
ml.tuning.CrossValidator which does cache the dataset
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L85>
.

Otherwise, perhaps you can elaborate more on your particular use case for
caching intermediate results and if the current API doesn't support it we
can create a JIRA for it.

On Tue, Sep 15, 2015 at 10:26 PM, Jingchu Liu <liujing...@gmail.com> wrote:

> Yeah I understand on the low-level we should do as you said.
>
> But since ML pipeline is a high-level API, it is pretty natural to expect
> the ability to recognize overlapping parameters between successive runs.
> (Actually, this happen A LOT when we have lots of hyper-params to search
> for)
>
> I can also imagine the implementation by appending parameter information
> to the cached results. Let's say if we implemented an "equal" method for
> param1. By comparing param1 with the previous run, the program will know
> data1 is reusable. And time used for generating data1 can be saved.
>
> Best,
> Lewis
>
> 2015-09-15 23:05 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>
>> Nope, and that's intentional. There is no guarantee that rawData did not
>> change between intermediate calls to searchRun, so reusing a cached data1
>> would be incorrect.
>>
>> If you want data1 to be cached between multiple runs, you have a few
>> options:
>> * cache it first and pass it in as an argument to searchRun
>> * use a creational pattern like singleton to ensure only one instantiation
>>
>> On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com>
>> wrote:
>>
>>> Hey Feynman,
>>>
>>> I doubt DF persistence will work in my case. Let's use the following
>>> code:
>>> ==
>>> def searchRun( params = [param1, param2] )
>>>   data1 = hashing1.transform(rawData, param1)
>>>   data1.cache()
>>>   data2 = hashing2.transform(data1, param2)
>>>   data2.someAction()
>>> ==
>>> Say if we run "searchRun()" for 2 times with the same "param1" but
>>> different "param2". Will spark recognize that the two local variables
>>> "data1" in consecutive runs has the same content?
>>>
>>>
>>> Best,
>>> Lewis
>>>
>>> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>
>>>> You can persist the transformed Dataframes, for example
>>>>
>>>> val data : DF = ...
>>>> val hashedData = hashingTF.transform(data)
>>>> hashedData.cache() // to cache DataFrame in memory
>>>>
>>>> Future usage of hashedData read from an in-memory cache now.
>>>>
>>>> You can also persist to disk, eg:
>>>>
>>>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>>>> format to disk
>>>> ...
>>>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>>>
>>>> Future uses of hash
>>>>
>>>> Like my earlier response, this will still require you call each
>>>> PipelineStage's `transform` method (i.e. to NOT use the overall
>>>> Pipeline.setStages API)
>>>>
>>>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Feynman,
>>>>>
>>>>> Thanks for your response, but I'm afraid "model save/load" is not
>>>>> exactly the feature I'm looking for.
>>>>>
>>>>> What I need to cache and reuse are the intermediate outputs of
>>>>> transformations, not transformer themselves. Do you know any related dev.
>>>>> activities or plans?
>>>>>
>>>>> Best,
>>>>> Lewis
>>>>>
>>>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>>>
>>>>>> Lewis,
>>>>>>
>>>>>> Many pipeline stages implement save/load methods, which can be used
>>>>>> if you instantiate and call the underlying pipeline stages `transform`
>>>>>> methods individually (instead of using the Pipeline.setStages API). See
>>>>>> associated JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>>>>>
>>>>>> 

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Jingchu Liu
Yeah I understand on the low-level we should do as you said.

But since ML pipeline is a high-level API, it is pretty natural to expect
the ability to recognize overlapping parameters between successive runs.
(Actually, this happen A LOT when we have lots of hyper-params to search
for)

I can also imagine the implementation by appending parameter information to
the cached results. Let's say if we implemented an "equal" method for
param1. By comparing param1 with the previous run, the program will know
data1 is reusable. And time used for generating data1 can be saved.

Best,
Lewis

2015-09-15 23:05 GMT+08:00 Feynman Liang <fli...@databricks.com>:

> Nope, and that's intentional. There is no guarantee that rawData did not
> change between intermediate calls to searchRun, so reusing a cached data1
> would be incorrect.
>
> If you want data1 to be cached between multiple runs, you have a few
> options:
> * cache it first and pass it in as an argument to searchRun
> * use a creational pattern like singleton to ensure only one instantiation
>
> On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com>
> wrote:
>
>> Hey Feynman,
>>
>> I doubt DF persistence will work in my case. Let's use the following code:
>> ==
>> def searchRun( params = [param1, param2] )
>>   data1 = hashing1.transform(rawData, param1)
>>   data1.cache()
>>   data2 = hashing2.transform(data1, param2)
>>   data2.someAction()
>> ==
>> Say if we run "searchRun()" for 2 times with the same "param1" but
>> different "param2". Will spark recognize that the two local variables
>> "data1" in consecutive runs has the same content?
>>
>>
>> Best,
>> Lewis
>>
>> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>
>>> You can persist the transformed Dataframes, for example
>>>
>>> val data : DF = ...
>>> val hashedData = hashingTF.transform(data)
>>> hashedData.cache() // to cache DataFrame in memory
>>>
>>> Future usage of hashedData read from an in-memory cache now.
>>>
>>> You can also persist to disk, eg:
>>>
>>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>>> format to disk
>>> ...
>>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>>
>>> Future uses of hash
>>>
>>> Like my earlier response, this will still require you call each
>>> PipelineStage's `transform` method (i.e. to NOT use the overall
>>> Pipeline.setStages API)
>>>
>>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>>> wrote:
>>>
>>>> Hey Feynman,
>>>>
>>>> Thanks for your response, but I'm afraid "model save/load" is not
>>>> exactly the feature I'm looking for.
>>>>
>>>> What I need to cache and reuse are the intermediate outputs of
>>>> transformations, not transformer themselves. Do you know any related dev.
>>>> activities or plans?
>>>>
>>>> Best,
>>>> Lewis
>>>>
>>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>>
>>>>> Lewis,
>>>>>
>>>>> Many pipeline stages implement save/load methods, which can be used if
>>>>> you instantiate and call the underlying pipeline stages `transform` 
>>>>> methods
>>>>> individually (instead of using the Pipeline.setStages API). See associated
>>>>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>>>>
>>>>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>>>>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>>>>
>>>>> Feynman
>>>>>
>>>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have a question regarding the ability of ML pipeline to cache
>>>>>> intermediate results. I've posted this question on stackoverflow
>>>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>>>>> but got no answer, hope someone here can help me out.
>>>>>>
>>>>>> ===
>>>>>> Late

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
Nope, and that's intentional. There is no guarantee that rawData did not
change between intermediate calls to searchRun, so reusing a cached data1
would be incorrect.

If you want data1 to be cached between multiple runs, you have a few
options:
* cache it first and pass it in as an argument to searchRun
* use a creational pattern like singleton to ensure only one instantiation

On Tue, Sep 15, 2015 at 12:49 AM, Jingchu Liu <liujing...@gmail.com> wrote:

> Hey Feynman,
>
> I doubt DF persistence will work in my case. Let's use the following code:
> ==
> def searchRun( params = [param1, param2] )
>   data1 = hashing1.transform(rawData, param1)
>   data1.cache()
>   data2 = hashing2.transform(data1, param2)
>   data2.someAction()
> ==
> Say if we run "searchRun()" for 2 times with the same "param1" but
> different "param2". Will spark recognize that the two local variables
> "data1" in consecutive runs has the same content?
>
>
> Best,
> Lewis
>
> 2015-09-15 13:58 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>
>> You can persist the transformed Dataframes, for example
>>
>> val data : DF = ...
>> val hashedData = hashingTF.transform(data)
>> hashedData.cache() // to cache DataFrame in memory
>>
>> Future usage of hashedData read from an in-memory cache now.
>>
>> You can also persist to disk, eg:
>>
>> hashedData.write.parquet(FilePath) // to write DataFrame in Parquet
>> format to disk
>> ...
>> val savedHashedData = sqlContext.read.parquet(FilePath)
>>
>> Future uses of hash
>>
>> Like my earlier response, this will still require you call each
>> PipelineStage's `transform` method (i.e. to NOT use the overall
>> Pipeline.setStages API)
>>
>> On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com>
>> wrote:
>>
>>> Hey Feynman,
>>>
>>> Thanks for your response, but I'm afraid "model save/load" is not
>>> exactly the feature I'm looking for.
>>>
>>> What I need to cache and reuse are the intermediate outputs of
>>> transformations, not transformer themselves. Do you know any related dev.
>>> activities or plans?
>>>
>>> Best,
>>> Lewis
>>>
>>> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>>>
>>>> Lewis,
>>>>
>>>> Many pipeline stages implement save/load methods, which can be used if
>>>> you instantiate and call the underlying pipeline stages `transform` methods
>>>> individually (instead of using the Pipeline.setStages API). See associated
>>>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>>>
>>>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>>>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>>>
>>>> Feynman
>>>>
>>>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I have a question regarding the ability of ML pipeline to cache
>>>>> intermediate results. I've posted this question on stackoverflow
>>>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>>>> but got no answer, hope someone here can help me out.
>>>>>
>>>>> ===
>>>>> Lately I'm planning to migrate my standalone python ML code to spark.
>>>>> The ML pipeline in spark.ml turns out quite handy, with streamlined
>>>>> API for chaining up algorithm stages and hyper-parameter grid search.
>>>>>
>>>>> Still, I found its support for one important feature obscure in
>>>>> existing documents: caching of intermediate results. The importance of 
>>>>> this
>>>>> feature arise when the pipeline involves computation intensive stages.
>>>>>
>>>>> For example, in my case I use a huge sparse matrix to perform multiple
>>>>> moving averages on time series data in order to form input features. The
>>>>> structure of the matrix is determined by some hyper-parameter. This step
>>>>> turns out to be a bottleneck for the entire pipeline because I have to
>>>>> construct the matrix in runtime.
>>>>>
>>>>> During parameter search, I usually have other parameters to examine in
>>>>> addition to this "structure parameter". So if I can reuse the huge matrix
>>>>> when the "structure parameter" is unchanged, I can save tons of time. For
>>>>> this reason, I intentionally formed my code to cache and reuse these
>>>>> intermediate results.
>>>>>
>>>>> So my question is: can Spark's ML pipeline handle intermediate caching
>>>>> automatically? Or do I have to manually form code to do so? If so, is 
>>>>> there
>>>>> any best practice to learn from?
>>>>>
>>>>> P.S. I have looked into the official document and some other material,
>>>>> but none of them seems to discuss this topic.
>>>>>
>>>>>
>>>>>
>>>>> Best,
>>>>> Lewis
>>>>>
>>>>
>>>>
>>>
>>
>


Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Jingchu Liu
Hi all,

I have a question regarding the ability of ML pipeline to cache
intermediate results. I've posted this question on stackoverflow
<http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
but got no answer, hope someone here can help me out.

===
Lately I'm planning to migrate my standalone python ML code to spark. The
ML pipeline in spark.ml turns out quite handy, with streamlined API for
chaining up algorithm stages and hyper-parameter grid search.

Still, I found its support for one important feature obscure in existing
documents: caching of intermediate results. The importance of this feature
arise when the pipeline involves computation intensive stages.

For example, in my case I use a huge sparse matrix to perform multiple
moving averages on time series data in order to form input features. The
structure of the matrix is determined by some hyper-parameter. This step
turns out to be a bottleneck for the entire pipeline because I have to
construct the matrix in runtime.

During parameter search, I usually have other parameters to examine in
addition to this "structure parameter". So if I can reuse the huge matrix
when the "structure parameter" is unchanged, I can save tons of time. For
this reason, I intentionally formed my code to cache and reuse these
intermediate results.

So my question is: can Spark's ML pipeline handle intermediate caching
automatically? Or do I have to manually form code to do so? If so, is there
any best practice to learn from?

P.S. I have looked into the official document and some other material, but
none of them seems to discuss this topic.



Best,
Lewis


Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
Lewis,

Many pipeline stages implement save/load methods, which can be used if you
instantiate and call the underlying pipeline stages `transform` methods
individually (instead of using the Pipeline.setStages API). See associated
JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.

Pipeline persistence is on the 1.6 roadmap, JIRA here
<https://issues.apache.org/jira/browse/SPARK-6725>.

Feynman

On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com> wrote:

> Hi all,
>
> I have a question regarding the ability of ML pipeline to cache
> intermediate results. I've posted this question on stackoverflow
> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
> but got no answer, hope someone here can help me out.
>
> ===
> Lately I'm planning to migrate my standalone python ML code to spark. The
> ML pipeline in spark.ml turns out quite handy, with streamlined API for
> chaining up algorithm stages and hyper-parameter grid search.
>
> Still, I found its support for one important feature obscure in existing
> documents: caching of intermediate results. The importance of this feature
> arise when the pipeline involves computation intensive stages.
>
> For example, in my case I use a huge sparse matrix to perform multiple
> moving averages on time series data in order to form input features. The
> structure of the matrix is determined by some hyper-parameter. This step
> turns out to be a bottleneck for the entire pipeline because I have to
> construct the matrix in runtime.
>
> During parameter search, I usually have other parameters to examine in
> addition to this "structure parameter". So if I can reuse the huge matrix
> when the "structure parameter" is unchanged, I can save tons of time. For
> this reason, I intentionally formed my code to cache and reuse these
> intermediate results.
>
> So my question is: can Spark's ML pipeline handle intermediate caching
> automatically? Or do I have to manually form code to do so? If so, is there
> any best practice to learn from?
>
> P.S. I have looked into the official document and some other material, but
> none of them seems to discuss this topic.
>
>
>
> Best,
> Lewis
>


Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
You can persist the transformed Dataframes, for example

val data : DF = ...
val hashedData = hashingTF.transform(data)
hashedData.cache() // to cache DataFrame in memory

Future usage of hashedData read from an in-memory cache now.

You can also persist to disk, eg:

hashedData.write.parquet(FilePath) // to write DataFrame in Parquet format
to disk
...
val savedHashedData = sqlContext.read.parquet(FilePath)

Future uses of hash

Like my earlier response, this will still require you call each
PipelineStage's `transform` method (i.e. to NOT use the overall
Pipeline.setStages API)

On Mon, Sep 14, 2015 at 10:45 PM, Jingchu Liu <liujing...@gmail.com> wrote:

> Hey Feynman,
>
> Thanks for your response, but I'm afraid "model save/load" is not exactly
> the feature I'm looking for.
>
> What I need to cache and reuse are the intermediate outputs of
> transformations, not transformer themselves. Do you know any related dev.
> activities or plans?
>
> Best,
> Lewis
>
> 2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:
>
>> Lewis,
>>
>> Many pipeline stages implement save/load methods, which can be used if
>> you instantiate and call the underlying pipeline stages `transform` methods
>> individually (instead of using the Pipeline.setStages API). See associated
>> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>>
>> Pipeline persistence is on the 1.6 roadmap, JIRA here
>> <https://issues.apache.org/jira/browse/SPARK-6725>.
>>
>> Feynman
>>
>> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I have a question regarding the ability of ML pipeline to cache
>>> intermediate results. I've posted this question on stackoverflow
>>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>>> but got no answer, hope someone here can help me out.
>>>
>>> ===
>>> Lately I'm planning to migrate my standalone python ML code to spark.
>>> The ML pipeline in spark.ml turns out quite handy, with streamlined API
>>> for chaining up algorithm stages and hyper-parameter grid search.
>>>
>>> Still, I found its support for one important feature obscure in existing
>>> documents: caching of intermediate results. The importance of this feature
>>> arise when the pipeline involves computation intensive stages.
>>>
>>> For example, in my case I use a huge sparse matrix to perform multiple
>>> moving averages on time series data in order to form input features. The
>>> structure of the matrix is determined by some hyper-parameter. This step
>>> turns out to be a bottleneck for the entire pipeline because I have to
>>> construct the matrix in runtime.
>>>
>>> During parameter search, I usually have other parameters to examine in
>>> addition to this "structure parameter". So if I can reuse the huge matrix
>>> when the "structure parameter" is unchanged, I can save tons of time. For
>>> this reason, I intentionally formed my code to cache and reuse these
>>> intermediate results.
>>>
>>> So my question is: can Spark's ML pipeline handle intermediate caching
>>> automatically? Or do I have to manually form code to do so? If so, is there
>>> any best practice to learn from?
>>>
>>> P.S. I have looked into the official document and some other material,
>>> but none of them seems to discuss this topic.
>>>
>>>
>>>
>>> Best,
>>> Lewis
>>>
>>
>>
>


Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Jingchu Liu
Hey Feynman,

Thanks for your response, but I'm afraid "model save/load" is not exactly
the feature I'm looking for.

What I need to cache and reuse are the intermediate outputs of
transformations, not transformer themselves. Do you know any related dev.
activities or plans?

Best,
Lewis

2015-09-15 13:03 GMT+08:00 Feynman Liang <fli...@databricks.com>:

> Lewis,
>
> Many pipeline stages implement save/load methods, which can be used if you
> instantiate and call the underlying pipeline stages `transform` methods
> individually (instead of using the Pipeline.setStages API). See associated
> JIRAs <https://issues.apache.org/jira/browse/SPARK-4587>.
>
> Pipeline persistence is on the 1.6 roadmap, JIRA here
> <https://issues.apache.org/jira/browse/SPARK-6725>.
>
> Feynman
>
> On Mon, Sep 14, 2015 at 9:20 PM, Jingchu Liu <liujing...@gmail.com> wrote:
>
>> Hi all,
>>
>> I have a question regarding the ability of ML pipeline to cache
>> intermediate results. I've posted this question on stackoverflow
>> <http://stackoverflow.com/questions/32561687/caching-intermediate-results-in-spark-ml-pipeline>
>> but got no answer, hope someone here can help me out.
>>
>> ===
>> Lately I'm planning to migrate my standalone python ML code to spark. The
>> ML pipeline in spark.ml turns out quite handy, with streamlined API for
>> chaining up algorithm stages and hyper-parameter grid search.
>>
>> Still, I found its support for one important feature obscure in existing
>> documents: caching of intermediate results. The importance of this feature
>> arise when the pipeline involves computation intensive stages.
>>
>> For example, in my case I use a huge sparse matrix to perform multiple
>> moving averages on time series data in order to form input features. The
>> structure of the matrix is determined by some hyper-parameter. This step
>> turns out to be a bottleneck for the entire pipeline because I have to
>> construct the matrix in runtime.
>>
>> During parameter search, I usually have other parameters to examine in
>> addition to this "structure parameter". So if I can reuse the huge matrix
>> when the "structure parameter" is unchanged, I can save tons of time. For
>> this reason, I intentionally formed my code to cache and reuse these
>> intermediate results.
>>
>> So my question is: can Spark's ML pipeline handle intermediate caching
>> automatically? Or do I have to manually form code to do so? If so, is there
>> any best practice to learn from?
>>
>> P.S. I have looked into the official document and some other material,
>> but none of them seems to discuss this topic.
>>
>>
>>
>> Best,
>> Lewis
>>
>
>