Re: Caching intermediate results in Spark ML pipeline?

2015-09-18 Thread Jingchu Liu
Thanks buddy I'll try it out in my project. Best, Lewis 2015-09-16 13:29 GMT+08:00 Feynman Liang : > If you're doing hyperparameter grid search, consider using > ml.tuning.CrossValidator which does cache the dataset >

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
If you're doing hyperparameter grid search, consider using ml.tuning.CrossValidator which does cache the dataset . Otherwise, perhaps you can elaborate more on your particular

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Jingchu Liu
Yeah I understand on the low-level we should do as you said. But since ML pipeline is a high-level API, it is pretty natural to expect the ability to recognize overlapping parameters between successive runs. (Actually, this happen A LOT when we have lots of hyper-params to search for) I can also

Re: Caching intermediate results in Spark ML pipeline?

2015-09-15 Thread Feynman Liang
Nope, and that's intentional. There is no guarantee that rawData did not change between intermediate calls to searchRun, so reusing a cached data1 would be incorrect. If you want data1 to be cached between multiple runs, you have a few options: * cache it first and pass it in as an argument to

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
Lewis, Many pipeline stages implement save/load methods, which can be used if you instantiate and call the underlying pipeline stages `transform` methods individually (instead of using the Pipeline.setStages API). See associated JIRAs . Pipeline

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Feynman Liang
You can persist the transformed Dataframes, for example val data : DF = ... val hashedData = hashingTF.transform(data) hashedData.cache() // to cache DataFrame in memory Future usage of hashedData read from an in-memory cache now. You can also persist to disk, eg:

Re: Caching intermediate results in Spark ML pipeline?

2015-09-14 Thread Jingchu Liu
Hey Feynman, Thanks for your response, but I'm afraid "model save/load" is not exactly the feature I'm looking for. What I need to cache and reuse are the intermediate outputs of transformations, not transformer themselves. Do you know any related dev. activities or plans? Best, Lewis