Re: ML Pipeline question about caching

Peter Rudenko Tue, 17 Mar 2015 16:00:13 -0700

Hi Cesar,

I had a similar issue. Yes for now it’s better to do A,B,C outside acrossvalidator. Take a look to my comment<https://issues.apache.org/jira/browse/SPARK-4766?focusedCommentId=14320038&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14320038>and this jira <https://issues.apache.org/jira/browse/SPARK-5844>. Theproblem is that transformers could also have hyperparameters in thefuture (like word2vec transformer). Then crossvalidator would need tofind need to find the best parameters for both transformer + estimator.It will blow number of combinations (num parameters for transformer/number parameters for estimator / number of folds).


Thanks,
Peter Rudenko

On 2015-03-18 00:26, Cesar Flores wrote:

Hello all:
I am using the ML Pipeline, which I consider very powerful. I have thenext use case:
  * I have three transformers, which I will call A,B,C, that basically
    extract features from text files, with no parameters.
  * I have a final stage D, which is the logistic regression estimator.
  * I am creating a pipeline with the sequence A,B,C,D.
  * Finally, I am using this pipeline as estimator parameter of the
    CrossValidator class.
I have some concerns about how data persistance inside the crossvalidator works. For example, if only D has multiple parameters totune using the cross validator, my concern is that the transformationA->B->C is being performed multiple times?. Is that the case, or it isSpark smart enough to realize that it is possible to persist theoutput of C? Do it will be better to leave A,B, and C outside thecross validator pipeline?
Thanks a lot
--
Cesar Flores

Re: ML Pipeline question about caching

Reply via email to