Github user leifker commented on the issue:

    https://github.com/apache/spark/pull/17306
  
    Looking over the contributing link, I should open a jira issue it seems?
    
    The intent is like you said, to run the CrossValidator with different 
pipelines.
    
    The same could be done using an external iterative approach. Build 
different pipelines, throwing each into a CrossValidator, and then taking the 
best model from each of those CrossValidators. Then finally picking the best 
from those. This is the initial approach I explored. It resulted in a lot of 
boiler plate code that felt like it shouldn't need to exist if the api simply 
allowed for arrays of estimators and their parameters.
    
    A couple advantages to this implementation to consider come from keeping a 
the functional interface to the CrossValidator.
    
    1. The caching of the folds is better utilized. An external iterative 
approach creates a new set of k folds for each CrossValidator fit and the folds 
are discarded after each CrossValidator run. In this implementation a single 
set of k folds is created and cached for all of the pipelines.
    2. A potential advantage of using this implementation is for future 
parallelization of the pipelines within the CrossValdiator. It is of course 
possible to handle the parallelization outside of the CrossValidator here too, 
however I believe there is already work in progress to parallelize the grid 
parameters and that could be extended to multiple pipelines.
    
    Both of those behind-the-scene optimizations are possible because of 
providing the CrossValidator with the data and the complete set of 
pipelines/estimators to evaluate up front allowing one to abstract away the 
implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to