Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-62229656
  
    Thanks for the replies. I'm summarizing the discussion so far and including 
some replies. I think we are getting close to a good API ! IMHO it would be 
good to have the developer API updates as well and test a couple of more 
pipelines before we push this out.
    
    Dataset
    1. We all seem to agree that there is a need for higher level Dataset API 
or more flexible functions in SchemaRDD for new operations. `appendCol` seems 
like it'll be very useful for one.
    2.  We also seem to have agreement that we will have an additional, simpler 
Transformer API from RDD to RDD. The one using Datasets will still be there for 
cases where the simpler API isn't enough
    Also I am not sure I fully understand the difference between the User API 
and the Developer API (developers are the main users for Pipelines ?). 
    
    Pipelines
    1. Batching, Loops etc: @mengxr -- I'll try out the parallel pipeline + 
feature assembler and let you know how it goes. One other requirement that we 
have for large pipelines is that we lazily evaluate each batch to fit in memory 
and use boosting. I'll try to see if we can do something similar here and get 
back to you.
    
    2. Constructors, Params: @mateiz -- Completely agree on the binary 
compatibility problems and the trouble we had with default constructor args.  
However I think some members are required as a part of the class vs. being 
optional parameters. For example, the regularization value is definitely a 
parameter with a default value of 0.0 and we want to tune its value etc. On the 
other hand in say 
[HadamardProduct](https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/RandomSignNode.scala#L19)
 or even 
[LinearModel](https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MultiClassLinearRegressionEstimator.scala#L81)
 the signs, weights are a part of the object. You almost never want to replace 
these values as doing so would result in creating a completely different 
pipeline. So I think there are roles for both -- and we need to be careful 
especially when we are saving and loading pipelines etc.
    
    3. Evaluators: I can see that these are useful for model selection inside 
estimators, but as @jkbradley said we need to figure out a better way to chain 
them to a pipeline. FWIW my example was very simple and just trying to compute 
test error for a single model and not doing any model selection.
    
    4. Parameter setters, passing, maps etc -- We seemed to have reached a nice 
design point on this ! I agree that implicit mapping was a bit tedious and 
`map(param)` is fine.
    
    5. Parameter traits like HasInputCol -- This is the one issue where we 
don't have great ideas so far I guess. On the one hand having too many traits 
seems wasteful. On the other hand the amount of cruft code without them is also 
tedious. One idea I had was to try out annotations (like @Param featureCol: 
String) and auto generating code for setters, getters. More knowledgeable 
Java/Scala people may know more. (@JoshRosen ?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to