Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62229656 Thanks for the replies. I'm summarizing the discussion so far and including some replies. I think we are getting close to a good API ! IMHO it would be good to have the developer API updates as well and test a couple of more pipelines before we push this out. Dataset 1. We all seem to agree that there is a need for higher level Dataset API or more flexible functions in SchemaRDD for new operations. `appendCol` seems like it'll be very useful for one. 2. We also seem to have agreement that we will have an additional, simpler Transformer API from RDD to RDD. The one using Datasets will still be there for cases where the simpler API isn't enough Also I am not sure I fully understand the difference between the User API and the Developer API (developers are the main users for Pipelines ?). Pipelines 1. Batching, Loops etc: @mengxr -- I'll try out the parallel pipeline + feature assembler and let you know how it goes. One other requirement that we have for large pipelines is that we lazily evaluate each batch to fit in memory and use boosting. I'll try to see if we can do something similar here and get back to you. 2. Constructors, Params: @mateiz -- Completely agree on the binary compatibility problems and the trouble we had with default constructor args. However I think some members are required as a part of the class vs. being optional parameters. For example, the regularization value is definitely a parameter with a default value of 0.0 and we want to tune its value etc. On the other hand in say [HadamardProduct](https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/RandomSignNode.scala#L19) or even [LinearModel](https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MultiClassLinearRegressionEstimator.scala#L81) the signs, weights are a part of the object. You almost never want to replace these values as doing so would result in creating a completely different pipeline. So I think there are roles for both -- and we need to be careful especially when we are saving and loading pipelines etc. 3. Evaluators: I can see that these are useful for model selection inside estimators, but as @jkbradley said we need to figure out a better way to chain them to a pipeline. FWIW my example was very simple and just trying to compute test error for a single model and not doing any model selection. 4. Parameter setters, passing, maps etc -- We seemed to have reached a nice design point on this ! I agree that implicit mapping was a bit tedious and `map(param)` is fine. 5. Parameter traits like HasInputCol -- This is the one issue where we don't have great ideas so far I guess. On the one hand having too many traits seems wasteful. On the other hand the amount of cruft code without them is also tedious. One idea I had was to try out annotations (like @Param featureCol: String) and auto generating code for setters, getters. More knowledgeable Java/Scala people may know more. (@JoshRosen ?)
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org