[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

shivaram Fri, 07 Nov 2014 15:15:20 -0800

Github user shivaram commented on the pull request:

https://github.com/apache/spark/pull/3099#issuecomment-62229656

Thanks for the replies. I'm summarizing the discussion so far and including
some replies. I think we are getting close to a good API ! IMHO it would be
good to have the developer API updates as well and test a couple of more
pipelines before we push this out.

Dataset
1. We all seem to agree that there is a need for higher level Dataset API
or more flexible functions in SchemaRDD for new operations. `appendCol` seems
like it'll be very useful for one.
2. We also seem to have agreement that we will have an additional, simpler
Transformer API from RDD to RDD. The one using Datasets will still be there for
cases where the simpler API isn't enough
Also I am not sure I fully understand the difference between the User API
and the Developer API (developers are the main users for Pipelines ?).

Pipelines
1. Batching, Loops etc: @mengxr -- I'll try out the parallel pipeline +
feature assembler and let you know how it goes. One other requirement that we
have for large pipelines is that we lazily evaluate each batch to fit in memory
and use boosting. I'll try to see if we can do something similar here and get
back to you.

2. Constructors, Params: @mateiz -- Completely agree on the binary
compatibility problems and the trouble we had with default constructor args.
However I think some members are required as a part of the class vs. being
optional parameters. For example, the regularization value is definitely a
parameter with a default value of 0.0 and we want to tune its value etc. On the
other hand in say
[HadamardProduct](https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/RandomSignNode.scala#L19)
or even
[LinearModel](https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MultiClassLinearRegressionEstimator.scala#L81)
the signs, weights are a part of the object. You almost never want to replace
these values as doing so would result in creating a completely different
pipeline. So I think there are roles for both -- and we need to be careful
especially when we are saving and loading pipelines etc.

3. Evaluators: I can see that these are useful for model selection inside
estimators, but as @jkbradley said we need to figure out a better way to chain
them to a pipeline. FWIW my example was very simple and just trying to compute
test error for a single model and not doing any model selection.

4. Parameter setters, passing, maps etc -- We seemed to have reached a nice
design point on this ! I agree that implicit mapping was a bit tedious and
`map(param)` is fine.

5. Parameter traits like HasInputCol -- This is the one issue where we
don't have great ideas so far I guess. On the one hand having too many traits
seems wasteful. On the other hand the amount of cruft code without them is also
tedious. One idea I had was to try out annotations (like @Param featureCol:
String) and auto generating code for setters, getters. More knowledgeable
Java/Scala people may know more. (@JoshRosen ?)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Reply via email to