[ https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744508#comment-14744508 ]
Joseph K. Bradley commented on SPARK-8418: ------------------------------------------ Apologies for being AWOL! I'd definitely appreciate help with designing this improvement. For API (Vector vs. Map): I prefer sticking with a Vector API. I see the appeal of keeping columns separate, but DataFrames are not yet meant to handle too many columns (hundreds at most, I'd say). We can still keep feature names and metadata using ML attributes (which describe each feature in Vector columns in DataFrames). For sharing code, we should definitely do option 2. For backwards compatibility, we should not modify current Params, but we could add a new one for multiple inputs (and check for conflicting settings when running). I would hope we could share code in this multi-value transformation so that each transformer only needs to specify how to transform a single value. I hope we can do this, rather than implementing option 1 as the default. Would you mind sketching up a quick design doc? That should help clarify the different options and help us choose a simple but flexible API. If you'd like to follow existing examples, here are some ones you could look at: * Classification threshold (shorter doc): [https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing] * R-like stats for model (long doc): [https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing] These items we've discussed can be sketched out in the doc. After you link it from this JIRA, others can give you feedback on this JIRA (better than on the doc since some people have trouble viewing Google docs). Thanks very much! > Add single- and multi-value support to ML Transformers > ------------------------------------------------------ > > Key: SPARK-8418 > URL: https://issues.apache.org/jira/browse/SPARK-8418 > Project: Spark > Issue Type: Sub-task > Components: ML > Reporter: Joseph K. Bradley > > It would be convenient if all feature transformers supported transforming > columns of single values and multiple values, specifically: > * one column with one value (e.g., type {{Double}}) > * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}}) > We could go as far as supporting multiple columns, but that may not be > necessary since VectorAssembler could be used to handle that. > Estimators under {{ml.feature}} should also support this. > This will likely require a short design doc to describe: > * how input and output columns will be specified > * schema validation > * code sharing to reduce duplication -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org