[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

Joseph K. Bradley (JIRA) Mon, 14 Sep 2015 16:42:12 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744508#comment-14744508
 ]


Joseph K. Bradley commented on SPARK-8418:
------------------------------------------

Apologies for being AWOL!  I'd definitely appreciate help with designing this 
improvement.

For API (Vector vs. Map): I prefer sticking with a Vector API.  I see the 
appeal of keeping columns separate, but DataFrames are not yet meant to handle 
too many columns (hundreds at most, I'd say).  We can still keep feature names 
and metadata using ML attributes (which describe each feature in Vector columns 
in DataFrames).

For sharing code, we should definitely do option 2.  For backwards 
compatibility, we should not modify current Params, but we could add a new one 
for multiple inputs (and check for conflicting settings when running).  I would 
hope we could share code in this multi-value transformation so that each 
transformer only needs to specify how to transform a single value.  I hope we 
can do this, rather than implementing option 1 as the default.

Would you mind sketching up a quick design doc?  That should help clarify the 
different options and help us choose a simple but flexible API.  If you'd like 
to follow existing examples, here are some ones you could look at:
* Classification threshold (shorter doc): 
[https://docs.google.com/document/d/1nV6m7sqViHkEpawelq1S5_QLWWAouSlv81eiEEjKuJY/edit?usp=sharing]
* R-like stats for model (long doc): 
[https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing]

These items we've discussed can be sketched out in the doc.

After you link it from this JIRA, others can give you feedback on this JIRA 
(better than on the doc since some people have trouble viewing Google docs).

Thanks very much!

> Add single- and multi-value support to ML Transformers
> ------------------------------------------------------
>
>                 Key: SPARK-8418
>                 URL: https://issues.apache.org/jira/browse/SPARK-8418
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> It would be convenient if all feature transformers supported transforming 
> columns of single values and multiple values, specifically:
> * one column with one value (e.g., type {{Double}})
> * one column with multiple values (e.g., {{Array[Double]}} or {{Vector}})
> We could go as far as supporting multiple columns, but that may not be 
> necessary since VectorAssembler could be used to handle that.
> Estimators under {{ml.feature}} should also support this.
> This will likely require a short design doc to describe:
> * how input and output columns will be specified
> * schema validation
> * code sharing to reduce duplication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8418) Add single- and multi-value support to ML Transformers

Reply via email to