[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5888:
---

Assignee: Apache Spark  (was: Sandy Ryza)

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Apache Spark

 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5888:
---

Assignee: Sandy Ryza  (was: Apache Spark)

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Sandy Ryza

 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-04-13 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza reassigned SPARK-5888:
-

Assignee: Sandy Ryza

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Sandy Ryza

 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org