[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5888: --- Assignee: Apache Spark (was: Sandy Ryza) Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Apache Spark `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5888: --- Assignee: Sandy Ryza (was: Apache Spark) Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Sandy Ryza `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza reassigned SPARK-5888: - Assignee: Sandy Ryza Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Sandy Ryza `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org