subject:"\"\\\[jira\\\] \\\[Updated\\\] \\\(SPARK\\\-14862\\\) Tree and ensemble classification\\\: do not require label metadata\""

[jira] [Updated] (SPARK-14862) Tree and ensemble classification: do not require label metadata

2016-04-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14862:
--
Target Version/s: 2.0.0

> Tree and ensemble classification: do not require label metadata
> ---
>
> Key: SPARK-14862
> URL: https://issues.apache.org/jira/browse/SPARK-14862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> spark.ml DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier 
> require that the labelCol have metadata specifying the number of classes.  
> Instead, if the number of classes is not specified, we should automatically 
> scan the column to identify numClasses.
> This differs from [SPARK-7126] in that this requires labels to be indexed 
> (but without metadata).  This issue is not for supporting String labels.
> Note: This could cause problems with very small datasets + cross validation 
> if there are k classes but class index k-1 does not appear in the training 
> data.  We should make sure the error thrown helps the user understand the 
> solution, which is probably to use StringIndexer to index the whole dataset's 
> labelCol before doing cross validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14862) Tree and ensemble classification: do not require label metadata

2016-04-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14862:
--
Description: 
spark.ml DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier 
require that the labelCol have metadata specifying the number of classes.  
Instead, if the number of classes is not specified, we should automatically 
scan the column to identify numClasses.

This differs from [SPARK-7126] in that this requires labels to be indexed (but 
without metadata).  This issue is not for supporting String labels.

Note: This could cause problems with very small datasets + cross validation if 
there are k classes but class index k-1 does not appear in the training data.  
We should make sure the error thrown helps the user understand the solution, 
which is probably to use StringIndexer to index the whole dataset's labelCol 
before doing cross validation.


  was:
spark.ml DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier 
require that the labelCol have metadata specifying the number of classes.  
Instead, if the number of classes is not specified, we should automatically 
scan the column to identify numClasses.

Note: This could cause problems with very small datasets + cross validation if 
there are k classes but class index k-1 does not appear in the training data.  
We should make sure the error thrown helps the user understand the solution, 
which is probably to use StringIndexer to index the whole dataset's labelCol 
before doing cross validation.


> Tree and ensemble classification: do not require label metadata
> ---
>
> Key: SPARK-14862
> URL: https://issues.apache.org/jira/browse/SPARK-14862
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> spark.ml DecisionTreeClassifier, RandomForestClassifier, and GBTClassifier 
> require that the labelCol have metadata specifying the number of classes.  
> Instead, if the number of classes is not specified, we should automatically 
> scan the column to identify numClasses.
> This differs from [SPARK-7126] in that this requires labels to be indexed 
> (but without metadata).  This issue is not for supporting String labels.
> Note: This could cause problems with very small datasets + cross validation 
> if there are k classes but class index k-1 does not appear in the training 
> data.  We should make sure the error thrown helps the user understand the 
> solution, which is probably to use StringIndexer to index the whole dataset's 
> labelCol before doing cross validation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14862) Tree and ensemble classification: do not require label metadata

[jira] [Updated] (SPARK-14862) Tree and ensemble classification: do not require label metadata

2 matches

Site Navigation

Mail list logo

Footer information