[ 
https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976098#comment-15976098
 ] 

Yan Facai (颜发才) commented on SPARK-20081:
-----------------------------------------

By the way, for StringIndexer, numerical label column will be cast to string 
and resort by count.

 if you had known of all possible labels and want to use custom order (say, 
lexicographical or natural order) , it's better to construct StringIndexerModel 
by yourself.

> RandomForestClassifier doesn't seem to support more than 100 labels
> -------------------------------------------------------------------
>
>                 Key: SPARK-20081
>                 URL: https://issues.apache.org/jira/browse/SPARK-20081
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.1.0
>         Environment: Java
>            Reporter: Christian Reiniger
>
> When feeding data with more than 100 labels into RanfomForestClassifer#fit() 
> (from java code), I get the following error message:
> {code}
> Classifier inferred 143 from label values in column 
> rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) 
> allowed to be inferred from values.  
>   To avoid this error for labels with > 100 classes, specify numClasses 
> explicitly in the metadata; this can be done by applying StringIndexer to the 
> label column.
> {code}
> Setting "numClasses" in the metadata for the label column doesn't make a 
> difference. Looking at the code, this is not surprising, since 
> MetadataUtils.getNumClasses() ignores this setting:
> {code:language=scala}
>   def getNumClasses(labelSchema: StructField): Option[Int] = {
>     Attribute.fromStructField(labelSchema) match {
>       case binAttr: BinaryAttribute => Some(2)
>       case nomAttr: NominalAttribute => nomAttr.getNumValues
>       case _: NumericAttribute | UnresolvedAttribute => None
>     }
>   }
> {code}
> The alternative would be to pass a proper "maxNumClasses" parameter to the 
> classifier, so that Classifier#getNumClasses() allows a larger number of 
> auto-detected labels. However, RandomForestClassifer#train() calls 
> #getNumClasses without the "maxNumClasses" parameter, causing it to use the 
> default of 100:
> {code:language=scala}
>   override protected def train(dataset: Dataset[_]): 
> RandomForestClassificationModel = {
>     val categoricalFeatures: Map[Int, Int] =
>       MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
>     val numClasses: Int = getNumClasses(dataset)
> // ...
> {code}
> My scala skills are pretty sketchy, so please correct me if I misinterpreted 
> something. But as it seems right now, there is no way to learn from data with 
> more than 100 labels via RandomForestClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to