[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967347#comment-15967347 ]
Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 9:42 AM: ------------------------------------------------------------------ Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. If you like to dig deeper, more details to see: org.apache.spark.ml.attribute.Attribute org.apache.spark.ml.attribute.NominalAttribute Use StringIndexer with your label column should work well, which take care of itself, I guess. was (Author: facai): Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. > RandomForestClassifier doesn't seem to support more than 100 labels > ------------------------------------------------------------------- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 2.1.0 > Environment: Java > Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org