AnChe Kuo created SPARK-22034: --------------------------------- Summary: CrossValidator's training and testing set with different set of labels, resulting in encoder transform error Key: SPARK-22034 URL: https://issues.apache.org/jira/browse/SPARK-22034 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.2.0 Environment: Ubuntu 16.04 Scala 2.11 Spark 2.2.0 Reporter: AnChe Kuo
Let's say we have a VectorIndexer with maxCategories set to 13, and training set has a column containing month label. In CrossValidator, dataframe is split into training and testing set automatically. If could happen that training set happens to lack month 2 (could happen by chance, or happen quite frequently if we have unbalanced label). When training set is being trained within the cross validator, the pipeline is fitted with the training set only, resulting in a partial key map in VectorIndexer. When this pipeline is used to transform the predict set, VectorIndexer will throw a "key not found" error. Making CrossValidator also an estimator thus can be connected to a whole pipeline is a cool idea, but bug like this occurs, and is not expected. The solution, I am guessing, would be to check each stage in the pipeline, and when we see encoder type stage, we fit the stage model with the complete dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org