[ https://issues.apache.org/jira/browse/SPARK-22034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16179938#comment-16179938 ]
Bryan Cutler edited comment on SPARK-22034 at 9/25/17 11:18 PM: ---------------------------------------------------------------- You would normally fit the VectorIndexer on the entire dataset and then put the resulting transformer in the pipeline for cross validation. This is not a bug unless I'm mistaken. For example: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala#L52 was (Author: bryanc): You would normally fit the VectorIndexer on the entire dataset and then put the resulting transformer in the pipeline for cross validation. This is not a bug unless I'm mistaken. > CrossValidator's training and testing set with different set of labels, > resulting in encoder transform error > ------------------------------------------------------------------------------------------------------------ > > Key: SPARK-22034 > URL: https://issues.apache.org/jira/browse/SPARK-22034 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 2.2.0 > Environment: Ubuntu 16.04 > Scala 2.11 > Spark 2.2.0 > Reporter: AnChe Kuo > Original Estimate: 72h > Remaining Estimate: 72h > > Let's say we have a VectorIndexer with maxCategories set to 13, and training > set has a column containing month label. > In CrossValidator, dataframe is split into training and testing set > automatically. If could happen that training set happens to lack month 2 > (could happen by chance, or happen quite frequently if we have unbalanced > label). > When training set is being trained within the cross validator, the pipeline > is fitted with the training set only, resulting in a partial key map in > VectorIndexer. When this pipeline is used to transform the predict set, > VectorIndexer will throw a "key not found" error. > Making CrossValidator also an estimator thus can be connected to a whole > pipeline is a cool idea, but bug like this occurs, and is not expected. > The solution, I am guessing, would be to check each stage in the pipeline, > and when we see encoder type stage, we fit the stage model with the complete > dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org