Hi, I encountered errors fitting a model using a CrossValidator. The training set contained a feature which was initially a String with many unique values. I used a StringIndexer to transform this feature column into label indices. Fitting a model with a regular pipeline worked fine, but I ran into the following error when I introduced the CrossValidator:
15/06/18 16:30:18 ERROR Executor: Exception in task 1.0 in stage 70.0 (TID 156) org.apache.spark.SparkException: Unseen label: 20000456. at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:120) at org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:115) at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:71) at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:70) at org.apache.spark.sql.catalyst.expressions.ScalaUdf.eval(ScalaUdf.scala:960) I think the pipeline with cross validation is applying the StringIndexer transformation to the training folds but not the test fold. When the pipeline encounters a previously unseen label in the test fold, it breaks down. When I whittled down the feature set to only contain low-cardinality categorical features, the pipeline behaved. Is this behavior desired? If I'm understanding this correctly, it would be great to have some more graceful error handling. My code is at https://gist.github.com/chelseaz/7ead2c0f25e2dd7fe5d9 Thanks, Chelsea -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Interaction-between-StringIndexer-feature-transformer-and-CrossValidator-tp23401.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org