Hi,

I encountered errors fitting a model using a CrossValidator. The training
set contained a feature which was initially a String with many unique
values. I used a StringIndexer to transform this feature column into label
indices. Fitting a model with a regular pipeline worked fine, but I ran into
the following error when I introduced the CrossValidator:

15/06/18 16:30:18 ERROR Executor: Exception in task 1.0 in stage 70.0 (TID
156)
org.apache.spark.SparkException: Unseen label: 20000456.
  at
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:120)
  at
org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:115)
  at
org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:71)
  at
org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:70)
  at
org.apache.spark.sql.catalyst.expressions.ScalaUdf.eval(ScalaUdf.scala:960)

I think the pipeline with cross validation is applying the StringIndexer
transformation to the training folds but not the test fold. When the
pipeline encounters a previously unseen label in the test fold, it breaks
down. When I whittled down the feature set to only contain low-cardinality
categorical features, the pipeline behaved.

Is this behavior desired? If I'm understanding this correctly, it would be
great to have some more graceful error handling.

My code is at https://gist.github.com/chelseaz/7ead2c0f25e2dd7fe5d9

Thanks,

Chelsea




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Interaction-between-StringIndexer-feature-transformer-and-CrossValidator-tp23401.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to