[jira] [Created] (SPARK-22034) CrossValidator's training and testing set with different set of labels, resulting in encoder transform error

AnChe Kuo (JIRA) Fri, 15 Sep 2017 19:20:14 -0700

AnChe Kuo created SPARK-22034:
---------------------------------

             Summary: CrossValidator's training and testing set with different 
set of labels, resulting in encoder transform error
                 Key: SPARK-22034
                 URL: https://issues.apache.org/jira/browse/SPARK-22034
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 2.2.0
         Environment: Ubuntu 16.04
Scala 2.11
Spark 2.2.0
            Reporter: AnChe Kuo



Let's say we have a VectorIndexer with maxCategories set to 13, and training 
set has a column containing month label.

In CrossValidator, dataframe is split into training and testing set 
automatically. If could happen that training set happens to lack month 2 (could 
happen by chance, or happen quite frequently if we have unbalanced label).

When training set is being trained within the cross validator, the pipeline is 
fitted with the training set only, resulting in a partial key map in 
VectorIndexer. When this pipeline is used to transform the predict set, 
VectorIndexer will throw  a "key not found" error.

Making CrossValidator also an estimator thus can be connected to a whole 
pipeline is a cool idea, but bug like this occurs, and is not expected.

The solution, I am guessing, would be to check each stage in the pipeline, and 
when we see encoder type stage, we fit the stage model with the complete 
dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22034) CrossValidator's training and testing set with different set of labels, resulting in encoder transform error

Reply via email to