[jira] [Commented] (SPARK-49615) Feature transformers are case sensitive when unintented

Weichen Xu (Jira) Thu, 17 Oct 2024 00:13:49 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-49615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890345#comment-17890345
 ]


Weichen Xu commented on SPARK-49615:
------------------------------------

[~chhavibansal] 

You need to test against spark nightly build 
([https://spark.apache.org/developer-tools.html)] or wait for next spark 
release.

> Feature transformers are case sensitive when unintented
> -------------------------------------------------------
>
>                 Key: SPARK-49615
>                 URL: https://issues.apache.org/jira/browse/SPARK-49615
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, Spark Core
>    Affects Versions: 3.4.3
>            Reporter: Chhavi Bansal
>            Assignee: Weichen Xu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> Hi team,
> https://spark.apache.org/docs/latest/ml-features
> The feature transformers are case sensitive even though the configuration 
>  
> {code:java}
> spark.conf.get("spark.sql.caseSensitive") {code}
>  
> is set to false. The user of all these transformers are forced to abide by 
> case of the column in the dataframe
>  
> {code:java}
>  val data = List(Row("the movie was great", "positive", 10, "greatest of all 
> time"),
>     Row("the movie was average", "negative", 11, "just average things, 
> average storyline"),
>     Row("movie was fun", "positive", 2, "superb screen play"))
>   val schema = new StructType()
>     .add("comments", StringType, true)
>     .add("reviews", StringType, true)
>     .add("counts", IntegerType, true)
>     .add("Additional_COMMENTS", StringType, true)
> val df = spark.createDataFrame(data.asJava, schema)
>   val si = new 
> StringIndexer().setInputCol("additional_comments").setOutputCol("si_additional_comments")
>   si.fit(df).transform(df).show() {code}
> The above code fails with 
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: Input column 
> additional_comments does not exist.
>     at 
> org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128)
>     at 
> scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
>     at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>     at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>     at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>     at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
>     at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
>     at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
>     at 
> org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema(StringIndexer.scala:123)
>     at 
> org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema$(StringIndexer.scala:115)
>  {code}
> Which means that the column "additional_comments" needs to be provided in the 
> same case as in the dataframe. 
>  
> I think when the caseSensitive  setting is set to false we should be able to 
> use the naming in any case.
>  
> Can someone please help to solve this bug for all transformers.?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-49615) Feature transformers are case sensitive when unintented

Reply via email to