[ https://issues.apache.org/jira/browse/SPARK-49615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890345#comment-17890345 ]
Weichen Xu commented on SPARK-49615: ------------------------------------ [~chhavibansal] You need to test against spark nightly build ([https://spark.apache.org/developer-tools.html)] or wait for next spark release. > Feature transformers are case sensitive when unintented > ------------------------------------------------------- > > Key: SPARK-49615 > URL: https://issues.apache.org/jira/browse/SPARK-49615 > Project: Spark > Issue Type: Bug > Components: ML, MLlib, Spark Core > Affects Versions: 3.4.3 > Reporter: Chhavi Bansal > Assignee: Weichen Xu > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Hi team, > https://spark.apache.org/docs/latest/ml-features > The feature transformers are case sensitive even though the configuration > > {code:java} > spark.conf.get("spark.sql.caseSensitive") {code} > > is set to false. The user of all these transformers are forced to abide by > case of the column in the dataframe > > {code:java} > val data = List(Row("the movie was great", "positive", 10, "greatest of all > time"), > Row("the movie was average", "negative", 11, "just average things, > average storyline"), > Row("movie was fun", "positive", 2, "superb screen play")) > val schema = new StructType() > .add("comments", StringType, true) > .add("reviews", StringType, true) > .add("counts", IntegerType, true) > .add("Additional_COMMENTS", StringType, true) > val df = spark.createDataFrame(data.asJava, schema) > val si = new > StringIndexer().setInputCol("additional_comments").setOutputCol("si_additional_comments") > si.fit(df).transform(df).show() {code} > The above code fails with > {code:java} > Exception in thread "main" org.apache.spark.SparkException: Input column > additional_comments does not exist. > at > org.apache.spark.ml.feature.StringIndexerBase.$anonfun$validateAndTransformSchema$2(StringIndexer.scala:128) > at > scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) > at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) > at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198) > at > org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema(StringIndexer.scala:123) > at > org.apache.spark.ml.feature.StringIndexerBase.validateAndTransformSchema$(StringIndexer.scala:115) > {code} > Which means that the column "additional_comments" needs to be provided in the > same case as in the dataframe. > > I think when the caseSensitive setting is set to false we should be able to > use the naming in any case. > > Can someone please help to solve this bug for all transformers.? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org