Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column
Any update on this? On Fri, 13 Oct, 2023, 12:56 pm Suyash Ajmera, wrote: > This issue is related to CharVarcharCodegenUtils readSidePadding method . > > Appending white spaces while reading ENUM data from mysql > > Causing issue in querying , writing the same data to Cassandra. > > On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, > wrote: > >> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am >> querying to Mysql Database and applying >> >> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working >> as expected in spark 3.3.1 , but not working with 3.5.0. >> >> Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR >> upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*` >> >> The *st *column is ENUM in the database and it is causing the issue. >> >> Below is the Physical Plan of *FILTER* phase : >> >> For 3.3.1 : >> >> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR >> (upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED))) >> >> For 3.5.0 : >> >> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, >> readSidePadding, st#42, 13, true, false, true)) = OPEN) OR >> (upper(staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, >> readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR >> (upper(staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, >> readSidePadding, st#42, 13, true, false, true)) = CLOSED))) >> >> - >> >> I have debug it and found that Spark added a property in version 3.4.0 , >> i.e. **spark.sql.readSideCharPadding** which has default value **true**. >> >> Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697 >> >> Added a new method in Class **CharVarcharCodegenUtils** >> >> public static UTF8String readSidePadding(UTF8String inputStr, int limit) { >> int numChars = inputStr.numChars(); >> if (numChars == limit) { >> return inputStr; >> } else if (numChars < limit) { >> return inputStr.rpad(limit, SPACE); >> } else { >> return inputStr; >> } >> } >> >> >> **This method is appending some whitespace padding to the ENUM values >> while reading and causing the Issue.** >> >> --- >> >> When I am removing the UPPER function from the where condition the >> **FILTER** Phase looks like this : >> >> +- Filter (((staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, >> StringType, readSidePadding, st#42, 13, true, false, true) = OPEN >> ) OR (staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, >> readSidePadding, st#42, 13, true, false, true) = REOPEN )) OR >> (staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, >> readSidePadding, st#42, 13, true, false, true) = CLOSED )) >> >> >> **You can see it has added some white space after the value and the query >> runs fine giving the correct result.** >> >> But with the UPPER function I am not getting the data. >> >> -- >> >> I have also tried to disable this Property *spark.sql.readSideCharPadding >> = false* with following cases : >> >> 1. With Upper function in where clause : >>It is not pushing the filters to Database and the *query works fine*. >> >> >> +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR >> (upper(st#42) = CLOSED)) >> >> 2. But when I am removing the upper function >> >> *It is pushing the filter to Mysql with the white spaces and I am not >> getting the data. (THIS IS A CAUSING VERY BIG ISSUE)* >> >> PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON), >> *Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN >> )),EqualTo(st,CLOSED ))] >> >> I cannot move this filter to JDBC read query , also I can't remove this >> UPPER function in the where clause. >> >> >> >> >> Also I found same data getting written to CASSANDRA with *PADDING .* >> >
Re: [ SPARK SQL ]: UPPER in WHERE condition is not working in Apache Spark 3.5.0 for Mysql ENUM Column
This issue is related to CharVarcharCodegenUtils readSidePadding method . Appending white spaces while reading ENUM data from mysql Causing issue in querying , writing the same data to Cassandra. On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, wrote: > I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am > querying to Mysql Database and applying > > `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working > as expected in spark 3.3.1 , but not working with 3.5.0. > > Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR > upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*` > > The *st *column is ENUM in the database and it is causing the issue. > > Below is the Physical Plan of *FILTER* phase : > > For 3.3.1 : > > +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR > (upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED))) > > For 3.5.0 : > > +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true)) = OPEN) OR > (upper(staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR > (upper(staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true)) = CLOSED))) > > - > > I have debug it and found that Spark added a property in version 3.4.0 , > i.e. **spark.sql.readSideCharPadding** which has default value **true**. > > Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697 > > Added a new method in Class **CharVarcharCodegenUtils** > > public static UTF8String readSidePadding(UTF8String inputStr, int limit) { > int numChars = inputStr.numChars(); > if (numChars == limit) { > return inputStr; > } else if (numChars < limit) { > return inputStr.rpad(limit, SPACE); > } else { > return inputStr; > } > } > > > **This method is appending some whitespace padding to the ENUM values > while reading and causing the Issue.** > > --- > > When I am removing the UPPER function from the where condition the > **FILTER** Phase looks like this : > > +- Filter (((staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, > StringType, readSidePadding, st#42, 13, true, false, true) = OPEN > ) OR (staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true) = REOPEN )) OR > (staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true) = CLOSED )) > > > **You can see it has added some white space after the value and the query > runs fine giving the correct result.** > > But with the UPPER function I am not getting the data. > > -- > > I have also tried to disable this Property *spark.sql.readSideCharPadding > = false* with following cases : > > 1. With Upper function in where clause : >It is not pushing the filters to Database and the *query works fine*. > > +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR > (upper(st#42) = CLOSED)) > > 2. But when I am removing the upper function > > *It is pushing the filter to Mysql with the white spaces and I am not > getting the data. (THIS IS A CAUSING VERY BIG ISSUE)* > > PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON), > *Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN > )),EqualTo(st,CLOSED ))] > > I cannot move this filter to JDBC read query , also I can't remove this > UPPER function in the where clause. > > > > > Also I found same data getting written to CASSANDRA with *PADDING .* >