Any update on this?
On Fri, 13 Oct, 2023, 12:56 pm Suyash Ajmera, <suyashajmera941...@gmail.com> wrote: > This issue is related to CharVarcharCodegenUtils readSidePadding method . > > Appending white spaces while reading ENUM data from mysql > > Causing issue in querying , writing the same data to Cassandra. > > On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, <suyashajmera941...@gmail.com> > wrote: > >> I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am >> querying to Mysql Database and applying >> >> `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working >> as expected in spark 3.3.1 , but not working with 3.5.0. >> >> Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR >> upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*` >> >> The *st *column is ENUM in the database and it is causing the issue. >> >> Below is the Physical Plan of *FILTER* phase : >> >> For 3.3.1 : >> >> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR >> (upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED))) >> >> For 3.5.0 : >> >> +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, >> readSidePadding, st#42, 13, true, false, true)) = OPEN) OR >> (upper(staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, >> readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR >> (upper(staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, >> readSidePadding, st#42, 13, true, false, true)) = CLOSED))) >> >> ----------------------------------------------------------------- >> >> I have debug it and found that Spark added a property in version 3.4.0 , >> i.e. **spark.sql.readSideCharPadding** which has default value **true**. >> >> Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697 >> >> Added a new method in Class **CharVarcharCodegenUtils** >> >> public static UTF8String readSidePadding(UTF8String inputStr, int limit) { >> int numChars = inputStr.numChars(); >> if (numChars == limit) { >> return inputStr; >> } else if (numChars < limit) { >> return inputStr.rpad(limit, SPACE); >> } else { >> return inputStr; >> } >> } >> >> >> **This method is appending some whitespace padding to the ENUM values >> while reading and causing the Issue.** >> >> ----------------------------------------------------------------------- >> >> When I am removing the UPPER function from the where condition the >> **FILTER** Phase looks like this : >> >> +- Filter (((staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, >> StringType, readSidePadding, st#42, 13, true, false, true) = OPEN >> ) OR (staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, >> readSidePadding, st#42, 13, true, false, true) = REOPEN )) OR >> (staticinvoke(class >> org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, >> readSidePadding, st#42, 13, true, false, true) = CLOSED )) >> >> >> **You can see it has added some white space after the value and the query >> runs fine giving the correct result.** >> >> But with the UPPER function I am not getting the data. >> >> ------------------------------------------------------------------ >> >> I have also tried to disable this Property *spark.sql.readSideCharPadding >> = false* with following cases : >> >> 1. With Upper function in where clause : >> It is not pushing the filters to Database and the *query works fine*. >> >> >> +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR >> (upper(st#42) = CLOSED)) >> >> 2. But when I am removing the upper function >> >> *It is pushing the filter to Mysql with the white spaces and I am not >> getting the data. (THIS IS A CAUSING VERY BIG ISSUE)* >> >> PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON), >> *Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN >> )),EqualTo(st,CLOSED ))] >> >> I cannot move this filter to JDBC read query , also I can't remove this >> UPPER function in the where clause. >> >> >> -------------------------------------------------------------------------------------------------------- >> >> Also I found same data getting written to CASSANDRA with *PADDING .* >> >