This issue is related to CharVarcharCodegenUtils readSidePadding method . Appending white spaces while reading ENUM data from mysql
Causing issue in querying , writing the same data to Cassandra. On Thu, 12 Oct, 2023, 7:46 pm Suyash Ajmera, <suyashajmera941...@gmail.com> wrote: > I have upgraded my spark job from spark 3.3.1 to spark 3.5.0, I am > querying to Mysql Database and applying > > `*UPPER(col) = UPPER(value)*` in the subsequent sql query. It is working > as expected in spark 3.3.1 , but not working with 3.5.0. > > Where Condition :: `*UPPER(vn) = 'ERICSSON' AND (upper(st) = 'OPEN' OR > upper(st) = 'REOPEN' OR upper(st) = 'CLOSED')*` > > The *st *column is ENUM in the database and it is causing the issue. > > Below is the Physical Plan of *FILTER* phase : > > For 3.3.1 : > > +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(st#42) = OPEN) OR > (upper(st#42) = REOPEN)) OR (upper(st#42) = CLOSED))) > > For 3.5.0 : > > +- Filter ((upper(vn#11) = ERICSSON) AND (((upper(staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true)) = OPEN) OR > (upper(staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true)) = REOPEN)) OR > (upper(staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true)) = CLOSED))) > > ----------------------------------------------------------------- > > I have debug it and found that Spark added a property in version 3.4.0 , > i.e. **spark.sql.readSideCharPadding** which has default value **true**. > > Link to the JIRA : https://issues.apache.org/jira/browse/SPARK-40697 > > Added a new method in Class **CharVarcharCodegenUtils** > > public static UTF8String readSidePadding(UTF8String inputStr, int limit) { > int numChars = inputStr.numChars(); > if (numChars == limit) { > return inputStr; > } else if (numChars < limit) { > return inputStr.rpad(limit, SPACE); > } else { > return inputStr; > } > } > > > **This method is appending some whitespace padding to the ENUM values > while reading and causing the Issue.** > > ----------------------------------------------------------------------- > > When I am removing the UPPER function from the where condition the > **FILTER** Phase looks like this : > > +- Filter (((staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, > StringType, readSidePadding, st#42, 13, true, false, true) = OPEN > ) OR (staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true) = REOPEN )) OR > (staticinvoke(class > org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, > readSidePadding, st#42, 13, true, false, true) = CLOSED )) > > > **You can see it has added some white space after the value and the query > runs fine giving the correct result.** > > But with the UPPER function I am not getting the data. > > ------------------------------------------------------------------ > > I have also tried to disable this Property *spark.sql.readSideCharPadding > = false* with following cases : > > 1. With Upper function in where clause : > It is not pushing the filters to Database and the *query works fine*. > > +- Filter (((upper(st#42) = OPEN) OR (upper(st#42) = REOPEN)) OR > (upper(st#42) = CLOSED)) > > 2. But when I am removing the upper function > > *It is pushing the filter to Mysql with the white spaces and I am not > getting the data. (THIS IS A CAUSING VERY BIG ISSUE)* > > PushedFilters: [*IsNotNull(vn), *EqualTo(vn,ERICSSON), > *Or(Or(EqualTo(st,OPEN ),EqualTo(st,REOPEN > )),EqualTo(st,CLOSED ))] > > I cannot move this filter to JDBC read query , also I can't remove this > UPPER function in the where clause. > > > -------------------------------------------------------------------------------------------------------- > > Also I found same data getting written to CASSANDRA with *PADDING .* >