Re: [PR] [SPARK-48652][SQL] Fix casting issue in Spark SQL when comparing string column to integer value [spark]

via GitHub Wed, 09 Oct 2024 08:42:42 -0700


zeal-thinker commented on PR #47246:
URL: https://github.com/apache/spark/pull/47246#issuecomment-2402680233


   > > Why does spark SQL have to address such a type casting issue? Isn't it 
user's responsibility to mention correct datatype in a query (string instead of 
integer in this case)?
   > > Why does spark SQL have to behave the way MySQL or Redshift behaves? 
There are many SQL systems. I don't know if we can be compatible with all of 
them.
   > > If we decide to address such a type casting issue, then should we also 
address type casting issue where comparing a string column to a boolean, float, 
etc. value results in an empty result set?
   > 
   > While it's true that users should specify the correct data types, Spark 
SQL might handle this implicit type casting for a few reasons:
   > 
   > Many people expect behavior similar to SQL systems like MySQL or Redshift, 
where implicit casting is allowed, making migrations easier and queries less 
error-prone.
   > 
   > While we can’t align with all SQL systems, following common behavior 
improves predictability and usability.
   > 
   > ### Take a string-to-integer comparison as an example:
   > `**SELECT * FROM table WHERE string_col = 123;**` If string_col contains 
values like "123", most users would expect Spark SQL to match those rows, since 
systems like MySQL or Redshift allow this kind of conversion. Handling this 
case makes sense and aligns with user expectations, avoiding unnecessary query 
failures/empty results.
   > 
   > ### Now, consider a string-to-boolean comparison:
   > `**SELECT * FROM table WHERE string_col = true;**` This is where things 
get tricky. Should "true" match true? What about strings like "yes", "1", or 
"false"?
   > 
   > ### A similar issue arises with a string-to-float comparison:
   > `**SELECT * FROM table WHERE string_col = 5.5;**` Should "5.5" match 5.5? 
How about "05.5" or "5.500"? These variations make it unclear what should be 
considered a match, adding complexity and increasing the chances of unexpected 
outcomes.
   > 
   > Imo, while handling straightforward cases like string-to-integer 
conversions is practical, trying to support more ambiguous comparisons like 
string-to-boolean or string-to-float could cause more confusion than it's worth
   
   It becomes tricky in the case of string-to-integer conversion too. Integer 1 
can be string "1", "01", "001", "0001", "1.0", "1.00", "1.000", "one" 
(English), "uno" (Spanish), "un" (French), "one dot zero", etc. The question we 
face now is if we should support all these variations or just some. If we 
decide to support just some variations, the query results will be wrong in some 
cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48652][SQL] Fix casting issue in Spark SQL when comparing string column to integer value [spark]

Reply via email to