[jira] [Comment Edited] (SPARK-41266) Spark does not parse timestamp strings when using the IN operator

2022-12-06 Thread huldar chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643706#comment-17643706
 ] 

huldar chen edited comment on SPARK-41266 at 12/6/22 8:11 AM:
--

You can try to use ANSI compliance:
{code:java}
spark.sql.ansi.enabled=true {code}
In the default hive compliance: promotes all the way to StringType.

In the ANSI compliance: promotes StringType to other data types.


was (Author: huldar):
You can try to use ANSI compliance:
{code:java}
spark.sql.ansi.enabled=true {code}

> Spark does not parse timestamp strings when using the IN operator
> -
>
> Key: SPARK-41266
> URL: https://issues.apache.org/jira/browse/SPARK-41266
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment: Windows 10, Spark 3.2.1 with Java 11
>Reporter: Laurens Versluis
>Priority: Major
>
> Likely affects more versions, tested only with 3.2.1.
>  
> Summary:
> Spark will convert a timestamp string to a timestamp when using the equal 
> operator (=), yet won't do this when using the IN operator.
>  
> Details:
> While debugging an issue why we got no results on a query, we found out that 
> when using the equal symbol `=` in the WHERE clause combined with a 
> TimeStampType column that Spark will convert the string to a timestamp and 
> filter.
> However, when using the IN operator (our query), it will not do so, and 
> perform a cast to string. We expected the behavior to be similar, or at least 
> that Spark realizes the IN clause operates on a TimeStampType column and thus 
> attempts to convert to timestamp first before falling back to string 
> comparison.
>  
> *Minimal reproducible example:*
> Suppose we have a one-line dataset with the follow contents and schema:
>  
> {noformat}
> ++
> |starttime   |
> ++
> |2019-08-11 19:33:05         |
> ++
> root
>  |-- starttime: timestamp (nullable = true){noformat}
> Then if we fire the following queries, we will not get results for the 
> IN-clause one using a timestamp string with timezone information:
>  
>  
> {code:java}
> // Works - Spark casts the argument to a string and the internal 
> representation of the time seems to match it...
> singleCol.filter("starttime IN ('2019-08-11 19:33:05')").show();
> // Works
> singleCol.filter("starttime = '2019-08-11 19:33:05'").show();
> // Works
> singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show();
> // Doesn't work
> singleCol.filter("starttime IN ('2019-08-11T19:33:05Z')").show();
> //Works
> singleCol.filter("starttime IN 
> (to_timestamp('2019-08-11T19:33:05Z'))").show(); {code}
>  
> We can see from the output that a cast to string is taking place:
> {noformat}
> [...] isnotnull(starttime#59),(cast(starttime#59 as string) = 2019-08-11 
> 19:33:05){noformat}
> Since the = operator does work, it would be consistent if operators such as 
> the IN operator would have similar, consistent behavior.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41266) Spark does not parse timestamp strings when using the IN operator

2022-12-05 Thread Laurens Versluis (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643212#comment-17643212
 ] 

Laurens Versluis edited comment on SPARK-41266 at 12/5/22 10:11 AM:


Hi [~huldar] I get a different output:
{noformat}
Input data (singleCol):
+---+
|starttime          |
+---+
|2019-08-11 19:33:05|
+---+

singleCol.filter("starttime = '2019-08-11 19:33:05'").show();
+---+
|          starttime|
+---+
|2019-08-11 19:33:05|
+---+

singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show();
+---+
|          starttime|
+---+
|2019-08-11 19:33:05|
+---+

singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show();
+---+
|          starttime|
+---+
|2019-08-11 19:33:05|
+---+

singleCol.filter("starttime IN ('2019-08-11 19:33:05')").show();
+---+
|          starttime|
+---+
|2019-08-11 19:33:05|
+---+

singleCol.filter("starttime IN ('2019-08-11T19:33:05Z')").show();
+-+
|starttime|
+-+
+-+

singleCol.filter("starttime IN (to_timestamp('2019-08-11T19:33:05Z'))").show();
+---+
|          starttime|
+---+
|2019-08-11 19:33:05|
+---+{noformat}
Make sure to set 
{code:java}
.config("spark.sql.session.timeZone", "UTC") {code}
Otherwise, your JVM timezone is used.


was (Author: JIRAUSER284645):
Hi [~huldar] I get a different output:
{noformat}
Input data (singleCol):
+---+
|starttime          |
+---+
|2019-08-11 19:33:05|
+---+

singleCol.filter("starttime = '2019-08-11 19:33:05'").show();
+---+
|          starttime|
+---+
|2019-08-11 19:33:05|
+---+

singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show();
+---+
|          starttime|
+---+
|2019-08-11 19:33:05|
+---+

singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show();
+---+
|          starttime|
+---+
|2019-08-11 19:33:05|
+---+

singleCol.filter("starttime IN ('2019-08-11 19:33:05')").show();
+---+
|          starttime|
+---+
|2019-08-11 19:33:05|
+---+

singleCol.filter("starttime IN ('2019-08-11T19:33:05Z')").show();
+-+
|starttime|
+-+
+-+

singleCol.filter("starttime IN (to_timestamp('2019-08-11T19:33:05Z'))").show();
+---+
|          starttime|
+---+
|2019-08-11 19:33:05|
+---+{noformat}

> Spark does not parse timestamp strings when using the IN operator
> -
>
> Key: SPARK-41266
> URL: https://issues.apache.org/jira/browse/SPARK-41266
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment: Windows 10, Spark 3.2.1 with Java 11
>Reporter: Laurens Versluis
>Priority: Major
>
> Likely affects more versions, tested only with 3.2.1.
>  
> Summary:
> Spark will convert a timestamp string to a timestamp when using the equal 
> operator (=), yet won't do this when using the IN operator.
>  
> Details:
> While debugging an issue why we got no results on a query, we found out that 
> when using the equal symbol `=` in the WHERE clause combined with a 
> TimeStampType column that Spark will convert the string to a timestamp and 
> filter.
> However, when using the IN operator (our query), it will not do so, and 
> perform a cast to string. We expected the behavior to be similar, or at least 
> that Spark realizes the IN clause operates on a TimeStampType column and thus 
> attempts to convert to timestamp first before falling back to string 
> comparison.
>  
> *Minimal reproducible example:*
> Suppose we have a one-line dataset with the follow contents and schema:
>  
> {noformat}
> ++
> |starttime   |
> ++
> |2019-08-11 19:33:05         |
> ++
> root
>  |-- starttime: timestamp (nullable = true){noformat}
> Then if we fire the following queries, we will not get results for the 
> IN-clause one using a timestamp string with timezone information:
>  
>  
> {code:java}
> // Works - Spark casts the argument to a string and the internal 
> representation of the time seems to match it...
> singleCol.filter("starttime IN ('2019-08-11 19:33:05')").show();
> // Works
> singleCol.filter("starttime = '2019-08-11 19:33:05'").show();
> // Works
> singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show();
> //