[jira] [Comment Edited] (SPARK-41266) Spark does not parse timestamp strings when using the IN operator
[ https://issues.apache.org/jira/browse/SPARK-41266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643706#comment-17643706 ] huldar chen edited comment on SPARK-41266 at 12/6/22 8:11 AM: -- You can try to use ANSI compliance: {code:java} spark.sql.ansi.enabled=true {code} In the default hive compliance: promotes all the way to StringType. In the ANSI compliance: promotes StringType to other data types. was (Author: huldar): You can try to use ANSI compliance: {code:java} spark.sql.ansi.enabled=true {code} > Spark does not parse timestamp strings when using the IN operator > - > > Key: SPARK-41266 > URL: https://issues.apache.org/jira/browse/SPARK-41266 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 > Environment: Windows 10, Spark 3.2.1 with Java 11 >Reporter: Laurens Versluis >Priority: Major > > Likely affects more versions, tested only with 3.2.1. > > Summary: > Spark will convert a timestamp string to a timestamp when using the equal > operator (=), yet won't do this when using the IN operator. > > Details: > While debugging an issue why we got no results on a query, we found out that > when using the equal symbol `=` in the WHERE clause combined with a > TimeStampType column that Spark will convert the string to a timestamp and > filter. > However, when using the IN operator (our query), it will not do so, and > perform a cast to string. We expected the behavior to be similar, or at least > that Spark realizes the IN clause operates on a TimeStampType column and thus > attempts to convert to timestamp first before falling back to string > comparison. > > *Minimal reproducible example:* > Suppose we have a one-line dataset with the follow contents and schema: > > {noformat} > ++ > |starttime | > ++ > |2019-08-11 19:33:05 | > ++ > root > |-- starttime: timestamp (nullable = true){noformat} > Then if we fire the following queries, we will not get results for the > IN-clause one using a timestamp string with timezone information: > > > {code:java} > // Works - Spark casts the argument to a string and the internal > representation of the time seems to match it... > singleCol.filter("starttime IN ('2019-08-11 19:33:05')").show(); > // Works > singleCol.filter("starttime = '2019-08-11 19:33:05'").show(); > // Works > singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show(); > // Doesn't work > singleCol.filter("starttime IN ('2019-08-11T19:33:05Z')").show(); > //Works > singleCol.filter("starttime IN > (to_timestamp('2019-08-11T19:33:05Z'))").show(); {code} > > We can see from the output that a cast to string is taking place: > {noformat} > [...] isnotnull(starttime#59),(cast(starttime#59 as string) = 2019-08-11 > 19:33:05){noformat} > Since the = operator does work, it would be consistent if operators such as > the IN operator would have similar, consistent behavior. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41266) Spark does not parse timestamp strings when using the IN operator
[ https://issues.apache.org/jira/browse/SPARK-41266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643212#comment-17643212 ] Laurens Versluis edited comment on SPARK-41266 at 12/5/22 10:11 AM: Hi [~huldar] I get a different output: {noformat} Input data (singleCol): +---+ |starttime | +---+ |2019-08-11 19:33:05| +---+ singleCol.filter("starttime = '2019-08-11 19:33:05'").show(); +---+ | starttime| +---+ |2019-08-11 19:33:05| +---+ singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show(); +---+ | starttime| +---+ |2019-08-11 19:33:05| +---+ singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show(); +---+ | starttime| +---+ |2019-08-11 19:33:05| +---+ singleCol.filter("starttime IN ('2019-08-11 19:33:05')").show(); +---+ | starttime| +---+ |2019-08-11 19:33:05| +---+ singleCol.filter("starttime IN ('2019-08-11T19:33:05Z')").show(); +-+ |starttime| +-+ +-+ singleCol.filter("starttime IN (to_timestamp('2019-08-11T19:33:05Z'))").show(); +---+ | starttime| +---+ |2019-08-11 19:33:05| +---+{noformat} Make sure to set {code:java} .config("spark.sql.session.timeZone", "UTC") {code} Otherwise, your JVM timezone is used. was (Author: JIRAUSER284645): Hi [~huldar] I get a different output: {noformat} Input data (singleCol): +---+ |starttime | +---+ |2019-08-11 19:33:05| +---+ singleCol.filter("starttime = '2019-08-11 19:33:05'").show(); +---+ | starttime| +---+ |2019-08-11 19:33:05| +---+ singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show(); +---+ | starttime| +---+ |2019-08-11 19:33:05| +---+ singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show(); +---+ | starttime| +---+ |2019-08-11 19:33:05| +---+ singleCol.filter("starttime IN ('2019-08-11 19:33:05')").show(); +---+ | starttime| +---+ |2019-08-11 19:33:05| +---+ singleCol.filter("starttime IN ('2019-08-11T19:33:05Z')").show(); +-+ |starttime| +-+ +-+ singleCol.filter("starttime IN (to_timestamp('2019-08-11T19:33:05Z'))").show(); +---+ | starttime| +---+ |2019-08-11 19:33:05| +---+{noformat} > Spark does not parse timestamp strings when using the IN operator > - > > Key: SPARK-41266 > URL: https://issues.apache.org/jira/browse/SPARK-41266 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 > Environment: Windows 10, Spark 3.2.1 with Java 11 >Reporter: Laurens Versluis >Priority: Major > > Likely affects more versions, tested only with 3.2.1. > > Summary: > Spark will convert a timestamp string to a timestamp when using the equal > operator (=), yet won't do this when using the IN operator. > > Details: > While debugging an issue why we got no results on a query, we found out that > when using the equal symbol `=` in the WHERE clause combined with a > TimeStampType column that Spark will convert the string to a timestamp and > filter. > However, when using the IN operator (our query), it will not do so, and > perform a cast to string. We expected the behavior to be similar, or at least > that Spark realizes the IN clause operates on a TimeStampType column and thus > attempts to convert to timestamp first before falling back to string > comparison. > > *Minimal reproducible example:* > Suppose we have a one-line dataset with the follow contents and schema: > > {noformat} > ++ > |starttime | > ++ > |2019-08-11 19:33:05 | > ++ > root > |-- starttime: timestamp (nullable = true){noformat} > Then if we fire the following queries, we will not get results for the > IN-clause one using a timestamp string with timezone information: > > > {code:java} > // Works - Spark casts the argument to a string and the internal > representation of the time seems to match it... > singleCol.filter("starttime IN ('2019-08-11 19:33:05')").show(); > // Works > singleCol.filter("starttime = '2019-08-11 19:33:05'").show(); > // Works > singleCol.filter("starttime = '2019-08-11T19:33:05Z'").show(); > //