[ https://issues.apache.org/jira/browse/SPARK-47063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kent Yao resolved SPARK-47063. ------------------------------ Fix Version/s: 3.4.3 3.5.2 4.0.0 Resolution: Fixed Issue resolved by pull request 45294 [https://github.com/apache/spark/pull/45294] > CAST long to timestamp has different behavior for codegen vs interpreted > ------------------------------------------------------------------------ > > Key: SPARK-47063 > URL: https://issues.apache.org/jira/browse/SPARK-47063 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.4.2 > Reporter: Robert Joseph Evans > Assignee: Pablo Langa Blanco > Priority: Major > Labels: pull-request-available > Fix For: 3.4.3, 3.5.2, 4.0.0 > > > It probably impacts a lot more versions of the code than this, but I verified > it on 3.4.2. This also appears to be related to > https://issues.apache.org/jira/browse/SPARK-39209 > {code:java} > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > +--------------------+-----------------------------+--------------------+ > |v |ts |unix_micros(ts) | > +--------------------+-----------------------------+--------------------+ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |1990000000 | > +--------------------+-----------------------------+--------------------+ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > +--------------------+-------------------+---------------+ > |v |ts |unix_micros(ts)| > +--------------------+-------------------+---------------+ > |9223372036854775807 |1969-12-31 23:59:59|-1000000 | > |-9223372036854775808|1970-01-01 00:00:00|0 | > |0 |1970-01-01 00:00:00|0 | > |1990 |1970-01-01 00:33:10|1990000000 | > +--------------------+-------------------+---------------+ > {code} > It looks like InMemoryTableScanExec is not doing code generation for the > expressions, but the ProjectExec after the repartition is. > If I disable code gen I get the same answer in both cases. > {code:java} > scala> spark.conf.set("spark.sql.codegen.wholeStage", false) > scala> spark.conf.set("spark.sql.codegen.factoryMode", "NO_CODEGEN") > scala> Seq(Long.MaxValue, Long.MinValue, 0L, 1990L).toDF("v").selectExpr("*", > "CAST(v AS timestamp) as ts").selectExpr("*", "unix_micros(ts)").show(false) > +--------------------+-----------------------------+--------------------+ > |v |ts |unix_micros(ts) | > +--------------------+-----------------------------+--------------------+ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |1990000000 | > +--------------------+-----------------------------+--------------------+ > scala> Seq(Long.MaxValue, Long.MinValue, 0L, > 1990L).toDF("v").repartition(1).selectExpr("*", "CAST(v AS timestamp) as > ts").selectExpr("*", "unix_micros(ts)").show(false) > +--------------------+-----------------------------+--------------------+ > |v |ts |unix_micros(ts) | > +--------------------+-----------------------------+--------------------+ > |9223372036854775807 |+294247-01-10 04:00:54.775807|9223372036854775807 | > |-9223372036854775808|-290308-12-21 19:59:05.224192|-9223372036854775808| > |0 |1970-01-01 00:00:00 |0 | > |1990 |1970-01-01 00:33:10 |1990000000 | > +--------------------+-----------------------------+--------------------+ > {code} > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L1627] > Is the code used in codegen, but > [https://github.com/apache/spark/blob/e2cd71a4cd54bbdf5af76d3edfbb2fc8c1b067b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L687] > is what is used outside of code gen. > Apparently `SECONDS.toMicros` truncates the value on an overflow, but the > codegen does not. > {code:java} > scala> Long.MaxValue > res11: Long = 9223372036854775807 > scala> java.util.concurrent.TimeUnit.SECONDS.toMicros(Long.MaxValue) > res12: Long = 9223372036854775807 > scala> Long.MaxValue * (1000L * 1000L) > res13: Long = -1000000 > {code} > Ideally these would be consistent with each other. I personally would prefer > the truncation as it feels more accurate, but I am fine either way. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org