Markus created SPARK-32720: ------------------------------ Summary: Spark 3 Fails to Cast DateType to StringType when comparing result of Max Key: SPARK-32720 URL: https://issues.apache.org/jira/browse/SPARK-32720 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Markus
When reading from parquet, using basepath option, the swh_date is inferred and read as a DateType column. I do a withColumn and cast as StringType. Then I do a where clause comparing swh_date to Max(swh_date) where it fails on casting from DateType to String. However, this works in spark 2.4. First showing what I did in spark 3, then in spark 2.4 below. {code:java} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.option("basePath", "hdfs://<hdfs_path>/kafka_log_files/parquet").parquet("hdfs://<hdfs_path>/kafka_log_files/parquet/swh_env=IMPL/log_name=prism_lens_builds/swh_date=2020-08-15/log_version=34000/swh_dc=DUB").withColumn("swh_date", col("swh_date").cast(StringType)).createOrReplaceTempView("testPrism") 20/08/26 23:04:47 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'. scala> spark.sql("select swh_date from testPrism t1 where swh_date = (select MAX(t2.swh_date) from testPrism t2)") res2: org.apache.spark.sql.DataFrame = [swh_date: string] scala> res2.count 20/08/26 23:05:31 WARN UnsafeProjection: Expr codegen error and falling back to interpreter mode java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287) at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$6(Cast.scala:295) at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$6$adapted(Cast.scala:295) at org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) at org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$5(Cast.scala:295) at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:815) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) at org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection.apply(InterpretedUnsafeProjection.scala:76) at org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:329) at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:113) at org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:926) at org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:914) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:96) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:182) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} {code:java} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Scala version 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> spark.read.option("basePath", "hdfs://<hdfsPath>/kafka_log_files/parquet").parquet("<hdfsPath>/kafka_log_files/parquet/swh_env=IMPL/log_name=prism_lens_builds/swh_date=2020-08-15/log_version=34000/swh_dc=DUB").withColumn("swh_date", col("swh_date").cast(StringType)).createOrReplaceTempView("testPrism") 20/08/26 16:20:13 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. scala> spark.sql("select swh_date from testPrism t1 where swh_date = (select MAX(t2.swh_date) from testPrism t2)") res1: org.apache.spark.sql.DataFrame = [swh_date: string] scala> res1.count res2: Long = 10{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org