[ https://issues.apache.org/jira/browse/SPARK-32720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186196#comment-17186196 ]
Hyukjin Kwon commented on SPARK-32720: -------------------------------------- I can't reproduce it by doing below: {code} Seq(java.sql.Date.valueOf("2000-01-01")).toDF.write.parquet("/tmp/foo") val df = spark.read.parquet("/tmp/foo") df.selectExpr("cast(value as string)").createOrReplaceTempView("testView") spark.sql("select value from testView t1 where value = (select MAX(t2.value) from testView t2)").count {code} > Spark 3 Fails to Cast DateType to StringType when comparing result of Max > ------------------------------------------------------------------------- > > Key: SPARK-32720 > URL: https://issues.apache.org/jira/browse/SPARK-32720 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.0.0 > Reporter: Markus > Priority: Major > > When reading from parquet, using basepath option, the some_date is inferred > and read as a DateType column. I do a withColumn and cast as StringType. Then > I do a where clause comparing swh_date to Max(swh_date) where it fails on > casting from DateType to String. > However, this works in spark 2.4. First showing what I did in spark 3, then > in spark 2.4 below. > {code:java} > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 3.0.0 > /_/ > Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92) > Type in expressions to have them evaluated. > Type :help for more information. > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> spark.read.option("basePath", > "hdfs://<hdfs_path>").parquet("hdfs://<hdfs_path>/some_date=2020-08-15").withColumn("some_date", > col("some_date").cast(StringType)).createOrReplaceTempView("testView") > 20/08/26 23:04:47 WARN package: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.sql.debug.maxToStringFields'. > scala> spark.sql("select some_date from testView t1 where some_date = (select > MAX(t2.some_date) from testView t2)") > res2: org.apache.spark.sql.DataFrame = [some_date: string] > scala> res2.count > 20/08/26 23:05:31 WARN UnsafeProjection: Expr codegen error and falling back > to interpreter mode > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56) > at > org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56) > at > org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253) > at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287) > at > org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$6(Cast.scala:295) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$6$adapted(Cast.scala:295) > at > org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:285) > at > org.apache.spark.sql.catalyst.expressions.CastBase.$anonfun$castToString$5(Cast.scala:295) > at > org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:815) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:461) > at > org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection.apply(InterpretedUnsafeProjection.scala:76) > at > org.apache.spark.sql.execution.joins.UnsafeHashedRelation$.apply(HashedRelation.scala:329) > at > org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:113) > at > org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:926) > at > org.apache.spark.sql.execution.joins.HashedRelationBroadcastMode.transform(HashedRelation.scala:914) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:96) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:182) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) {code} > {code:java} > Welcome to > ____ __ > / __/__ ___ _____/ /__ > _\ \/ _ \/ _ `/ __/ '_/ > /___/ .__/\_,_/_/ /_/\_\ version 2.4.0 > /_/ > Using Scala version 2.12.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_92) > Type in expressions to have them evaluated. > Type :help for more information. > scala> import org.apache.spark.sql.types._ > import org.apache.spark.sql.types._ > scala> spark.read.option("basePath", > "hdfs://<hdfsPath>").parquet("<hdfsPath>/some_date=2020-08-15").withColumn("some_date", > col("some_date").cast(StringType)).createOrReplaceTempView("testView") > 20/08/26 16:20:13 WARN Utils: Truncated the string representation of a plan > since it was too large. This behavior can be adjusted by setting > 'spark.debug.maxToStringFields' in SparkEnv.conf. > scala> spark.sql("select some_date from testView t1 where some_date = (select > MAX(t2.some_date) from testView t2)") > res1: org.apache.spark.sql.DataFrame = [some_date: string] > scala> res1.count > res2: Long = 10{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org