[GitHub] spark pull request #19331: [SPARK-22109][SQL] Resolves type conflicts betwee...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/19331#discussion_r140936933 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -470,15 +471,15 @@ object PartitioningUtils { * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" * types. */ - private def resolveTypeConflicts(literals: Seq[Literal]): Seq[Literal] = { + private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { val desiredType = { val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- I see. Actually, I believe this PR fixed only the (corner) regression between 2.1.0 and 2.2.0 because, if I understood correctly, we started to support infer timestamp in partition column from 2.1.0, [SPARK-17388](https://issues.apache.org/jira/browse/SPARK-17388), and this corner regression was introduced from 2.2.0 [SPARK-18939](https://issues.apache.org/jira/browse/SPARK-18939) (not tested). For the issue described above, I think that also should be applied to `DecimalType`, `DateType` and `TimestampType` which were started to be supported from [SPARK-17388](https://issues.apache.org/jira/browse/SPARK-17388) and should strictly be related but orthogonal. For perfectness, I guess we should port this logic: https://github.com/apache/spark/blob/183d4cb71fbcbf484fc85d8621e1fe04cbbc8195/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L135-L150 because the problem is, `TimestampType`, `DateType` and `DecimalType` can be upcasted to other types easily by comparing numeric precedence, but of course with few special handling because we are currently only inferring decimals when `scale <= 0` (e.g., not 1.1) and castable to a decimal before trying a double: https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L384-L393 This actually also could a problem of types between `DateType` and `TimestampType` which should be upcastable (from date to timestamp). Let me take a closer look and probably make a fix soon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19331: [SPARK-22109][SQL] Resolves type conflicts betwee...
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/19331#discussion_r140890429 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala --- @@ -470,15 +471,15 @@ object PartitioningUtils { * Given a collection of [[Literal]]s, resolves possible type conflicts by up-casting "lower" * types. */ - private def resolveTypeConflicts(literals: Seq[Literal]): Seq[Literal] = { + private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = { val desiredType = { val topType = literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_)) --- End diff -- thanks for such a quick fix! but, don't we also need to update [`upCastingOrder`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L430)? It seems this happens to work because `upcastingOrder.indexOf(TimestampType) = -1`. But I think that doesn't compare the right way with NullType, so if you add this to [`ParquetPartitionDiscoverySuite.test("parse partitions")`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala#L241), it fails: ``` check(Seq( s"hdfs://host:9000/path/a=$defaultPartitionName/b=blah", s"hdfs://host:9000/path/a=2014-01-01 00%3A00%3A00.0/b=foo"), PartitionSpec( StructType(Seq( StructField("a", TimestampType), StructField("b", StringType))), Seq( Partition(InternalRow(null, "blah"), s"hdfs://host:9000/path/a=$defaultPartitionName/b=blah"), Partition(InternalRow(Timestamp.valueOf("2014-01-01 00:00:00.0"), "foo"), s"hdfs://host:9000/path/a=2014-01-01 00%3A00%3A00.0/b=foo" ``` (I have to admit, I don't totally understand what the ramifications of that fail are -- the behavior in the resulting dataframe seems fine to me, but I figure there is probably some case this would mess up ...) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19331: [SPARK-22109][SQL] Resolves type conflicts betwee...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19331 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19331: [SPARK-22109][SQL] Resolves type conflicts betwee...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/19331 [SPARK-22109][SQL] Resolves type conflicts between strings and timestamps in partition column ## What changes were proposed in this pull request? This PR proposes to resolve the type conflicts in strings and timestamps in partition column values. It looks we need to set the timezone as it needs a cast between strings and timestamps. ```scala val df = Seq((1, "2015-01-01 00:00:00"), (2, "2014-01-01 00:00:00"), (3, "blah")).toDF("i", "str") val path = "/tmp/test.parquet" df.write.format("parquet").partitionBy("str").save(path) spark.read.parquet(path).show() ``` **Before** ``` java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToString$3$$anonfun$apply$16.apply(Cast.scala:208) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToString$3$$anonfun$apply$16.apply(Cast.scala:208) at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201) at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToString$3.apply(Cast.scala:207) at org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:331) at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningUtils$$resolveTypeConflicts$1.apply(PartitioningUtils.scala:481) at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningUtils$$resolveTypeConflicts$1.apply(PartitioningUtils.scala:480) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) ``` **After** ``` +---+---+ | i|str| +---+---+ | 2|2014-01-01 00:00:00| | 1|2015-01-01 00:00:00| | 3| blah| +---+---+ ``` ## How was this patch tested? Unit tests added in `ParquetPartitionDiscoverySuite` and manual tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-22109 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19331.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19331 commit 946a0d93ef7baf71500e85156e097a3b9a05888c Author: hyukjinkwonDate: 2017-09-23T10:24:34Z Resolve type conflicts between strings and timestamps in partition column --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org