[GitHub] spark pull request #19331: [SPARK-22109][SQL] Resolves type conflicts betwee...

2017-09-25 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19331#discussion_r140936933
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -470,15 +471,15 @@ object PartitioningUtils {
* Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
* types.
*/
-  private def resolveTypeConflicts(literals: Seq[Literal]): Seq[Literal] = 
{
+  private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
 val desiredType = {
   val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

I see. Actually, I believe this PR fixed only the (corner) regression 
between 2.1.0 and 2.2.0 because, if I understood correctly, we started to 
support infer timestamp in partition column from 2.1.0, 
[SPARK-17388](https://issues.apache.org/jira/browse/SPARK-17388), and this 
corner regression was introduced from 2.2.0 
[SPARK-18939](https://issues.apache.org/jira/browse/SPARK-18939) (not tested).

For the issue described above, I think that also should be applied to 
`DecimalType`, `DateType` and `TimestampType` which were started to be 
supported from [SPARK-17388](https://issues.apache.org/jira/browse/SPARK-17388) 
and should strictly be related but orthogonal.

For perfectness, I guess we should port this logic:


https://github.com/apache/spark/blob/183d4cb71fbcbf484fc85d8621e1fe04cbbc8195/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala#L135-L150

because the problem is, `TimestampType`, `DateType` and `DecimalType` can 
be upcasted to other types easily by comparing numeric precedence, but of 
course with few special handling because we are currently only inferring 
decimals when `scale <= 0` (e.g., not 1.1) and castable to a decimal before 
trying a double:


https://github.com/apache/spark/blob/04975a68b583a6175f93da52374108e5d4754d9a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L384-L393

This actually also could a problem of types between `DateType` and 
`TimestampType` which should be upcastable (from date to timestamp).

Let me take a closer look and probably make a fix soon.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19331: [SPARK-22109][SQL] Resolves type conflicts betwee...

2017-09-25 Thread squito
Github user squito commented on a diff in the pull request:

https://github.com/apache/spark/pull/19331#discussion_r140890429
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala
 ---
@@ -470,15 +471,15 @@ object PartitioningUtils {
* Given a collection of [[Literal]]s, resolves possible type conflicts 
by up-casting "lower"
* types.
*/
-  private def resolveTypeConflicts(literals: Seq[Literal]): Seq[Literal] = 
{
+  private def resolveTypeConflicts(literals: Seq[Literal], timeZone: 
TimeZone): Seq[Literal] = {
 val desiredType = {
   val topType = 
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))
--- End diff --

thanks for such a quick fix!

but, don't we also need to update 
[`upCastingOrder`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala#L430)?
  It seems this happens to work because `upcastingOrder.indexOf(TimestampType) 
= -1`.  But I think that doesn't compare the right way with NullType, so if you 
add this to [`ParquetPartitionDiscoverySuite.test("parse 
partitions")`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala#L241),
 it fails:

```
check(Seq(
  s"hdfs://host:9000/path/a=$defaultPartitionName/b=blah",
  s"hdfs://host:9000/path/a=2014-01-01 00%3A00%3A00.0/b=foo"),
  PartitionSpec(
StructType(Seq(
  StructField("a", TimestampType),
  StructField("b", StringType))),
Seq(
  Partition(InternalRow(null, "blah"),
s"hdfs://host:9000/path/a=$defaultPartitionName/b=blah"),
  Partition(InternalRow(Timestamp.valueOf("2014-01-01 00:00:00.0"), 
"foo"),
s"hdfs://host:9000/path/a=2014-01-01 00%3A00%3A00.0/b=foo"
```

(I have to admit, I don't totally understand what the ramifications of that 
fail are -- the behavior in the resulting dataframe seems fine to me, but I 
figure there is probably some case this would mess up ...)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19331: [SPARK-22109][SQL] Resolves type conflicts betwee...

2017-09-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19331


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19331: [SPARK-22109][SQL] Resolves type conflicts betwee...

2017-09-23 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/19331

[SPARK-22109][SQL] Resolves type conflicts between strings and timestamps 
in partition column

## What changes were proposed in this pull request?

This PR proposes to resolve the type conflicts in strings and timestamps in 
partition column values.
It looks we need to set the timezone as it needs a cast between strings and 
timestamps.

```scala
val df = Seq((1, "2015-01-01 00:00:00"), (2, "2014-01-01 00:00:00"), (3, 
"blah")).toDF("i", "str")
val path = "/tmp/test.parquet"
df.write.format("parquet").partitionBy("str").save(path)
spark.read.parquet(path).show()
```

**Before**

```
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:347)
  at scala.None$.get(Option.scala:345)
  at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172)
  at org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToString$3$$anonfun$apply$16.apply(Cast.scala:208)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToString$3$$anonfun$apply$16.apply(Cast.scala:208)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201)
  at 
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToString$3.apply(Cast.scala:207)
  at 
org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:331)
  at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningUtils$$resolveTypeConflicts$1.apply(PartitioningUtils.scala:481)
  at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$org$apache$spark$sql$execution$datasources$PartitioningUtils$$resolveTypeConflicts$1.apply(PartitioningUtils.scala:480)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
```

**After**

```
+---+---+
|  i|str|
+---+---+
|  2|2014-01-01 00:00:00|
|  1|2015-01-01 00:00:00|
|  3|   blah|
+---+---+
```

## How was this patch tested?

Unit tests added in `ParquetPartitionDiscoverySuite` and manual tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-22109

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19331.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19331


commit 946a0d93ef7baf71500e85156e097a3b9a05888c
Author: hyukjinkwon 
Date:   2017-09-23T10:24:34Z

Resolve type conflicts between strings and timestamps in partition column




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org