[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-10 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r240148392
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 
   test("Sort metrics") {
-// Assume the execution plan is
-// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1))
-val ds = spark.range(10).sort('id)
-testSparkPlanMetrics(ds.toDF(), 2, Map.empty)
+// Assume the execution plan with node id is
+// Sort(nodeId = 0)
+//   Exchange(nodeId = 1)
+// LocalTableScan(nodeId = 2)
+val df = Seq(1, 3, 2).toDF("id").sort('id)
+testSparkPlanMetrics(df, 2, Map.empty)
--- End diff --

This makes sense! Let me try. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of...

2018-12-10 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/23273

[SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation 
rule

## What changes were proposed in this pull request?
There are some comments issues left when `ConvertToLocalRelation` rule was 
added 
(#22205/[SPARK-25212](https://issues.apache.org/jira/browse/SPARK-25212)). This 
PR fixes those comments issues.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark ConvertToLocalRelation-doc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23273.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23273


commit dfd0f71afb8d95253ea4f64d00cea53c306b6e1c
Author: seancxmao 
Date:   2018-12-10T09:41:54Z

[SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation 
rule




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-09 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r240038550
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 
   test("Sort metrics") {
-// Assume the execution plan is
-// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1))
-val ds = spark.range(10).sort('id)
-testSparkPlanMetrics(ds.toDF(), 2, Map.empty)
+// Assume the execution plan with node id is
+// Sort(nodeId = 0)
+//   Exchange(nodeId = 1)
+// LocalTableScan(nodeId = 2)
+val df = Seq(1, 3, 2).toDF("id").sort('id)
+testSparkPlanMetrics(df, 2, Map.empty)
--- End diff --

@cloud-fan This case tries to check metrics of `SortExec`, however these 
metrics (`sortTime`, `peakMemory`, `spillSize`) change each time  the query is 
executed, they are not fixed. So far what I did is to check whether `SortExec` 
exists. Do you mean we should further check whether these metrics names exist? 
Though we can't know their values beforehand.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-08 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r240024723
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 
   test("Sort metrics") {
-// Assume the execution plan is
-// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1))
-val ds = spark.range(10).sort('id)
-testSparkPlanMetrics(ds.toDF(), 2, Map.empty)
+// Assume the execution plan with node id is
+// Sort(nodeId = 0)
+//   Exchange(nodeId = 1)
+// LocalTableScan(nodeId = 2)
+val df = Seq(1, 3, 2).toDF("id").sort('id)
+testSparkPlanMetrics(df, 2, Map.empty)
--- End diff --

@mgaido91 This case tries to check `Sort` (nodeId=0) metrics, rather than 
`LocalTableScan`. The second parameter (`2`) of `testSparkPlanMetrics(df, 2, 
Map.empty)` means `expectedNumOfJobs` rather than `nodeId`. The third parameter 
`expectedMetrics` will pass `nodeId` together with corresponding expected 
metrics. Because metrics of Sort node (including `sortTime`, `peakMemory`, 
`spillSize`) may change during each execution, unlike metrics like 
`numOutputRows`, we have no way to check these values.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23238: [SPARK-25132][SQL][FOLLOWUP][DOC] Add migration doc for ...

2018-12-08 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/23238
  
Thank you! @dongjoon-hyun 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...

2018-12-08 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/23258
  
retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r239995952
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 
   test("Sort metrics") {
-// Assume the execution plan is
-// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1))
-val ds = spark.range(10).sort('id)
-testSparkPlanMetrics(ds.toDF(), 2, Map.empty)
+// Assume the execution plan with node id is
+// Sort(nodeId = 0)
+//   Exchange(nodeId = 1)
+// Range(nodeId = 2)
+val df = spark.range(9, -1, -1).sort('id).toDF()
--- End diff --

Using `Seq` instead of `Range`, we makes things simpler and more readable. 
I'll change to use `Seq`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r239995901
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
   }
 
   test("Sort metrics") {
-// Assume the execution plan is
-// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1))
-val ds = spark.range(10).sort('id)
-testSparkPlanMetrics(ds.toDF(), 2, Map.empty)
+// Assume the execution plan with node id is
+// Sort(nodeId = 0)
+//   Exchange(nodeId = 1)
+// Range(nodeId = 2)
+val df = spark.range(9, -1, -1).sort('id).toDF()
+testSparkPlanMetrics(df, 2, Map.empty)
+
df.queryExecution.executedPlan.find(_.isInstanceOf[SortExec]).getOrElse(assert(false))
--- End diff --

OK, I think this is more readable. fixed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23258#discussion_r239995882
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -26,7 +26,7 @@ import org.apache.spark.SparkFunSuite
 import org.apache.spark.sql._
 import org.apache.spark.sql.catalyst.expressions.aggregate.{Final, Partial}
 import org.apache.spark.sql.catalyst.plans.logical.LocalRelation
-import org.apache.spark.sql.execution.{FilterExec, RangeExec, SparkPlan, 
WholeStageCodegenExec}
+import org.apache.spark.sql.execution._
--- End diff --

Ok, unfolded.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metric...

2018-12-07 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23224#discussion_r239992863
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala
 ---
@@ -80,8 +80,10 @@ class SQLMetricsSuite extends SparkFunSuite with 
SQLMetricsTestUtils with Shared
 // Assume the execution plan is
 // WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Filter(nodeId = 
1))
 // TODO: update metrics in generated operators
-val ds = spark.range(10).filter('id < 5)
-testSparkPlanMetrics(ds.toDF(), 1, Map.empty)
+val df = spark.range(10).filter('id < 5).toDF()
+testSparkPlanMetrics(df, 1, Map.empty, true)
+
df.queryExecution.executedPlan.find(_.isInstanceOf[WholeStageCodegenExec])
+  .getOrElse(assert(false))
--- End diff --

Thank you @viirya. Very good suggestions.

After investigation, besides whole-stage codegen related issue, I found 
another issue. 
#20560/[SPARK-23375](https://issues.apache.org/jira/browse/SPARK-23375) 
introduced an optimizer rule to eliminate redundant Sort. For a test case named 
"Sort metrics" in `SQLMetricsSuite`, because range is already sorted, sort is 
removed by the `RemoveRedundantSorts`, which makes the test case meaningless. 
This seems to be a pretty different issue, so I opened a new PR. See #23258 for 
details.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...

2018-12-07 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/23258

[SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics while Sort is missing

## What changes were proposed in this pull request?
#20560/[SPARK-23375](https://issues.apache.org/jira/browse/SPARK-23375) 
introduced an optimizer rule to eliminate redundant Sort. For a test case named 
"Sort metrics" in `SQLMetricsSuite`, because range is already sorted, sort is 
removed by the `RemoveRedundantSorts`, which makes this test case meaningless.

This PR modifies the query for testing Sort metrics and checks Sort exists 
in the plan.

## How was this patch tested?
Modify the existing test case.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark sort-metrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23258.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23258


commit 408ccf88325c23c6f2f6119cd6ba99c5056f95a2
Author: seancxmao 
Date:   2018-12-08T03:58:21Z

[SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics while Sort is missing




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...

2018-12-07 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/23224
  
@felixcheung Yes, that makes sense. I have added a commit to check that.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23238: [SPARK-25132][SQL][FOLLOWUP][DOC] Add migration d...

2018-12-07 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/23238#discussion_r239752238
  
--- Diff: docs/sql-migration-guide-upgrade.md ---
@@ -141,6 +141,8 @@ displayTitle: Spark SQL Upgrading Guide
 
   - In Spark version 2.3 and earlier, HAVING without GROUP BY is treated 
as WHERE. This means, `SELECT 1 FROM range(10) HAVING true` is executed as 
`SELECT 1 FROM range(10) WHERE true`  and returns 10 rows. This violates SQL 
standard, and has been fixed in Spark 2.4. Since Spark 2.4, HAVING without 
GROUP BY is treated as a global aggregate, which means `SELECT 1 FROM range(10) 
HAVING true` will return only one row. To restore the previous behavior, set 
`spark.sql.legacy.parser.havingWithoutGroupByAsWhere` to `true`.
 
+  - In version 2.3 and earlier, when reading from a Parquet data source 
table, Spark always returns null for any column whose column names in Hive 
metastore schema and Parquet schema are in different letter cases, no matter 
whether `spark.sql.caseSensitive` is set to true or false. Since 2.4, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched. This change also applies to Parquet Hive tables when 
`spark.sql.hive.convertMetastoreParquet` is set to true.
--- End diff --

@dongjoon-hyun Good suggestions. I have fixed them with a new commit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23238: [SPARK-25132][SQL][FOLLOWUP] Add migration doc for case-...

2018-12-06 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/23238
  
@HyukjinKwon Would you please kindly take a look at this when you have time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23237: [SPARK-26279][CORE] Remove unused method in Logging

2018-12-05 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/23237
  
@HyukjinKwon Close this PR. Thank you!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23237: [SPARK-26279][CORE] Remove unused method in Loggi...

2018-12-05 Thread seancxmao
Github user seancxmao closed the pull request at:

https://github.com/apache/spark/pull/23237


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23238: [SPARK-25132][SQL][FOLLOWUP] Add migration doc fo...

2018-12-05 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/23238

[SPARK-25132][SQL][FOLLOWUP] Add migration doc for case-insensitive field 
resolution when reading from Parquet

## What changes were proposed in this pull request?
#22148 introduces a behavior change. According to discussion at #22184, 
this PR updates migration guide when upgrade from Spark 2.3 to 2.4.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25132-doc-2.4

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23238.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23238


commit 5bbcf41f34f2ca160da7ef4ebe4c54d15a2d09b5
Author: seancxmao 
Date:   2018-12-05T15:05:38Z

[SPARK-25132][SQL][FOLLOWUP] Update migration doc for case-insensitive 
field resolution when reading from Parquet




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

2018-12-05 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22184
  
@srowen Sorry for the late reply! I'd like to close this PR and file a new 
one since our SQL doc has changed a lot. Thank you all for your comments and 
time!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-12-05 Thread seancxmao
Github user seancxmao closed the pull request at:

https://github.com/apache/spark/pull/22184


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23237: [SPARK-26279][CORE] Remove unused method in Loggi...

2018-12-05 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/23237

[SPARK-26279][CORE] Remove unused method in Logging

## What changes were proposed in this pull request?
The method `Logging.isTraceEnabled` is not used anywhere. We should remove 
it to avoid confusion.

## How was this patch tested?
Test locally with existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark clean-logging

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23237


commit 90b111f900d8f11e4d730e0cfbe56a1683f96faa
Author: seancxmao 
Date:   2018-12-05T14:07:49Z

[SPARK-26279][CORE] Remove unused methods in Logging




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...

2018-12-05 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/23224
  
@HyukjinKwon Thank you for your comments! I have filed a JIRA and updated 
the PR title accordingly.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23224: [MINOR][SQL][TEST] WholeStageCodegen metrics shou...

2018-12-04 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/23224

[MINOR][SQL][TEST] WholeStageCodegen metrics should be tested with 
whole-stage codegen enabled

## What changes were proposed in this pull request?
In `org.apache.spark.sql.execution.metric.SQLMetricsSuite`, there's a test 
case named "WholeStageCodegen metrics". However, it is executed with 
whole-stage codegen disabled. This PR fixes this by enable whole-stage codegen 
for this test case.

## How was this patch tested?
Tested locally using exiting test cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark codegen-metrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23224.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23224


commit 021728ccc70cf971592c560cfc5492dedbdc362a
Author: seancxmao 
Date:   2018-12-05T06:28:02Z

[MINOR][SQL][TEST] WholeStageCodegen metrics should be tested with 
whole-stage codegen enabled




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

2018-11-11 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22184
  
@HyukjinKwon Thank you for your comments. Yes, this is only valid when 
upgrade Spark 2.3 to 2.4. I will do it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229156006
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
+```
+
+Instead, you should create `v1` as below with column aliases 
explicitly specified.
+
+```
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1 AS inc_c1, upper(c2) AS 
upper_c2 FROM t1) t2;
--- End diff --

Sure, updated as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229155030
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * View
   * If column aliases are not specified in view definition queries, both 
Spark and Hive will
 generate alias names, but in different ways. In order for Spark to be 
able to read views created
-by Hive, users should explicitly specify column aliases in view 
definition queries.
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

It seems Hive 1.x does not allow `(` following `CREATE VIEW ... AS`, while 
Hive 2.x just works well. The following works on Hive 1.2.1, 1.2.2 and 2.3.3.

```
CREATE VIEW v1 AS SELECT c1 + 1, upper(c2) FROM t1;
```

Another finding is that the above view is readable by Spark though view 
column names are weird (`_c0`, `_c1`). Because Spark will add a `Project` 
between `View` and view definition query if their output attributes do not 
match. 

```
spark-sql> explain extended v1;
...
== Analyzed Logical Plan ==
_c0: int, _c1: string
Project [_c0#44, _c1#45]
+- SubqueryAlias v1
   +- View (`default`.`v1`, [_c0#44,_c1#45])
  +- Project [cast((c1 + 1)#48 as int) AS _c0#44, cast(upper(c2)#49 as 
string) AS _c1#45] // this is added by AliasViewChild rule
 +- Project [(c1#46 + 1) AS (c1 + 1)#48, upper(c2#47) AS 
upper(c2)#49]
+- SubqueryAlias t1
   +- HiveTableRelation `default`.`t1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#46, c2#47]
...
```

But, if column aliases in subqueries of the view definition query are 
missing, Spark will not be able to read the view.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229150147
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

Good ideas. I have simplified the example. and tested the example above 
using Hive 2.3.3 and Spark 2.3.1.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22868: [SPARK-25833][SQL][DOCS] Update migration guide for Hive...

2018-10-29 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22868
  
@dongjoon-hyun Do you mean SPARK-25833? Since SPARK-24864 is resolved as 
Won't Fix, I updated type, priority and title of SPARK-25833.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r228845332
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries.
--- End diff --

Good idea. I have added an example.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22851: [SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migrat...

2018-10-29 Thread seancxmao
Github user seancxmao closed the pull request at:

https://github.com/apache/spark/pull/22851


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22851: [SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migration doc...

2018-10-29 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22851
  
Closing this. Thank you @dongjoon-hyun @felixcheung 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22868: [SPARK-25833][SQL][DOCS] Update migration guide for Hive...

2018-10-28 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22868
  
retest this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-27 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22868

[SPARK-25833][SQL][DOCS] Update migration guide for Hive view compatibility

## What changes were proposed in this pull request?
Both Spark and Hive support views. However in some cases views created by 
Hive are not readable by Spark. For example, if column aliases are not 
specified in view definition queries, both Spark and Hive will generate alias 
names, but in different ways. In order for Spark to be able to read views 
created by Hive, users should explicitly specify column aliases in view 
definition queries.

Given that it's not uncommon that Hive and Spark are used together in 
enterprise data warehouse, this PR aims to explicitly describe this 
compatibility issue to help users troubleshoot this issue easily.

## How was this patch tested?
Docs are manually generated and checked locally.

```
SKIP_API=1 jekyll serve
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25833

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22868.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22868


commit e5b3a11c2cedcbbe528cc72d465ab6e27f5215e3
Author: seancxmao 
Date:   2018-10-28T06:46:10Z

update migration guide for hive view compatibility




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22846: [SPARK-25797][SQL][DOCS] Add migration doc for solving i...

2018-10-26 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22846
  
@cloud-fan PR for 2.3 is submitted. Please see #22851.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22851: [SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migrat...

2018-10-26 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22851

[SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migration doc for solving issues 
caused by view canonicalization approach change

## What changes were proposed in this pull request?
Since Spark 2.2, view definitions are stored in a different way from prior 
versions. This may cause Spark unable to read views created by prior versions. 
See [SPARK-25797](https://issues.apache.org/jira/browse/SPARK-25797) for more 
details.

Basically, we have 2 options.
1) Make Spark 2.2+ able to get older view definitions back. Since the 
expanded text is buggy and unusable, we have to use original text (this is 
possible with 
[SPARK-25459](https://issues.apache.org/jira/browse/SPARK-25459)). However, 
because older Spark versions don't save the context for the database, we cannot 
always get correct view definitions without view default database.
2) Recreate the views by `ALTER VIEW AS` or `CREATE OR REPLACE VIEW AS`.

This PR aims to add migration doc to help users troubleshoot this issue by 
above option 2.

## How was this patch tested?
N/A. 

Docs are generated and checked locally

```
cd docs
SKIP_API=1 jekyll serve --watch
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25797-2.3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22851.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22851


commit e0e71bcf2e825a021e102e4305f6bf0fc25e8b88
Author: seancxmao 
Date:   2018-10-26T11:35:39Z

add migration doc




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22846: [SPARK-25797][SQL][DOCS] Add migration doc for solving i...

2018-10-26 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22846
  
@cloud-fan Sure, I will send a new PR for 2.3. Thanks you for review this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22846: [SPARK-25797][SQL][DOCS] Add migration doc for solving i...

2018-10-26 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22846
  
@jiangxb1987 @cloud-fan @gatorsmile Could you please kindly review this 
when you have time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22846: [SPARK-25797][SQL][DOCS] Add migration doc for so...

2018-10-26 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22846

[SPARK-25797][SQL][DOCS] Add migration doc for solving issues caused by 
view canonicalization approach change

## What changes were proposed in this pull request?
Since Spark 2.2, view definitions are stored in a different way from prior 
versions. This may cause Spark unable to read views created by prior versions. 
See [SPARK-25797](https://issues.apache.org/jira/browse/SPARK-25797) for more 
details.

Basically, we have 2 options.
1) Make Spark 2.2+ able to get older view definitions back. Since the 
expanded text is buggy and unusable, we have to use original text (this is 
possible with 
[SPARK-25459](https://issues.apache.org/jira/browse/SPARK-25459)). However, 
because older Spark versions don't save the context for the database, we cannot 
always get correct view definitions without view default database.
2) Recreate the views by `ALTER VIEW AS` or `CREATE OR REPLACE VIEW AS`.

This PR aims to add migration doc to help users troubleshoot this issue by 
above option 2.

## How was this patch tested?
N/A. 

Docs are generated and checked locally

```
cd docs
SKIP_API=1 jekyll serve --watch
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25797

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22846.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22846


commit a6f4d540d948a887d9dd88b5fc83d3dcf065491f
Author: seancxmao 
Date:   2018-10-26T06:08:42Z

add migration doc




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...

2018-09-29 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22461
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite I...

2018-09-28 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22461#discussion_r221411618
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1489,7 +1489,7 @@ See the [Apache Avro Data Source 
Guide](avro-data-source-guide.html).
 
  * The JDBC driver class must be visible to the primordial class loader on 
the client session and on all executors. This is because Java's DriverManager 
class does a security check that results in it ignoring all drivers not visible 
to the primordial class loader when one goes to open a connection. One 
convenient way to do this is to modify compute_classpath.sh on all worker nodes 
to include your driver JARs.
  * Some databases, such as H2, convert all names to upper case. You'll 
need to use upper case to refer to those names in Spark SQL.
-
+ * Users can specify vendor-specific JDBC connection properties in the 
data source options to do special treatment. For example, 
`spark.read.format("jdbc").option("url", 
oracleJdbcUrl).option("oracle.jdbc.mapDateToTimestamp", "false")`. 
`oracle.jdbc.mapDateToTimestamp` defaults to true, users often need to disable 
this flag to avoid Oracle date being resolved as timestamp.
--- End diff --

@maropu @gatorsmile Could you please suggest something else to do here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22531: [SPARK-25415][SQL][FOLLOW-UP] Add Locale.ROOT when toUpp...

2018-09-28 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22531
  
@HyukjinKwon Please go ahead since you've already been working on this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.writeLega...

2018-09-26 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22453
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...

2018-09-25 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22453#discussion_r220407692
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1002,6 +1002,21 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
   
 
+
+  spark.sql.parquet.writeLegacyFormat
+  false
+  
+This configuration indicates whether we should use legacy Parquet 
format adopted by Spark 1.4
+and prior versions or the standard format defined in parquet-format 
specification to write
+Parquet files. This is not only related to compatibility with old 
Spark ones, but also other
+systems like Hive, Impala, Presto, etc. This is especially important 
for decimals. If this
+configuration is not enabled, decimals will be written in int-based 
format in Spark 1.5 and
+above, other systems that only support legacy decimal format (fixed 
length byte array) will not
+be able to read what Spark has written. Note other systems may have 
added support for the
+standard format in more recent versions, which will make this 
configuration unnecessary. Please
--- End diff --

If we must call it "legacy", I'd think of it legacy implementation in Spark 
side, rather than legacy format in Parquet side. 
As comment in 
[SPARK-20297](https://issues.apache.org/jira/browse/SPARK-20297?focusedCommentId=15975559&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15975559)
> The standard doesn't say that smaller decimals have to be stored in 
int32/int64, it just is an option for subset of decimal types. int32 and int64 
are valid representations for a subset of decimal types. fixed_len_byte_array 
and binary are a valid representation of any decimal type.
> 
>The int32/int64 options were present in the original version of the 
decimal spec, they just weren't widely implemented. So its not a new/old 
version thing, it was just an alternative representation that many systems 
didn't implement.

Anyway, it really leads to confusion.

Really appreciate your suggestion @srowen to make the doc shorter, the doc 
you suggested is more concise and to the point.

One more thing I want to discuss. After investigating the usage of this 
option,  I found this option is not only related to decimals, but also complex 
types (Array, Map), see below source code. Should we mention this in the doc?


https://github.com/apache/spark/blob/473d0d862de54ec1c7a8f0354fa5e06f3d66e455/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L450-L458



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...

2018-09-24 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22453#discussion_r220042478
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1002,6 +1002,21 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
   
 
+
+  spark.sql.parquet.writeLegacyFormat
+  false
+  
+This configuration indicates whether we should use legacy Parquet 
format adopted by Spark 1.4
+and prior versions or the standard format defined in parquet-format 
specification to write
+Parquet files. This is not only related to compatibility with old 
Spark ones, but also other
+systems like Hive, Impala, Presto, etc. This is especially important 
for decimals. If this
+configuration is not enabled, decimals will be written in int-based 
format in Spark 1.5 and
+above, other systems that only support legacy decimal format (fixed 
length byte array) will not
+be able to read what Spark has written. Note other systems may have 
added support for the
+standard format in more recent versions, which will make this 
configuration unnecessary. Please
--- End diff --

Thanks for your suggestion. I have updated the doc in SQLConf.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...

2018-09-24 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22453#discussion_r220038438
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1002,6 +1002,21 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
   
 
+
+  spark.sql.parquet.writeLegacyFormat
+  false
+  
+This configuration indicates whether we should use legacy Parquet 
format adopted by Spark 1.4
+and prior versions or the standard format defined in parquet-format 
specification to write
+Parquet files. This is not only related to compatibility with old 
Spark ones, but also other
+systems like Hive, Impala, Presto, etc. This is especially important 
for decimals. If this
+configuration is not enabled, decimals will be written in int-based 
format in Spark 1.5 and
+above, other systems that only support legacy decimal format (fixed 
length byte array) will not
+be able to read what Spark has written. Note other systems may have 
added support for the
+standard format in more recent versions, which will make this 
configuration unnecessary. Please
--- End diff --

Hive and Impala do NOT support new Parquet format yet.

* [HIVE-19069](https://jira.apache.org/jira/browse/HIVE-19069): Hive can't 
read int32 and int64 Parquet decimal. This issue is not resolved yet. This is 
consistent with source code check by @HyukjinKwon 
* [IMPALA-5542](https://issues.apache.org/jira/browse/IMPALA-5542): Impala 
cannot scan Parquet decimal stored as int64_t/int32_t. This is resolved, 
however targeted to Impala 3.1.0, which is a version not released yet. The 
latest release is 3.0.0 (https://impala.apache.org/downloads.html).

Presto began to support new Parquet format since 0.182.

* [issues/7533](https://github.com/prestodb/presto/issues/7533): Improve 
decimal type support in the new Parquet reader.  This patch is included in 
[0.182](https://prestodb.io/docs/current/release/release-0.182.html). Blow is 
the excerpt: 

> Fix reading decimal values in the optimized Parquet reader when they are 
backed by the int32 or int64 types.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.writeLega...

2018-09-23 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22453
  
FYI. I had a brief survey on Parquet decimal support of computing engines 
at the time of writing. 

Hive
* [HIVE-19069](https://jira.apache.org/jira/browse/HIVE-19069) Hive can't 
read int32 and int64 Parquet decimal. Not resolved yet.

Impala:
* [IMPALA-5628](https://issues.apache.org/jira/browse/IMPALA-5628) Parquet 
support for additional valid decimal representations. This is an umbrella JIRA.
* [IMPALA-2494](https://issues.apache.org/jira/browse/IMPALA-2494) Impala 
Unable to scan a Decimal column stored as Bytes. Fix Version/s: Impala 2.11.0.
* [IMPALA-5542](https://issues.apache.org/jira/browse/IMPALA-5542) Impala 
cannot scan Parquet decimal stored as int64_t/int32_t. Fix Version/s: Impala 
3.1.0, not released yet.

Presto

* [issues/7232](https://github.com/prestodb/presto/issues/7232). Can't read 
decimal type in parquet files written by spark and referenced as external in 
the hive metastore
* [issues/7533](https://github.com/prestodb/presto/issues/7533). Improve 
decimal type support in the new Parquet reader. Fixed Version: 
[0.182](https://prestodb.io/docs/current/release/release-0.182.html)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...

2018-09-23 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22453#discussion_r219729166
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1002,6 +1002,15 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
   
 
+
+  spark.sql.parquet.writeLegacyFormat
--- End diff --

OK, I will update the doc and describe scenarios and reasons why we need 
this flag.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...

2018-09-23 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22453#discussion_r219721110
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1002,6 +1002,15 @@ Configuration of Parquet can be done using the 
`setConf` method on `SparkSession
 
   
 
+
+  spark.sql.parquet.writeLegacyFormat
--- End diff --

I'd like to add my 2 cents. We use both Spark and Hive in our Hadoop/Spark 
clusters. And we have 2 types of tables, working tables and target tables. 
Working tables are only used by Spark jobs, while target tables are populated 
by Spark and exposed to downstream jobs including Hive jobs. Our data engineers 
frequently meet with this issue when they use Hive to read target tables. 
Finally we decided to set spark.sql.parquet.writeLegacyFormat=true as the 
default value for target tables and explicitly describe this in our internal 
developer guide.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22499: [SPARK-25489][ML][TEST] Refactor UDTSerialization...

2018-09-23 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22499#discussion_r219698800
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala
 ---
@@ -18,52 +18,52 @@
 package org.apache.spark.mllib.linalg
 
 import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder
-import org.apache.spark.util.Benchmark
+import org.apache.spark.util.{Benchmark, BenchmarkBase => 
FileBenchmarkBase}
 
 /**
  * Serialization benchmark for VectorUDT.
+ * To run this benchmark:
+ * 1. without sbt: bin/spark-submit --class  
--- End diff --

I have rebased this PR to the latest master and also fixed the docs. I am 
outing for mid-autumn festival, sorry for late reply. BTW, happy mid-autumn 
festival.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite I...

2018-09-21 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22461#discussion_r219539062
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1287,8 +1287,18 @@ bin/spark-shell --driver-class-path 
postgresql-9.4.1207.jar --jars postgresql-9.
 Tables from the remote database can be loaded as a DataFrame or Spark SQL 
temporary view using
 the Data Sources API. Users can specify the JDBC connection properties in 
the data source options.
 user and password are normally provided as 
connection properties for
-logging into the data sources. In addition to the connection properties, 
Spark also supports
-the following case-insensitive options:
+logging into the data sources. Vendor-specific connection properties can 
also be passed to the
+underlying JDBC driver in the same way. For example:
+
+{% highlight scala %}
+// oracle.jdbc.mapDateToTimestamp defaults to true. If this flag is not 
disabled, a column of Oracle
+// DATE type will be resolved as Catalyst TimestampType, which is probably 
not the desired behavior.
+spark.read.format("jdbc")
+  .option("url", oracleJdbcUrl)
+  .option("oracle.jdbc.mapDateToTimestamp", "false")
+{% endhighlight %}
+
--- End diff --

I have moved this description to `Troubleshooting` section. I also tried to 
brush up the description. Writing good documentation is sometimes difficult 
than writing code. really need your help :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite I...

2018-09-21 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22461#discussion_r219537942
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 ---
@@ -462,6 +468,12 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationSuite with SharedSQLCo
   .option("lowerBound", "2018-07-04 03:30:00.0")
   .option("upperBound", "2018-07-27 14:11:05.0")
   .option("numPartitions", 2)
+  // oracle.jdbc.mapDateToTimestamp defaults to true. If this flag is 
not disabled, column d
+  // (Oracle DATE) will be resolved as Catalyst Timestamp, which will 
fail test result
+  // comparison. E.g. Expected 2018-07-06 while Actual 2018-07-06 
00:00:00.0.
--- End diff --

OK, I will remove duplicate description.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite I...

2018-09-21 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22461#discussion_r219403696
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 ---
@@ -462,6 +464,9 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationSuite with SharedSQLCo
   .option("lowerBound", "2018-07-04 03:30:00.0")
   .option("upperBound", "2018-07-27 14:11:05.0")
   .option("numPartitions", 2)
+  .option("oracle.jdbc.mapDateToTimestamp", "false")
--- End diff --

Yes, we need this. Otherwise, Spark will read column `d` values as Catalyst 
type timestamp, which will fail the test.

https://user-images.githubusercontent.com/12194089/45865915-9e730800-bdb1-11e8-9a42-a1394c601166.png";>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22497: [SPARK-25487][SQL][TEST] Refactor PrimitiveArrayBenchmar...

2018-09-20 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22497
  
@kiszk @wangyum Thank you!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite I...

2018-09-20 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22461#discussion_r219386919
  
--- Diff: 
external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala
 ---
@@ -442,6 +442,8 @@ class OracleIntegrationSuite extends 
DockerJDBCIntegrationSuite with SharedSQLCo
   .option("lowerBound", "2018-07-06")
   .option("upperBound", "2018-07-20")
   .option("numPartitions", 3)
+  .option("oracle.jdbc.mapDateToTimestamp", "false")
--- End diff --

ok. I will add notes to 
http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases,
 and will also add comments to the code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22499: [SPARK-25489][ML][TEST] Refactor UDTSerialization...

2018-09-20 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22499

[SPARK-25489][ML][TEST] Refactor UDTSerializationBenchmark

## What changes were proposed in this pull request?
Refactor `UDTSerializationBenchmark` to use main method and print the 
output as a separate file.

Run blow command to generate benchmark results:

```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "mllib/test:runMain 
org.apache.spark.mllib.linalg.UDTSerializationBenchmark"
```

## How was this patch tested?
Manual tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25489

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22499.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22499


commit e43dda9c279fdffa3293e3ef40897cf06fb7cbfa
Author: seancxmao 
Date:   2018-09-20T15:32:17Z

[SPARK-25489] Refactor UDTSerializationBenchmark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22497: [SPARK-25487][SQL][TEST] Refactor PrimitiveArrayB...

2018-09-20 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22497

[SPARK-25487][SQL][TEST] Refactor PrimitiveArrayBenchmark

## What changes were proposed in this pull request?
Refactor PrimitiveArrayBenchmark to use main method and print the output as 
a separate file.

Run blow command to generate benchmark results:

```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain 
org.apache.spark.sql.execution.benchmark.PrimitiveArrayBenchmark"
```

## How was this patch tested?
Manual tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25487

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22497.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22497


commit 631df4b1944c73a4c63b7dcb9873358671a07baf
Author: seancxmao 
Date:   2018-09-20T14:48:21Z

[SPARK-25487][SQL][TEST] Refactor PrimitiveArrayBenchmark




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...

2018-09-19 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22461
  
I tested in my mac, following guidance of 
http://spark.apache.org/docs/latest/building-spark.html#running-docker-based-integration-test-suites.

```
./build/mvn install -DskipTests
./build/mvn test -Pdocker-integration-tests -pl 
:spark-docker-integration-tests_2.11
```

The results are as blow:

https://user-images.githubusercontent.com/12194089/45753213-a4020e00-bc4a-11e8-897f-111fc9dbaba5.png";>


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...

2018-09-19 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22461
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...

2018-09-18 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22461
  
@gatorsmile Thanks a lot!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...

2018-09-18 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22461
  
Jenkins, retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgum...

2018-09-18 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22461

[SPARK-25453] OracleIntegrationSuite IllegalArgumentException: Timestamp 
format must be -mm-dd hh:mm:ss[.f]

## What changes were proposed in this pull request?
This PR aims to fix the failed test of `OracleIntegrationSuite`.

## How was this patch tested?
Existing integration tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25453

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22461.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22461


commit fb7b9d22812f789b970f8dae8a9cd3763c0eccff
Author: seancxmao 
Date:   2018-09-19T03:18:38Z

[SPARK-25453] OracleIntegrationSuite IllegalArgumentException: Timestamp 
format must be -mm-dd hh:mm:ss[.f]




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.writeLega...

2018-09-18 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22453
  
@HyukjinKwon Could you please help review this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...

2018-09-18 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22453

[SPARK-20937][DOCS] Describe spark.sql.parquet.writeLegacyFormat property 
in Spark SQL, DataFrames and Datasets Guide

## What changes were proposed in this pull request?
Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, 
DataFrames and Datasets Guide.

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-20937

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22453.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22453


commit 3af33a31f528059b5f4a66e8ba10bf945eb6fa53
Author: seancxmao 
Date:   2018-09-18T14:32:18Z

[SPARK-20937][DOCS] Describe spark.sql.parquet.writeLegacyFormat property 
in Spark SQL, DataFrames and Datasets Guide




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22343: [SPARK-25391][SQL] Make behaviors consistent when...

2018-09-15 Thread seancxmao
Github user seancxmao closed the pull request at:

https://github.com/apache/spark/pull/22343


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-15 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
Sure, close this PR. Thank you all for your time and insights.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-12 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
I agree that correctness is more important. If we should not make behaviors 
consistent when do the convertion, I will close this PR. @cloud-fan @gatorsmile 
what do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-11 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
It keeps Hive compatibility but loses performance benefit by setting 
spark.sql.hive.convertMetastoreParquet=false. We can do better by enabling the 
conversion and still keeping Hive compatibility. Though this makes our 
implementation more complex, I guess most end users may keep 
`spark.sql.hive.convertMetastoreParquet=true` and 
`spark.sql.caseSensitive=false` which are default values, this brings benefits 
to end users.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-11 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
Could we see this as a behavior change? We can add a legacy conf (e.g. 
`spark.sql.hive.legacy.convertMetastoreParquet`, may be defined in HiveUtils) 
to enable users to revert back to the previous behavior for backward 
compatibility. If this legacy conf is set to true, behaviors will be reverted 
both in case-sensitive and case-insensitive mode.

|caseSensitive|legacy behavior|new behavior|
|-|-|-|
|true|convert anyway|skip conversion, log warning message|
|false|convert, fail if there's ambiguity|convert, first match if there's 
ambiguity|


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
@dongjoon-hyun It is a little complicated. There has been a discussion 
about this in #22184. Below are some key comments from @cloud-fan and 
@gatorsmile, just FYI.

* https://github.com/apache/spark/pull/22184#discussion_r212834477
* https://github.com/apache/spark/pull/22184#issuecomment-416885728


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-10 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
Hi, @dongjoon-hyun
When we find duplicated field names in the case of convertMetastoreXXX, we 
have 2 options
(1) raise exception as parquet data source. To most of end users, they do 
not know the difference between hive parquet table and parquet data source. If 
the conversion leads to different behaviors, they may be confused, and in some 
cases even lead to tricky data issues silently.
(2) Adjust behaviors of parquet data source to keep behaviors consistent. 
This seems more friendly to end users, and avoid any potential issues 
introduced by the conversion.

BTW, for parquet data source which is not converted from hive parquet 
table, we raise exception when there is ambiguity, sine this is more intuitive 
and reasonable.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

2018-09-09 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22184
  
@cloud-fan @gatorsmile I think the old `Upgrading From Spark SQL 2.3.1 to 
2.3.2 and above` is not needed since we do not backport SPARK-25132 to 
branch-2.3. I'm wondering if we need `Upgrading From Spark SQL 2.3 to 2.4 and 
above`. What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22343: [SPARK-25391][SQL] Make behaviors consistent when...

2018-09-09 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22343#discussion_r216212552
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala
 ---
@@ -1390,7 +1395,11 @@ class ParquetSchemaSuite extends ParquetSchemaTest {
 catalystSchema = new StructType(),
 
 expectedSchema = ParquetSchemaConverter.EMPTY_MESSAGE,
-caseSensitive = true)
+conf = {
+  val conf = new Configuration()
+  conf.setBoolean(SQLConf.CASE_SENSITIVE.key, true)
--- End diff --

@cloud-fan There is no default value for `spark.sql.caseSensitive` in 
Configuration. Let me explain in more details below.

This is one of the overloaded methods of `testSchemaClipping`. I tried to 
give this `testSchemaClipping` method a default conf, however Scalac complains 
that 
> in class ParquetSchemaSuite, multiple overloaded alternatives of 
testSchemaClipping define default arguments


https://github.com/apache/spark/blob/95673cdacb1da65d7ac45c212a365e167ae0d713/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala#L1014-L1019


https://github.com/apache/spark/blob/95673cdacb1da65d7ac45c212a365e167ae0d713/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala#L1028-L1033

It seems a little confusing, because these two methods have different 
parameter types. After a brief investigation, I found Scala compiler simply 
disallows overloaded methods with default arguments even when these methods 
have different parameter types.


https://stackoverflow.com/questions/4652095/why-does-the-scala-compiler-disallow-overloaded-methods-with-default-arguments


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...

2018-09-09 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
@dongjoon-hyun @HyukjinKwon I created a new JIRA ticket and try to use a 
more complete and clear title for this PR. What do you think?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-09 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22262
  
@dongjoon-hyun Thank you!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-07 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22262
  
> ... we need this duplication check in case-sensitive mode ...
Do you mean we may define ORC/Parquet schema with identical field names 
(even in the same letter case)? Would you please explain a bit more on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consi...

2018-09-07 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22343
  
@HyukjinKwon @cloud-fan @gatorsmile Could you please kindly help review 
this if you have time?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22262: [SPARK-25175][SQL] Field resolution should fail i...

2018-09-07 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22262#discussion_r216121510
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala
 ---
@@ -116,6 +116,14 @@ object OrcUtils extends Logging {
 })
   } else {
 val resolver = if (isCaseSensitive) caseSensitiveResolution else 
caseInsensitiveResolution
+// Need to fail if there is ambiguity, i.e. more than one field is 
matched.
+requiredSchema.fieldNames.foreach { requiredFieldName =>
+  val matchedOrcFields = orcFieldNames.filter(resolver(_, 
requiredFieldName))
--- End diff --

That's all right :)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-07 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22262
  
@dongjoon-hyun That's all right :). I have reverted to the first commit and 
adjusted the indentation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-06 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22262
  
I updated the PR description.  Thank you for pointing that PR description 
should stay focused. I also think it's more clear.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-09-05 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22262
  
@dongjoon-hyun I have updated PR description to explain in more details. As 
you mentioned, this PR is specific to the case when reading from data source 
table persisted in metastore.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

2018-09-05 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22184
  
@cloud-fan I've just sent a PR (#22343) for this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must b...

2018-09-05 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22343

[SPARK-25132][SQL][FOLLOW-UP] The behavior must be consistent to do the 
conversion

## What changes were proposed in this pull request?
parquet data source tables and hive parquet tables have different behaviors 
about parquet field resolution. So, when 
`spark.sql.hive.convertMetastoreParquet` is true, users might face inconsistent 
behaviors. The differences are:
* Whether respect spark.sql.caseSensitive. Without #22148, both data source 
tables and hive tables do NOT respect spark.sql.caseSensitive. However data 
source tables always do case-sensitive parquet field resolution, while hive 
tables always do case-insensitive parquet field resolution no matter whether 
spark.sql.caseSensitive is set to true or false. #22148 let data source tables 
respect spark.sql.caseSensitive while hive serde table behavior is not changed.
* How to resolve ambiguity in case-insensitive mode. Without #22148, data 
source tables do case-sensitive resolution and return columns with the 
corresponding letter cases, while hive tables always return the first matched 
column ignoring cases. #22148 let data source tables throw exception when there 
is ambiguity while hive table behavior is not changed.

This PR aims to make behaviors consistent when converting hive table to 
data source table.
* The behavior must be consistent to do the conversion, so we skip the 
conversion in case-sensitive mode because hive parquet table always do 
case-insensitive field resolution.
* When converting hive parquet table to parquet data source, we switch the 
duplicated fields resolution mode to ask parquet data source to pick the first 
matched field - the same behavior as hive parquet table - to keep behaviors 
consistent.

## How was this patch tested?
Unit tests added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25132-CONSISTENCY

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22343.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22343


commit 95673cdacb1da65d7ac45c212a365e167ae0d713
Author: seancxmao 
Date:   2018-09-04T08:00:31Z

[SPARK-25132][SQL][FOLLOW-UP] The behavior must be consistent to do the 
conversion




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22183: [SPARK-25132][SQL][BACKPORT-2.3] Case-insensitive...

2018-09-04 Thread seancxmao
Github user seancxmao closed the pull request at:

https://github.com/apache/spark/pull/22183


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

2018-08-29 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22184
  
> My proposal is, parquet data source should provide an option(not SQL 
conf) to ...
You mentioned this option is not SQL conf. Could you give me some advice 
about where this option should be defined? I just thought to define this option 
in SQLConf as something like `spark.sql.parquet.onDuplicatedFields` = FAIL, 
FIRST_MATCH, as I see bunch of options starting with `spark.sql.parquet` in 
SQLConf.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

2018-08-29 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22184
  
@cloud-fan OK, I will do it. 

Just to confirm, when reading from hive parquet table, if 
`spark.sql.hive.convertMetastoreParquet` and `spark.sql.caseSensitive` are both 
set to true, we throw an exception to tell users they should not do this 
because this could lead to inconsistent results. Is my understanding correct?

Another thing to confirm is that when there is ambiguity in 
case-insensitive mode, native parquet reader throws exception while hive serde 
reader returns the first matched field, which is not consistent. Is it OK?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...

2018-08-28 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22262
  
@dongjoon-hyun @cloud-fan @gatorsmile Could you please kindly review this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22262: [SPARK-25175][SQL] Field resolution should fail i...

2018-08-28 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22262

[SPARK-25175][SQL] Field resolution should fail if there is ambiguity for 
ORC native reader

## What changes were proposed in this pull request?
This PR aims to make ORC data source native implementation consistent with 
Parquet data source. The gap is that field resolution should fail if there is 
ambiguity in case-insensitive mode when reading from ORC.

See #22148 for more details about parquet data source reader.

## How was this patch tested?
Unit tests added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25175

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22262.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22262


commit 366bb35ad62edb8e6707e65c681a5b9001cc868e
Author: seancxmao 
Date:   2018-08-17T10:06:28Z

[SPARK-25175][SQL] Field resolution should fail if there's ambiguity for 
ORC native reader




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-27 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r213020789
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

As a followup to cloud-fan's point, I did a deep dive into read path of 
parquet hive serde table. Following is a rough invocation chain:

```
org.apache.spark.sql.hive.execution.HiveTableScanExec
org.apache.spark.sql.hive.HadoopTableReader (extendes 
org.apache.spark.sql.hive.TableReader)
org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat (extends 
org.apache.hadoop.mapred.FileInputFormat)
org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper 
(extends org.apache.hadoop.mapred.RecordReader)
parquet.hadoop.ParquetRecordReader
parquet.hadoop.InternalParquetRecordReader
org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport (extends 
parquet.hadoop.api.ReadSupport)
```

Finally, `DataWritableReadSupport#getFieldTypeIgnoreCase` is invoked. 


https://github.com/JoshRosen/hive/blob/release-1.2.1-spark2/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L79-L95

This is why parquet hive serde table always do case-insensitive field 
resolution. However, this is a class inside 
`org.spark-project.hive:hive-exec:1.2.1.spark2`.

I also found the related Hive JIRA ticket:
[HIVE-7554: Parquet Hive should resolve column names in case insensitive 
manner](https://issues.apache.org/jira/browse/HIVE-7554)

BTW:
* org.apache.hadoop.hive.ql = org.spark-project.hive:hive-exec:1.2.1.spark2
* parquet.hadoop = com.twitter:parquet-hadoop-bundle:1.6.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-27 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212894532
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

As a followup, I also did investigation about ORC. Below are some results. 
Just FYI.

* 
https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593185&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593185
* 
https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593194&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593194


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-23 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22184#discussion_r212405373
  
--- Diff: docs/sql-programming-guide.md ---
@@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
   - Since Spark 2.4, File listing for compute statistics is done in 
parallel by default. This can be disabled by setting 
`spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
   - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and 
temporary files are not counted as data files when calculating table size 
during Statistics computation.
 
+## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above
+
+  - In version 2.3.1 and earlier, when reading from a Parquet table, Spark 
always returns null for any column whose column names in Hive metastore schema 
and Parquet schema are in different letter cases, no matter whether 
`spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when 
`spark.sql.caseSensitive` is set to false, Spark does case insensitive column 
name resolution between Hive metastore schema and Parquet schema, so even 
column names are in different letter cases, Spark returns corresponding column 
values. An exception is thrown if there is ambiguity, i.e. more than one 
Parquet column is matched.
--- End diff --

Following your advice, I did a thorough comparison between data source 
table and hive serde table. 

Parquet data and tables are created via the following code:

```
val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as 
c", "id * 4 as C")
spark.conf.set("spark.sql.caseSensitive", true)

data.write.format("parquet").mode("overwrite").save("/user/hive/warehouse/parquet_data")

CREATE TABLE parquet_data_source_lower (a LONG, b LONG, c LONG) USING 
parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_data_source_upper (A LONG, B LONG, C LONG) USING 
parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS 
parquet LOCATION '/user/hive/warehouse/parquet_data'
CREATE TABLE parquet_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS 
parquet LOCATION '/user/hive/warehouse/parquet_data'
```
spark.sql.hive.convertMetastoreParquet is set to false:

```
spark.conf.set("spark.sql.hive.convertMetastoreParquet", false)
```

Below are the comparison results both without #22148 and with #22148.

The comparison result without #22148:

|no.|caseSensitive|table columns|select column|parquet column (select via 
data source table)|parquet column (select via hive serde 
table)|consistent?|resolved by SPARK-25132|
| - | - | - | - | - | - | - | - |
|1|true|a, b, c|a| a|a |Y | |
|2| | |b|null|B|NG| |
|3| | |c|c |c |Y | |
|4| | |A|AnalysisException|AnalysisException|Y | |
|5| | |B|AnalysisException|AnalysisException|Y | |
|6| | |C|AnalysisException|AnalysisException|Y | |
|7| |A, B, C|a|AnalysisException |AnalysisException|Y | |
|8| | |b|AnalysisException |AnalysisException |Y | |
|9| | |c|AnalysisException |AnalysisException |Y | |
|10| | |A|null |a |NG | |
|11| | |B|B |B|Y | |
|12| | |C|C |c |NG | |
|13|false|a, b, c|a|a |a |Y | |
|14| | |b|null |B |NG |Y|
|15| | |c|c |c |Y | |
|16| | |A|a |a |Y | |
|17| | |B|null |B |NG |Y|
|18| | |C|c |c |Y | |
|19| |A, B, C|a|null |a |NG |Y|
|20| | |b|B |B |Y | |
|21| | |c|C |c |NG | |
|22| | |A|null |a |NG |Y|
|23| | |B|B |B |Y | |
|24| | |C|C |c |NG | |

The comparison result with #22148 applied:

|no.|caseSensitive|table columns|select column|parquet column (select via 
data source table)|parquet column (select via hive serde 
table)|consistent?|introduced by SPARK-25132|
|---|---|---|---|---|---|---|---|
|1|true|a, b, c|a|a |a |Y | |
|2| | |b|null |B |NG | |
|3| | |c|c |c |Y | |
|4| | |A|AnalysisException |AnalysisException |Y | |
|5| | |B|AnalysisException |AnalysisException |Y | |
|6| | |C|AnalysisException |AnalysisException |Y | |
|7| |A, B, C|a|AnalysisException |AnalysisException |Y | |
|8| | |b|AnalysisException |AnalysisException |Y | |
|9| | |c|AnalysisException |AnalysisException |Y | |
|10| | |A|null |a |NG | |
|11| | |B|B |B |Y | |
|12| | |C|C |c |NG | |
|13|false|a, b, c|a|a |a |Y | |
|14| | |b|B |B |Y | |
|15| | |c|RuntimeException |c |NG |Y|
|16| | |A|a |a |Y | |
|17| | |B

[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...

2018-08-22 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22184
  
@gatorsmile Could you kindly help trigger Jenkins and review?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...

2018-08-22 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22184

[SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field 
resolution when reading from Parquet

## What changes were proposed in this pull request?
#22148 introduces a behavior change. We need to document it in the 
migration guide.

## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25132-DOC

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22184.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22184


commit eae8a3c98f146765d25bbf529421ce3c7a92639b
Author: seancxmao 
Date:   2018-08-22T09:17:55Z

[SPARK-25132][SQL][DOC] Case-insensitive field resolution when reading from 
Parquet




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22183: [SPARK-25132][SQL][BACKPORT-2.3] Case-insensitive...

2018-08-21 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22183

[SPARK-25132][SQL][BACKPORT-2.3] Case-insensitive field resolution when 
reading from Parquet

## What changes were proposed in this pull request?
This is a backport of https://github.com/apache/spark/pull/22148

Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, regardless of spark.sql.caseSensitive set 
to true or false. This PR aims to add case-insensitive field resolution for 
ParquetFileFormat.
* Do case-insensitive resolution only if Spark is in case-insensitive mode.
* Field resolution should fail if there is ambiguity, i.e. more than one 
field is matched.

## How was this patch tested?
Unit tests added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25132-2.3

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22183.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22183


commit 28315888eaae5a9c9160ea53eb6eb9a9af712958
Author: seancxmao 
Date:   2018-08-21T02:34:23Z

[SPARK-25132][SQL][BACKPORT-2.3] Case-insensitive field resolution when 
reading from Parquet




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22148: [SPARK-25132][SQL] Case-insensitive field resolution whe...

2018-08-20 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22148
  
Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22148: [SPARK-25132][SQL] Case-insensitive field resolut...

2018-08-19 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22148

[SPARK-25132][SQL] Case-insensitive field resolution when reading from 
Parquet

## What changes were proposed in this pull request?
Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, regardless of spark.sql.caseSensitive set 
to true or false. This PR aims to add case-insensitive field resolution for 
ParquetFileFormat.
* Do case-insensitive resolution only if Spark is in case-insensitive mode.
* Field resolution should fail if there is ambiguity, i.e. more than one 
field is matched.

## How was this patch tested?
Unit tests added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25132-Parquet

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22148.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22148


commit e0a555354436b0099a7b48b3269b8c5ccb73571b
Author: seancxmao 
Date:   2018-08-17T10:06:28Z

[SPARK-25132][SQL] Case-insensitive field resolution when reading from 
Parquet




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22142: [SPARK-25132][SQL] Case-insensitive field resolution whe...

2018-08-19 Thread seancxmao
Github user seancxmao commented on the issue:

https://github.com/apache/spark/pull/22142
  
Split this into 2 PRs, one for Parquet and ORC respectively.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22142: [SPARK-25132][SQL] Case-insensitive field resolut...

2018-08-19 Thread seancxmao
Github user seancxmao closed the pull request at:

https://github.com/apache/spark/pull/22142


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22142: [SPARK-25132][SQL] case-insensitive field resolut...

2018-08-19 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22142

[SPARK-25132][SQL] case-insensitive field resolution when reading from 
Parquet/ORC

## What changes were proposed in this pull request?
Spark SQL returns NULL for a column whose Hive metastore schema and Parquet 
schema are in different letter cases, regardless of spark.sql.caseSensitive set 
to true or false. This applies not only to Parquet, but also to ORC. Following 
is a brief summary:
* ParquetFileFormat doesn't support case-insensitive field resolution.
* native OrcFileFormat supports case-insensitive field resolution, however 
it cannot handle duplicate fields.
* hive OrcFileFormat doesn't support case-insensitive field resolution.

https://github.com/apache/spark/pull/15799 reverted case-insensitive 
resolution for ParquetFileFormat and hive OrcFileFormat. This PR brings it back 
and improves it to do case-insensitive resolution only if Spark is in 
case-insensitive mode. And field resolution will fail if there is ambiguity, 
i.e. more than one field is matched. ParquetFileFormat, native OrcFileFormat 
and hive OrcFileFormat are all supported.

## How was this patch tested?
Unit tests added.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25132

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22142.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22142


commit 5c3d20b654609c86de9c24c9751ec34916f3aabd
Author: seancxmao 
Date:   2018-08-17T10:06:28Z

SPARK-25132: case-insensitive field resolution when reading from Parquet/ORC

* Fix ParquetFileFormat
* More than one Parquet column is matched
* Fix OrcFileFormat (both native and hive implementations)
* Fix issues according to review results: refactor test cases, code style, 
...
* Test cases: change paruqet/orc file schema from a to A
* Test cases: let different columns have different value series
* Refine error message
* Split multi-format test suite
* Simplify test cases for ambiguous resolution
* Simplify test cases to reduce code lines
* Refine tests and  comments




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21113: [SPARK-13136][SQL][FOLLOW-UP] Fix comment

2018-04-20 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/21113

[SPARK-13136][SQL][FOLLOW-UP] Fix comment

## What changes were proposed in this pull request?
Fix comment. Change `BroadcastHashJoin.broadcastFuture` to 
`BroadcastExchange.relationFuture`: 
https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L66

## How was this patch tested?
N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-13136

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21113.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21113


commit 6aceb43c56641a38b97b5e090379ef5143f17dd0
Author: seancxmao 
Date:   2018-04-20T07:12:09Z

fix comment issue




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20597: [MINOR][TEST] Update from 2.2.0 to 2.2.1 in HiveE...

2018-02-24 Thread seancxmao
Github user seancxmao closed the pull request at:

https://github.com/apache/spark/pull/20597


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20597: [MINOR][TEST] Update from 2.2.0 to 2.2.1 in HiveE...

2018-02-13 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/20597

[MINOR][TEST] Update from 2.2.0 to 2.2.1 in HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?
In `HiveExternalCatalogVersionsSuite`, latest version of 2.2.x is updated 
from 2.2.0 to 2.2.1

## How was this patch tested?
N/A

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark wrong_versions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20597


commit 047e11913872454be03e353c4ba2f793b74ca3ae
Author: seancxmao 
Date:   2018-02-13T08:23:53Z

Latest version of 2.2.x is changed from 2.2.0 to 2.2.1




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >