[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r240148392 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// LocalTableScan(nodeId = 2) +val df = Seq(1, 3, 2).toDF("id").sort('id) +testSparkPlanMetrics(df, 2, Map.empty) --- End diff -- This makes sense! Let me try. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23273: [SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/23273 [SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation rule ## What changes were proposed in this pull request? There are some comments issues left when `ConvertToLocalRelation` rule was added (#22205/[SPARK-25212](https://issues.apache.org/jira/browse/SPARK-25212)). This PR fixes those comments issues. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark ConvertToLocalRelation-doc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23273.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23273 commit dfd0f71afb8d95253ea4f64d00cea53c306b6e1c Author: seancxmao Date: 2018-12-10T09:41:54Z [SPARK-25212][SQL][FOLLOWUP][DOC] Fix comments of ConvertToLocalRelation rule --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r240038550 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// LocalTableScan(nodeId = 2) +val df = Seq(1, 3, 2).toDF("id").sort('id) +testSparkPlanMetrics(df, 2, Map.empty) --- End diff -- @cloud-fan This case tries to check metrics of `SortExec`, however these metrics (`sortTime`, `peakMemory`, `spillSize`) change each time the query is executed, they are not fixed. So far what I did is to check whether `SortExec` exists. Do you mean we should further check whether these metrics names exist? Though we can't know their values beforehand. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r240024723 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// LocalTableScan(nodeId = 2) +val df = Seq(1, 3, 2).toDF("id").sort('id) +testSparkPlanMetrics(df, 2, Map.empty) --- End diff -- @mgaido91 This case tries to check `Sort` (nodeId=0) metrics, rather than `LocalTableScan`. The second parameter (`2`) of `testSparkPlanMetrics(df, 2, Map.empty)` means `expectedNumOfJobs` rather than `nodeId`. The third parameter `expectedMetrics` will pass `nodeId` together with corresponding expected metrics. Because metrics of Sort node (including `sortTime`, `peakMemory`, `spillSize`) may change during each execution, unlike metrics like `numOutputRows`, we have no way to check these values. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23238: [SPARK-25132][SQL][FOLLOWUP][DOC] Add migration doc for ...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/23238 Thank you! @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics whi...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/23258 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r239995952 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// Range(nodeId = 2) +val df = spark.range(9, -1, -1).sort('id).toDF() --- End diff -- Using `Seq` instead of `Range`, we makes things simpler and more readable. I'll change to use `Seq`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r239995901 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -182,10 +182,13 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared } test("Sort metrics") { -// Assume the execution plan is -// WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Sort(nodeId = 1)) -val ds = spark.range(10).sort('id) -testSparkPlanMetrics(ds.toDF(), 2, Map.empty) +// Assume the execution plan with node id is +// Sort(nodeId = 0) +// Exchange(nodeId = 1) +// Range(nodeId = 2) +val df = spark.range(9, -1, -1).sort('id).toDF() +testSparkPlanMetrics(df, 2, Map.empty) + df.queryExecution.executedPlan.find(_.isInstanceOf[SortExec]).getOrElse(assert(false)) --- End diff -- OK, I think this is more readable. fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23258#discussion_r239995882 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -26,7 +26,7 @@ import org.apache.spark.SparkFunSuite import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.expressions.aggregate.{Final, Partial} import org.apache.spark.sql.catalyst.plans.logical.LocalRelation -import org.apache.spark.sql.execution.{FilterExec, RangeExec, SparkPlan, WholeStageCodegenExec} +import org.apache.spark.sql.execution._ --- End diff -- Ok, unfolded. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metric...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23224#discussion_r239992863 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/metric/SQLMetricsSuite.scala --- @@ -80,8 +80,10 @@ class SQLMetricsSuite extends SparkFunSuite with SQLMetricsTestUtils with Shared // Assume the execution plan is // WholeStageCodegen(nodeId = 0, Range(nodeId = 2) -> Filter(nodeId = 1)) // TODO: update metrics in generated operators -val ds = spark.range(10).filter('id < 5) -testSparkPlanMetrics(ds.toDF(), 1, Map.empty) +val df = spark.range(10).filter('id < 5).toDF() +testSparkPlanMetrics(df, 1, Map.empty, true) + df.queryExecution.executedPlan.find(_.isInstanceOf[WholeStageCodegenExec]) + .getOrElse(assert(false)) --- End diff -- Thank you @viirya. Very good suggestions. After investigation, besides whole-stage codegen related issue, I found another issue. #20560/[SPARK-23375](https://issues.apache.org/jira/browse/SPARK-23375) introduced an optimizer rule to eliminate redundant Sort. For a test case named "Sort metrics" in `SQLMetricsSuite`, because range is already sorted, sort is removed by the `RemoveRedundantSorts`, which makes the test case meaningless. This seems to be a pretty different issue, so I opened a new PR. See #23258 for details. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23258: [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metr...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/23258 [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics while Sort is missing ## What changes were proposed in this pull request? #20560/[SPARK-23375](https://issues.apache.org/jira/browse/SPARK-23375) introduced an optimizer rule to eliminate redundant Sort. For a test case named "Sort metrics" in `SQLMetricsSuite`, because range is already sorted, sort is removed by the `RemoveRedundantSorts`, which makes this test case meaningless. This PR modifies the query for testing Sort metrics and checks Sort exists in the plan. ## How was this patch tested? Modify the existing test case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark sort-metrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23258.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23258 commit 408ccf88325c23c6f2f6119cd6ba99c5056f95a2 Author: seancxmao Date: 2018-12-08T03:58:21Z [SPARK-23375][SQL][FOLLOWUP][TEST] Test Sort metrics while Sort is missing --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/23224 @felixcheung Yes, that makes sense. I have added a commit to check that. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23238: [SPARK-25132][SQL][FOLLOWUP][DOC] Add migration d...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/23238#discussion_r239752238 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -141,6 +141,8 @@ displayTitle: Spark SQL Upgrading Guide - In Spark version 2.3 and earlier, HAVING without GROUP BY is treated as WHERE. This means, `SELECT 1 FROM range(10) HAVING true` is executed as `SELECT 1 FROM range(10) WHERE true` and returns 10 rows. This violates SQL standard, and has been fixed in Spark 2.4. Since Spark 2.4, HAVING without GROUP BY is treated as a global aggregate, which means `SELECT 1 FROM range(10) HAVING true` will return only one row. To restore the previous behavior, set `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` to `true`. + - In version 2.3 and earlier, when reading from a Parquet data source table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.4, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. This change also applies to Parquet Hive tables when `spark.sql.hive.convertMetastoreParquet` is set to true. --- End diff -- @dongjoon-hyun Good suggestions. I have fixed them with a new commit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23238: [SPARK-25132][SQL][FOLLOWUP] Add migration doc for case-...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/23238 @HyukjinKwon Would you please kindly take a look at this when you have time? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23237: [SPARK-26279][CORE] Remove unused method in Logging
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/23237 @HyukjinKwon Close this PR. Thank you! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23237: [SPARK-26279][CORE] Remove unused method in Loggi...
Github user seancxmao closed the pull request at: https://github.com/apache/spark/pull/23237 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23238: [SPARK-25132][SQL][FOLLOWUP] Add migration doc fo...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/23238 [SPARK-25132][SQL][FOLLOWUP] Add migration doc for case-insensitive field resolution when reading from Parquet ## What changes were proposed in this pull request? #22148 introduces a behavior change. According to discussion at #22184, this PR updates migration guide when upgrade from Spark 2.3 to 2.4. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25132-doc-2.4 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23238.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23238 commit 5bbcf41f34f2ca160da7ef4ebe4c54d15a2d09b5 Author: seancxmao Date: 2018-12-05T15:05:38Z [SPARK-25132][SQL][FOLLOWUP] Update migration doc for case-insensitive field resolution when reading from Parquet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22184 @srowen Sorry for the late reply! I'd like to close this PR and file a new one since our SQL doc has changed a lot. Thank you all for your comments and time! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user seancxmao closed the pull request at: https://github.com/apache/spark/pull/22184 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23237: [SPARK-26279][CORE] Remove unused method in Loggi...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/23237 [SPARK-26279][CORE] Remove unused method in Logging ## What changes were proposed in this pull request? The method `Logging.isTraceEnabled` is not used anywhere. We should remove it to avoid confusion. ## How was this patch tested? Test locally with existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark clean-logging Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23237 commit 90b111f900d8f11e4d730e0cfbe56a1683f96faa Author: seancxmao Date: 2018-12-05T14:07:49Z [SPARK-26279][CORE] Remove unused methods in Logging --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23224: [SPARK-26277][SQL][TEST] WholeStageCodegen metrics shoul...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/23224 @HyukjinKwon Thank you for your comments! I have filed a JIRA and updated the PR title accordingly. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23224: [MINOR][SQL][TEST] WholeStageCodegen metrics shou...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/23224 [MINOR][SQL][TEST] WholeStageCodegen metrics should be tested with whole-stage codegen enabled ## What changes were proposed in this pull request? In `org.apache.spark.sql.execution.metric.SQLMetricsSuite`, there's a test case named "WholeStageCodegen metrics". However, it is executed with whole-stage codegen disabled. This PR fixes this by enable whole-stage codegen for this test case. ## How was this patch tested? Tested locally using exiting test cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark codegen-metrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23224.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23224 commit 021728ccc70cf971592c560cfc5492dedbdc362a Author: seancxmao Date: 2018-12-05T06:28:02Z [MINOR][SQL][TEST] WholeStageCodegen metrics should be tested with whole-stage codegen enabled --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22184 @HyukjinKwon Thank you for your comments. Yes, this is only valid when upgrade Spark 2.3 to 2.4. I will do it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229156006 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; +``` + +Instead, you should create `v1` as below with column aliases explicitly specified. + +``` +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1 AS inc_c1, upper(c2) AS upper_c2 FROM t1) t2; --- End diff -- Sure, updated as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229155030 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, such as: * View * If column aliases are not specified in view definition queries, both Spark and Hive will generate alias names, but in different ways. In order for Spark to be able to read views created -by Hive, users should explicitly specify column aliases in view definition queries. +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- It seems Hive 1.x does not allow `(` following `CREATE VIEW ... AS`, while Hive 2.x just works well. The following works on Hive 1.2.1, 1.2.2 and 2.3.3. ``` CREATE VIEW v1 AS SELECT c1 + 1, upper(c2) FROM t1; ``` Another finding is that the above view is readable by Spark though view column names are weird (`_c0`, `_c1`). Because Spark will add a `Project` between `View` and view definition query if their output attributes do not match. ``` spark-sql> explain extended v1; ... == Analyzed Logical Plan == _c0: int, _c1: string Project [_c0#44, _c1#45] +- SubqueryAlias v1 +- View (`default`.`v1`, [_c0#44,_c1#45]) +- Project [cast((c1 + 1)#48 as int) AS _c0#44, cast(upper(c2)#49 as string) AS _c1#45] // this is added by AliasViewChild rule +- Project [(c1#46 + 1) AS (c1 + 1)#48, upper(c2#47) AS upper(c2)#49] +- SubqueryAlias t1 +- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#46, c2#47] ... ``` But, if column aliases in subqueries of the view definition query are missing, Spark will not be able to read the view. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229150147 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- Good ideas. I have simplified the example. and tested the example above using Hive 2.3.3 and Spark 2.3.1. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22868: [SPARK-25833][SQL][DOCS] Update migration guide for Hive...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22868 @dongjoon-hyun Do you mean SPARK-25833? Since SPARK-24864 is resolved as Won't Fix, I updated type, priority and title of SPARK-25833. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r228845332 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. --- End diff -- Good idea. I have added an example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22851: [SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migrat...
Github user seancxmao closed the pull request at: https://github.com/apache/spark/pull/22851 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22851: [SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migration doc...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22851 Closing this. Thank you @dongjoon-hyun @felixcheung --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22868: [SPARK-25833][SQL][DOCS] Update migration guide for Hive...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22868 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22868 [SPARK-25833][SQL][DOCS] Update migration guide for Hive view compatibility ## What changes were proposed in this pull request? Both Spark and Hive support views. However in some cases views created by Hive are not readable by Spark. For example, if column aliases are not specified in view definition queries, both Spark and Hive will generate alias names, but in different ways. In order for Spark to be able to read views created by Hive, users should explicitly specify column aliases in view definition queries. Given that it's not uncommon that Hive and Spark are used together in enterprise data warehouse, this PR aims to explicitly describe this compatibility issue to help users troubleshoot this issue easily. ## How was this patch tested? Docs are manually generated and checked locally. ``` SKIP_API=1 jekyll serve ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25833 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22868.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22868 commit e5b3a11c2cedcbbe528cc72d465ab6e27f5215e3 Author: seancxmao Date: 2018-10-28T06:46:10Z update migration guide for hive view compatibility --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22846: [SPARK-25797][SQL][DOCS] Add migration doc for solving i...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22846 @cloud-fan PR for 2.3 is submitted. Please see #22851. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22851: [SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migrat...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22851 [SPARK-25797][SQL][DOCS][BACKPORT-2.3] Add migration doc for solving issues caused by view canonicalization approach change ## What changes were proposed in this pull request? Since Spark 2.2, view definitions are stored in a different way from prior versions. This may cause Spark unable to read views created by prior versions. See [SPARK-25797](https://issues.apache.org/jira/browse/SPARK-25797) for more details. Basically, we have 2 options. 1) Make Spark 2.2+ able to get older view definitions back. Since the expanded text is buggy and unusable, we have to use original text (this is possible with [SPARK-25459](https://issues.apache.org/jira/browse/SPARK-25459)). However, because older Spark versions don't save the context for the database, we cannot always get correct view definitions without view default database. 2) Recreate the views by `ALTER VIEW AS` or `CREATE OR REPLACE VIEW AS`. This PR aims to add migration doc to help users troubleshoot this issue by above option 2. ## How was this patch tested? N/A. Docs are generated and checked locally ``` cd docs SKIP_API=1 jekyll serve --watch ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25797-2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22851.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22851 commit e0e71bcf2e825a021e102e4305f6bf0fc25e8b88 Author: seancxmao Date: 2018-10-26T11:35:39Z add migration doc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22846: [SPARK-25797][SQL][DOCS] Add migration doc for solving i...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22846 @cloud-fan Sure, I will send a new PR for 2.3. Thanks you for review this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22846: [SPARK-25797][SQL][DOCS] Add migration doc for solving i...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22846 @jiangxb1987 @cloud-fan @gatorsmile Could you please kindly review this when you have time? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22846: [SPARK-25797][SQL][DOCS] Add migration doc for so...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22846 [SPARK-25797][SQL][DOCS] Add migration doc for solving issues caused by view canonicalization approach change ## What changes were proposed in this pull request? Since Spark 2.2, view definitions are stored in a different way from prior versions. This may cause Spark unable to read views created by prior versions. See [SPARK-25797](https://issues.apache.org/jira/browse/SPARK-25797) for more details. Basically, we have 2 options. 1) Make Spark 2.2+ able to get older view definitions back. Since the expanded text is buggy and unusable, we have to use original text (this is possible with [SPARK-25459](https://issues.apache.org/jira/browse/SPARK-25459)). However, because older Spark versions don't save the context for the database, we cannot always get correct view definitions without view default database. 2) Recreate the views by `ALTER VIEW AS` or `CREATE OR REPLACE VIEW AS`. This PR aims to add migration doc to help users troubleshoot this issue by above option 2. ## How was this patch tested? N/A. Docs are generated and checked locally ``` cd docs SKIP_API=1 jekyll serve --watch ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25797 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22846.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22846 commit a6f4d540d948a887d9dd88b5fc83d3dcf065491f Author: seancxmao Date: 2018-10-26T06:08:42Z add migration doc --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22461 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite I...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22461#discussion_r221411618 --- Diff: docs/sql-programming-guide.md --- @@ -1489,7 +1489,7 @@ See the [Apache Avro Data Source Guide](avro-data-source-guide.html). * The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java's DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs. * Some databases, such as H2, convert all names to upper case. You'll need to use upper case to refer to those names in Spark SQL. - + * Users can specify vendor-specific JDBC connection properties in the data source options to do special treatment. For example, `spark.read.format("jdbc").option("url", oracleJdbcUrl).option("oracle.jdbc.mapDateToTimestamp", "false")`. `oracle.jdbc.mapDateToTimestamp` defaults to true, users often need to disable this flag to avoid Oracle date being resolved as timestamp. --- End diff -- @maropu @gatorsmile Could you please suggest something else to do here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22531: [SPARK-25415][SQL][FOLLOW-UP] Add Locale.ROOT when toUpp...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22531 @HyukjinKwon Please go ahead since you've already been working on this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.writeLega...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22453 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22453#discussion_r220407692 --- Diff: docs/sql-programming-guide.md --- @@ -1002,6 +1002,21 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession + + spark.sql.parquet.writeLegacyFormat + false + +This configuration indicates whether we should use legacy Parquet format adopted by Spark 1.4 +and prior versions or the standard format defined in parquet-format specification to write +Parquet files. This is not only related to compatibility with old Spark ones, but also other +systems like Hive, Impala, Presto, etc. This is especially important for decimals. If this +configuration is not enabled, decimals will be written in int-based format in Spark 1.5 and +above, other systems that only support legacy decimal format (fixed length byte array) will not +be able to read what Spark has written. Note other systems may have added support for the +standard format in more recent versions, which will make this configuration unnecessary. Please --- End diff -- If we must call it "legacy", I'd think of it legacy implementation in Spark side, rather than legacy format in Parquet side. As comment in [SPARK-20297](https://issues.apache.org/jira/browse/SPARK-20297?focusedCommentId=15975559&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15975559) > The standard doesn't say that smaller decimals have to be stored in int32/int64, it just is an option for subset of decimal types. int32 and int64 are valid representations for a subset of decimal types. fixed_len_byte_array and binary are a valid representation of any decimal type. > >The int32/int64 options were present in the original version of the decimal spec, they just weren't widely implemented. So its not a new/old version thing, it was just an alternative representation that many systems didn't implement. Anyway, it really leads to confusion. Really appreciate your suggestion @srowen to make the doc shorter, the doc you suggested is more concise and to the point. One more thing I want to discuss. After investigating the usage of this option, I found this option is not only related to decimals, but also complex types (Array, Map), see below source code. Should we mention this in the doc? https://github.com/apache/spark/blob/473d0d862de54ec1c7a8f0354fa5e06f3d66e455/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L450-L458 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22453#discussion_r220042478 --- Diff: docs/sql-programming-guide.md --- @@ -1002,6 +1002,21 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession + + spark.sql.parquet.writeLegacyFormat + false + +This configuration indicates whether we should use legacy Parquet format adopted by Spark 1.4 +and prior versions or the standard format defined in parquet-format specification to write +Parquet files. This is not only related to compatibility with old Spark ones, but also other +systems like Hive, Impala, Presto, etc. This is especially important for decimals. If this +configuration is not enabled, decimals will be written in int-based format in Spark 1.5 and +above, other systems that only support legacy decimal format (fixed length byte array) will not +be able to read what Spark has written. Note other systems may have added support for the +standard format in more recent versions, which will make this configuration unnecessary. Please --- End diff -- Thanks for your suggestion. I have updated the doc in SQLConf. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22453#discussion_r220038438 --- Diff: docs/sql-programming-guide.md --- @@ -1002,6 +1002,21 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession + + spark.sql.parquet.writeLegacyFormat + false + +This configuration indicates whether we should use legacy Parquet format adopted by Spark 1.4 +and prior versions or the standard format defined in parquet-format specification to write +Parquet files. This is not only related to compatibility with old Spark ones, but also other +systems like Hive, Impala, Presto, etc. This is especially important for decimals. If this +configuration is not enabled, decimals will be written in int-based format in Spark 1.5 and +above, other systems that only support legacy decimal format (fixed length byte array) will not +be able to read what Spark has written. Note other systems may have added support for the +standard format in more recent versions, which will make this configuration unnecessary. Please --- End diff -- Hive and Impala do NOT support new Parquet format yet. * [HIVE-19069](https://jira.apache.org/jira/browse/HIVE-19069): Hive can't read int32 and int64 Parquet decimal. This issue is not resolved yet. This is consistent with source code check by @HyukjinKwon * [IMPALA-5542](https://issues.apache.org/jira/browse/IMPALA-5542): Impala cannot scan Parquet decimal stored as int64_t/int32_t. This is resolved, however targeted to Impala 3.1.0, which is a version not released yet. The latest release is 3.0.0 (https://impala.apache.org/downloads.html). Presto began to support new Parquet format since 0.182. * [issues/7533](https://github.com/prestodb/presto/issues/7533): Improve decimal type support in the new Parquet reader. This patch is included in [0.182](https://prestodb.io/docs/current/release/release-0.182.html). Blow is the excerpt: > Fix reading decimal values in the optimized Parquet reader when they are backed by the int32 or int64 types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.writeLega...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22453 FYI. I had a brief survey on Parquet decimal support of computing engines at the time of writing. Hive * [HIVE-19069](https://jira.apache.org/jira/browse/HIVE-19069) Hive can't read int32 and int64 Parquet decimal. Not resolved yet. Impala: * [IMPALA-5628](https://issues.apache.org/jira/browse/IMPALA-5628) Parquet support for additional valid decimal representations. This is an umbrella JIRA. * [IMPALA-2494](https://issues.apache.org/jira/browse/IMPALA-2494) Impala Unable to scan a Decimal column stored as Bytes. Fix Version/s: Impala 2.11.0. * [IMPALA-5542](https://issues.apache.org/jira/browse/IMPALA-5542) Impala cannot scan Parquet decimal stored as int64_t/int32_t. Fix Version/s: Impala 3.1.0, not released yet. Presto * [issues/7232](https://github.com/prestodb/presto/issues/7232). Can't read decimal type in parquet files written by spark and referenced as external in the hive metastore * [issues/7533](https://github.com/prestodb/presto/issues/7533). Improve decimal type support in the new Parquet reader. Fixed Version: [0.182](https://prestodb.io/docs/current/release/release-0.182.html) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22453#discussion_r219729166 --- Diff: docs/sql-programming-guide.md --- @@ -1002,6 +1002,15 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession + + spark.sql.parquet.writeLegacyFormat --- End diff -- OK, I will update the doc and describe scenarios and reasons why we need this flag. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22453#discussion_r219721110 --- Diff: docs/sql-programming-guide.md --- @@ -1002,6 +1002,15 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession + + spark.sql.parquet.writeLegacyFormat --- End diff -- I'd like to add my 2 cents. We use both Spark and Hive in our Hadoop/Spark clusters. And we have 2 types of tables, working tables and target tables. Working tables are only used by Spark jobs, while target tables are populated by Spark and exposed to downstream jobs including Hive jobs. Our data engineers frequently meet with this issue when they use Hive to read target tables. Finally we decided to set spark.sql.parquet.writeLegacyFormat=true as the default value for target tables and explicitly describe this in our internal developer guide. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22499: [SPARK-25489][ML][TEST] Refactor UDTSerialization...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22499#discussion_r219698800 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala --- @@ -18,52 +18,52 @@ package org.apache.spark.mllib.linalg import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder -import org.apache.spark.util.Benchmark +import org.apache.spark.util.{Benchmark, BenchmarkBase => FileBenchmarkBase} /** * Serialization benchmark for VectorUDT. + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class --- End diff -- I have rebased this PR to the latest master and also fixed the docs. I am outing for mid-autumn festival, sorry for late reply. BTW, happy mid-autumn festival. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite I...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22461#discussion_r219539062 --- Diff: docs/sql-programming-guide.md --- @@ -1287,8 +1287,18 @@ bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9. Tables from the remote database can be loaded as a DataFrame or Spark SQL temporary view using the Data Sources API. Users can specify the JDBC connection properties in the data source options. user and password are normally provided as connection properties for -logging into the data sources. In addition to the connection properties, Spark also supports -the following case-insensitive options: +logging into the data sources. Vendor-specific connection properties can also be passed to the +underlying JDBC driver in the same way. For example: + +{% highlight scala %} +// oracle.jdbc.mapDateToTimestamp defaults to true. If this flag is not disabled, a column of Oracle +// DATE type will be resolved as Catalyst TimestampType, which is probably not the desired behavior. +spark.read.format("jdbc") + .option("url", oracleJdbcUrl) + .option("oracle.jdbc.mapDateToTimestamp", "false") +{% endhighlight %} + --- End diff -- I have moved this description to `Troubleshooting` section. I also tried to brush up the description. Writing good documentation is sometimes difficult than writing code. really need your help :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite I...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22461#discussion_r219537942 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala --- @@ -462,6 +468,12 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSQLCo .option("lowerBound", "2018-07-04 03:30:00.0") .option("upperBound", "2018-07-27 14:11:05.0") .option("numPartitions", 2) + // oracle.jdbc.mapDateToTimestamp defaults to true. If this flag is not disabled, column d + // (Oracle DATE) will be resolved as Catalyst Timestamp, which will fail test result + // comparison. E.g. Expected 2018-07-06 while Actual 2018-07-06 00:00:00.0. --- End diff -- OK, I will remove duplicate description. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite I...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22461#discussion_r219403696 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala --- @@ -462,6 +464,9 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSQLCo .option("lowerBound", "2018-07-04 03:30:00.0") .option("upperBound", "2018-07-27 14:11:05.0") .option("numPartitions", 2) + .option("oracle.jdbc.mapDateToTimestamp", "false") --- End diff -- Yes, we need this. Otherwise, Spark will read column `d` values as Catalyst type timestamp, which will fail the test. https://user-images.githubusercontent.com/12194089/45865915-9e730800-bdb1-11e8-9a42-a1394c601166.png";> --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22497: [SPARK-25487][SQL][TEST] Refactor PrimitiveArrayBenchmar...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22497 @kiszk @wangyum Thank you! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite I...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22461#discussion_r219386919 --- Diff: external/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/OracleIntegrationSuite.scala --- @@ -442,6 +442,8 @@ class OracleIntegrationSuite extends DockerJDBCIntegrationSuite with SharedSQLCo .option("lowerBound", "2018-07-06") .option("upperBound", "2018-07-20") .option("numPartitions", 3) + .option("oracle.jdbc.mapDateToTimestamp", "false") --- End diff -- ok. I will add notes to http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases, and will also add comments to the code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22499: [SPARK-25489][ML][TEST] Refactor UDTSerialization...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22499 [SPARK-25489][ML][TEST] Refactor UDTSerializationBenchmark ## What changes were proposed in this pull request? Refactor `UDTSerializationBenchmark` to use main method and print the output as a separate file. Run blow command to generate benchmark results: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "mllib/test:runMain org.apache.spark.mllib.linalg.UDTSerializationBenchmark" ``` ## How was this patch tested? Manual tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25489 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22499.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22499 commit e43dda9c279fdffa3293e3ef40897cf06fb7cbfa Author: seancxmao Date: 2018-09-20T15:32:17Z [SPARK-25489] Refactor UDTSerializationBenchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22497: [SPARK-25487][SQL][TEST] Refactor PrimitiveArrayB...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22497 [SPARK-25487][SQL][TEST] Refactor PrimitiveArrayBenchmark ## What changes were proposed in this pull request? Refactor PrimitiveArrayBenchmark to use main method and print the output as a separate file. Run blow command to generate benchmark results: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.PrimitiveArrayBenchmark" ``` ## How was this patch tested? Manual tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25487 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22497.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22497 commit 631df4b1944c73a4c63b7dcb9873358671a07baf Author: seancxmao Date: 2018-09-20T14:48:21Z [SPARK-25487][SQL][TEST] Refactor PrimitiveArrayBenchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22461 I tested in my mac, following guidance of http://spark.apache.org/docs/latest/building-spark.html#running-docker-based-integration-test-suites. ``` ./build/mvn install -DskipTests ./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11 ``` The results are as blow: https://user-images.githubusercontent.com/12194089/45753213-a4020e00-bc4a-11e8-897f-111fc9dbaba5.png";> --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22461 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22461 @gatorsmile Thanks a lot! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453][SQL][TEST] OracleIntegrationSuite IllegalA...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22461 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgum...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22461 [SPARK-25453] OracleIntegrationSuite IllegalArgumentException: Timestamp format must be -mm-dd hh:mm:ss[.f] ## What changes were proposed in this pull request? This PR aims to fix the failed test of `OracleIntegrationSuite`. ## How was this patch tested? Existing integration tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25453 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22461.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22461 commit fb7b9d22812f789b970f8dae8a9cd3763c0eccff Author: seancxmao Date: 2018-09-19T03:18:38Z [SPARK-25453] OracleIntegrationSuite IllegalArgumentException: Timestamp format must be -mm-dd hh:mm:ss[.f] --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.writeLega...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22453 @HyukjinKwon Could you please help review this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22453: [SPARK-20937][DOCS] Describe spark.sql.parquet.wr...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22453 [SPARK-20937][DOCS] Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide ## What changes were proposed in this pull request? Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-20937 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22453.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22453 commit 3af33a31f528059b5f4a66e8ba10bf945eb6fa53 Author: seancxmao Date: 2018-09-18T14:32:18Z [SPARK-20937][DOCS] Describe spark.sql.parquet.writeLegacyFormat property in Spark SQL, DataFrames and Datasets Guide --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22343: [SPARK-25391][SQL] Make behaviors consistent when...
Github user seancxmao closed the pull request at: https://github.com/apache/spark/pull/22343 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 Sure, close this PR. Thank you all for your time and insights. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 I agree that correctness is more important. If we should not make behaviors consistent when do the convertion, I will close this PR. @cloud-fan @gatorsmile what do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 It keeps Hive compatibility but loses performance benefit by setting spark.sql.hive.convertMetastoreParquet=false. We can do better by enabling the conversion and still keeping Hive compatibility. Though this makes our implementation more complex, I guess most end users may keep `spark.sql.hive.convertMetastoreParquet=true` and `spark.sql.caseSensitive=false` which are default values, this brings benefits to end users. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 Could we see this as a behavior change? We can add a legacy conf (e.g. `spark.sql.hive.legacy.convertMetastoreParquet`, may be defined in HiveUtils) to enable users to revert back to the previous behavior for backward compatibility. If this legacy conf is set to true, behaviors will be reverted both in case-sensitive and case-insensitive mode. |caseSensitive|legacy behavior|new behavior| |-|-|-| |true|convert anyway|skip conversion, log warning message| |false|convert, fail if there's ambiguity|convert, first match if there's ambiguity| --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 @dongjoon-hyun It is a little complicated. There has been a discussion about this in #22184. Below are some key comments from @cloud-fan and @gatorsmile, just FYI. * https://github.com/apache/spark/pull/22184#discussion_r212834477 * https://github.com/apache/spark/pull/22184#issuecomment-416885728 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 Hi, @dongjoon-hyun When we find duplicated field names in the case of convertMetastoreXXX, we have 2 options (1) raise exception as parquet data source. To most of end users, they do not know the difference between hive parquet table and parquet data source. If the conversion leads to different behaviors, they may be confused, and in some cases even lead to tricky data issues silently. (2) Adjust behaviors of parquet data source to keep behaviors consistent. This seems more friendly to end users, and avoid any potential issues introduced by the conversion. BTW, for parquet data source which is not converted from hive parquet table, we raise exception when there is ambiguity, sine this is more intuitive and reasonable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22184 @cloud-fan @gatorsmile I think the old `Upgrading From Spark SQL 2.3.1 to 2.3.2 and above` is not needed since we do not backport SPARK-25132 to branch-2.3. I'm wondering if we need `Upgrading From Spark SQL 2.3 to 2.4 and above`. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22343: [SPARK-25391][SQL] Make behaviors consistent when...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22343#discussion_r216212552 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala --- @@ -1390,7 +1395,11 @@ class ParquetSchemaSuite extends ParquetSchemaTest { catalystSchema = new StructType(), expectedSchema = ParquetSchemaConverter.EMPTY_MESSAGE, -caseSensitive = true) +conf = { + val conf = new Configuration() + conf.setBoolean(SQLConf.CASE_SENSITIVE.key, true) --- End diff -- @cloud-fan There is no default value for `spark.sql.caseSensitive` in Configuration. Let me explain in more details below. This is one of the overloaded methods of `testSchemaClipping`. I tried to give this `testSchemaClipping` method a default conf, however Scalac complains that > in class ParquetSchemaSuite, multiple overloaded alternatives of testSchemaClipping define default arguments https://github.com/apache/spark/blob/95673cdacb1da65d7ac45c212a365e167ae0d713/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala#L1014-L1019 https://github.com/apache/spark/blob/95673cdacb1da65d7ac45c212a365e167ae0d713/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala#L1028-L1033 It seems a little confusing, because these two methods have different parameter types. After a brief investigation, I found Scala compiler simply disallows overloaded methods with default arguments even when these methods have different parameter types. https://stackoverflow.com/questions/4652095/why-does-the-scala-compiler-disallow-overloaded-methods-with-default-arguments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25391][SQL] Make behaviors consistent when conver...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 @dongjoon-hyun @HyukjinKwon I created a new JIRA ticket and try to use a more complete and clear title for this PR. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22262 @dongjoon-hyun Thank you! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22262 > ... we need this duplication check in case-sensitive mode ... Do you mean we may define ORC/Parquet schema with identical field names (even in the same letter case)? Would you please explain a bit more on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consi...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22343 @HyukjinKwon @cloud-fan @gatorsmile Could you please kindly help review this if you have time? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22262: [SPARK-25175][SQL] Field resolution should fail i...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22262#discussion_r216121510 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala --- @@ -116,6 +116,14 @@ object OrcUtils extends Logging { }) } else { val resolver = if (isCaseSensitive) caseSensitiveResolution else caseInsensitiveResolution +// Need to fail if there is ambiguity, i.e. more than one field is matched. +requiredSchema.fieldNames.foreach { requiredFieldName => + val matchedOrcFields = orcFieldNames.filter(resolver(_, requiredFieldName)) --- End diff -- That's all right :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22262 @dongjoon-hyun That's all right :). I have reverted to the first commit and adjusted the indentation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22262 I updated the PR description. Thank you for pointing that PR description should stay focused. I also think it's more clear. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22262 @dongjoon-hyun I have updated PR description to explain in more details. As you mentioned, this PR is specific to the case when reading from data source table persisted in metastore. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22184 @cloud-fan I've just sent a PR (#22343) for this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22343: [SPARK-25132][SQL][FOLLOW-UP] The behavior must b...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22343 [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consistent to do the conversion ## What changes were proposed in this pull request? parquet data source tables and hive parquet tables have different behaviors about parquet field resolution. So, when `spark.sql.hive.convertMetastoreParquet` is true, users might face inconsistent behaviors. The differences are: * Whether respect spark.sql.caseSensitive. Without #22148, both data source tables and hive tables do NOT respect spark.sql.caseSensitive. However data source tables always do case-sensitive parquet field resolution, while hive tables always do case-insensitive parquet field resolution no matter whether spark.sql.caseSensitive is set to true or false. #22148 let data source tables respect spark.sql.caseSensitive while hive serde table behavior is not changed. * How to resolve ambiguity in case-insensitive mode. Without #22148, data source tables do case-sensitive resolution and return columns with the corresponding letter cases, while hive tables always return the first matched column ignoring cases. #22148 let data source tables throw exception when there is ambiguity while hive table behavior is not changed. This PR aims to make behaviors consistent when converting hive table to data source table. * The behavior must be consistent to do the conversion, so we skip the conversion in case-sensitive mode because hive parquet table always do case-insensitive field resolution. * When converting hive parquet table to parquet data source, we switch the duplicated fields resolution mode to ask parquet data source to pick the first matched field - the same behavior as hive parquet table - to keep behaviors consistent. ## How was this patch tested? Unit tests added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25132-CONSISTENCY Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22343.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22343 commit 95673cdacb1da65d7ac45c212a365e167ae0d713 Author: seancxmao Date: 2018-09-04T08:00:31Z [SPARK-25132][SQL][FOLLOW-UP] The behavior must be consistent to do the conversion --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22183: [SPARK-25132][SQL][BACKPORT-2.3] Case-insensitive...
Github user seancxmao closed the pull request at: https://github.com/apache/spark/pull/22183 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22184 > My proposal is, parquet data source should provide an option(not SQL conf) to ... You mentioned this option is not SQL conf. Could you give me some advice about where this option should be defined? I just thought to define this option in SQLConf as something like `spark.sql.parquet.onDuplicatedFields` = FAIL, FIRST_MATCH, as I see bunch of options starting with `spark.sql.parquet` in SQLConf. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22184 @cloud-fan OK, I will do it. Just to confirm, when reading from hive parquet table, if `spark.sql.hive.convertMetastoreParquet` and `spark.sql.caseSensitive` are both set to true, we throw an exception to tell users they should not do this because this could lead to inconsistent results. Is my understanding correct? Another thing to confirm is that when there is ambiguity in case-insensitive mode, native parquet reader throws exception while hive serde reader returns the first matched field, which is not consistent. Is it OK? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22262: [SPARK-25175][SQL] Field resolution should fail if there...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22262 @dongjoon-hyun @cloud-fan @gatorsmile Could you please kindly review this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22262: [SPARK-25175][SQL] Field resolution should fail i...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22262 [SPARK-25175][SQL] Field resolution should fail if there is ambiguity for ORC native reader ## What changes were proposed in this pull request? This PR aims to make ORC data source native implementation consistent with Parquet data source. The gap is that field resolution should fail if there is ambiguity in case-insensitive mode when reading from ORC. See #22148 for more details about parquet data source reader. ## How was this patch tested? Unit tests added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25175 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22262.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22262 commit 366bb35ad62edb8e6707e65c681a5b9001cc868e Author: seancxmao Date: 2018-08-17T10:06:28Z [SPARK-25175][SQL] Field resolution should fail if there's ambiguity for ORC native reader --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r213020789 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- As a followup to cloud-fan's point, I did a deep dive into read path of parquet hive serde table. Following is a rough invocation chain: ``` org.apache.spark.sql.hive.execution.HiveTableScanExec org.apache.spark.sql.hive.HadoopTableReader (extendes org.apache.spark.sql.hive.TableReader) org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat (extends org.apache.hadoop.mapred.FileInputFormat) org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper (extends org.apache.hadoop.mapred.RecordReader) parquet.hadoop.ParquetRecordReader parquet.hadoop.InternalParquetRecordReader org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport (extends parquet.hadoop.api.ReadSupport) ``` Finally, `DataWritableReadSupport#getFieldTypeIgnoreCase` is invoked. https://github.com/JoshRosen/hive/blob/release-1.2.1-spark2/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java#L79-L95 This is why parquet hive serde table always do case-insensitive field resolution. However, this is a class inside `org.spark-project.hive:hive-exec:1.2.1.spark2`. I also found the related Hive JIRA ticket: [HIVE-7554: Parquet Hive should resolve column names in case insensitive manner](https://issues.apache.org/jira/browse/HIVE-7554) BTW: * org.apache.hadoop.hive.ql = org.spark-project.hive:hive-exec:1.2.1.spark2 * parquet.hadoop = com.twitter:parquet-hadoop-bundle:1.6.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212894532 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- As a followup, I also did investigation about ORC. Below are some results. Just FYI. * https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593185&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593185 * https://issues.apache.org/jira/browse/SPARK-25175?focusedCommentId=16593194&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16593194 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22184#discussion_r212405373 --- Diff: docs/sql-programming-guide.md --- @@ -1895,6 +1895,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. +## Upgrading From Spark SQL 2.3.1 to 2.3.2 and above + + - In version 2.3.1 and earlier, when reading from a Parquet table, Spark always returns null for any column whose column names in Hive metastore schema and Parquet schema are in different letter cases, no matter whether `spark.sql.caseSensitive` is set to true or false. Since 2.3.2, when `spark.sql.caseSensitive` is set to false, Spark does case insensitive column name resolution between Hive metastore schema and Parquet schema, so even column names are in different letter cases, Spark returns corresponding column values. An exception is thrown if there is ambiguity, i.e. more than one Parquet column is matched. --- End diff -- Following your advice, I did a thorough comparison between data source table and hive serde table. Parquet data and tables are created via the following code: ``` val data = spark.range(5).selectExpr("id as a", "id * 2 as B", "id * 3 as c", "id * 4 as C") spark.conf.set("spark.sql.caseSensitive", true) data.write.format("parquet").mode("overwrite").save("/user/hive/warehouse/parquet_data") CREATE TABLE parquet_data_source_lower (a LONG, b LONG, c LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data' CREATE TABLE parquet_data_source_upper (A LONG, B LONG, C LONG) USING parquet LOCATION '/user/hive/warehouse/parquet_data' CREATE TABLE parquet_hive_serde_lower (a LONG, b LONG, c LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data' CREATE TABLE parquet_hive_serde_upper (A LONG, B LONG, C LONG) STORED AS parquet LOCATION '/user/hive/warehouse/parquet_data' ``` spark.sql.hive.convertMetastoreParquet is set to false: ``` spark.conf.set("spark.sql.hive.convertMetastoreParquet", false) ``` Below are the comparison results both without #22148 and with #22148. The comparison result without #22148: |no.|caseSensitive|table columns|select column|parquet column (select via data source table)|parquet column (select via hive serde table)|consistent?|resolved by SPARK-25132| | - | - | - | - | - | - | - | - | |1|true|a, b, c|a| a|a |Y | | |2| | |b|null|B|NG| | |3| | |c|c |c |Y | | |4| | |A|AnalysisException|AnalysisException|Y | | |5| | |B|AnalysisException|AnalysisException|Y | | |6| | |C|AnalysisException|AnalysisException|Y | | |7| |A, B, C|a|AnalysisException |AnalysisException|Y | | |8| | |b|AnalysisException |AnalysisException |Y | | |9| | |c|AnalysisException |AnalysisException |Y | | |10| | |A|null |a |NG | | |11| | |B|B |B|Y | | |12| | |C|C |c |NG | | |13|false|a, b, c|a|a |a |Y | | |14| | |b|null |B |NG |Y| |15| | |c|c |c |Y | | |16| | |A|a |a |Y | | |17| | |B|null |B |NG |Y| |18| | |C|c |c |Y | | |19| |A, B, C|a|null |a |NG |Y| |20| | |b|B |B |Y | | |21| | |c|C |c |NG | | |22| | |A|null |a |NG |Y| |23| | |B|B |B |Y | | |24| | |C|C |c |NG | | The comparison result with #22148 applied: |no.|caseSensitive|table columns|select column|parquet column (select via data source table)|parquet column (select via hive serde table)|consistent?|introduced by SPARK-25132| |---|---|---|---|---|---|---|---| |1|true|a, b, c|a|a |a |Y | | |2| | |b|null |B |NG | | |3| | |c|c |c |Y | | |4| | |A|AnalysisException |AnalysisException |Y | | |5| | |B|AnalysisException |AnalysisException |Y | | |6| | |C|AnalysisException |AnalysisException |Y | | |7| |A, B, C|a|AnalysisException |AnalysisException |Y | | |8| | |b|AnalysisException |AnalysisException |Y | | |9| | |c|AnalysisException |AnalysisException |Y | | |10| | |A|null |a |NG | | |11| | |B|B |B |Y | | |12| | |C|C |c |NG | | |13|false|a, b, c|a|a |a |Y | | |14| | |b|B |B |Y | | |15| | |c|RuntimeException |c |NG |Y| |16| | |A|a |a |Y | | |17| | |B
[GitHub] spark issue #22184: [SPARK-25132][SQL][DOC] Add migration doc for case-insen...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22184 @gatorsmile Could you kindly help trigger Jenkins and review? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22184: [SPARK-25132][SQL][DOC] Add migration doc for cas...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22184 [SPARK-25132][SQL][DOC] Add migration doc for case-insensitive field resolution when reading from Parquet ## What changes were proposed in this pull request? #22148 introduces a behavior change. We need to document it in the migration guide. ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25132-DOC Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22184.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22184 commit eae8a3c98f146765d25bbf529421ce3c7a92639b Author: seancxmao Date: 2018-08-22T09:17:55Z [SPARK-25132][SQL][DOC] Case-insensitive field resolution when reading from Parquet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22183: [SPARK-25132][SQL][BACKPORT-2.3] Case-insensitive...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22183 [SPARK-25132][SQL][BACKPORT-2.3] Case-insensitive field resolution when reading from Parquet ## What changes were proposed in this pull request? This is a backport of https://github.com/apache/spark/pull/22148 Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, regardless of spark.sql.caseSensitive set to true or false. This PR aims to add case-insensitive field resolution for ParquetFileFormat. * Do case-insensitive resolution only if Spark is in case-insensitive mode. * Field resolution should fail if there is ambiguity, i.e. more than one field is matched. ## How was this patch tested? Unit tests added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25132-2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22183.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22183 commit 28315888eaae5a9c9160ea53eb6eb9a9af712958 Author: seancxmao Date: 2018-08-21T02:34:23Z [SPARK-25132][SQL][BACKPORT-2.3] Case-insensitive field resolution when reading from Parquet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22148: [SPARK-25132][SQL] Case-insensitive field resolution whe...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22148 Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22148: [SPARK-25132][SQL] Case-insensitive field resolut...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22148 [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet ## What changes were proposed in this pull request? Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, regardless of spark.sql.caseSensitive set to true or false. This PR aims to add case-insensitive field resolution for ParquetFileFormat. * Do case-insensitive resolution only if Spark is in case-insensitive mode. * Field resolution should fail if there is ambiguity, i.e. more than one field is matched. ## How was this patch tested? Unit tests added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25132-Parquet Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22148.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22148 commit e0a555354436b0099a7b48b3269b8c5ccb73571b Author: seancxmao Date: 2018-08-17T10:06:28Z [SPARK-25132][SQL] Case-insensitive field resolution when reading from Parquet --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22142: [SPARK-25132][SQL] Case-insensitive field resolution whe...
Github user seancxmao commented on the issue: https://github.com/apache/spark/pull/22142 Split this into 2 PRs, one for Parquet and ORC respectively. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22142: [SPARK-25132][SQL] Case-insensitive field resolut...
Github user seancxmao closed the pull request at: https://github.com/apache/spark/pull/22142 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22142: [SPARK-25132][SQL] case-insensitive field resolut...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22142 [SPARK-25132][SQL] case-insensitive field resolution when reading from Parquet/ORC ## What changes were proposed in this pull request? Spark SQL returns NULL for a column whose Hive metastore schema and Parquet schema are in different letter cases, regardless of spark.sql.caseSensitive set to true or false. This applies not only to Parquet, but also to ORC. Following is a brief summary: * ParquetFileFormat doesn't support case-insensitive field resolution. * native OrcFileFormat supports case-insensitive field resolution, however it cannot handle duplicate fields. * hive OrcFileFormat doesn't support case-insensitive field resolution. https://github.com/apache/spark/pull/15799 reverted case-insensitive resolution for ParquetFileFormat and hive OrcFileFormat. This PR brings it back and improves it to do case-insensitive resolution only if Spark is in case-insensitive mode. And field resolution will fail if there is ambiguity, i.e. more than one field is matched. ParquetFileFormat, native OrcFileFormat and hive OrcFileFormat are all supported. ## How was this patch tested? Unit tests added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25132 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22142.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22142 commit 5c3d20b654609c86de9c24c9751ec34916f3aabd Author: seancxmao Date: 2018-08-17T10:06:28Z SPARK-25132: case-insensitive field resolution when reading from Parquet/ORC * Fix ParquetFileFormat * More than one Parquet column is matched * Fix OrcFileFormat (both native and hive implementations) * Fix issues according to review results: refactor test cases, code style, ... * Test cases: change paruqet/orc file schema from a to A * Test cases: let different columns have different value series * Refine error message * Split multi-format test suite * Simplify test cases for ambiguous resolution * Simplify test cases to reduce code lines * Refine tests and comments --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21113: [SPARK-13136][SQL][FOLLOW-UP] Fix comment
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/21113 [SPARK-13136][SQL][FOLLOW-UP] Fix comment ## What changes were proposed in this pull request? Fix comment. Change `BroadcastHashJoin.broadcastFuture` to `BroadcastExchange.relationFuture`: https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L66 ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-13136 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21113.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21113 commit 6aceb43c56641a38b97b5e090379ef5143f17dd0 Author: seancxmao Date: 2018-04-20T07:12:09Z fix comment issue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20597: [MINOR][TEST] Update from 2.2.0 to 2.2.1 in HiveE...
Github user seancxmao closed the pull request at: https://github.com/apache/spark/pull/20597 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20597: [MINOR][TEST] Update from 2.2.0 to 2.2.1 in HiveE...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/20597 [MINOR][TEST] Update from 2.2.0 to 2.2.1 in HiveExternalCatalogVersionsSuite ## What changes were proposed in this pull request? In `HiveExternalCatalogVersionsSuite`, latest version of 2.2.x is updated from 2.2.0 to 2.2.1 ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark wrong_versions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20597.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20597 commit 047e11913872454be03e353c4ba2f793b74ca3ae Author: seancxmao Date: 2018-02-13T08:23:53Z Latest version of 2.2.x is changed from 2.2.0 to 2.2.1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org