[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22868 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229156006 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; +``` + +Instead, you should create `v1` as below with column aliases explicitly specified. + +``` +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1 AS inc_c1, upper(c2) AS upper_c2 FROM t1) t2; --- End diff -- Sure, updated as well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229155030 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, such as: * View * If column aliases are not specified in view definition queries, both Spark and Hive will generate alias names, but in different ways. In order for Spark to be able to read views created -by Hive, users should explicitly specify column aliases in view definition queries. +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- It seems Hive 1.x does not allow `(` following `CREATE VIEW ... AS`, while Hive 2.x just works well. The following works on Hive 1.2.1, 1.2.2 and 2.3.3. ``` CREATE VIEW v1 AS SELECT c1 + 1, upper(c2) FROM t1; ``` Another finding is that the above view is readable by Spark though view column names are weird (`_c0`, `_c1`). Because Spark will add a `Project` between `View` and view definition query if their output attributes do not match. ``` spark-sql> explain extended v1; ... == Analyzed Logical Plan == _c0: int, _c1: string Project [_c0#44, _c1#45] +- SubqueryAlias v1 +- View (`default`.`v1`, [_c0#44,_c1#45]) +- Project [cast((c1 + 1)#48 as int) AS _c0#44, cast(upper(c2)#49 as string) AS _c1#45] // this is added by AliasViewChild rule +- Project [(c1#46 + 1) AS (c1 + 1)#48, upper(c2#47) AS upper(c2)#49] +- SubqueryAlias t1 +- HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#46, c2#47] ... ``` But, if column aliases in subqueries of the view definition query are missing, Spark will not be able to read the view. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229150147 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- Good ideas. I have simplified the example. and tested the example above using Hive 2.3.3 and Spark 2.3.1. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229064730 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- Thanks for the finding. I'd like to remove `upper(c)` like the following. ```sql CREATE VIEW v1 AS SELECT * FROM (SELECT c + 1 FROM (SELECT 1 c) t1) t2; ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229063489 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, such as: * View * If column aliases are not specified in view definition queries, both Spark and Hive will generate alias names, but in different ways. In order for Spark to be able to read views created -by Hive, users should explicitly specify column aliases in view definition queries. +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- @dongjoon-hyun oh.. thanks .. because it requires an explicit correlation ? Sorry, don't have 1.2.2 env to try out .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229062732 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- @dongjoon-hyun i was thinking, calling upper on a int column is probably not very intuitive :-) What do you think about adding a string literal in the projection ? ``` SELECT c + 1, upper(d) FROM select 1 c, 'test' as d ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229060706 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, such as: * View * If column aliases are not specified in view definition queries, both Spark and Hive will generate alias names, but in different ways. In order for Spark to be able to read views created -by Hive, users should explicitly specify column aliases in view definition queries. +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- BTW, @dilipbiswal . The above query `CREATE VIEW v1 AS (SELECT c1 + 1, upper(c2) FROM t1);` seems to fail at Hive 1.2.2. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229059209 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; +``` + +Instead, you should create `v1` as below with column aliases explicitly specified. + +``` +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1 AS inc_c1, upper(c2) AS upper_c2 FROM t1) t2; --- End diff -- Also, let's update this one together. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229059043 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- Could you simplify more by removing `CREATE TABLE` and using the following view creation? ```sql CREATE VIEW v1 AS SELECT * FROM (SELECT c + 1, upper(c) FROM (SELECT 1 c) t1) t2; ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r229051205 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, such as: * View * If column aliases are not specified in view definition queries, both Spark and Hive will generate alias names, but in different ways. In order for Spark to be able to read views created -by Hive, users should explicitly specify column aliases in view definition queries. +by Hive, users should explicitly specify column aliases in view definition queries. As an +example, Spark cannot read `v1` created as below by Hive. + +``` +CREATE TABLE t1 (c1 INT, c2 STRING); +CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2; --- End diff -- nit : We could perhaps simplify the query to : ``` CREATE VIEW v1 AS (SELECT c1 + 1, upper(c2) FROM t1); ``` what do you think ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user seancxmao commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r228845332 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. --- End diff -- Good idea. I have added an example. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r228776349 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
Github user dilipbiswal commented on a diff in the pull request: https://github.com/apache/spark/pull/22868#discussion_r228766091 --- Diff: docs/sql-migration-guide-hive-compatibility.md --- @@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, such as: * Explain * Partitioned tables including dynamic partition insertion * View + * If column aliases are not specified in view definition queries, both Spark and Hive will +generate alias names, but in different ways. In order for Spark to be able to read views created +by Hive, users should explicitly specify column aliases in view definition queries. --- End diff -- @seancxmao Thanks for adding the doc. Can a small example here help illustrate this better ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...
GitHub user seancxmao opened a pull request: https://github.com/apache/spark/pull/22868 [SPARK-25833][SQL][DOCS] Update migration guide for Hive view compatibility ## What changes were proposed in this pull request? Both Spark and Hive support views. However in some cases views created by Hive are not readable by Spark. For example, if column aliases are not specified in view definition queries, both Spark and Hive will generate alias names, but in different ways. In order for Spark to be able to read views created by Hive, users should explicitly specify column aliases in view definition queries. Given that it's not uncommon that Hive and Spark are used together in enterprise data warehouse, this PR aims to explicitly describe this compatibility issue to help users troubleshoot this issue easily. ## How was this patch tested? Docs are manually generated and checked locally. ``` SKIP_API=1 jekyll serve ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/seancxmao/spark SPARK-25833 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22868.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22868 commit e5b3a11c2cedcbbe528cc72d465ab6e27f5215e3 Author: seancxmao Date: 2018-10-28T06:46:10Z update migration guide for hive view compatibility --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org