[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-27 Thread seancxmao
GitHub user seancxmao opened a pull request:

https://github.com/apache/spark/pull/22868

[SPARK-25833][SQL][DOCS] Update migration guide for Hive view compatibility

## What changes were proposed in this pull request?
Both Spark and Hive support views. However in some cases views created by 
Hive are not readable by Spark. For example, if column aliases are not 
specified in view definition queries, both Spark and Hive will generate alias 
names, but in different ways. In order for Spark to be able to read views 
created by Hive, users should explicitly specify column aliases in view 
definition queries.

Given that it's not uncommon that Hive and Spark are used together in 
enterprise data warehouse, this PR aims to explicitly describe this 
compatibility issue to help users troubleshoot this issue easily.

## How was this patch tested?
Docs are manually generated and checked locally.

```
SKIP_API=1 jekyll serve
```

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/seancxmao/spark SPARK-25833

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22868.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22868


commit e5b3a11c2cedcbbe528cc72d465ab6e27f5215e3
Author: seancxmao 
Date:   2018-10-28T06:46:10Z

update migration guide for hive view compatibility




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-28 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r228766091
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries.
--- End diff --

@seancxmao Thanks for adding the doc. Can a small example here help 
illustrate this better ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-28 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r228776349
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries.
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r228845332
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,9 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries.
--- End diff --

Good idea. I have added an example.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229051205
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * View
   * If column aliases are not specified in view definition queries, both 
Spark and Hive will
 generate alias names, but in different ways. In order for Spark to be 
able to read views created
-by Hive, users should explicitly specify column aliases in view 
definition queries.
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

nit : We could perhaps simplify the query to : 
```
CREATE VIEW v1 AS (SELECT c1 + 1, upper(c2) FROM t1);
```
what do you think ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229059043
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

Could you simplify more by removing `CREATE TABLE` and using the following 
view creation?
```sql
CREATE VIEW v1 AS SELECT * FROM (SELECT c + 1, upper(c) FROM (SELECT 1 c) 
t1) t2;
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229059209
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
+```
+
+Instead, you should create `v1` as below with column aliases 
explicitly specified.
+
+```
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1 AS inc_c1, upper(c2) AS 
upper_c2 FROM t1) t2;
--- End diff --

Also, let's update this one together.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229060706
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * View
   * If column aliases are not specified in view definition queries, both 
Spark and Hive will
 generate alias names, but in different ways. In order for Spark to be 
able to read views created
-by Hive, users should explicitly specify column aliases in view 
definition queries.
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

BTW, @dilipbiswal . The above query `CREATE VIEW v1 AS (SELECT c1 + 1, 
upper(c2) FROM t1);` seems to fail at Hive 1.2.2.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229062732
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

@dongjoon-hyun i was thinking, calling upper on a int column is probably 
not very intuitive :-)
What do you think about adding a string literal in the projection ?

```
SELECT c + 1, upper(d) FROM select 1 c, 'test' as d 
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread dilipbiswal
Github user dilipbiswal commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229063489
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * View
   * If column aliases are not specified in view definition queries, both 
Spark and Hive will
 generate alias names, but in different ways. In order for Spark to be 
able to read views created
-by Hive, users should explicitly specify column aliases in view 
definition queries.
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

@dongjoon-hyun oh.. thanks .. because it requires an explicit correlation ? 
Sorry, don't have  1.2.2 env to try out ..


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread dongjoon-hyun
Github user dongjoon-hyun commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229064730
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

Thanks for the finding. I'd like to remove `upper(c)` like the following.
```sql
CREATE VIEW v1 AS SELECT * FROM (SELECT c + 1 FROM (SELECT 1 c) t1) t2;
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229150147
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

Good ideas. I have simplified the example. and tested the example above 
using Hive 2.3.3 and Spark 2.3.1.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229155030
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -53,7 +53,20 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * View
   * If column aliases are not specified in view definition queries, both 
Spark and Hive will
 generate alias names, but in different ways. In order for Spark to be 
able to read views created
-by Hive, users should explicitly specify column aliases in view 
definition queries.
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
--- End diff --

It seems Hive 1.x does not allow `(` following `CREATE VIEW ... AS`, while 
Hive 2.x just works well. The following works on Hive 1.2.1, 1.2.2 and 2.3.3.

```
CREATE VIEW v1 AS SELECT c1 + 1, upper(c2) FROM t1;
```

Another finding is that the above view is readable by Spark though view 
column names are weird (`_c0`, `_c1`). Because Spark will add a `Project` 
between `View` and view definition query if their output attributes do not 
match. 

```
spark-sql> explain extended v1;
...
== Analyzed Logical Plan ==
_c0: int, _c1: string
Project [_c0#44, _c1#45]
+- SubqueryAlias v1
   +- View (`default`.`v1`, [_c0#44,_c1#45])
  +- Project [cast((c1 + 1)#48 as int) AS _c0#44, cast(upper(c2)#49 as 
string) AS _c1#45] // this is added by AliasViewChild rule
 +- Project [(c1#46 + 1) AS (c1 + 1)#48, upper(c2#47) AS 
upper(c2)#49]
+- SubqueryAlias t1
   +- HiveTableRelation `default`.`t1`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c1#46, c2#47]
...
```

But, if column aliases in subqueries of the view definition query are 
missing, Spark will not be able to read the view.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-29 Thread seancxmao
Github user seancxmao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22868#discussion_r229156006
  
--- Diff: docs/sql-migration-guide-hive-compatibility.md ---
@@ -51,6 +51,22 @@ Spark SQL supports the vast majority of Hive features, 
such as:
 * Explain
 * Partitioned tables including dynamic partition insertion
 * View
+  * If column aliases are not specified in view definition queries, both 
Spark and Hive will
+generate alias names, but in different ways. In order for Spark to be 
able to read views created
+by Hive, users should explicitly specify column aliases in view 
definition queries. As an
+example, Spark cannot read `v1` created as below by Hive.
+
+```
+CREATE TABLE t1 (c1 INT, c2 STRING);
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1, upper(c2) FROM t1) t2;
+```
+
+Instead, you should create `v1` as below with column aliases 
explicitly specified.
+
+```
+CREATE VIEW v1 AS SELECT * FROM (SELECT c1 + 1 AS inc_c1, upper(c2) AS 
upper_c2 FROM t1) t2;
--- End diff --

Sure, updated as well.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22868: [SPARK-25833][SQL][DOCS] Update migration guide f...

2018-10-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/22868


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org