[GitHub] [spark] cloud-fan opened a new pull request #31930: [SPARK-34719][SQL] Correctly resolve the view query with duplicated column names

GitBox Mon, 22 Mar 2021 09:20:52 -0700


cloud-fan opened a new pull request #31930:
URL: https://github.com/apache/spark/pull/31930

forward-port https://github.com/apache/spark/pull/31811 to master

<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR:
https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g.,
'[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a
faster review.
7. If you want to add a new configuration, please read the guideline first
for naming configurations in

'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
-->

### What changes were proposed in this pull request?

For permanent views (and the new SQL temp view in Spark 3.1), we store the
view SQL text and re-parse/analyze the view SQL text when reading the view. In
the case of `SELECT * FROM ...`, we want to avoid view schema change (e.g. the
referenced table changes its schema) and will record the view query output
column names when creating the view, so that when reading the view we can add a
`SELECT recorded_column_names FROM ...` to retain the original view query
schema.

In Spark 3.1 and before, the final SELECT is added after the analysis phase:
https://github.com/apache/spark/blob/branch-3.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/view.scala#L67

If the view query has duplicated output column names, we always pick the
first column when reading a view. A simple repro:
```
scala> sql("create view c(x, y) as select 1 a, 2 a")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("select * from c").show
+---+---+
| x| y|
+---+---+
| 1| 1|
+---+---+
```

In the master branch, we will fail at the view reading time due to
https://github.com/apache/spark/commit/b891862fb6b740b103d5a09530626ee4e0e8f6e3
, which adds the final SELECT during analysis, so that the query fails with
`Reference 'a' is ambiguous`

This PR proposes to resolve the view query output column names from the
matching attributes by ordinal.

For example, `create view c(x, y) as select 1 a, 2 a`, the view query
output column names are `[a, a]`. When we reading the view, there are 2
matching attributes (e.g.`[a#1, a#2]`) and we can simply match them by ordinal.

A negative example is
```
create table t(a int)
create view v as select *, 1 as col from t
replace table t(a int, col int)
```
When reading the view, the view query output column names are `[a, col]`,
and there are two matching attributes of `col`, and we should fail the query.
See the tests for details.

### Why are the changes needed?

bug fix

### Does this PR introduce _any_ user-facing change?

yes

### How was this patch tested?

new test

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan opened a new pull request #31930: [SPARK-34719][SQL] Correctly resolve the view query with duplicated column names

Reply via email to