[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

Cheng Lian (JIRA) Tue, 21 Jun 2016 05:13:36 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341658#comment-15341658
 ]


Cheng Lian commented on SPARK-16032:
------------------------------------

Hey [~rdblue], [~yhuai] and [~cloud_fan] had already covered a lot of details. 
I think one problem here is that we made all the changes during a relatively 
short time with quite a few PRs, thus it can be quite hard to grasp the main 
idea as well as to track all the details.

I agree that by name resolution can be more intuitive for users, but the most 
important issue we would like to address here is *inconsistency*. From the 
attached report we may see that table insertion behaviors can be inconsistent 
across the following dimensions:

# Table type: Hive SerDe table or data source table
# Partition type: static or dynamic
# Write mode: overwriting or appending
# API type: SQL or DataFrame DSL
# Resolution manner: by name or by position

All the subtasks listed here mostly aim to ensure consistent and predictable 
behaviors. However, unfortunately, we noticed all these issues too late, and it 
can be hard to cover all the dimensions all at once at this stage. From my 
perspective, the reason why we couldn't cover by name resolution here is that:

# Hive and conventional SQL databases have clearly defined semantics of by 
position resolution
# By name resolution can be useful, but it takes more time to achieve 
reasonable and consistent syntax and semantics for both SQL and DataFrame API.
# [PR #12313|https://github.com/apache/spark/pull/12313] only covers the code 
path of Hive SerDe tables, and it could be hard to cover all the cases if the 
inconsistencies across the first 3 dimensions are not properly addressed first.

Due to the above reasons, we decided to follow the conventional and standard 
SQL semantics, i.e. by position resolution, to ensure consistent semantics 
first.

You may argue that we dropped existing by name resolution feature in some 
cases. However, this feature wasn't well defined at the first place, and can be 
error prone and misleading. On the other hand, by fixing all existing 
inconsistencies, it would be a lot easier to introduce by name resolution in 
2.1 after we come up with a better design.

> Audit semantics of various insertion operations related to partitioned tables
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-16032
>                 URL: https://issues.apache.org/jira/browse/SPARK-16032
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Cheng Lian
>            Assignee: Wenchen Fan
>            Priority: Blocker
>         Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

Reply via email to