[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

Ryan Blue (JIRA) Tue, 21 Jun 2016 09:42:18 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342135#comment-15342135
 ]


Ryan Blue commented on SPARK-16032:
-----------------------------------

I agree with the push to unify the Hive and DataSource implementations, but I 
think that this was not done in time to make it into the 2.0 release.

I know there's a lot of tech debt here and I also don't like the 
inconsistencies, but we rely on the Hive integration and disabling instead of 
fixing it isn't a good option. That's more tech debt, and rushing these changes 
is adding more also. The pre-insert rule, for example, has special cases for 
{{HadoopFsRelation}} and {{InsertableRelation}} linked in by 
{{LogicalRelation}}, is now located in a confusing package, 
{{execution.datasources}}, and is "a little mess". I'm glad you guys caught 
cases that I didn't have covered, but shouldn't the special cases get fixed? 
Shouldn't new tests be applied to all sources if this is targeted at fixing 
inconsistencies?

bq. For example, before our changes, {{INSERT OVERWRITE TABLE hive_src 
PARTITION (b, c) SELECT 7, 8 AS c, 9 AS b;}} and {{INSERT OVERWRITE TABLE 
hive_src SELECT 7, 8 AS c, 9 AS b;}} have totally different behaviors.

The second version isn't valid in Hive and I think there should be discussion 
and a design for how it works. Partition data is explicit in Hive via the 
{{PARTITION (b, c)}} syntax, not a convention based on its position in a column 
list. This set of changes makes all Hive writes rely on position, but before 
you could use {{partitionBy}} to explicitly set the partitions without the 
convention. Why is that not a valid use, when it matches Hive's behavior more 
closely?

bq. seems it is not necessary to use {{partitionBy}}

A common use case is moving data from a temp table into a partitioned table 
using data frames by pulling out a date column. Right now that works by simply 
running 
{{sqlContext.table("src").write.partitionBy("utc_date").insertInto("dest")}}, 
with no knowledge of where you have to put the date column for it to work. Jobs 
that explicitly set the partition data like this will all break if this patch 
set goes into 2.0.

bq. {{saveAsTable}} actually never works with Hive SerDe tables

The doc says "the support in 1.6 use byposition resolution" for Hive, which 
was implemented using {{InsertIntoTable}}. Since we have an implementation that 
can do by-name resolution with {{InsertIntoTable}}, why not use it here to fix 
the behavior instead of disabling it?

> Audit semantics of various insertion operations related to partitioned tables
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-16032
>                 URL: https://issues.apache.org/jira/browse/SPARK-16032
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Cheng Lian
>            Assignee: Wenchen Fan
>            Priority: Blocker
>         Attachments: [SPARK-16032] Spark SQL table insertion auditing - 
> Google Docs.pdf
>
>
> We found that semantics of various insertion operations related to partition 
> tables can be inconsistent. This is an umbrella ticket for all related 
> tickets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16032) Audit semantics of various insertion operations related to partitioned tables

Reply via email to