[ https://issues.apache.org/jira/browse/SPARK-16032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342135#comment-15342135 ]
Ryan Blue commented on SPARK-16032: ----------------------------------- I agree with the push to unify the Hive and DataSource implementations, but I think that this was not done in time to make it into the 2.0 release. I know there's a lot of tech debt here and I also don't like the inconsistencies, but we rely on the Hive integration and disabling instead of fixing it isn't a good option. That's more tech debt, and rushing these changes is adding more also. The pre-insert rule, for example, has special cases for {{HadoopFsRelation}} and {{InsertableRelation}} linked in by {{LogicalRelation}}, is now located in a confusing package, {{execution.datasources}}, and is "a little mess". I'm glad you guys caught cases that I didn't have covered, but shouldn't the special cases get fixed? Shouldn't new tests be applied to all sources if this is targeted at fixing inconsistencies? bq. For example, before our changes, {{INSERT OVERWRITE TABLE hive_src PARTITION (b, c) SELECT 7, 8 AS c, 9 AS b;}} and {{INSERT OVERWRITE TABLE hive_src SELECT 7, 8 AS c, 9 AS b;}} have totally different behaviors. The second version isn't valid in Hive and I think there should be discussion and a design for how it works. Partition data is explicit in Hive via the {{PARTITION (b, c)}} syntax, not a convention based on its position in a column list. This set of changes makes all Hive writes rely on position, but before you could use {{partitionBy}} to explicitly set the partitions without the convention. Why is that not a valid use, when it matches Hive's behavior more closely? bq. seems it is not necessary to use {{partitionBy}} A common use case is moving data from a temp table into a partitioned table using data frames by pulling out a date column. Right now that works by simply running {{sqlContext.table("src").write.partitionBy("utc_date").insertInto("dest")}}, with no knowledge of where you have to put the date column for it to work. Jobs that explicitly set the partition data like this will all break if this patch set goes into 2.0. bq. {{saveAsTable}} actually never works with Hive SerDe tables The doc says "the support in 1.6 use byposition resolution" for Hive, which was implemented using {{InsertIntoTable}}. Since we have an implementation that can do by-name resolution with {{InsertIntoTable}}, why not use it here to fix the behavior instead of disabling it? > Audit semantics of various insertion operations related to partitioned tables > ----------------------------------------------------------------------------- > > Key: SPARK-16032 > URL: https://issues.apache.org/jira/browse/SPARK-16032 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Cheng Lian > Assignee: Wenchen Fan > Priority: Blocker > Attachments: [SPARK-16032] Spark SQL table insertion auditing - > Google Docs.pdf > > > We found that semantics of various insertion operations related to partition > tables can be inconsistent. This is an umbrella ticket for all related > tickets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org