[ https://issues.apache.org/jira/browse/HIVE-16266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584850#comment-16584850 ]
Sushanth Sowmyan commented on HIVE-16266: ----------------------------------------- Hi [~akolb], apologies if this reply is no longer accurate ([~anishek] or [~sankarh] might be able to clarify if things have changed - I have not been active with hive for a year now), but at the time that the repl subsystem was written, that's correct, by intention. The basic idea is this - hive has two types of tables : MANAGED, where hive is responsible for the storage, and EXTERNAL, where some other external program is responsible for the storage. A key way to think about this distinction is what happens when you do a DROP TABLE. For MANAGED tables, if a DROP TABLE is issued, hive should delete the data on hdfs, since we own and manage the data as well. For EXTERNAL tables, we are guests, and some other tool is managing the data, and thus, we should not touch it - we can drop the metadata, but we leave the data on HDFS alone. Now, in the case where we're replicating from a primary to a secondary, if the table is a EXTERNAL table on the primary, then an external tool is managing it on the primary. But what about the secondary? Since the secondary is being "managed" by Hive Replication, and thus, Hive, we own and manage it, keeping it in sync with the primary. Thus, by definition, the copy is MANAGED even if the source is EXTERNAL. If we kept it EXTERNAL, we would start having some weird midway behaviour that we'd have to add complex rules for - consider the same deletion scenario: If we have a DROP PARTITION on the source table, by definition, on the source, we do not delete the data on source hdfs. The user will likely do a hdfs rm, refresh the data and might do a ADD PARTITION of new data. Now, what about the destination? Should we delete the data corresponding to that DROP PARTITION on destination? If so, then it is consistent with behaviour for MANAGED, rather than EXTERNAL, and thus, we should keep it as MANAGED. If not, then well, we have leftover data sitting in hdfs in the same location, and if new data gets added in, as a result of an upcoming ADD PARTITION, then the behaviour is indeterminable depending on the user - it can be the correct new data, it can be a partial merge or a weird append. That gets messy fast. So, for this problem and other possible unexpected problems, we decided to be consistent with the meaning of MANAGED and EXTERNAL, and always make repl destinations MANAGED. :) > Enable function metadata to be written during bootstrap > ------------------------------------------------------- > > Key: HIVE-16266 > URL: https://issues.apache.org/jira/browse/HIVE-16266 > Project: Hive > Issue Type: Sub-task > Components: repl > Affects Versions: 2.2.0 > Reporter: anishek > Assignee: anishek > Priority: Major > Fix For: 3.0.0 > > Attachments: HIVE-16266.1.patch, HIVE-16266.2.patch, > HIVE-16266.3.patch > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)