[ 
https://issues.apache.org/jira/browse/HIVE-13348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15513804#comment-15513804
 ] 

Sushanth Sowmyan commented on HIVE-13348:
-----------------------------------------

Removing gsoc tag, as this was not proceeded on for gsoc.

> Add Event Nullification support for Replication
> -----------------------------------------------
>
>                 Key: HIVE-13348
>                 URL: https://issues.apache.org/jira/browse/HIVE-13348
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Import/Export
>            Reporter: Sushanth Sowmyan
>
> Replication, as implemented by HIVE-7973 works as follows:
> a) For every singly modification to the hive metastore, an event gets 
> triggered that logs a notification object.
> b) Replication tools such as falcon can consume these notification objects as 
> a HCatReplicationTaskIterator from 
> HCatClient.getReplicationTasks(lastEventId, maxEvents, dbName, tableName).
> c) For each event,  we generate statements and distcp requirements for falcon 
> to export, distcp and import to do the replication (along with requisite 
> changes to export and import that would allow state management).
> The big thing missing from this picture is that while it works, it is pretty 
> dumb about how it works in that it will exhaustively process every single 
> event generated, and will try to do the export-distcp-import cycle for all 
> modifications, irrespective of whether or not that will actually get used at 
> import time.
> We need to build some sort of filtering logic which can process a batch of 
> events to identify events that will result in effective no-ops, and to 
> nullify those events from the stream before passing them on. The goal is to 
> minimize the number of events that the tools like Falcon would actually have 
> to process.
> Examples of cases where event nullification would take place:
> a) CREATE-DROP cases: If an object is being created in event#34 that will 
> eventually get dropped in event#47, then there is no point in replicating 
> this along. We simply null out both these events, and also, any other event 
> that references this object between event#34 and event#47.
> b) APPEND-APPEND : Some objects are replicated wholesale, which means every 
> APPEND that occurs would cause a full export of the object in question. At 
> this point, the prior APPENDS would all be supplanted by the last APPEND. 
> Thus, we could nullify all the prior such events. 
> Additional such cases can be inferred by analysis of the Export-Import relay 
> protocol definition at 
> https://issues.apache.org/jira/secure/attachment/12725999/EXIMReplicationReplayProtocol.pdf
>  or by reasoning out various event processing orders possible.
> Replication, as implemented by HIVE-7973 is merely a first step for 
> functional support. This work is needed for replication to be efficient at 
> all, and thus, usable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to