Also, while I have not wiki-ized the documentation for the above, I have uploaded slides from talks that I've given in hive user group meetup on the subject, and also a doc that describes the replication protocol followed for the EXIM replication that are attached over at https://issues.apache.org/jira/browse/HIVE-10264
On Thu, Dec 17, 2015 at 11:59 AM, Sushanth Sowmyan <khorg...@gmail.com> wrote: > Hi, > > I think that the replication work added with > https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this > alley. > > Per Eugene's suggestion of MetaStoreEventListener, this replication > system plugs into that and gets you a stream of notification events > from HCatClient for the exact purpose you mention. > > There's some work still outstanding on this task, most notably > documentation (sorry!) but please have a look at > HCatClient.getReplicationTasks(...) and > org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in > your implementation of ReplicationTask.Factory to inject your own > logic for how to handle the replication according to your needs. > (currently there exists an implementation that uses Hive EXPORT/IMPORT > to perform replication - you can look at the code for this, and the > tests for these classes to see how that is achieved. Falcon already > uses this to perform cross-hive-warehouse replication) > > > Thanks, > > -Sushanth > > On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman > <ekoif...@hortonworks.com> wrote: >> Metastore supports MetaStoreEventListener and MetaStorePreEventListener >> which may be useful here >> >> Eugene >> >> From: Elliot West <tea...@gmail.com> >> Reply-To: "user@hive.apache.org" <user@hive.apache.org> >> Date: Thursday, December 17, 2015 at 8:21 AM >> To: "user@hive.apache.org" <user@hive.apache.org> >> Subject: Synchronizing Hive metastores across clusters >> >> Hello, >> >> I'm thinking about the steps required to repeatedly push Hive datasets out >> from a traditional Hadoop cluster into a parallel cloud based cluster. This >> is not a one off, it needs to be a constantly running sync process. As new >> tables and partitions are added in one cluster, they need to be synced to >> the cloud cluster. Assuming for a moment that I have the HDFS data syncing >> working, I'm wondering what steps I need to take to reliably ship the >> HCatalog metadata across. I use HCatalog as the point of truth as to when >> when data is available and where it is located and so I think that metadata >> is a critical element to replicate in the cloud based cluster. >> >> Does anyone have any recommendations on how to achieve this in practice? One >> issue (of many I suspect) is that Hive appears to store table/partition >> locations internally with absolute, fully qualified URLs, therefore unless >> the target cloud cluster is similarly named and configured some path >> transformation step will be needed as part of the synchronisation process. >> >> I'd appreciate any suggestions, thoughts, or experiences related to this. >> >> Cheers - Elliot. >> >>