Himanshu, please keep digging, though. This is will mission critical for us, and we'll be testing this heavily. If you find anything strange, by all means file a jira, squashing bugs here is critical.
-- Lars ----- Original Message ----- From: lars hofhansl <[email protected]> To: "[email protected]" <[email protected]> Cc: Sent: Thursday, April 12, 2012 3:12 PM Subject: Re: HBase Replication use cases I think it's like J-D said. stop_replication is a kill switch. In 0.94+ we have start/stop_peer which suspends replication, but still keeps track of the logs to replicate. It would complicate the code a lot (IMHO) to start replicating from partial logs or to roll each and every log and then consider replication started only after the last log was rolled. ----- Original Message ----- From: Jesse Yates <[email protected]> To: "[email protected]" <[email protected]> Cc: "[email protected]" <[email protected]> Sent: Thursday, April 12, 2012 2:56 PM Subject: Re: HBase Replication use cases On Apr 12, 2012, at 2:50 PM, lars hofhansl <[email protected]> wrote: > Thanks Himanshu, > > we're planning to use Replication for cross DC replication for DR (and we > added a bunch of stuff and fixed bugs in replication). > > > We'll have it always on (and only use stop/start_peer, which is new in 0.94+ > to temporarily stop replication, rather than stop/start_replication) > HBASE-2611 is a problem. We did not have time recently to work on this. > > i) and ii) can be worked around by forcing a log roll on all region servers > after replication was enabled. Replication would be considered started after > the logs were > rolled... But that is quite annoying. > Should we consider adding this as part of the replication code proper? Is there a smarter way to go about it? - Jesse > Is iii) still a problem in 0.92+? I thought we fixed that together with a). > > -- Lars > > ________________________________ > From: Himanshu Vashishtha <[email protected]> > To: [email protected] > Sent: Thursday, April 12, 2012 12:11 PM > Subject: HBase Replication use cases > > Hello All, > > I have been doing testing on the HBase replication (0.90.4, and 0.92 > variants). > > Here are some of the findings: > > a) 0.90+ is not that great in handling out znode changes; in an > ongoing replication, if I delete a peer and a region server goes to > the znode to update the log status, the region server aborts itself > when it sees a missing znode. > > Recoverable Zookeeper seems to have fix this in 0.92+? > > 0.92 has lot of new features (start/stop handle, master master, cyclic). > > But there are corner cases with the start/stop switches. > i) A log is en-queue when the replication state is set to true. When we > start the cluster, it is true and the starting region server takes the > new log into the queue. If I do a stop_replication, and there is a log > roll, and then I do a start_replication, the current log will not be > replicated, as it has missed the opportunity of being added to the queue. > > ii) If I _start_ a region server when the replication state is set to > false, its log will not be added to the queue. Now, if I do a > start_replication, its log will not be replicated. > > iii) Removing a peer doesn't result in master region server abort, but > in case of zk is down and there is a log roll, it will abort. Not a > serious one as zk is down so the cluster is not healthy anyway. > > I was looking for jiras (including 2611), and stumbled upon 2223. I > don't think there is any thing like time based partition behavior (as > mentioned in the jira description). Though. the patch has lot of other > nice things which indeed are in existing code. Please correct me if I > miss anything. > > Having said that, I wonder about other folks out there use it. > Their experience, common issues (minor + major) they come across. > I did find a ppt by Jean Daniel at oscon mentioning about using it in > SU production. > > I plan to file jiras for the above ones and will start digging in. > > Look forward for your responses. > > Thanks, > Himanshu
