Thank you all for replying. @JD you asked about logs for 0.90. I ran it 2 weeks back, and don't have logs atm; but you also echoed the same thing that when RS talks to ZK and there is a problem, they abort themselves. It seems similar to me.
@Lars/@Jessy: Yes, rolling log on invoking start/stop replication is fairly disruptive. I agree that enable/disable a particular peer is more appropriate as we keep on enqueing the new logs at the ReplicationSource. But there is no limit on the number of logs it should keep (a priorityBlockingQueue has Integer.Max capacity) atm. For iii), in case of a log rolling, ReplicationSourceManager tries to add the new log at the Znodes of the peers, and throws an IOException when it fails. In case ZK is down, HBase is automatically down (though RS keeps on waiting, for the Master as the it aborts itself, and for the ZK quorum); but it can still serve the reads/write to existing clients, with no splits obviously. Not a serious issue, though. Yeah, start/stop_replication begets interesting scenarios, which may lead to incomplete replication. Should be used in extreme conditions. Still looking at it... Thanks, Himanshu On Thu, Apr 12, 2012 at 3:37 PM, lars hofhansl <[email protected]> wrote: > Himanshu, > > please keep digging, though. This is will mission critical for us, and we'll > be testing this heavily. > If you find anything strange, by all means file a jira, squashing bugs here > is critical. > > > -- Lars > > > ----- Original Message ----- > From: lars hofhansl <[email protected]> > To: "[email protected]" <[email protected]> > Cc: > Sent: Thursday, April 12, 2012 3:12 PM > Subject: Re: HBase Replication use cases > > I think it's like J-D said. stop_replication is a kill switch. > In 0.94+ we have start/stop_peer which suspends replication, but still keeps > track of the logs to replicate. > > > It would complicate the code a lot (IMHO) to start replicating from partial > logs or to roll each and every log and then consider replication started only > after the last log was rolled. > > > ----- Original Message ----- > From: Jesse Yates <[email protected]> > To: "[email protected]" <[email protected]> > Cc: "[email protected]" <[email protected]> > Sent: Thursday, April 12, 2012 2:56 PM > Subject: Re: HBase Replication use cases > > > > On Apr 12, 2012, at 2:50 PM, lars hofhansl <[email protected]> wrote: > >> Thanks Himanshu, >> >> we're planning to use Replication for cross DC replication for DR (and we >> added a bunch of stuff and fixed bugs in replication). >> >> >> We'll have it always on (and only use stop/start_peer, which is new in 0.94+ >> to temporarily stop replication, rather than stop/start_replication) >> HBASE-2611 is a problem. We did not have time recently to work on this. >> >> i) and ii) can be worked around by forcing a log roll on all region servers >> after replication was enabled. Replication would be considered started after >> the logs were >> rolled... But that is quite annoying. >> > > Should we consider adding this as part of the replication code proper? Is > there a smarter way to go about it? > > - Jesse >> Is iii) still a problem in 0.92+? I thought we fixed that together with a). >> >> -- Lars >> >> ________________________________ >> From: Himanshu Vashishtha <[email protected]> >> To: [email protected] >> Sent: Thursday, April 12, 2012 12:11 PM >> Subject: HBase Replication use cases >> >> Hello All, >> >> I have been doing testing on the HBase replication (0.90.4, and 0.92 >> variants). >> >> Here are some of the findings: >> >> a) 0.90+ is not that great in handling out znode changes; in an >> ongoing replication, if I delete a peer and a region server goes to >> the znode to update the log status, the region server aborts itself >> when it sees a missing znode. >> >> Recoverable Zookeeper seems to have fix this in 0.92+? >> >> 0.92 has lot of new features (start/stop handle, master master, cyclic). >> >> But there are corner cases with the start/stop switches. >> i) A log is en-queue when the replication state is set to true. When we >> start the cluster, it is true and the starting region server takes the >> new log into the queue. If I do a stop_replication, and there is a log >> roll, and then I do a start_replication, the current log will not be >> replicated, as it has missed the opportunity of being added to the queue. >> >> ii) If I _start_ a region server when the replication state is set to >> false, its log will not be added to the queue. Now, if I do a >> start_replication, its log will not be replicated. >> >> iii) Removing a peer doesn't result in master region server abort, but >> in case of zk is down and there is a log roll, it will abort. Not a >> serious one as zk is down so the cluster is not healthy anyway. >> >> I was looking for jiras (including 2611), and stumbled upon 2223. I >> don't think there is any thing like time based partition behavior (as >> mentioned in the jira description). Though. the patch has lot of other >> nice things which indeed are in existing code. Please correct me if I >> miss anything. >> >> Having said that, I wonder about other folks out there use it. >> Their experience, common issues (minor + major) they come across. >> I did find a ppt by Jean Daniel at oscon mentioning about using it in >> SU production. >> >> I plan to file jiras for the above ones and will start digging in. >> >> Look forward for your responses. >> >> Thanks, >> Himanshu
