Thanks J-D! I disabled replication because at the time, every time I started it the entire cluster would shut itself down. Any reason why the servers will not create the HLog immediately when they receive the start_replicsation command? Is there a less destructive way to stop and start the replication? Will removing the peer yield better results? (By the way it would be nice if the shell had a "show_peers" command.)
The way I planned on using the replication is kind of a "manual by-direction" operation. The idea is to have each cluster in a different data center, one primary and one backup. Initially the replication is configured from primary to backup. If the primary data center goes down, we will switch traffic to the backup DC and reverse the direction of the replication so new writes will eventually sync back to the primary when it comes back online. Right now I see two problems with this plan: 1) it seems that the servers crash if they can't talk to the peer ZK ensemble, which is really a huge problem. 2) I can't be certain when will the HLogs actually start being written unless I restart the entire secondary cluster after reversing the replication direction. Am I right in my understanding of the current state of things? Really appreciate your help! -eran On Mon, Mar 28, 2011 at 20:29, Jean-Daniel Cryans <jdcry...@apache.org> wrote: > Ah turns out the issue was way simpler than I thought. One example: > > 2011-03-25 13:55:02,103 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: > Replication is disabled, sleeping 1000 times 10 > 2011-03-25 13:55:07,762 INFO > org.apache.hadoop.hbase.replication.ReplicationZookeeper: Replication > is now started > 2011-03-25 13:55:13,111 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No > log to process, sleeping 1000 times 10 > > (BTW, pro tip when debugging an issue with HBase is to go back to its > first occurrence) > > The issue is that the cluster had replication disabled, which is > REALLY disruptive as it disables every replication feature including > adding new logs to replicate meaning that if the server starts with > replication disabled it won't even add a single log to replicate. > Here's an example of when a new log was finally added after a long > time of "No log to process": > > 2011-03-24 03:56:20,538 DEBUG > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: No > log to process, sleeping 1000 times 10 > 2011-03-24 03:56:22,848 DEBUG > org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period > 3600000ms elapsed > 2011-03-24 03:56:22,911 INFO > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using > syncFs -- HDFS-200 > 2011-03-24 03:56:22,974 INFO > org.apache.hadoop.hbase.regionserver.wal.HLog: Roll > /hbase/.logs/hadoop1-s04.farm-ny.etc > > That was the previous day, on the 25th when replication was enabled no > other log with data in it was rolled so none was added to replicate. > > Bottom line, disabling replication is a kill switch and shouldn't only > be used with that functionality in mind. Starting the cluster with > replication enabled should make it work right away for you. > > Thx! > > J-D > > On Sun, Mar 27, 2011 at 2:21 AM, Eran Kutner <e...@notspammer.com> wrote: >> Had more time to look into it and verify that indeed data is not replicated >> because the server doesn't see it in the log. So I tried restarting the RS >> and sure enough when the table (which has only one region) transitioned to >> another RS the replication started working (for new data only). >> So I tried with another table, and same thing, replication doesn't work and >> the logs says "No log to process" but after restarting the RS and a table >> transition the replication started working for that table too. Is there >> something that gets initialized during a transition that could be missing >> before? >> >> -eran >> >