It sounds like your master cluster resolves bk1-4 as ds1-4. Could you check that by doing a ping on those hostnames from those machines? Else... I can't see what could be the error at the moment...
J-D On Mon, Dec 13, 2010 at 3:55 PM, Nathaniel Cook <[email protected]> wrote: > Running the 'ls /hbase/rs' cmd through zkcli on the master I get: > > [ds2.internal,60020,1292278767510, ds3.internal,60020,1292278776930, > ds1.internal,60020,1292278759087, ds4.internal,60020,1292278792724 > > On my slave cluster I get: > > [bk1.internal,60020,1292278881467, bk3.internal,60020,1292278895189, > bk2.internal,60020,1292278888034, bk4.internal,60020,1292278905096] > > But as I mentioned the peer it chooses is ds4 from the master cluster. > > Could it be that for some reason the Configuration passed to the > ZooKeeperWrapper.createInstance for the slave cluster isn't honored > and is defaulting to the local connection settings? I am running a > QuorumPeer on the same machine as the RegionServers for these test > clusters. Could it be finding the zoo.cfg file on that machine that > points to the local quorum? > > To test this i wrote a quick jruby script... > #------------------------------------------------------ > include Java > import org.apache.hadoop.hbase.HBaseConfiguration > import org.apache.hadoop.hbase.HConstants > import org.apache.hadoop.conf.Configuration > import org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper > > > parts1 = ARGV[0].split(":") > > c1 = HBaseConfiguration.create() > c1.set(HConstants::ZOOKEEPER_QUORUM, parts1[0]) > c1.set("hbase.zookeeper.property.clientPort", parts1[1]) > c1.set(HConstants::ZOOKEEPER_ZNODE_PARENT, parts1[2]) > > > zkw = ZooKeeperWrapper.createInstance(c1, "ZK") > > zkw.writeZNode(parts1[2], "test", "") > > #------------------------------------------------------------ > > I ran it from the master cluster and gave it the address of the slave > quorum with this command: > > hbase org.jruby.Main testZK.rb bk1,bk2,bk3:2181:/hbase > > The slave ZK quorum didn't have the '/hbase/test' node but the master > ZK quorum did. The script didn't honor the specified configuration. > Any thoughts? > > > On Mon, Dec 13, 2010 at 4:04 PM, Jean-Daniel Cryans <[email protected]> > wrote: >> Interesting... the fact that it says that it's connecting to >> bk1,bk2,bk3 means that it's looking at the right zookeeper ensemble. >> What it does next is reading all the znodes in /hbase/rs/ (which is >> the list of live region servers) and chooses a subset of it. >> >> Using the zcli utility, could you check the value of those znodes and >> see if it makes sense? You can run it like that: >> >> bin/hbase zkcli >> >> And it will be run against the ensemble that that cluster is using. >> >> J-D >> >> On Mon, Dec 13, 2010 at 2:03 PM, Nathaniel Cook >> <[email protected]> wrote: >>> When the master cluster chooses a peer it is supposed to choose a peer >>> from the slave cluster correct? >>> >>> This is what I am seeing in the master cluster logs. >>> >>> >>> Added new peer cluster bk1,bk2,bk3,2181,/hbase >>> Getting 1 rs from peer cluster # test >>> Choosing peer 192.168.1.170:60020 >>> >>> But 192.168.1.170 is an address in the master cluster. I think this >>> may be related to the problem I had while running the add_peer.rb >>> script. When I ran that script it would only talk to the ZK quorum >>> running on that machine and would not talk to the slave ZK quorum . >>> Could it be that when it is trying to choose a peer, instead of going >>> to the slave ZK quorum running on a different machine it is talking >>> only to the ZK quorum running on its localhost? >>> >>> >>> >>> On Mon, Dec 13, 2010 at 2:51 PM, Nathaniel Cook >>> <[email protected]> wrote: >>>> Thanks for looking into this with me. >>>> >>>> Ok so on the master region servers I am getting the two statements >>>> 'Replicating x' and 'Replicated in total: y' >>>> >>>> Nothing on the slave cluster. >>>> >>>> On Mon, Dec 13, 2010 at 12:28 PM, Jean-Daniel Cryans >>>> <[email protected]> wrote: >>>>> Hi Nathaniel, >>>>> >>>>> Thanks for trying out replication, let's make it work for you. >>>>> >>>>> So on the master-side there's 2 lines that are important to make sure >>>>> that replication works, first it has to say: >>>>> >>>>> Replicating x >>>>> >>>>> Where x is the number of edits it's going to ship, and then >>>>> >>>>> Replicated in total: y >>>>> >>>>> Where y is the total number it replicated. Seeing the second line >>>>> means that replication was successful, at least from the master point >>>>> of view. >>>>> >>>>> On the slave, one node should have: >>>>> >>>>> Total replicated: z >>>>> >>>>> And that z is the number of edits that that region server applied on >>>>> it's cluster. It could be on any region server, since the sink for >>>>> replication is chose at random. >>>>> >>>>> Do you see those? Any exceptions around those logs apart from EOFs? >>>>> >>>>> Thx, >>>>> >>>>> J-D >>>>> >>>>> On Mon, Dec 13, 2010 at 10:52 AM, Nathaniel Cook >>>>> <[email protected]> wrote: >>>>>> Hi, >>>>>> >>>>>> I am trying to setup replication for my HBase clusters. I have two >>>>>> small clusters for testing each with 4 machines. The setup for the two >>>>>> clusters is identical. Each machine runs a DataNode, and >>>>>> HRegionServer. Three of the machines run a ZK peer and one machine >>>>>> runs the HMaster and NameNode. The cluster master machines have >>>>>> hostnames (ds1,ds2 ...) and the slave cluster is (bk1, bk2 ...). I set >>>>>> the replication scope to 1 for my test table column families and set >>>>>> the hbase.replication property to true for both clusters. Next I ran >>>>>> the add_peer.rb script with the following command on the ds1 machine: >>>>>> >>>>>> hbase org.jruby.Main /usr/lib/hbase/bin/replication/add_peer.rb >>>>>> ds1:2181:/hbase bk1:2181:/hbase >>>>>> >>>>>> After the script finishes ZK for the master cluster has the >>>>>> replication znode and children of peers, master, and state. The slave >>>>>> ZK didn't have a replication znode. I fixed that problem by rerunning >>>>>> the script on the bk1 machine and commenting out the code to write to >>>>>> the master ZK. Now the slave ZK has the /hbase/replication/master >>>>>> znode with data (ds1:2181:/hbase). Everthing looked to be configured >>>>>> correctly. I restarted the clusters. The logs of the master >>>>>> regionservers stated: >>>>>> >>>>>> This cluster (ds1:2181:/hbase) is a master for replication, compared >>>>>> with (ds1:2181:/hbase) >>>>>> >>>>>> The logs on the slave cluster stated: >>>>>> >>>>>> This cluster (bk1:2181:/hbase) is a slave for replication, compared >>>>>> with (ds1:2181:/hbase) >>>>>> >>>>>> Using the hbase shell I put a row into the test table. >>>>>> >>>>>> The regionserver for that table had a log statement like: >>>>>> >>>>>> Going to report log #192.168.1.166%3A60020.1291757445179 for position >>>>>> 15828 in >>>>>> hdfs://ds1:9000/hbase/.logs/ds1.internal,60020,1291757445059/192.168.1.166 >>>>>> <http://192.168.1.166/>%3A60020.1291757445179 >>>>>> >>>>>> (192.168.1.166 is ds1) >>>>>> >>>>>> I wait and even after several minutes the row still does not appear in >>>>>> the slave cluster table. >>>>>> >>>>>> Any help with what the problem might be is greatly appreciated. >>>>>> >>>>>> Both clusters are using a CDH3b3. The HBase version is exactly >>>>>> 0.89.20100924+28. >>>>>> >>>>>> -Nathaniel Cook >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> -Nathaniel Cook >>>> >>> >>> >>> >>> -- >>> -Nathaniel Cook >>> >> > > > > -- > -Nathaniel Cook >
