So, I got it working :) Because of these strange connection/configuration issues I decided to just service both clusters from one ZK quorum. I just set the zookeeper.znode.parent to hbase_bk and then set up the replication again and all it working. It is even keeping up with some initial load testing. Thanks.
I think we should still look into why it couldn't talk to two different ZK quorums but this works for now. On Mon, Dec 13, 2010 at 5:38 PM, Nathaniel Cook <[email protected]> wrote: > Yes correct IP address. > > On Mon, Dec 13, 2010 at 5:24 PM, Jean-Daniel Cryans <[email protected]> > wrote: >> Just to be clear, does ping show the right IP address too? That's the >> real concern here. >> >> Thx >> >> J-D >> >> On Mon, Dec 13, 2010 at 4:16 PM, Nathaniel Cook >> <[email protected]> wrote: >>> The hostnames are resolving fine. I can ping bk1-4 from ds1-4 and vise >>> versa. >>> >>> On Mon, Dec 13, 2010 at 5:11 PM, Jean-Daniel Cryans <[email protected]> >>> wrote: >>>> It sounds like your master cluster resolves bk1-4 as ds1-4. Could you >>>> check that by doing a ping on those hostnames from those machines? >>>> Else... I can't see what could be the error at the moment... >>>> >>>> J-D >>>> >>>> On Mon, Dec 13, 2010 at 3:55 PM, Nathaniel Cook >>>> <[email protected]> wrote: >>>>> Running the 'ls /hbase/rs' cmd through zkcli on the master I get: >>>>> >>>>> [ds2.internal,60020,1292278767510, ds3.internal,60020,1292278776930, >>>>> ds1.internal,60020,1292278759087, ds4.internal,60020,1292278792724 >>>>> >>>>> On my slave cluster I get: >>>>> >>>>> [bk1.internal,60020,1292278881467, bk3.internal,60020,1292278895189, >>>>> bk2.internal,60020,1292278888034, bk4.internal,60020,1292278905096] >>>>> >>>>> But as I mentioned the peer it chooses is ds4 from the master cluster. >>>>> >>>>> Could it be that for some reason the Configuration passed to the >>>>> ZooKeeperWrapper.createInstance for the slave cluster isn't honored >>>>> and is defaulting to the local connection settings? I am running a >>>>> QuorumPeer on the same machine as the RegionServers for these test >>>>> clusters. Could it be finding the zoo.cfg file on that machine that >>>>> points to the local quorum? >>>>> >>>>> To test this i wrote a quick jruby script... >>>>> #------------------------------------------------------ >>>>> include Java >>>>> import org.apache.hadoop.hbase.HBaseConfiguration >>>>> import org.apache.hadoop.hbase.HConstants >>>>> import org.apache.hadoop.conf.Configuration >>>>> import org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper >>>>> >>>>> >>>>> parts1 = ARGV[0].split(":") >>>>> >>>>> c1 = HBaseConfiguration.create() >>>>> c1.set(HConstants::ZOOKEEPER_QUORUM, parts1[0]) >>>>> c1.set("hbase.zookeeper.property.clientPort", parts1[1]) >>>>> c1.set(HConstants::ZOOKEEPER_ZNODE_PARENT, parts1[2]) >>>>> >>>>> >>>>> zkw = ZooKeeperWrapper.createInstance(c1, "ZK") >>>>> >>>>> zkw.writeZNode(parts1[2], "test", "") >>>>> >>>>> #------------------------------------------------------------ >>>>> >>>>> I ran it from the master cluster and gave it the address of the slave >>>>> quorum with this command: >>>>> >>>>> hbase org.jruby.Main testZK.rb bk1,bk2,bk3:2181:/hbase >>>>> >>>>> The slave ZK quorum didn't have the '/hbase/test' node but the master >>>>> ZK quorum did. The script didn't honor the specified configuration. >>>>> Any thoughts? >>>>> >>>>> >>>>> On Mon, Dec 13, 2010 at 4:04 PM, Jean-Daniel Cryans <[email protected]> >>>>> wrote: >>>>>> Interesting... the fact that it says that it's connecting to >>>>>> bk1,bk2,bk3 means that it's looking at the right zookeeper ensemble. >>>>>> What it does next is reading all the znodes in /hbase/rs/ (which is >>>>>> the list of live region servers) and chooses a subset of it. >>>>>> >>>>>> Using the zcli utility, could you check the value of those znodes and >>>>>> see if it makes sense? You can run it like that: >>>>>> >>>>>> bin/hbase zkcli >>>>>> >>>>>> And it will be run against the ensemble that that cluster is using. >>>>>> >>>>>> J-D >>>>>> >>>>>> On Mon, Dec 13, 2010 at 2:03 PM, Nathaniel Cook >>>>>> <[email protected]> wrote: >>>>>>> When the master cluster chooses a peer it is supposed to choose a peer >>>>>>> from the slave cluster correct? >>>>>>> >>>>>>> This is what I am seeing in the master cluster logs. >>>>>>> >>>>>>> >>>>>>> Added new peer cluster bk1,bk2,bk3,2181,/hbase >>>>>>> Getting 1 rs from peer cluster # test >>>>>>> Choosing peer 192.168.1.170:60020 >>>>>>> >>>>>>> But 192.168.1.170 is an address in the master cluster. I think this >>>>>>> may be related to the problem I had while running the add_peer.rb >>>>>>> script. When I ran that script it would only talk to the ZK quorum >>>>>>> running on that machine and would not talk to the slave ZK quorum . >>>>>>> Could it be that when it is trying to choose a peer, instead of going >>>>>>> to the slave ZK quorum running on a different machine it is talking >>>>>>> only to the ZK quorum running on its localhost? >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Dec 13, 2010 at 2:51 PM, Nathaniel Cook >>>>>>> <[email protected]> wrote: >>>>>>>> Thanks for looking into this with me. >>>>>>>> >>>>>>>> Ok so on the master region servers I am getting the two statements >>>>>>>> 'Replicating x' and 'Replicated in total: y' >>>>>>>> >>>>>>>> Nothing on the slave cluster. >>>>>>>> >>>>>>>> On Mon, Dec 13, 2010 at 12:28 PM, Jean-Daniel Cryans >>>>>>>> <[email protected]> wrote: >>>>>>>>> Hi Nathaniel, >>>>>>>>> >>>>>>>>> Thanks for trying out replication, let's make it work for you. >>>>>>>>> >>>>>>>>> So on the master-side there's 2 lines that are important to make sure >>>>>>>>> that replication works, first it has to say: >>>>>>>>> >>>>>>>>> Replicating x >>>>>>>>> >>>>>>>>> Where x is the number of edits it's going to ship, and then >>>>>>>>> >>>>>>>>> Replicated in total: y >>>>>>>>> >>>>>>>>> Where y is the total number it replicated. Seeing the second line >>>>>>>>> means that replication was successful, at least from the master point >>>>>>>>> of view. >>>>>>>>> >>>>>>>>> On the slave, one node should have: >>>>>>>>> >>>>>>>>> Total replicated: z >>>>>>>>> >>>>>>>>> And that z is the number of edits that that region server applied on >>>>>>>>> it's cluster. It could be on any region server, since the sink for >>>>>>>>> replication is chose at random. >>>>>>>>> >>>>>>>>> Do you see those? Any exceptions around those logs apart from EOFs? >>>>>>>>> >>>>>>>>> Thx, >>>>>>>>> >>>>>>>>> J-D >>>>>>>>> >>>>>>>>> On Mon, Dec 13, 2010 at 10:52 AM, Nathaniel Cook >>>>>>>>> <[email protected]> wrote: >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I am trying to setup replication for my HBase clusters. I have two >>>>>>>>>> small clusters for testing each with 4 machines. The setup for the >>>>>>>>>> two >>>>>>>>>> clusters is identical. Each machine runs a DataNode, and >>>>>>>>>> HRegionServer. Three of the machines run a ZK peer and one machine >>>>>>>>>> runs the HMaster and NameNode. The cluster master machines have >>>>>>>>>> hostnames (ds1,ds2 ...) and the slave cluster is (bk1, bk2 ...). I >>>>>>>>>> set >>>>>>>>>> the replication scope to 1 for my test table column families and set >>>>>>>>>> the hbase.replication property to true for both clusters. Next I ran >>>>>>>>>> the add_peer.rb script with the following command on the ds1 machine: >>>>>>>>>> >>>>>>>>>> hbase org.jruby.Main /usr/lib/hbase/bin/replication/add_peer.rb >>>>>>>>>> ds1:2181:/hbase bk1:2181:/hbase >>>>>>>>>> >>>>>>>>>> After the script finishes ZK for the master cluster has the >>>>>>>>>> replication znode and children of peers, master, and state. The slave >>>>>>>>>> ZK didn't have a replication znode. I fixed that problem by rerunning >>>>>>>>>> the script on the bk1 machine and commenting out the code to write to >>>>>>>>>> the master ZK. Now the slave ZK has the /hbase/replication/master >>>>>>>>>> znode with data (ds1:2181:/hbase). Everthing looked to be configured >>>>>>>>>> correctly. I restarted the clusters. The logs of the master >>>>>>>>>> regionservers stated: >>>>>>>>>> >>>>>>>>>> This cluster (ds1:2181:/hbase) is a master for replication, compared >>>>>>>>>> with (ds1:2181:/hbase) >>>>>>>>>> >>>>>>>>>> The logs on the slave cluster stated: >>>>>>>>>> >>>>>>>>>> This cluster (bk1:2181:/hbase) is a slave for replication, compared >>>>>>>>>> with (ds1:2181:/hbase) >>>>>>>>>> >>>>>>>>>> Using the hbase shell I put a row into the test table. >>>>>>>>>> >>>>>>>>>> The regionserver for that table had a log statement like: >>>>>>>>>> >>>>>>>>>> Going to report log #192.168.1.166%3A60020.1291757445179 for position >>>>>>>>>> 15828 in >>>>>>>>>> hdfs://ds1:9000/hbase/.logs/ds1.internal,60020,1291757445059/192.168.1.166 >>>>>>>>>> <http://192.168.1.166/>%3A60020.1291757445179 >>>>>>>>>> >>>>>>>>>> (192.168.1.166 is ds1) >>>>>>>>>> >>>>>>>>>> I wait and even after several minutes the row still does not appear >>>>>>>>>> in >>>>>>>>>> the slave cluster table. >>>>>>>>>> >>>>>>>>>> Any help with what the problem might be is greatly appreciated. >>>>>>>>>> >>>>>>>>>> Both clusters are using a CDH3b3. The HBase version is exactly >>>>>>>>>> 0.89.20100924+28. >>>>>>>>>> >>>>>>>>>> -Nathaniel Cook >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> -Nathaniel Cook >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> -Nathaniel Cook >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> -Nathaniel Cook >>>>> >>>> >>> >>> >>> >>> -- >>> -Nathaniel Cook >>> >> > > > > -- > -Nathaniel Cook > -- -Nathaniel Cook
