Sounds good. I'll do some digging around. On Tue, Dec 14, 2010 at 11:31 AM, Jean-Daniel Cryans <[email protected]> wrote: > Good! > > I'm not sure why it's not working for you with two ensembles... here > it works between two clusters that are in two different datacenters > using different ZK ensembles. You could try inserting debug statements > in the code and see where the mix up happens. > > Thx, > > J-D > > On Tue, Dec 14, 2010 at 10:26 AM, Nathaniel Cook > <[email protected]> wrote: >> So, I got it working :) >> >> Because of these strange connection/configuration issues I decided to >> just service both clusters from one ZK quorum. I just set the >> zookeeper.znode.parent to hbase_bk and then set up the replication >> again and all it working. It is even keeping up with some initial load >> testing. Thanks. >> >> I think we should still look into why it couldn't talk to two >> different ZK quorums but this works for now. >> >> >> On Mon, Dec 13, 2010 at 5:38 PM, Nathaniel Cook >> <[email protected]> wrote: >>> Yes correct IP address. >>> >>> On Mon, Dec 13, 2010 at 5:24 PM, Jean-Daniel Cryans <[email protected]> >>> wrote: >>>> Just to be clear, does ping show the right IP address too? That's the >>>> real concern here. >>>> >>>> Thx >>>> >>>> J-D >>>> >>>> On Mon, Dec 13, 2010 at 4:16 PM, Nathaniel Cook >>>> <[email protected]> wrote: >>>>> The hostnames are resolving fine. I can ping bk1-4 from ds1-4 and vise >>>>> versa. >>>>> >>>>> On Mon, Dec 13, 2010 at 5:11 PM, Jean-Daniel Cryans <[email protected]> >>>>> wrote: >>>>>> It sounds like your master cluster resolves bk1-4 as ds1-4. Could you >>>>>> check that by doing a ping on those hostnames from those machines? >>>>>> Else... I can't see what could be the error at the moment... >>>>>> >>>>>> J-D >>>>>> >>>>>> On Mon, Dec 13, 2010 at 3:55 PM, Nathaniel Cook >>>>>> <[email protected]> wrote: >>>>>>> Running the 'ls /hbase/rs' cmd through zkcli on the master I get: >>>>>>> >>>>>>> [ds2.internal,60020,1292278767510, ds3.internal,60020,1292278776930, >>>>>>> ds1.internal,60020,1292278759087, ds4.internal,60020,1292278792724 >>>>>>> >>>>>>> On my slave cluster I get: >>>>>>> >>>>>>> [bk1.internal,60020,1292278881467, bk3.internal,60020,1292278895189, >>>>>>> bk2.internal,60020,1292278888034, bk4.internal,60020,1292278905096] >>>>>>> >>>>>>> But as I mentioned the peer it chooses is ds4 from the master cluster. >>>>>>> >>>>>>> Could it be that for some reason the Configuration passed to the >>>>>>> ZooKeeperWrapper.createInstance for the slave cluster isn't honored >>>>>>> and is defaulting to the local connection settings? I am running a >>>>>>> QuorumPeer on the same machine as the RegionServers for these test >>>>>>> clusters. Could it be finding the zoo.cfg file on that machine that >>>>>>> points to the local quorum? >>>>>>> >>>>>>> To test this i wrote a quick jruby script... >>>>>>> #------------------------------------------------------ >>>>>>> include Java >>>>>>> import org.apache.hadoop.hbase.HBaseConfiguration >>>>>>> import org.apache.hadoop.hbase.HConstants >>>>>>> import org.apache.hadoop.conf.Configuration >>>>>>> import org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper >>>>>>> >>>>>>> >>>>>>> parts1 = ARGV[0].split(":") >>>>>>> >>>>>>> c1 = HBaseConfiguration.create() >>>>>>> c1.set(HConstants::ZOOKEEPER_QUORUM, parts1[0]) >>>>>>> c1.set("hbase.zookeeper.property.clientPort", parts1[1]) >>>>>>> c1.set(HConstants::ZOOKEEPER_ZNODE_PARENT, parts1[2]) >>>>>>> >>>>>>> >>>>>>> zkw = ZooKeeperWrapper.createInstance(c1, "ZK") >>>>>>> >>>>>>> zkw.writeZNode(parts1[2], "test", "") >>>>>>> >>>>>>> #------------------------------------------------------------ >>>>>>> >>>>>>> I ran it from the master cluster and gave it the address of the slave >>>>>>> quorum with this command: >>>>>>> >>>>>>> hbase org.jruby.Main testZK.rb bk1,bk2,bk3:2181:/hbase >>>>>>> >>>>>>> The slave ZK quorum didn't have the '/hbase/test' node but the master >>>>>>> ZK quorum did. The script didn't honor the specified configuration. >>>>>>> Any thoughts? >>>>>>> >>>>>>> >>>>>>> On Mon, Dec 13, 2010 at 4:04 PM, Jean-Daniel Cryans >>>>>>> <[email protected]> wrote: >>>>>>>> Interesting... the fact that it says that it's connecting to >>>>>>>> bk1,bk2,bk3 means that it's looking at the right zookeeper ensemble. >>>>>>>> What it does next is reading all the znodes in /hbase/rs/ (which is >>>>>>>> the list of live region servers) and chooses a subset of it. >>>>>>>> >>>>>>>> Using the zcli utility, could you check the value of those znodes and >>>>>>>> see if it makes sense? You can run it like that: >>>>>>>> >>>>>>>> bin/hbase zkcli >>>>>>>> >>>>>>>> And it will be run against the ensemble that that cluster is using. >>>>>>>> >>>>>>>> J-D >>>>>>>> >>>>>>>> On Mon, Dec 13, 2010 at 2:03 PM, Nathaniel Cook >>>>>>>> <[email protected]> wrote: >>>>>>>>> When the master cluster chooses a peer it is supposed to choose a peer >>>>>>>>> from the slave cluster correct? >>>>>>>>> >>>>>>>>> This is what I am seeing in the master cluster logs. >>>>>>>>> >>>>>>>>> >>>>>>>>> Added new peer cluster bk1,bk2,bk3,2181,/hbase >>>>>>>>> Getting 1 rs from peer cluster # test >>>>>>>>> Choosing peer 192.168.1.170:60020 >>>>>>>>> >>>>>>>>> But 192.168.1.170 is an address in the master cluster. I think this >>>>>>>>> may be related to the problem I had while running the add_peer.rb >>>>>>>>> script. When I ran that script it would only talk to the ZK quorum >>>>>>>>> running on that machine and would not talk to the slave ZK quorum . >>>>>>>>> Could it be that when it is trying to choose a peer, instead of going >>>>>>>>> to the slave ZK quorum running on a different machine it is talking >>>>>>>>> only to the ZK quorum running on its localhost? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Dec 13, 2010 at 2:51 PM, Nathaniel Cook >>>>>>>>> <[email protected]> wrote: >>>>>>>>>> Thanks for looking into this with me. >>>>>>>>>> >>>>>>>>>> Ok so on the master region servers I am getting the two statements >>>>>>>>>> 'Replicating x' and 'Replicated in total: y' >>>>>>>>>> >>>>>>>>>> Nothing on the slave cluster. >>>>>>>>>> >>>>>>>>>> On Mon, Dec 13, 2010 at 12:28 PM, Jean-Daniel Cryans >>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>> Hi Nathaniel, >>>>>>>>>>> >>>>>>>>>>> Thanks for trying out replication, let's make it work for you. >>>>>>>>>>> >>>>>>>>>>> So on the master-side there's 2 lines that are important to make >>>>>>>>>>> sure >>>>>>>>>>> that replication works, first it has to say: >>>>>>>>>>> >>>>>>>>>>> Replicating x >>>>>>>>>>> >>>>>>>>>>> Where x is the number of edits it's going to ship, and then >>>>>>>>>>> >>>>>>>>>>> Replicated in total: y >>>>>>>>>>> >>>>>>>>>>> Where y is the total number it replicated. Seeing the second line >>>>>>>>>>> means that replication was successful, at least from the master >>>>>>>>>>> point >>>>>>>>>>> of view. >>>>>>>>>>> >>>>>>>>>>> On the slave, one node should have: >>>>>>>>>>> >>>>>>>>>>> Total replicated: z >>>>>>>>>>> >>>>>>>>>>> And that z is the number of edits that that region server applied on >>>>>>>>>>> it's cluster. It could be on any region server, since the sink for >>>>>>>>>>> replication is chose at random. >>>>>>>>>>> >>>>>>>>>>> Do you see those? Any exceptions around those logs apart from EOFs? >>>>>>>>>>> >>>>>>>>>>> Thx, >>>>>>>>>>> >>>>>>>>>>> J-D >>>>>>>>>>> >>>>>>>>>>> On Mon, Dec 13, 2010 at 10:52 AM, Nathaniel Cook >>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I am trying to setup replication for my HBase clusters. I have two >>>>>>>>>>>> small clusters for testing each with 4 machines. The setup for the >>>>>>>>>>>> two >>>>>>>>>>>> clusters is identical. Each machine runs a DataNode, and >>>>>>>>>>>> HRegionServer. Three of the machines run a ZK peer and one machine >>>>>>>>>>>> runs the HMaster and NameNode. The cluster master machines have >>>>>>>>>>>> hostnames (ds1,ds2 ...) and the slave cluster is (bk1, bk2 ...). I >>>>>>>>>>>> set >>>>>>>>>>>> the replication scope to 1 for my test table column families and >>>>>>>>>>>> set >>>>>>>>>>>> the hbase.replication property to true for both clusters. Next I >>>>>>>>>>>> ran >>>>>>>>>>>> the add_peer.rb script with the following command on the ds1 >>>>>>>>>>>> machine: >>>>>>>>>>>> >>>>>>>>>>>> hbase org.jruby.Main /usr/lib/hbase/bin/replication/add_peer.rb >>>>>>>>>>>> ds1:2181:/hbase bk1:2181:/hbase >>>>>>>>>>>> >>>>>>>>>>>> After the script finishes ZK for the master cluster has the >>>>>>>>>>>> replication znode and children of peers, master, and state. The >>>>>>>>>>>> slave >>>>>>>>>>>> ZK didn't have a replication znode. I fixed that problem by >>>>>>>>>>>> rerunning >>>>>>>>>>>> the script on the bk1 machine and commenting out the code to write >>>>>>>>>>>> to >>>>>>>>>>>> the master ZK. Now the slave ZK has the /hbase/replication/master >>>>>>>>>>>> znode with data (ds1:2181:/hbase). Everthing looked to be >>>>>>>>>>>> configured >>>>>>>>>>>> correctly. I restarted the clusters. The logs of the master >>>>>>>>>>>> regionservers stated: >>>>>>>>>>>> >>>>>>>>>>>> This cluster (ds1:2181:/hbase) is a master for replication, >>>>>>>>>>>> compared >>>>>>>>>>>> with (ds1:2181:/hbase) >>>>>>>>>>>> >>>>>>>>>>>> The logs on the slave cluster stated: >>>>>>>>>>>> >>>>>>>>>>>> This cluster (bk1:2181:/hbase) is a slave for replication, compared >>>>>>>>>>>> with (ds1:2181:/hbase) >>>>>>>>>>>> >>>>>>>>>>>> Using the hbase shell I put a row into the test table. >>>>>>>>>>>> >>>>>>>>>>>> The regionserver for that table had a log statement like: >>>>>>>>>>>> >>>>>>>>>>>> Going to report log #192.168.1.166%3A60020.1291757445179 for >>>>>>>>>>>> position >>>>>>>>>>>> 15828 in >>>>>>>>>>>> hdfs://ds1:9000/hbase/.logs/ds1.internal,60020,1291757445059/192.168.1.166 >>>>>>>>>>>> <http://192.168.1.166/>%3A60020.1291757445179 >>>>>>>>>>>> >>>>>>>>>>>> (192.168.1.166 is ds1) >>>>>>>>>>>> >>>>>>>>>>>> I wait and even after several minutes the row still does not >>>>>>>>>>>> appear in >>>>>>>>>>>> the slave cluster table. >>>>>>>>>>>> >>>>>>>>>>>> Any help with what the problem might be is greatly appreciated. >>>>>>>>>>>> >>>>>>>>>>>> Both clusters are using a CDH3b3. The HBase version is exactly >>>>>>>>>>>> 0.89.20100924+28. >>>>>>>>>>>> >>>>>>>>>>>> -Nathaniel Cook >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> -Nathaniel Cook >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> -Nathaniel Cook >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> -Nathaniel Cook >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> -Nathaniel Cook >>>>> >>>> >>> >>> >>> >>> -- >>> -Nathaniel Cook >>> >> >> >> >> -- >> -Nathaniel Cook >> >
-- -Nathaniel Cook
