Interesting... the fact that it says that it's connecting to
bk1,bk2,bk3 means that it's looking at the right zookeeper ensemble.
What it does next is reading all the znodes in /hbase/rs/ (which is
the list of live region servers) and chooses a subset of it.

Using the zcli utility, could you check the value of those znodes and
see if it makes sense? You can run it like that:

bin/hbase zkcli

And it will be run against the ensemble that that cluster is using.

J-D

On Mon, Dec 13, 2010 at 2:03 PM, Nathaniel Cook
<[email protected]> wrote:
> When the master cluster chooses a peer it is supposed to choose a peer
> from the slave cluster correct?
>
> This is what I am seeing in the master cluster logs.
>
>
> Added new peer cluster bk1,bk2,bk3,2181,/hbase
> Getting 1 rs from peer cluster # test
> Choosing peer 192.168.1.170:60020
>
> But 192.168.1.170 is an address in the master cluster. I think this
> may be related to the problem I had while running the add_peer.rb
> script. When I ran that script it would only talk to the ZK quorum
> running on that machine and would not talk to the slave ZK quorum .
> Could it be that when it is trying to choose a peer, instead of going
> to the slave ZK quorum  running on a different machine it is talking
> only to the ZK quorum running on its localhost?
>
>
>
> On Mon, Dec 13, 2010 at 2:51 PM, Nathaniel Cook
> <[email protected]> wrote:
>> Thanks for looking into this with me.
>>
>> Ok so on the master region servers I am getting the two statements
>> 'Replicating x' and 'Replicated in total: y'
>>
>> Nothing on the slave cluster.
>>
>> On Mon, Dec 13, 2010 at 12:28 PM, Jean-Daniel Cryans
>> <[email protected]> wrote:
>>> Hi Nathaniel,
>>>
>>> Thanks for trying out replication, let's make it work for you.
>>>
>>> So on the master-side there's 2 lines that are important to make sure
>>> that replication works, first it has to say:
>>>
>>> Replicating x
>>>
>>> Where x is the number of edits it's going to ship, and then
>>>
>>> Replicated in total: y
>>>
>>> Where y is the total number it replicated. Seeing the second line
>>> means that replication was successful, at least from the master point
>>> of view.
>>>
>>> On the slave, one node should have:
>>>
>>> Total replicated: z
>>>
>>> And that z is the number of edits that that region server applied on
>>> it's cluster. It could be on any region server, since the sink for
>>> replication is chose at random.
>>>
>>> Do you see those? Any exceptions around those logs apart from EOFs?
>>>
>>> Thx,
>>>
>>> J-D
>>>
>>> On Mon, Dec 13, 2010 at 10:52 AM, Nathaniel Cook
>>> <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> I am trying to setup replication for my HBase clusters. I have two
>>>> small clusters for testing each with 4 machines. The setup for the two
>>>> clusters is identical. Each machine runs a DataNode, and
>>>> HRegionServer. Three of the machines run a ZK peer and one machine
>>>> runs the HMaster and NameNode. The cluster master machines have
>>>> hostnames (ds1,ds2 ...) and the slave cluster is (bk1, bk2 ...). I set
>>>> the replication  scope to 1 for my test table column families and set
>>>> the hbase.replication property to true for both clusters. Next I ran
>>>> the add_peer.rb script with the following command on the ds1 machine:
>>>>
>>>> hbase org.jruby.Main /usr/lib/hbase/bin/replication/add_peer.rb
>>>> ds1:2181:/hbase bk1:2181:/hbase
>>>>
>>>> After the script finishes ZK for the master cluster has the
>>>> replication znode and children of peers, master, and state. The slave
>>>> ZK didn't have a replication znode. I fixed that problem by rerunning
>>>> the script on the bk1 machine and commenting out the code to write to
>>>> the master ZK. Now the slave ZK has the /hbase/replication/master
>>>> znode with data (ds1:2181:/hbase). Everthing looked to be configured
>>>> correctly. I restarted the clusters. The logs of the master
>>>> regionservers stated:
>>>>
>>>> This cluster (ds1:2181:/hbase) is a master for replication, compared
>>>> with (ds1:2181:/hbase)
>>>>
>>>> The logs on the slave cluster stated:
>>>>
>>>> This cluster (bk1:2181:/hbase) is a slave for replication, compared
>>>> with (ds1:2181:/hbase)
>>>>
>>>> Using the hbase shell I put a row into the test table.
>>>>
>>>> The regionserver for that table had a log statement like:
>>>>
>>>> Going to report log #192.168.1.166%3A60020.1291757445179 for position
>>>> 15828 in 
>>>> hdfs://ds1:9000/hbase/.logs/ds1.internal,60020,1291757445059/192.168.1.166
>>>> <http://192.168.1.166/>%3A60020.1291757445179
>>>>
>>>> (192.168.1.166 is ds1)
>>>>
>>>> I wait and even after several minutes the row still does not appear in
>>>> the slave cluster table.
>>>>
>>>> Any help with what the problem might be is greatly appreciated.
>>>>
>>>> Both clusters are using a CDH3b3. The HBase version is exactly
>>>> 0.89.20100924+28.
>>>>
>>>> -Nathaniel Cook
>>>>
>>>
>>
>>
>>
>> --
>> -Nathaniel Cook
>>
>
>
>
> --
> -Nathaniel Cook
>

Reply via email to