Re: HBase Replication problems

Jean-Daniel Cryans Tue, 14 Dec 2010 10:31:58 -0800

Good!

I'm not sure why it's not working for you with two ensembles... here
it works between two clusters that are in two different datacenters
using different ZK ensembles. You could try inserting debug statements
in the code and see where the mix up happens.


Thx,

J-D

On Tue, Dec 14, 2010 at 10:26 AM, Nathaniel Cook
<[email protected]> wrote:
> So, I got it working :)
>
> Because of these strange connection/configuration issues I decided to
> just service both clusters from one ZK quorum. I just set the
> zookeeper.znode.parent to hbase_bk and then set up the replication
> again and all it working. It is even keeping up with some initial load
> testing. Thanks.
>
> I think we should still look into why it couldn't talk to two
> different ZK quorums but this works for now.
>
>
> On Mon, Dec 13, 2010 at 5:38 PM, Nathaniel Cook
> <[email protected]> wrote:
>> Yes correct IP address.
>>
>> On Mon, Dec 13, 2010 at 5:24 PM, Jean-Daniel Cryans <[email protected]> 
>> wrote:
>>> Just to be clear, does ping show the right IP address too? That's the
>>> real concern here.
>>>
>>> Thx
>>>
>>> J-D
>>>
>>> On Mon, Dec 13, 2010 at 4:16 PM, Nathaniel Cook
>>> <[email protected]> wrote:
>>>> The hostnames are resolving fine. I can ping bk1-4 from ds1-4 and vise 
>>>> versa.
>>>>
>>>> On Mon, Dec 13, 2010 at 5:11 PM, Jean-Daniel Cryans <[email protected]> 
>>>> wrote:
>>>>> It sounds like your master cluster resolves bk1-4 as ds1-4. Could you
>>>>> check that by doing a ping on those hostnames from those machines?
>>>>> Else... I can't see what could be the error at the moment...
>>>>>
>>>>> J-D
>>>>>
>>>>> On Mon, Dec 13, 2010 at 3:55 PM, Nathaniel Cook
>>>>> <[email protected]> wrote:
>>>>>> Running the 'ls /hbase/rs' cmd through zkcli  on the master I get:
>>>>>>
>>>>>> [ds2.internal,60020,1292278767510, ds3.internal,60020,1292278776930,
>>>>>> ds1.internal,60020,1292278759087, ds4.internal,60020,1292278792724
>>>>>>
>>>>>> On my slave cluster I get:
>>>>>>
>>>>>> [bk1.internal,60020,1292278881467, bk3.internal,60020,1292278895189,
>>>>>> bk2.internal,60020,1292278888034, bk4.internal,60020,1292278905096]
>>>>>>
>>>>>> But as I mentioned the peer it chooses is ds4 from the master cluster.
>>>>>>
>>>>>> Could it be that for some reason the Configuration passed to the
>>>>>> ZooKeeperWrapper.createInstance for the slave cluster isn't honored
>>>>>> and is defaulting to the local connection settings? I am running a
>>>>>> QuorumPeer on the same machine as the RegionServers for these test
>>>>>> clusters. Could it be finding the zoo.cfg file on that machine that
>>>>>> points to the local quorum?
>>>>>>
>>>>>> To test this i wrote a quick jruby script...
>>>>>> #------------------------------------------------------
>>>>>> include Java
>>>>>> import org.apache.hadoop.hbase.HBaseConfiguration
>>>>>> import org.apache.hadoop.hbase.HConstants
>>>>>> import org.apache.hadoop.conf.Configuration
>>>>>> import org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper
>>>>>>
>>>>>>
>>>>>> parts1 = ARGV[0].split(":")
>>>>>>
>>>>>> c1 = HBaseConfiguration.create()
>>>>>> c1.set(HConstants::ZOOKEEPER_QUORUM, parts1[0])
>>>>>> c1.set("hbase.zookeeper.property.clientPort", parts1[1])
>>>>>> c1.set(HConstants::ZOOKEEPER_ZNODE_PARENT, parts1[2])
>>>>>>
>>>>>>
>>>>>> zkw = ZooKeeperWrapper.createInstance(c1, "ZK")
>>>>>>
>>>>>> zkw.writeZNode(parts1[2], "test", "")
>>>>>>
>>>>>> #------------------------------------------------------------
>>>>>>
>>>>>> I ran it from the master cluster and gave it the address of the slave
>>>>>> quorum with this command:
>>>>>>
>>>>>> hbase org.jruby.Main testZK.rb bk1,bk2,bk3:2181:/hbase
>>>>>>
>>>>>> The slave ZK quorum didn't have the '/hbase/test' node but the master
>>>>>> ZK quorum did. The script didn't honor the specified configuration.
>>>>>> Any thoughts?
>>>>>>
>>>>>>
>>>>>> On Mon, Dec 13, 2010 at 4:04 PM, Jean-Daniel Cryans 
>>>>>> <[email protected]> wrote:
>>>>>>> Interesting... the fact that it says that it's connecting to
>>>>>>> bk1,bk2,bk3 means that it's looking at the right zookeeper ensemble.
>>>>>>> What it does next is reading all the znodes in /hbase/rs/ (which is
>>>>>>> the list of live region servers) and chooses a subset of it.
>>>>>>>
>>>>>>> Using the zcli utility, could you check the value of those znodes and
>>>>>>> see if it makes sense? You can run it like that:
>>>>>>>
>>>>>>> bin/hbase zkcli
>>>>>>>
>>>>>>> And it will be run against the ensemble that that cluster is using.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Mon, Dec 13, 2010 at 2:03 PM, Nathaniel Cook
>>>>>>> <[email protected]> wrote:
>>>>>>>> When the master cluster chooses a peer it is supposed to choose a peer
>>>>>>>> from the slave cluster correct?
>>>>>>>>
>>>>>>>> This is what I am seeing in the master cluster logs.
>>>>>>>>
>>>>>>>>
>>>>>>>> Added new peer cluster bk1,bk2,bk3,2181,/hbase
>>>>>>>> Getting 1 rs from peer cluster # test
>>>>>>>> Choosing peer 192.168.1.170:60020
>>>>>>>>
>>>>>>>> But 192.168.1.170 is an address in the master cluster. I think this
>>>>>>>> may be related to the problem I had while running the add_peer.rb
>>>>>>>> script. When I ran that script it would only talk to the ZK quorum
>>>>>>>> running on that machine and would not talk to the slave ZK quorum .
>>>>>>>> Could it be that when it is trying to choose a peer, instead of going
>>>>>>>> to the slave ZK quorum  running on a different machine it is talking
>>>>>>>> only to the ZK quorum running on its localhost?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Dec 13, 2010 at 2:51 PM, Nathaniel Cook
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> Thanks for looking into this with me.
>>>>>>>>>
>>>>>>>>> Ok so on the master region servers I am getting the two statements
>>>>>>>>> 'Replicating x' and 'Replicated in total: y'
>>>>>>>>>
>>>>>>>>> Nothing on the slave cluster.
>>>>>>>>>
>>>>>>>>> On Mon, Dec 13, 2010 at 12:28 PM, Jean-Daniel Cryans
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> Hi Nathaniel,
>>>>>>>>>>
>>>>>>>>>> Thanks for trying out replication, let's make it work for you.
>>>>>>>>>>
>>>>>>>>>> So on the master-side there's 2 lines that are important to make sure
>>>>>>>>>> that replication works, first it has to say:
>>>>>>>>>>
>>>>>>>>>> Replicating x
>>>>>>>>>>
>>>>>>>>>> Where x is the number of edits it's going to ship, and then
>>>>>>>>>>
>>>>>>>>>> Replicated in total: y
>>>>>>>>>>
>>>>>>>>>> Where y is the total number it replicated. Seeing the second line
>>>>>>>>>> means that replication was successful, at least from the master point
>>>>>>>>>> of view.
>>>>>>>>>>
>>>>>>>>>> On the slave, one node should have:
>>>>>>>>>>
>>>>>>>>>> Total replicated: z
>>>>>>>>>>
>>>>>>>>>> And that z is the number of edits that that region server applied on
>>>>>>>>>> it's cluster. It could be on any region server, since the sink for
>>>>>>>>>> replication is chose at random.
>>>>>>>>>>
>>>>>>>>>> Do you see those? Any exceptions around those logs apart from EOFs?
>>>>>>>>>>
>>>>>>>>>> Thx,
>>>>>>>>>>
>>>>>>>>>> J-D
>>>>>>>>>>
>>>>>>>>>> On Mon, Dec 13, 2010 at 10:52 AM, Nathaniel Cook
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I am trying to setup replication for my HBase clusters. I have two
>>>>>>>>>>> small clusters for testing each with 4 machines. The setup for the 
>>>>>>>>>>> two
>>>>>>>>>>> clusters is identical. Each machine runs a DataNode, and
>>>>>>>>>>> HRegionServer. Three of the machines run a ZK peer and one machine
>>>>>>>>>>> runs the HMaster and NameNode. The cluster master machines have
>>>>>>>>>>> hostnames (ds1,ds2 ...) and the slave cluster is (bk1, bk2 ...). I 
>>>>>>>>>>> set
>>>>>>>>>>> the replication  scope to 1 for my test table column families and 
>>>>>>>>>>> set
>>>>>>>>>>> the hbase.replication property to true for both clusters. Next I ran
>>>>>>>>>>> the add_peer.rb script with the following command on the ds1 
>>>>>>>>>>> machine:
>>>>>>>>>>>
>>>>>>>>>>> hbase org.jruby.Main /usr/lib/hbase/bin/replication/add_peer.rb
>>>>>>>>>>> ds1:2181:/hbase bk1:2181:/hbase
>>>>>>>>>>>
>>>>>>>>>>> After the script finishes ZK for the master cluster has the
>>>>>>>>>>> replication znode and children of peers, master, and state. The 
>>>>>>>>>>> slave
>>>>>>>>>>> ZK didn't have a replication znode. I fixed that problem by 
>>>>>>>>>>> rerunning
>>>>>>>>>>> the script on the bk1 machine and commenting out the code to write 
>>>>>>>>>>> to
>>>>>>>>>>> the master ZK. Now the slave ZK has the /hbase/replication/master
>>>>>>>>>>> znode with data (ds1:2181:/hbase). Everthing looked to be configured
>>>>>>>>>>> correctly. I restarted the clusters. The logs of the master
>>>>>>>>>>> regionservers stated:
>>>>>>>>>>>
>>>>>>>>>>> This cluster (ds1:2181:/hbase) is a master for replication, compared
>>>>>>>>>>> with (ds1:2181:/hbase)
>>>>>>>>>>>
>>>>>>>>>>> The logs on the slave cluster stated:
>>>>>>>>>>>
>>>>>>>>>>> This cluster (bk1:2181:/hbase) is a slave for replication, compared
>>>>>>>>>>> with (ds1:2181:/hbase)
>>>>>>>>>>>
>>>>>>>>>>> Using the hbase shell I put a row into the test table.
>>>>>>>>>>>
>>>>>>>>>>> The regionserver for that table had a log statement like:
>>>>>>>>>>>
>>>>>>>>>>> Going to report log #192.168.1.166%3A60020.1291757445179 for 
>>>>>>>>>>> position
>>>>>>>>>>> 15828 in 
>>>>>>>>>>> hdfs://ds1:9000/hbase/.logs/ds1.internal,60020,1291757445059/192.168.1.166
>>>>>>>>>>> <http://192.168.1.166/>%3A60020.1291757445179
>>>>>>>>>>>
>>>>>>>>>>> (192.168.1.166 is ds1)
>>>>>>>>>>>
>>>>>>>>>>> I wait and even after several minutes the row still does not appear 
>>>>>>>>>>> in
>>>>>>>>>>> the slave cluster table.
>>>>>>>>>>>
>>>>>>>>>>> Any help with what the problem might be is greatly appreciated.
>>>>>>>>>>>
>>>>>>>>>>> Both clusters are using a CDH3b3. The HBase version is exactly
>>>>>>>>>>> 0.89.20100924+28.
>>>>>>>>>>>
>>>>>>>>>>> -Nathaniel Cook
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> -Nathaniel Cook
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> -Nathaniel Cook
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> -Nathaniel Cook
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -Nathaniel Cook
>>>>
>>>
>>
>>
>>
>> --
>> -Nathaniel Cook
>>
>
>
>
> --
> -Nathaniel Cook
>

Re: HBase Replication problems

Reply via email to