Re: Region server goes away

Patrick Hunt Mon, 19 Apr 2010 15:03:25 -0700

Btw, we (ZK) changed this message from a warning to an INFO in 3.3.0.It's basically saying that the client is trying to create a znode thatalready exists. Which is actually fine (an expected case from the clientAPI side of things), just that some server side error logging code wehad was catching this in the net along with the errors that we reallywanted to log at WARN level. Unless HBase is not expecting this(creating a node that already exists) it's not really a problem.


Patrick


On 04/19/2010 02:11 PM, Geoff Hendrey wrote:

As a follow-up to this saga: Hbase seems to be healthy at this time, modulo the 
WARN below, which I have not figured out how to ameliorate. I believe that some 
of the issues with HDFS corruption were caused by the large write buffers that 
I was using in the mapreduce job (32,000 was the number of Put that would be 
buffered before a commit). I had tried many write buffer values on smaller 
jobs, and had determined 32,000 to be optimal. However, it seems when I scaled 
up the mapreduce job, the 32K write buffer was just way too high. I scaled it 
way down to 100, and I don't get any errors or HDFS corruptions.

Finally, sometimes one of my two region servers seems to disappear (running 
'status' in the Hbase shell shows only 1 region server). However, when I 
restart Hbase, the dead region server comes back.

Thanks for the advice and pointers.

-geoff


-----Original Message-----
From: Geoff Hendrey
Sent: Thursday, April 15, 2010 10:26 AM
To: [email protected]
Subject: RE: Region server goes away

After making all the recommended config changes, the only issue I see it this, 
in the zookeeper logs. It happens repeatedly. Hbase shell seems to work fine, 
running it on same machine as the zookeeper. Any ideas? I reviewed a thread in 
the email list, on this topic, but it seemed inconclusive.:

2010-04-15 04:14:36,048 WARN org.apache.zookeeper.server.PrepRequestProcessor:  
ot exception when processing sessionid:0x128012c809c0000 type:create cxid:0x4 z 
id:0xfffffffffffffffe txntype:unknown n/a
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = Nod 
Existsof 0x128012c809c0002 valid:true
         at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReques 
Processor.java:245)87c5a0000
         at 
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProc 
ssor.java:114)27fe787c5a3bba

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Stack
Sent: Wednesday, April 14, 2010 8:45 PM
To: [email protected]
Cc: Paul Mahon; Bill Brune; Shaheen Bahauddin; Rohit Nigam
Subject: Re: Region server goes away

On Wed, Apr 14, 2010 at 8:27 PM, Geoff Hendrey<[email protected]>  wrote:

Hi,

I have posted previously about issues I was having with HDFS when I
was running HBase and HDFS on the same box both pseudoclustered. Now I
have two very capable servers. I've setup HDFS with a datanode on each box.
I've setup the namenode on one box, and the zookeeper and HDFS master
on the other box. Both boxes are region servers. I am using hadoop
20.2 and hbase 20.3.


What do you have for replication?  If two datanodes, you've set it to two 
rather than default 3?


I have set dfs.datanode.socket.write.timeout to 0 in hbase-site.xml.

This is probably not necessary.

I am running a mapreduce job with about 200 concurrent reducers, each
of which writes into HBase, with 32,000 row flush buffers.



Why don't you try with just a few reducers first and then build it up?
  See if that works?


About 40%

through the completion of my job, HDFS started showing one of the
datanodes was dead (the one *not* on the same machine as the namenode).



Do you think it dead -- what did a threaddump say? -- or was it just that you 
couldn't get into it?  Any errors in the datanode logs complaining about 
xceiver count or perhaps you need to up the number of handlers?

I stopped HBase, and magically the datanode came back to life.

Any suggestions on how to increase the robustness?


I see errors like this in the datanode's log:

2010-04-14 12:54:58,692 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: D
atanodeRegistration(10.241.6.80:50010,
storageID=DS-642079670-10.241.6.80-50010-
1271178858027, infoPort=50075, ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: 480000 millis timeout while waiting
for channel



I believe this harmless.  Its just the DN timing out the socket -- you set the 
timeout to 0 in the hbase-site.xml rather than in hdfs-site.xml where it would 
have an effect.  See HADOOP-3831 for detail.

  to be ready for write. ch : java.nio.channels.SocketChannel[connected
local=/10
.241.6.80:50010 remote=/10.241.6.80:48320]
        at
org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeo
ut.java:246)
        at
org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutput
Stream.java:159)
        at
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutput
Stream.java:198)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSe
nder.java:313)
        at
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSen
der.java:400)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXcei
ver.java:180)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.ja
:


Here I show the output of 'hadoop dfsadmin -report'. First time it is
invoked, all is well. Second time, one datanode is dead. Third time,
the dead datanode has come back to life.:

[had...@dt1 ~]$ hadoop dfsadmin -report Configured Capacity:
1277248323584 (1.16 TB) Present Capacity: 1208326105528 (1.1 TB) DFS
Remaining: 1056438108160 (983.88 GB) DFS Used: 151887997368 (141.46
GB) DFS Used%: 12.57% Under replicated blocks: 3479 Blocks with
corrupt replicas: 0 Missing blocks: 0

-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead)

Name: 10.241.6.79:50010
Decommission Status : Normal
Configured Capacity: 643733970944 (599.52 GB) DFS Used: 75694104268
(70.5 GB) Non DFS Used: 35150238004 (32.74 GB) DFS Remaining:
532889628672(496.29 GB) DFS Used%: 11.76% DFS Remaining%: 82.78% Last
contact: Wed Apr 14 11:20:59 PDT 2010


Yeah, my guess as per above is that the reporting client couldn't get on to the 
datanode because handlers were full or xceivers exceeded.

Let us know how it goes.
St.Ack

Name: 10.241.6.80:50010
Decommission Status : Normal
Configured Capacity: 633514352640 (590.01 GB) DFS Used: 76193893100
(70.96 GB) Non DFS Used: 33771980052 (31.45 GB) DFS Remaining:
523548479488(487.59 GB) DFS Used%: 12.03% DFS Remaining%: 82.64% Last
contact: Wed Apr 14 11:14:37 PDT 2010


[had...@dt1 ~]$ hadoop dfsadmin -report Configured Capacity:
643733970944 (599.52 GB) Present Capacity: 609294929920 (567.45 GB)
DFS Remaining: 532876144640 (496.28 GB) DFS Used: 76418785280 (71.17
GB) DFS Used%: 12.54% Under replicated blocks: 3247 Blocks with
corrupt replicas: 0 Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (2 total, 1 dead)

Name: 10.241.6.79:50010
Decommission Status : Normal
Configured Capacity: 643733970944 (599.52 GB) DFS Used: 76418785280
(71.17 GB) Non DFS Used: 34439041024 (32.07 GB) DFS Remaining:
532876144640(496.28 GB) DFS Used%: 11.87% DFS Remaining%: 82.78% Last
contact: Wed Apr 14 11:28:38 PDT 2010


Name: 10.241.6.80:50010
Decommission Status : Normal
Configured Capacity: 0 (0 KB)
DFS Used: 0 (0 KB)
Non DFS Used: 0 (0 KB)
DFS Remaining: 0(0 KB)
DFS Used%: 100%
DFS Remaining%: 0%
Last contact: Wed Apr 14 11:14:37 PDT 2010


[had...@dt1 ~]$ hadoop dfsadmin -report Configured Capacity:
1277248323584 (1.16 TB) Present Capacity: 1210726427080 (1.1 TB) DFS
Remaining: 1055440003072 (982.96 GB) DFS Used: 155286424008 (144.62
GB) DFS Used%: 12.83% Under replicated blocks: 3338 Blocks with
corrupt replicas: 0 Missing blocks: 0

-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead)

Name: 10.241.6.79:50010
Decommission Status : Normal
Configured Capacity: 643733970944 (599.52 GB) DFS Used: 77775145981
(72.43 GB) Non DFS Used: 33086850051 (30.81 GB) DFS Remaining:
532871974912(496.28 GB) DFS Used%: 12.08% DFS Remaining%: 82.78% Last
contact: Wed Apr 14 11:29:44 PDT 2010


Name: 10.241.6.80:50010
Decommission Status : Normal
Configured Capacity: 633514352640 (590.01 GB) DFS Used: 77511278027
(72.19 GB) Non DFS Used: 33435046453 (31.14 GB) DFS Remaining:
522568028160(486.68 GB) DFS Used%: 12.24% DFS Remaining%: 82.49% Last
contact: Wed Apr 14 11:29:44 PDT 2010

Re: Region server goes away

Reply via email to