Re: Region servers down when inserting with hbase0.20.0 rc

Zheng Lv Wed, 05 Aug 2009 22:57:54 -0700

Hello,
    I adjusted the option "zookeeper.session.timeout" to 120000, and then
restarted the hbase cluster and the test program. After running normally for
14


hours, one of datanodes shut down. When I restarted the hadoop and hbase,
and checked the row count of table 'webpage', I got the result of 6625,
while the

test program log telling me there should be at least 885000. There are too
many data lost. Following is the end part of the datanode log in that
server.

2009-08-06 04:28:32,214 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
192.168.33.9:45465, dest: /192.168.33.6:50010, bytes: 1214,

op: HDFS_WRITE, cliID: DFSClient_1777493426, srvID:
DS-1028185837-192.168.33.6-50010-1249268609430, blockid:
blk_-402434507207277902_27468
2009-08-06 04:28:32,214 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block
blk_-402434507207277902_27468 terminating
2009-08-06 04:28:32,606 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
192.168.33.6:50010, dest: /192.168.33.5:44924, bytes: 446,

op: HDFS_READ, cliID: DFSClient_-255011821, srvID:
DS-1028185837-192.168.33.6-50010-1249268609430, blockid:
blk_-2647720945992878390_27447
2009-08-06 04:28:32,612 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
192.168.33.6:50010, dest: /192.168.33.5:44925, bytes: 277022,

op: HDFS_READ, cliID: DFSClient_-255011821, srvID:
DS-1028185837-192.168.33.6-50010-1249268609430, blockid:
blk_-2647720945992878390_27447
2009-08-06 04:28:32,770 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-5186903983646527212_27469 src: /192.168.33.5:44941 dest:

/192.168.33.6:50010
2009-08-06 04:29:35,672 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
blk_1888582734643135148_27447 1 Exception

java.net.SocketTimeoutException: 60000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected


local=/192.168.33.6:35418 remote=/192.168.33.5:50010]
        at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:853)
        at java.lang.Thread.run(Thread.java:619)

2009-08-06 04:29:35,673 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 1 for block
blk_1888582734643135148_27447 terminating
2009-08-06 04:29:35,683 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock
for block blk_1888582734643135148_27447

java.io.EOFException: while trying to read 65557 bytes
2009-08-06 04:29:35,689 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
blk_1888582734643135148_27447 received exception

java.io.EOFException: while trying to read 65557 bytes
2009-08-06 04:29:35,689 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
192.168.33.6:50010, storageID=DS-1028185837-192.168.33.6

-50010-1249268609430, infoPort=50075, ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
        at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:264)
        at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308)
        at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372)
        at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
        at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
        at java.lang.Thread.run(Thread.java:619)




    *************************************




    And following is part of the content of test program log.

insertting 880000 webpages need 51920792 ms.
insertting 881000 webpages need 51972741 ms.
insertting 882000 webpages need 52024775 ms.
09/08/06 04:32:20 WARN zookeeper.ClientCnxn: Exception closing session
0x222e91bb6b90002 to sun.nio.ch.selectionkeyi...@527809c6
java.io.IOException: TIMED OUT
        at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:858)
09/08/06 04:32:21 INFO zookeeper.ClientCnxn: Attempting connection to server
ubuntu3/192.168.33.8:2222
09/08/06 04:32:21 INFO zookeeper.ClientCnxn: Priming connection to
java.nio.channels.SocketChannel[connected local=/192.168.33.7:52496

remote=ubuntu3/192.168.
33.8:2222]
09/08/06 04:32:21 INFO zookeeper.ClientCnxn: Server connection successful
insertting 883000 webpages need 52246380 ms.
insertting 884000 webpages need 52298370 ms.
insertting 885000 webpages need 52380479 ms.
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server Some server, retryOnlyOne=true, index=0, islastrow=true,
tries=9,

nu
mtries=10, i=0, listsize=1, location=address: 192.168.33.5:60020,
regioninfo: REGION => {NAME => 'webpage,http:\x2F\x2Fnews.163.com
\x2F09\x2F0803\x2F01

\x2F5FO
O155J0001124J.html1249504151762_879696,1249504267420', STARTKEY =>
'http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01

\x2F5FOO155J0001124J.html1249504151762_879696
', ENDKEY => '', ENCODED => 1607113409, TABLE => {{NAME => 'webpage',
FAMILIES => [{NAME => 'CF_CONTENT', COMPRESSION => 'NONE', VERSIONS => '2',
TTL =>

'2147
483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
{NAME => 'CF_INFORMATION', COMPRESSION => 'NONE', VERSIONS => '1', TTL =>

'2147483
647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}},
region=webpage,http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01

\x2F5FOO155J0001124J.h
tml1249504151762_879696,1249504267420 for region webpage,http:\x2F\
x2Fnews.163.com\x2F09\x2F0803\x2F01

\x2F5FOO155J0001124J.html1249504151762_879696,1249504267
420, row 
'http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01\x2F5FOO155J0001124J.html1249504668723_885781',
but failed after 10 attempts.
Exceptions:

        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1041)
        at
org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:584)
        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:450)
        at hbasetest.HBaseWebpage.insert(HBaseWebpage.java:82)
        at hbasetest.InsertThread.run(InsertThread.java:26)
org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact
region server Some server, retryOnlyOne=true, index=0, islastrow=true,
tries=9,

nu
mtries=10, i=0, listsize=1, location=address: 192.168.33.5:60020,
regioninfo: REGION => {NAME => 'webpage,http:\x2F\x2Fnews.163.com
\x2F09\x2F0803\x2F01

\x2F5FO
O155J0001124J.html1249504151762_879696,1249504267420', STARTKEY =>
'http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01

\x2F5FOO155J0001124J.html1249504151762_879696
', ENDKEY => '', ENCODED => 1607113409, TABLE => {{NAME => 'webpage',
FAMILIES => [{NAME => 'CF_CONTENT', COMPRESSION => 'NONE', VERSIONS => '2',
TTL =>

'2147
483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},
{NAME => 'CF_INFORMATION', COMPRESSION => 'NONE', VERSIONS => '1', TTL =>

'2147483
647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}},
region=webpage,http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01

\x2F5FOO155J0001124J.h
tml1249504151762_879696,1249504267420 for region webpage,http:\x2F\
x2Fnews.163.com\x2F09\x2F0803\x2F01

\x2F5FOO155J0001124J.html1249504151762_879696,1249504267
420, row 
'http:\x2F\x2Fnews.163.com\x2F09\x2F0803\x2F01\x2F5FOO155J0001124J.html1249504754735_885782',
but failed after 10 attempts.
Exceptions:

        at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1041)
        at
org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:584)
        at org.apache.hadoop.hbase.client.HTable.put(HTable.java:450)
        at hbasetest.HBaseWebpage.insert(HBaseWebpage.java:82)
        at hbasetest.InsertThread.run(InsertThread.java:26)
.
.
.
.
.
.
.



    Any suggestion?
    Thanks a lot,
    LvZheng

2009/8/5 Zheng Lv <[email protected]>

> Hi Stack,
>     Thank you very much for your explaination.
>     We just adjusted the value of the property "zookeeper.session.timeout"
> to 120000, and we are observing the system now.
>     "Are nodes running on same nodes as hbase? " --Do you mean we should
> have several servers running exclusively for zk cluster? But I'm afraid that
> we can not have that many servers. Any suggestion?
>     We don't config the zk in zoo.cfg, but in hbase-site.xml. Following is
> the content in hbase-site.xml about zk.
>     <property>
>       <name>hbase.zookeeper.property.clientPort</name>
>       <value>2222</value>
>     </property>
>
>      <property>
>       <name>hbase.zookeeper.quorum</name>
>       <value>ubuntu2,ubuntu3,ubuntu7,ubuntu9,ubuntu6</value>
>     </property>
>
>     <property>
>       <name>zookeeper.session.timeout</name>
>       <value>120000</value>
>     </property>
>
>     Thanks a lot,
>     LvZheng
>
>

Re: Region servers down when inserting with hbase0.20.0 rc

Reply via email to