Hi,
I'm using hbase 0.20.5 and hadoop 0.20.1. Some region servers are crashing,
saying that an file cannot be found, and that a lease has expired (log detail
below). Tried to find in this mailing list for the exact problem but was not
successful. These are the symptoms:
- Typically I see higher used swap on these servers than on the rest (the ones
that didn't crash).
- I don't see any Xciever count exceeded message on the DataNode logs.
- Please find log snippets as well as configuration settings below.
- We don't write to WALs
I would greatly appreciate any hint on this regard.
Many thanks,
Martin
======================
These are some of my settings which I think might be relevant:
hdfs-site:
<property>
<name>dfs.datanode.max.xcievers</name>
<value>10000</value>
</property>
hbase-site:
<property>
<name>hbase.regionserver.handler.count</name>
<value>50</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.upperLimit</name>
<value>0.3</value>
</property>
<property>
<name>hfile.block.cache.size</name>
<value>0.3</value>
</property>
ulimit -n
100000
LOGS:
Region Server:
2010-12-27 08:35:31,932 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Finished snapshotting, commencing flushing stores
2010-12-27 08:35:32,426 INFO org.apache.hadoop.hbase.regionserver.Store: Added
hdfs://ls-nn05.netw.domain01.net:50001/hbase/usertrails_all/1777872268/d/2971304353461036329,
entries=
22319, sequenceid=1717292326, memsize=16.1m, filesize=13.5m to
usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,1292964088406
2010-12-27 08:35:32,426 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Caches flushed, doing commit now (which includes update scanners)
2010-12-27 08:35:32,426 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Finished memstore flush of ~16.1m for region
usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,1292964088406
in 494ms, sequence id=1717292326, compaction requested=false
2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Updates disabled for region, no outstanding scanners on
usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4
f1d1\x019223370750718307807\x015692970,1292964088406
2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: No
more row locks outstanding on region
usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x0192233707
50718307807\x015692970,1292964088406
2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.Store:
closed d
2010-12-27 08:35:32,427 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Closed
usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,12929640884
06
2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HLog:
closing hlog writer in
hdfs://ls-nn05.netw.domain01.net:50001/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,12
92875777337
2010-12-27 08:35:32,497 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
Exception: org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredEx
ception: No lease on
/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959
File does not exist. Holder DFSClient_52782511 does not have any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597) at
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at
org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396) at
org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
at org.apache.hadoop.ipc.Client.call(Client.java:739)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy1.addBlock(Unknown Source)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2906)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2788)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
2010-12-27 08:35:32,497 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
for block null bad datanode[0] nodes == null2010-12-27 08:35:32,497 WARN
org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file
"/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293
458980959" - Aborting...
2010-12-27 08:35:32,499 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Close and delete
failedorg.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/hbase/.logs/ls-hdfs24.netw.domain01.net,6002
0,1292875777337/hlog.dat.1293458980959 File does not exist. Holder
DFSClient_52782511 does not have any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
at
org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:626)
at java.lang.Thread.run(Thread.java:619)2010-12-27 08:35:32,500 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: telling master that region
server is shutting down at: ls-hdfs24.netw.domain01.net,60020,12928757773
37
2010-12-27 08:35:32,508 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server at:
ls-hdfs24.netw.domain01.net,60020,1292875777337
2010-12-27 08:35:32,509 INFO org.apache.zookeeper.ZooKeeper: Closing session:
0x2d05625f9b0008
2010-12-27 08:35:32,509 INFO org.apache.zookeeper.ClientCnxn: Closing
ClientCnxn for session: 0x2d05625f9b0008
2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ClientCnxn: Disconnecting
ClientCnxn for session: 0x2d05625f9b0008
2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ZooKeeper: Session:
0x2d05625f9b0008 closed
2010-12-27 08:35:32,512 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with
ZooKeeper
2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ClientCnxn: EventThread shut
down
2010-12-27 08:35:32,614 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer:
regionserver/10.32.20.70:60020 exiting
2010-12-27 08:35:49,448 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-4947199215968218273 lease expired
2010-12-27 08:35:49,538 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 3282292946771297386
lease expired
2010-12-27 08:35:49,582 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 7882165461457343907
lease expired
2010-12-27 08:35:49,582 INFO org.apache.hadoop.hbase.Leases:
regionserver/10.32.20.70:60020.leaseChecker closing leases
2010-12-27 08:35:49,582 INFO org.apache.hadoop.hbase.Leases:
regionserver/10.32.20.70:60020.leaseChecker closed leases
2010-12-27 08:35:49,607 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
2010-12-27 08:35:49,607 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
Namenode:
2010-12-27 08:35:32,427 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7
on 50001, call
addBlock(/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959,
DFSClient_52782511) from 10.32.20.70:44967: error:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959
File does not exist. Holder DFSClient_52782511 does not have any open files.
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on
/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959
File does not exist. Holder DFSClient_52782511 does not have any open files.
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
Datanode:
Not too unusual. Some entries like the following:
2010-12-27 08:35:52,519 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(10.32.20.70:50010,
storageID=DS-2137188581-173.192.229.228-50010-1273698157328, infoPort=50075,
ipcPort=50020):DataXceiver
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.DataInputStream.read(DataInputStream.java:132)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:261)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
at java.lang.Thread.run(Thread.java:619)
2010-12-27 09:35:52,728 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
writeBlock blk_-4708329089501359494_30620223 received exception
java.net.SocketException: Connection reset