First, never swap with Java. Disable it on your machines.

Second, to go at the bottom of this issue you need to go where it
starts showing exceptions in the logs. In your case it seems we only
see indirect symptoms of a forceful failover by the HBase master.
Somewhere before that there should be a dump of the metrics, and it's
probably due to a long garbage collection (that got worse because of
swapping).

J-D

On Sat, Jan 1, 2011 at 2:10 PM, Martin Arnandze <[email protected]> wrote:
> Hi All,
>  I was just wondering if anyone had a chance too look at this issue, or 
> experienced something similar. I'm stuck with this and would appreciate any 
> hint.
> Many thanks,
>   Martin
>
> On Dec 27, 2010, at 11:27 AM, Martin Arnandze wrote:
>
>> Hi,
>>   I'm using hbase 0.20.5 and hadoop 0.20.1. Some region servers are 
>> crashing, saying that an file cannot be found, and that a lease has expired 
>> (log detail below). Tried to find in this mailing list for the exact problem 
>> but was not successful. These are the symptoms:
>>
>> - Typically I see higher used swap on these servers than on the rest (the 
>> ones that didn't crash).
>> - I don't see any Xciever count exceeded message on the DataNode logs.
>> - Please find log snippets as well as configuration settings below.
>>  - We don't write to WALs
>>
>> I would greatly appreciate any hint on this regard.
>>
>> Many thanks,
>>    Martin
>>
>> ======================
>>
>> These are some of my settings which I think might be relevant:
>>
>> hdfs-site:
>> <property>
>>    <name>dfs.datanode.max.xcievers</name>
>>    <value>10000</value>
>>  </property>
>>
>> hbase-site:
>>
>>  <property>
>>    <name>hbase.regionserver.handler.count</name>
>>    <value>50</value>
>>  </property>
>>
>>  <property>
>>    <name>hbase.regionserver.global.memstore.upperLimit</name>
>>    <value>0.3</value>
>>  </property>
>>  <property>
>>    <name>hfile.block.cache.size</name>
>>    <value>0.3</value>
>>  </property>
>>
>>
>> ulimit -n
>> 100000
>>
>>
>> LOGS:
>>
>> Region Server:
>>
>> 2010-12-27 08:35:31,932 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
>> Finished snapshotting, commencing flushing stores
>> 2010-12-27 08:35:32,426 INFO org.apache.hadoop.hbase.regionserver.Store: 
>> Added 
>> hdfs://ls-nn05.netw.domain01.net:50001/hbase/usertrails_all/1777872268/d/2971304353461036329,
>>  entries=
>> 22319, sequenceid=1717292326, memsize=16.1m, filesize=13.5m to 
>> usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,1292964088406
>> 2010-12-27 08:35:32,426 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
>> Caches flushed, doing commit now (which includes update scanners)
>> 2010-12-27 08:35:32,426 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
>> Finished memstore flush of ~16.1m for region 
>> usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,1292964088406
>>  in 494ms, sequence id=1717292326, compaction requested=false
>> 2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
>> Updates disabled for region, no outstanding scanners on 
>> usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4
>> f1d1\x019223370750718307807\x015692970,1292964088406
>> 2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: 
>> No more row locks outstanding on region 
>> usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x0192233707
>> 50718307807\x015692970,1292964088406
>> 2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.Store: 
>> closed d
>> 2010-12-27 08:35:32,427 INFO org.apache.hadoop.hbase.regionserver.HRegion: 
>> Closed 
>> usertrails_all,1b005e71-cd19-4d82-87a3-3fa269f4f1d1\x019223370750718307807\x015692970,12929640884
>> 06
>> 2010-12-27 08:35:32,427 DEBUG org.apache.hadoop.hbase.regionserver.HLog: 
>> closing hlog writer in 
>> hdfs://ls-nn05.netw.domain01.net:50001/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,12
>> 92875777337
>> 2010-12-27 08:35:32,497 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
>> Exception: org.apache.hadoop.ipc.RemoteException: 
>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredEx
>> ception: No lease on 
>> /hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959
>>  File does not exist. Holder DFSClient_52782511 does not have any open files.
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434)
>>         at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>>         at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)        at 
>> org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)        at 
>> org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:396)        at 
>> org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
>>          at org.apache.hadoop.ipc.Client.call(Client.java:739)
>>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>>         at $Proxy1.addBlock(Unknown Source)
>>         at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
>>         at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at 
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>>         at 
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>>         at $Proxy1.addBlock(Unknown Source)
>>         at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2906)
>>         at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2788)
>>         at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2078)
>>         at 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2264)
>>
>> 2010-12-27 08:35:32,497 WARN org.apache.hadoop.hdfs.DFSClient: Error 
>> Recovery for block null bad datanode[0] nodes == null2010-12-27 08:35:32,497 
>> WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source 
>> file 
>> "/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293
>> 458980959" - Aborting...
>> 2010-12-27 08:35:32,499 ERROR 
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Close and delete 
>> failedorg.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: 
>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
>> /hbase/.logs/ls-hdfs24.netw.domain01.net,6002
>> 0,1292875777337/hlog.dat.1293458980959 File does not exist. Holder 
>> DFSClient_52782511 does not have any open files.
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434)
>>         at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>>         at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
>>
>>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
>> Method)
>>         at 
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>>         at 
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>>         at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>>         at 
>> org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
>>         at 
>> org.apache.hadoop.hbase.RemoteExceptionHandler.checkThrowable(RemoteExceptionHandler.java:48)
>>         at 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:626)
>>         at java.lang.Thread.run(Thread.java:619)2010-12-27 08:35:32,500 INFO 
>> org.apache.hadoop.hbase.regionserver.HRegionServer: telling master that 
>> region server is shutting down at: 
>> ls-hdfs24.netw.domain01.net,60020,12928757773
>> 37
>> 2010-12-27 08:35:32,508 INFO 
>> org.apache.hadoop.hbase.regionserver.HRegionServer: stopping server at: 
>> ls-hdfs24.netw.domain01.net,60020,1292875777337
>> 2010-12-27 08:35:32,509 INFO org.apache.zookeeper.ZooKeeper: Closing 
>> session: 0x2d05625f9b0008
>> 2010-12-27 08:35:32,509 INFO org.apache.zookeeper.ClientCnxn: Closing 
>> ClientCnxn for session: 0x2d05625f9b0008
>> 2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ClientCnxn: Disconnecting 
>> ClientCnxn for session: 0x2d05625f9b0008
>> 2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ZooKeeper: Session: 
>> 0x2d05625f9b0008 closed
>> 2010-12-27 08:35:32,512 DEBUG 
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Closed connection with 
>> ZooKeeper
>> 2010-12-27 08:35:32,512 INFO org.apache.zookeeper.ClientCnxn: EventThread 
>> shut down
>> 2010-12-27 08:35:32,614 INFO 
>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>> regionserver/10.32.20.70:60020 exiting
>> 2010-12-27 08:35:49,448 INFO 
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 
>> -4947199215968218273 lease expired
>> 2010-12-27 08:35:49,538 INFO 
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 
>> 3282292946771297386 lease expired
>> 2010-12-27 08:35:49,582 INFO 
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner 
>> 7882165461457343907 lease expired
>> 2010-12-27 08:35:49,582 INFO org.apache.hadoop.hbase.Leases: 
>> regionserver/10.32.20.70:60020.leaseChecker closing leases
>> 2010-12-27 08:35:49,582 INFO org.apache.hadoop.hbase.Leases: 
>> regionserver/10.32.20.70:60020.leaseChecker closed leases
>> 2010-12-27 08:35:49,607 INFO 
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
>> 2010-12-27 08:35:49,607 INFO 
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete
>>
>>
>>
>>
>> Namenode:
>>
>>
>> 2010-12-27 08:35:32,427 INFO org.apache.hadoop.ipc.Server: IPC Server 
>> handler 7 on 50001, call 
>> addBlock(/hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959,
>>  DFSClient_52782511) from 10.32.20.70:44967: error: 
>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
>> /hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959
>>  File does not exist. Holder DFSClient_52782511 does not have any open files.
>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on 
>> /hbase/.logs/ls-hdfs24.netw.domain01.net,60020,1292875777337/hlog.dat.1293458980959
>>  File does not exist. Holder DFSClient_52782511 does not have any open files.
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1328)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1319)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1247)
>>         at 
>> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:434)
>>         at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>>         at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:396)
>>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
>>
>>
>>
>> Datanode:
>>
>> Not too unusual. Some entries like the following:
>>
>> 2010-12-27 08:35:52,519 ERROR 
>> org.apache.hadoop.hdfs.server.datanode.DataNode: 
>> DatanodeRegistration(10.32.20.70:50010, 
>> storageID=DS-2137188581-173.192.229.228-50010-1273698157328, infoPort=50075, 
>> ipcPort=50020):DataXceiver
>> java.net.SocketException: Connection reset
>>         at java.net.SocketInputStream.read(SocketInputStream.java:168)
>>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>         at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>         at java.io.DataInputStream.read(DataInputStream.java:132)
>>         at 
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:261)
>>         at 
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308)
>>         at 
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372)
>>         at 
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524)
>>         at 
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
>>         at 
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>>         at java.lang.Thread.run(Thread.java:619)
>> 2010-12-27 09:35:52,728 INFO 
>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock 
>> blk_-4708329089501359494_30620223 received exception 
>> java.net.SocketException: Connection reset
>>
>>
>
>

Reply via email to