[jira] Updated: (HBASE-2042) ChecksumException generates thousands of open sockets

Richard Lackey (JIRA) Thu, 12 Aug 2010 07:08:48 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Richard Lackey updated HBASE-2042:
----------------------------------

    Attachment: hbase-2042.patch

This change applies to trunk. It is a common problem highlighted in hbase-2042 
(https://issues.apache.org/jira/browse/HBASE-2042).

The change is isolated to Hadoop common: org.apache.hadoop.ipc.Client.java 
Connection.close() svn: 982681

In general, a RuntimeException in close() will percolate up the stack eliciting 
unexpected errors. In this case, the file descriptor ulimit was exceeded, 
resulting in a NPE. 
The change is to protect close(). In addition to the general case, close() 
checks for a partially built Connection object as needed.

The proposed patch is attached.



> ChecksumException generates thousands of open sockets
> -----------------------------------------------------
>
>                 Key: HBASE-2042
>                 URL: https://issues.apache.org/jira/browse/HBASE-2042
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: hbase-2042.patch
>
>
> Below is from Stefan Will:
> {code}
> I just ran into another case where a Regionserver wouldn't properly shut down 
> (this is on my development cluster, so not too critical). In this case, it is 
> also holding over 27000 datanode sockets (port 50010) in CLOSE_WAIT state. I 
> was actually using so many that the SSH daemon wouldn't let me log in to the 
> machine due to too many open file handles (I can probably figure out how to 
> increase that limit). So I shut down the Master and the other regionservers, 
> which might have been the cause for it to be in "handleConnectionFailure()". 
> The last few lines of the log were:
> 009-12-06 10:30:12,609 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: aborting server at: 
> 10.1.20.144:60020 <http://10.1.20.144:60020>
> - Hide quoted text -
> 2009-12-06 10:30:12,646 INFO org.apache.hadoop.hbase.regionserver.LogFlusher: 
> regionserver/10.1.20.144:60020.logFlusher exiting
> 2009-12-06 10:30:12,646 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer$MajorCompactionChecker: 
> regionserver/10.1.20.144:60020.majorCompactionChecker exiting
> 2009-12-06 10:30:14,871 INFO org.apache.hadoop.hbase.Leases: 
> regionserver/10.1.20.144:60020.leaseChecker closing leases
> 2009-12-06 10:30:14,871 INFO org.apache.hadoop.hbase.Leases: 
> regionserver/10.1.20.144:60020.leaseChecker closed leases
> 2009-12-06 10:30:17,366 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: worker thread exiting
> 2009-12-06 10:30:17,366 INFO org.apache.zookeeper.ZooKeeper: Closing session: 
> 0x2255b24f9f1003a
> 2009-12-06 10:30:17,366 INFO org.apache.zookeeper.ClientCnxn: Closing 
> ClientCnxn for session: 0x2255b24f9f1003a
> 2009-12-06 10:30:17,387 INFO org.apache.zookeeper.ClientCnxn: Exception while 
> closing send thread for session 0x2255b24f9f1003a : Read error rc = -1 
> java.nio.DirectByteBuffer[
> pos=0 lim=4 cap=4]
> 2009-12-06 10:30:17,489 INFO org.apache.zookeeper.ClientCnxn: Disconnecting 
> ClientCnxn for session: 0x2255b24f9f1003a
> 2009-12-06 10:30:17,489 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x2255b24f9f1003a closed
> 2009-12-06 10:30:17,548 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down
> 2009-12-06 20:35:25,725 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown thread.
> And here is the stacktrace taken after I was able to log back into the 
> machine and send a kill to the process (at the last timestamp). It is now 15 
> minutes later, and the process is still running with no further log output.
> Between midnight and 10:30, there are also over 6700 checksum errors in the 
> log, always with the same log file name:
> org.apache.hadoop.fs.ChecksumException: Checksum error: 
> /blk_4234778497757442733:of:/hbase/post/1926133619/oldlogfile.log at 62521344
>         at 
> org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
>         at 
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> Here is the stack dump:
> Exception in thread "LeaseChecker" java.lang.NullPointerException
>     at 
> org.apache.hadoop.ipc.Client$Connection.handleConnectionFailure(Client.java:351)
>     at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:313)
>     at org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
>     at org.apache.hadoop.ipc.Client.getConnection(Client.java:859)
>     at org.apache.hadoop.ipc.Client.call(Client.java:719)
>     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>     at $Proxy1.renewLease(Unknown Source)
>     at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>     at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>     at $Proxy1.renewLease(Unknown Source)
>     at 
> org.apache.hadoop.hdfs.DFSClient$LeaseChecker.renew(DFSClient.java:1046)
>     at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1058)
>     at java.lang.Thread.run(Thread.java:619)
> 2009-12-06 20:43:32
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.0-b16 mixed mode):
> "Thread-28" prio=10 tid=0x00000000539e5800 nid=0x742e in Object.wait() 
> [0x0000000040a45000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     - waiting on <0x00002aaab7803738> (a java.lang.Thread)
>     at java.lang.Thread.join(Thread.java:1143)
>     - locked <0x00002aaab7803738> (a java.lang.Thread)
>     at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:77)
>     at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:66)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer$ShutdownThread.run(HRegionServer.java:992)
> "SIGTERM handler" daemon prio=10 tid=0x0000000053ca6000 nid=0x742d in 
> Object.wait() [0x0000000043b02000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     - waiting on <0x00002aaab7941978> (a 
> org.apache.hadoop.hbase.regionserver.HRegionServer$ShutdownThread)
>     at java.lang.Thread.join(Thread.java:1143)
>     - locked <0x00002aaab7941978> (a 
> org.apache.hadoop.hbase.regionserver.HRegionServer$ShutdownThread)
>     at java.lang.Thread.join(Thread.java:1196)
>     at 
> java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79)
>     at 
> java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24)
>     at java.lang.Shutdown.runHooks(Shutdown.java:79)
>     at java.lang.Shutdown.sequence(Shutdown.java:123)
>     at java.lang.Shutdown.exit(Shutdown.java:168)
>     - locked <0x00002aaaed6f2ef0> (a java.lang.Class for java.lang.Shutdown)
>     at java.lang.Terminator$1.handle(Terminator.java:35)
>     at sun.misc.Signal$1.run(Signal.java:195)
>     at java.lang.Thread.run(Thread.java:619)
> "Thread-23" prio=10 tid=0x00002aab08019800 nid=0x2036 
> sleeping[0x00000000413de000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>     at java.lang.Thread.sleep(Native Method)
>     at org.apache.hadoop.ipc.Client.stop(Client.java:679)
>     at org.apache.hadoop.ipc.RPC$ClientCache.stopClient(RPC.java:192)
>     at org.apache.hadoop.ipc.RPC$ClientCache.access$400(RPC.java:141)
>     at org.apache.hadoop.ipc.RPC$Invoker.close(RPC.java:234)
>     - locked <0x00002aaab7955f10> (a org.apache.hadoop.ipc.RPC$Invoker)
>     at org.apache.hadoop.ipc.RPC$Invoker.access$500(RPC.java:199)
>     at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:393)
>     at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:241)
>     - locked <0x00002aaab78e5078> (a org.apache.hadoop.hdfs.DFSClient)
>     at 
> org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:269)
>     at org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1419)
>     - locked <0x00002aaab79563f0> (a org.apache.hadoop.fs.FileSystem$Cache)
>     at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:212)
>     at 
> org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:197)
>     - locked <0x00002aaab78037e0> (a 
> org.apache.hadoop.fs.FileSystem$ClientFinalizer)
> "LruBlockCache.EvictionThread" daemon prio=10 tid=0x0000000053920800 
> nid=0x31a4 in Object.wait() [0x0000000043900000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     - waiting on <0x00002aaab7a56160> (a 
> org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread)
>     at java.lang.Object.wait(Object.java:485)
>     at 
> org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread.run(LruBlockCache.java:512)
>     - locked <0x00002aaab7a56160> (a 
> org.apache.hadoop.hbase.io.hfile.LruBlockCache$EvictionThread)
> "DestroyJavaVM" prio=10 tid=0x00002aaaf800f000 nid=0x2f53 waiting on 
> condition [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
> "regionserver/10.1.20.144:60020 <http://10.1.20.144:60020> " prio=10 
> tid=0x00002aaaf800e000 nid=0x2f6c waiting for monitor entry 
> [0x00000000415e2000]
> - Hide quoted text -
>    java.lang.Thread.State: BLOCKED (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     - waiting on <0x00002aaab78037e0> (a 
> org.apache.hadoop.fs.FileSystem$ClientFinalizer)
>     at java.lang.Thread.join(Thread.java:1151)
>     - locked <0x00002aaab78037e0> (a 
> org.apache.hadoop.fs.FileSystem$ClientFinalizer)
>     at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:77)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.runThread(HRegionServer.java:722)
>     at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:677)
>     at java.lang.Thread.run(Thread.java:619)
> "Low Memory Detector" daemon prio=10 tid=0x00000000536b5000 nid=0x2f66 
> runnable [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
> "CompilerThread1" daemon prio=10 tid=0x00000000536b2800 nid=0x2f65 waiting on 
> condition [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
> "CompilerThread0" daemon prio=10 tid=0x00000000536ae000 nid=0x2f64 waiting on 
> condition [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
> "Signal Dispatcher" daemon prio=10 tid=0x00000000536ac000 nid=0x2f63 runnable 
> [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
> "Surrogate Locker Thread (CMS)" daemon prio=10 tid=0x00000000536aa000 
> nid=0x2f62 waiting on condition [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
> "Finalizer" daemon prio=10 tid=0x0000000053687800 nid=0x2f61 in Object.wait() 
> [0x00000000421e9000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     - waiting on <0x00002aaab77f0108> (a java.lang.ref.ReferenceQueue$Lock)
>     at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
>     - locked <0x00002aaab77f0108> (a java.lang.ref.ReferenceQueue$Lock)
>     at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
>     at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
> "Reference Handler" daemon prio=10 tid=0x0000000053685800 nid=0x2f60 in 
> Object.wait() [0x00000000420e8000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait(Native Method)
>     - waiting on <0x00002aaab77f32b8> (a java.lang.ref.Reference$Lock)
>     at java.lang.Object.wait(Object.java:485)
>     at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>     - locked <0x00002aaab77f32b8> (a java.lang.ref.Reference$Lock)
> "VM Thread" prio=10 tid=0x000000005367e800 nid=0x2f5f runnable
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x00000000534c1800 
> nid=0x2f54 runnable
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x00000000534c3800 
> nid=0x2f55 runnable
> "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x00000000534c5800 
> nid=0x2f56 runnable
> "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x00000000534c7000 
> nid=0x2f57 runnable
> "Gang worker#4 (Parallel GC Threads)" prio=10 tid=0x00000000534c9000 
> nid=0x2f58 runnable
> "Gang worker#5 (Parallel GC Threads)" prio=10 tid=0x00000000534cb000 
> nid=0x2f59 runnable
> "Gang worker#6 (Parallel GC Threads)" prio=10 tid=0x00000000534cc800 
> nid=0x2f5a runnable
> "Gang worker#7 (Parallel GC Threads)" prio=10 tid=0x00000000534ce800 
> nid=0x2f5b runnable
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x000000005359e800 nid=0x2f5e 
> runnable
> "Gang worker#0 (Parallel CMS Threads)" prio=10 tid=0x000000005359b000 
> nid=0x2f5c runnable
> "Gang worker#1 (Parallel CMS Threads)" prio=10 tid=0x000000005359c800 
> nid=0x2f5d runnable
> "VM Periodic Task Thread" prio=10 tid=0x00000000536b7800 nid=0x2f67 waiting 
> on condition
> JNI global references: 960
> Heap
>  par new generation   total 115200K, used 67320K [0x00002aaaae6f0000, 
> 0x00002aaab63f0000, 0x00002aaab63f0000)
>   eden space 102400K,  65% used [0x00002aaaae6f0000, 0x00002aaab28a6130, 
> 0x00002aaab4af0000)
>   from space 12800K,   0% used [0x00002aaab5770000, 0x00002aaab5777f10, 
> 0x00002aaab63f0000)
>   to   space 12800K,   0% used [0x00002aaab4af0000, 0x00002aaab4af0000, 
> 0x00002aaab5770000)
>  concurrent mark-sweep generation total 896000K, used 705982K 
> [0x00002aaab63f0000, 0x00002aaaecef0000, 0x00002aaaecef0000)
>  concurrent-mark-sweep perm gen total 27204K, used 16361K 
> [0x00002aaaecef0000, 0x00002aaaee981000, 0x00002aaaf22f0000)
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-2042) ChecksumException generates thousands of open sockets

Reply via email to