[ 
https://issues.apache.org/jira/browse/HBASE-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266310#comment-14266310
 ] 

Cosmin Lehene commented on HBASE-5079:
--------------------------------------

[~saint....@gmail.com],  please review, this is old and doesn' but seems nasty 

> DistributedLogSplitter interrupt can be hazardous to regionserver health
> ------------------------------------------------------------------------
>
>                 Key: HBASE-5079
>                 URL: https://issues.apache.org/jira/browse/HBASE-5079
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>              Labels: delete
>
> The DLS interrupt can kill the regionserver if happens while conversation w/ 
> namenode is going on.
> The interrupt is used to end a task on regionserver when done whether 
> successful or to interrupt an ongoing split since assumed by another server.
> I saw this issue testing because I was killing servers.  I also was suffering 
> "HBASE-5078 DistributedLogSplitter failing to split file because it has edits 
> for lots of regions" which made it more likely to happen.
> Here is what it looks like on the regionserver that died:
> {code}
> 2011-12-20 17:54:58,009 INFO 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: task 
> /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403495463
>  preempted from sv4r13s38,7003,1324365396583, current task state and 
> owner=owned sv4r27s44,7003,1324365396664
> 2011-12-20 17:54:58,009 INFO 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to 
> stop the worker thread
> 2011-12-20 17:54:59,133 INFO 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: task 
> /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403495463
>  preempted from sv4r13s38,7003,1324365396583, current task state and 
> owner=owned sv4r27s44,7003,1324365396664
> 2011-12-20 17:54:59,134 INFO 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to 
> stop the worker thread
> ...
> 2011-12-20 17:55:25,505 INFO 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: task 
> /hbase/splitlog/hdfs%3A%2F%2Fsv4r11s38%3A7000%2Fhbase%2F.logs%2Fsv4r31s44%2C7003%2C1324365396770-splitting%2Fsv4r31s44%252C7003%252C1324365396770.1324403495463
>  preempted from sv4r13s38,7003,1324365396583, current task state and 
> owner=unassigned sv4r11s38,7001,1324365395047
> 2011-12-20 17:55:25,505 INFO 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to 
> stop the worker thread
> {code}
> Three interrupts are sent over period of 31 seconds or so.
> Eventually the interrupt has an effect and I get:
> {code}
> 2011-12-20 17:55:25,505 INFO 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to 
> stop the worker thread
> 2011-12-20 17:55:48,022 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: 
> HLog roll requested
> 2011-12-20 17:55:58,070 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 
> Exception: java.io.IOException: Call to sv4r11s38/10.4.11.38:7000 failed on 
> local exception: java.nio.channels.ClosedByInterruptException
>         at org.apache.hadoop.ipc.Client.wrapException(Client.java:1103)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1071)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>         at $Proxy9.addBlock(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at $Proxy9.addBlock(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3507)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3370)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2586)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2826)
> Caused by: java.nio.channels.ClosedByInterruptException
>         at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:341)
>         at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>         at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>         at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>         at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:779)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1047)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>         at $Proxy9.getFileInfo(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at $Proxy9.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:875)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:513)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:768)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.convertRegionEditsToTemp(HLogSplitter.java:1097)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.createWAP(HLogSplitter.java:1066)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.convertRegionEditsToTemp(HLogSplitter.java:1097)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.createWAP(HLogSplitter.java:1066)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:410)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:351)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:266)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> 2011-12-20 17:55:58,070 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not prepare temp 
> staging area
> java.io.IOException: Call to sv4r11s38/10.4.11.38:7000 failed on local 
> exception: java.nio.channels.ClosedByInterruptException
>         at org.apache.hadoop.ipc.Client.wrapException(Client.java:1103)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1071)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>         at $Proxy9.getFileInfo(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at $Proxy9.getFileInfo(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:875)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:513)
>         at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:768)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.convertRegionEditsToTemp(HLogSplitter.java:1097)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.createWAP(HLogSplitter.java:1066)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:410)
>         at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFileToTemp(HLogSplitter.java:351)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:113)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:266)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:197)
>         at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:165)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.nio.channels.ClosedByInterruptException
>         at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:341)
>         at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>         at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>         at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>         at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:779)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1047)
>         ... 20 more
> {code}
> Now here is the wacky part.  Above we are trying to go to the namenode it 
> looks like and the interrupt is closing socket.  At same, time, I'm trying to 
> flush and it fails with same stack trace ... and because we failed a flush, 
> regionserver goes down:
> {code}
> 2011-12-20 17:55:58,071 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 
> for block blk_-6239131583587622790_187561 bad datanode[0] nodes == null
> 2011-12-20 17:55:58,073 WARN org.apache.hadoop.hdfs.DFSClient: Could not get 
> block locations. Source file 
> "/hbase/TestTable/9f98e6764d322832c845b740336e5750/.tmp/0df1673c96274028bda24c9cb49e9c3e"
>  - Aborting...
> 2011-12-20 17:55:58,074 FATAL 
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 
> sv4r13s38,7003,1324365396583: Replay of HLog required. Forcing server shutdown
> org.apache.hadoop.hbase.DroppedSnapshotException: region: 
> TestTable,0134394898,1323822783216.9f98e6764d322832c845b740336e5750.
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1276)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1160)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1102)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:400)
>  
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:374)
>         at 
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:243)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.IOException: Call to sv4r11s38/10.4.11.38:7000 failed on 
> local exception: java.nio.channels.ClosedByInterruptException
>         at org.apache.hadoop.ipc.Client.wrapException(Client.java:1103)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1071)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>         at $Proxy9.addBlock(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at $Proxy9.addBlock(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3507)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3370)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2700(DFSClient.java:2586)
>         at 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2826)
> Caused by: java.nio.channels.ClosedByInterruptException
>         at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:341)
>         at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>         at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>         at java.io.DataOutputStream.flush(DataOutputStream.java:106)
>         at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:779)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1047)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>         at $Proxy9.getFileInfo(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at $Proxy9.getFileInfo(Unknown Source)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to