Over in hbase-space, we trigger a hang in DFSOutputStream.writeChunk. Input appreciated.

Here are the two pertinent extracts from the hbase regionserver thread dump:

"IPC Server handler 9 on 60020" daemon prio=10 tid=0x00007fef1c3f0400 nid=0x7470 waiting for monitor entry [0x0000000042d18000..0x0000000042d189f0]
  java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2486)
   - waiting to lock <0x00007fef38ecc138> (a java.util.LinkedList)
- locked <0x00007fef38ecbdb8> (a org.apache.hadoop.dfs.DFSClient$DFSOutputStream) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132) - locked <0x00007fef38ecbdb8> (a org.apache.hadoop.dfs.DFSClient$DFSOutputStream) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121) - locked <0x00007fef38ecbdb8> (a org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
   at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
   at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
- locked <0x00007fef38ecbdb8> (a org.apache.hadoop.dfs.DFSClient$DFSOutputStream) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
   at java.io.DataOutputStream.write(DataOutputStream.java:107)
- locked <0x00007fef38e09fc0> (a org.apache.hadoop.fs.FSDataOutputStream) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1016) - locked <0x00007fef38e09f30> (a org.apache.hadoop.io.SequenceFile$Writer) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:980) - locked <0x00007fef38e09f30> (a org.apache.hadoop.io.SequenceFile$Writer)
   at org.apache.hadoop.hbase.regionserver.HLog.doWrite(HLog.java:461)
   at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:421)
   - locked <0x00007fef29ad9588> (a java.lang.Integer)
at org.apache.hadoop.hbase.regionserver.HRegion.update(HRegion.java:1676) at org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1439) at org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1378) at org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1184)
   at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:616)
   at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:622)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)


Here is trace from accompanying DataStreamer:


"DataStreamer for file /hbase/log_72.34.249.212_1225407466779_60020/hlog.dat.1227075571390 block blk_-7436808403424765554_553837" daemon prio=10 tid=0x0000000001c84c00 nid=0x7125 in Object.wait() [0x00000000409b3000..0x00000000409b3d70]
  java.lang.Thread.State: WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   at java.lang.Object.wait(Object.java:502)
   at org.apache.hadoop.ipc.Client.call(Client.java:709)
   - locked <0x00007fef39520bb8> (a org.apache.hadoop.ipc.Client$Call)
   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
   at org.apache.hadoop.dfs.$Proxy4.getProtocolVersion(Unknown Source)
   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
   at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
at org.apache.hadoop.dfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:139) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2185) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)
   - locked <0x00007fef38ecc138> (a java.util.LinkedList)

Repeated thread dumpings show that we never move on from this state.

Looking at where we're stuck in DataStreamer, we're down in processDatanodeError trying to setup an RPC proxy with the 'least' datanode as our new 'primary'.

2178         // Tell the primary datanode to do error recovery
2179         // by stamping appropriate generation stamps.
2180         //
2181         Block newBlock = null;
2182         ClientDatanodeProtocol primary =  null;
2183         try {
2184 // Pick the "least" datanode as the primary datanode to avoid deadlock.
2185           primary = createClientDatanodeProtocolProxy(
2186               Collections.min(Arrays.asList(newnodes)), conf);
2187           newBlock = primary.recoverBlock(block, newnodes);
2188         } catch (IOException e) {


If I read this right, problem is that since RPC doesn't timeout, we never return from createClientDatanodeProtocolProxy (At the time, a machine vanished from the network).

Thanks for any insight.
St.Ack

Reply via email to