Over in hbase-space, we trigger a hang in DFSOutputStream.writeChunk.
Input appreciated.
Here are the two pertinent extracts from the hbase regionserver thread dump:
"IPC Server handler 9 on 60020" daemon prio=10 tid=0x00007fef1c3f0400
nid=0x7470 waiting for monitor entry
[0x0000000042d18000..0x0000000042d189f0]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2486)
- waiting to lock <0x00007fef38ecc138> (a java.util.LinkedList)
- locked <0x00007fef38ecbdb8> (a
org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
at
org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
at
org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
- locked <0x00007fef38ecbdb8> (a
org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
at
org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
- locked <0x00007fef38ecbdb8> (a
org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
- locked <0x00007fef38ecbdb8> (a
org.apache.hadoop.dfs.DFSClient$DFSOutputStream)
at
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
- locked <0x00007fef38e09fc0> (a
org.apache.hadoop.fs.FSDataOutputStream)
at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1016)
- locked <0x00007fef38e09f30> (a
org.apache.hadoop.io.SequenceFile$Writer)
at
org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:980)
- locked <0x00007fef38e09f30> (a
org.apache.hadoop.io.SequenceFile$Writer)
at org.apache.hadoop.hbase.regionserver.HLog.doWrite(HLog.java:461)
at org.apache.hadoop.hbase.regionserver.HLog.append(HLog.java:421)
- locked <0x00007fef29ad9588> (a java.lang.Integer)
at
org.apache.hadoop.hbase.regionserver.HRegion.update(HRegion.java:1676)
at
org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1439)
at
org.apache.hadoop.hbase.regionserver.HRegion.batchUpdate(HRegion.java:1378)
at
org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1184)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:622)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
Here is trace from accompanying DataStreamer:
"DataStreamer for file
/hbase/log_72.34.249.212_1225407466779_60020/hlog.dat.1227075571390
block blk_-7436808403424765554_553837" daemon prio=10
tid=0x0000000001c84c00 nid=0x7125 in Object.wait()
[0x00000000409b3000..0x00000000409b3d70]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.hadoop.ipc.Client.call(Client.java:709)
- locked <0x00007fef39520bb8> (a org.apache.hadoop.ipc.Client$Call)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.dfs.$Proxy4.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
at
org.apache.hadoop.dfs.DFSClient.createClientDatanodeProtocolProxy(DFSClient.java:139)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2185)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)
- locked <0x00007fef38ecc138> (a java.util.LinkedList)
Repeated thread dumpings show that we never move on from this state.
Looking at where we're stuck in DataStreamer, we're down in
processDatanodeError trying to setup an RPC proxy with the 'least'
datanode as our new 'primary'.
2178 // Tell the primary datanode to do error recovery
2179 // by stamping appropriate generation stamps.
2180 //
2181 Block newBlock = null;
2182 ClientDatanodeProtocol primary = null;
2183 try {
2184 // Pick the "least" datanode as the primary datanode to
avoid deadlock.
2185 primary = createClientDatanodeProtocolProxy(
2186 Collections.min(Arrays.asList(newnodes)), conf);
2187 newBlock = primary.recoverBlock(block, newnodes);
2188 } catch (IOException e) {
If I read this right, problem is that since RPC doesn't timeout, we
never return from createClientDatanodeProtocolProxy (At the time, a
machine vanished from the network).
Thanks for any insight.
St.Ack