Soliciting suggesions on hung ipc Client connection flush

Michael Stack Wed, 03 Oct 2007 15:35:59 -0700

On Hudson, we've been seeing tests sporadically hang on an ipc Clientflush of params. I'm writing the list for suggestions or opinions onwhat folks think might be happening or ideas on what to try next. Seebelow for the latest example for a thread dump from a recent patch build.

The usual scenario is that we are trying to simulate failed servers in amini-cluster. All servers -- hbase + dfs servers -- are up and runninginside the same JVM. The remote ipc Server will of-a-sudden have itsstop method run to simulate a server crash. The Client, unawares, triesto go about its usual business.


   [junit] "HMaster.metaScanner" daemon prio=10 tid=0x091ecde0 nid=0x4a 
runnable [0xe2af9000..0xe2af9b38]
   [junit]      at java.net.SocketOutputStream.socketWrite0(Native Method)
   [junit]      at 
java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
   [junit]      at 
java.net.SocketOutputStream.write(SocketOutputStream.java:136)
   [junit]      at 
org.apache.hadoop.ipc.Client$Connection$2.write(Client.java:190)
   [junit]      at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
   [junit]      at 
java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
   [junit]      - locked <0xf7bb40e0> (a java.io.BufferedOutputStream)
   [junit]      at java.io.DataOutputStream.flush(DataOutputStream.java:106)
   [junit]      at 
org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:325)
   [junit]      - locked <0xf7bb3f68> (a java.io.DataOutputStream)
   [junit]      at org.apache.hadoop.ipc.Client.call(Client.java:462)
   [junit]      - locked <0xf7bb3fa8> (a org.apache.hadoop.ipc.Client$Call)
   [junit]      at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
   [junit]      at $Proxy8.openScanner(Unknown Source)
   [junit]      at 
org.apache.hadoop.hbase.HMaster$BaseScanner.scanRegion(HMaster.java:207)
   [junit]      at 
org.apache.hadoop.hbase.HMaster$MetaScanner.scanOneMetaRegion(HMaster.java:643)
   [junit]      - locked <0xf7b6b460> (a java.lang.Integer)
   [junit]      at 
org.apache.hadoop.hbase.HMaster$MetaScanner.maintenanceScan(HMaster.java:694)
   [junit]      at 
org.apache.hadoop.hbase.HMaster$BaseScanner.chore(HMaster.java:188)
   [junit]      at org.apache.hadoop.hbase.Chore.run(Chore.java:59)

Other threads in the thread dump will be parked at the DataOutputStreamsynchronize block.

Please correct me if I am wrong, but it is my understanding that writesdo not timeout nor is this type of I/O interruptable. The connection isprobably already established else it would have timed out trying toconnect to the non-existent server and besides, the ipc Client patternseems to be keeps up the connection multiplexing 'commands' to theremote server...

I'm wondering why don't we get an exception on client side when theremote side of the socket goes away?


Am unable to reproduce locally.

Thanks for any input,
St.Ack

Soliciting suggesions on hung ipc Client connection flush

Reply via email to