[ https://issues.apache.org/jira/browse/HADOOP-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Todd Lipcon updated HADOOP-6762: -------------------------------- Attachment: hadoop-6762.txt Here's an updated patch against trunk. I ran all of the unit tests in the ipc package locally and they passed. I also tried the new unit tests _without_ the patch, and they failed as expected. Given that there was a deadlock found in an early rev of this patch, I also ran all of the IPC unit tests under jcarder to look for lock inversions and it found none. I ran the RPCCallBenchmark for 30 seconds with and without the patch, with the following results: With patch: ====== Results ====== Options: rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine serverThreads=30 serverReaderThreads=4 clientThreads=30 host=0.0.0.0 port=12345 secondsToRun=30 msgSize=1024 Total calls per second: 24668.0 CPU time per call on client: 58639 ns CPU time per call on server: 64893 ns Without patch: ====== Results ====== Options: rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine serverThreads=30 serverReaderThreads=4 clientThreads=30 host=0.0.0.0 port=12345 secondsToRun=30 msgSize=1024 Total calls per second: 27881.0 CPU time per call on client: 68079 ns CPU time per call on server: 62582 ns As expected, the CPU time on the client was increased and the throughput went down by about 13%, since the RPC calls are now being shuttled between threads on the client side. That's unfortunate, but given that this fixes an important bug, and given that _client_ side RPC throughput is rarely a bottleneck in common usage scenarios, I think it is acceptable. This patch is also nearly identical to a patch that we've shipped in CDH since June 2010, so I'm fairly confident that the approach is correct. > exception while doing RPC I/O closes channel > -------------------------------------------- > > Key: HADOOP-6762 > URL: https://issues.apache.org/jira/browse/HADOOP-6762 > Project: Hadoop Common > Issue Type: Bug > Affects Versions: 0.20.2 > Reporter: sam rash > Assignee: Todd Lipcon > Priority: Critical > Attachments: hadoop-6762-10.txt, hadoop-6762-1.txt, > hadoop-6762-2.txt, hadoop-6762-3.txt, hadoop-6762-4.txt, hadoop-6762-6.txt, > hadoop-6762-7.txt, hadoop-6762-8.txt, hadoop-6762-9.txt, HADOOP-6762.patch, > hadoop-6762.txt, hadoop-6762.txt, hadoop-6762.txt > > > If a single process creates two unique fileSystems to the same NN using > FileSystem.newInstance(), and one of them issues a close(), the leasechecker > thread is interrupted. This interrupt races with the rpc namenode.renew() > and can cause a ClosedByInterruptException. This closes the underlying > channel and the other filesystem, sharing the connection will get errors. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira