[ 
https://issues.apache.org/jira/browse/HADOOP-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HADOOP-6762:
--------------------------------

    Attachment: hadoop-6762.txt

Here's an updated patch against trunk.

I ran all of the unit tests in the ipc package locally and they passed. I also 
tried the new unit tests _without_ the patch, and they failed as expected.

Given that there was a deadlock found in an early rev of this patch, I also ran 
all of the IPC unit tests under jcarder to look for lock inversions and it 
found none.

I ran the RPCCallBenchmark for 30 seconds with and without the patch, with the 
following results:

With patch:
====== Results ======
Options:
rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine
serverThreads=30
serverReaderThreads=4
clientThreads=30
host=0.0.0.0
port=12345
secondsToRun=30
msgSize=1024
Total calls per second: 24668.0
CPU time per call on client: 58639 ns
CPU time per call on server: 64893 ns


Without patch:
====== Results ======
Options:
rpcEngine=class org.apache.hadoop.ipc.ProtobufRpcEngine
serverThreads=30
serverReaderThreads=4
clientThreads=30
host=0.0.0.0
port=12345
secondsToRun=30
msgSize=1024
Total calls per second: 27881.0
CPU time per call on client: 68079 ns
CPU time per call on server: 62582 ns

As expected, the CPU time on the client was increased and the throughput went 
down by about 13%, since the RPC calls are now being shuttled between threads 
on the client side. That's unfortunate, but given that this fixes an important 
bug, and given that _client_ side RPC throughput is rarely a bottleneck in 
common usage scenarios, I think it is acceptable.

This patch is also nearly identical to a patch that we've shipped in CDH since 
June 2010, so I'm fairly confident that the approach is correct.
                
> exception while doing RPC I/O closes channel
> --------------------------------------------
>
>                 Key: HADOOP-6762
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6762
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: sam rash
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hadoop-6762-10.txt, hadoop-6762-1.txt, 
> hadoop-6762-2.txt, hadoop-6762-3.txt, hadoop-6762-4.txt, hadoop-6762-6.txt, 
> hadoop-6762-7.txt, hadoop-6762-8.txt, hadoop-6762-9.txt, HADOOP-6762.patch, 
> hadoop-6762.txt, hadoop-6762.txt, hadoop-6762.txt
>
>
> If a single process creates two unique fileSystems to the same NN using 
> FileSystem.newInstance(), and one of them issues a close(), the leasechecker 
> thread is interrupted.  This interrupt races with the rpc namenode.renew() 
> and can cause a ClosedByInterruptException.  This closes the underlying 
> channel and the other filesystem, sharing the connection will get errors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to