[ 
http://issues.apache.org/jira/browse/HADOOP-255?page=comments#action_12413498 ] 

Naveen Nalam commented on HADOOP-255:
-------------------------------------

Well the problem I was seeing is that a getFile RPC request for say 1GB was 
issued, but then the Call object timedout on the client. Yet the 1GB was still 
transferred fully to the client and then discarded since there was no waiting 
Call object. My system got into a situation where 1000s of getFile requests 
were queued up per node in the server's tcp receive buffers. So you can see how 
no progress would have ever been made.

Are you suggesting that this is not going to be a problem because all the large 
response body RPCs will now be done over HTTP?  I can't see how leaving the 
code as it is would be fine, unless all RPCs are going to be very quick to 
service and have small response bodies.

You mentioned that search queries is an example of an RPC that could be 
broadcast out. What would happen if the queries were taking too long to service 
and the client side rpc request was already timing out. I could see it get into 
a similar situation where the servers would become busy processing stale query 
requests.

So if all the RPCs will be small response bodies, then it seems fine to keep 
the connection always open and just read in the response and throw it away. And 
then why not add a CANCEL-RPC request type that can get sent over whenever the 
client request has timed out?

> Client Calls are not cancelled after a call timeout
> ---------------------------------------------------
>
>          Key: HADOOP-255
>          URL: http://issues.apache.org/jira/browse/HADOOP-255
>      Project: Hadoop
>         Type: Bug

>   Components: ipc
>     Versions: 0.2.1
>  Environment: Tested on Linux 2.6
>     Reporter: Naveen Nalam

>
> In ipc/Client.java, if a call times out, a SocketTimeoutException is thrown 
> but the Call object still exists on the queue.
> What I found was that when transferring very large amounts of data, it's 
> common for queued up calls to timeout. Yet even though the caller has is no 
> longer waiting, the request is still serviced on the server and the data is 
> sent to the client. The client after receiving the full response calls 
> callComplete() which is a noop since nobody is waiting.
> The problem is that the calls that timeout will retry and the system gets 
> into a situation where data is being transferred around, but it's all data 
> for timed out requests and no progress is ever made.
> My quick solution to this was to add a "boolean timedout" to the Call object 
> which I set to true whenever the queued caller times out. And then when the 
> client starts to pull over the response data (in Connection::run) to first 
> check if the Call is timedout and immediately close the connection.
> I think a good fix for this is to queue requests on the client, and do a 
> single sendParam only when there is no outstanding request. This will allow 
> closing the connection when receiving a response for a request we no longer 
> have pending, reopen the connection, and resend the next queued request. I 
> can provide a patch for this, but I've seen a lot of recent activity in this 
> area so I'd like to get some feedback first.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to