[ 
http://issues.apache.org/jira/browse/HADOOP-312?page=comments#action_12426312 ] 
            
Devaraj Das commented on HADOOP-312:
------------------------------------

I agree with this. In the current code, there is a timeout of 10 minutes and 
only when a TaskTracker is out of contact for this much amount of time does the 
JobTracker assume that the TaskTracker is dead. Unfortunately, even with this 
large timeout, sometimes an unfortunate TaskTracker cannot make it. Yes, the 
accept queue can be made longer but we will hit the problem sometime later when 
we have more clients. So,  do you think, in addition to increasing the accept 
queue size, it makes sense to have a two-way heartbeat here? That is, if a 
server doesn't receive a heartbeat from a client and the expiry-timeout is 
about to expire, it schedules a heartbeat to the client and probably invokes a 
GETSTATUS or some such method on the client and if that method returns a valid 
response, it keeps the client alive for another expiry-timeout interval and 
this goes on... We can also look at other approaches - some of them are 
outlined in hadoop-362.
By the way, the patch for hadoop-181 should handle the lost tracker problem but 
this kind of a problem might turn up for any client-server interaction.

> Connections should not be cached
> --------------------------------
>
>                 Key: HADOOP-312
>                 URL: http://issues.apache.org/jira/browse/HADOOP-312
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: ipc
>            Reporter: Devaraj Das
>         Assigned To: Devaraj Das
>         Attachments: no_connection_caching.patch, no_connection_caching.patch
>
>
> Servers and clients (client include datanodes, tasktrackers, DFSClients & 
> tasks) should not cache connections or maybe cache them for very short 
> periods of time. Clients should set up & tear down connections to the servers 
> everytime they need to contact the servers (including the heartbeats). If 
> connection is cached, then reuse the existing connection for a few subsequent 
> transactions until the connection expires. The heartbeat interval should be 
> more so that many more clients (order of  tens of thousands) can be 
> accomodated within 1 heartbeat interval.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to