[jira] [Commented] (HDFS-1880) When Namenode network is unplugged, DFSClient operations waits for ever

John George (JIRA) Fri, 22 Jul 2011 07:57:23 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13069574#comment-13069574
 ]


John George commented on HDFS-1880:
-----------------------------------

{quote}
> Uma Maheswara Rao G commented on HADOOP-6889:
> ---------------------------------------------
>
> Hi John,
>
> I have seen waitForProxy is passing 0 as rpcTimeOut. It is hardcoded value.
>
> {code}
> return waitForProtocolProxy(protocol, clientVersion, addr, conf, 0,
> connTimeout);
> {code}
{quote}

If you want to control this value, you could use the waitForProtocolProxy() 
that accepts "rpcTimeout" as an argument. You could pass in any value
(eg: "DFS_CLIENT_SOCKET_TIMEOUT_KEY") as rpcTimeout (though that means that it 
will timeout within that time instead of retrying.
{code}
  public static <T> ProtocolProxy<T> waitForProtocolProxy(Class<T> protocol,
                               long clientVersion,
                               InetSocketAddress addr, Configuration conf,
                               int rpcTimeout,
                               long timeout) throws IOException {
{code}

{quote}
> If user wants to control this value then , how can he configure?
{quote}

HADOOP-6889 ensures that any communication to/from DN (DFSclient->DN & DN->DN), 
times out within rpcTimeout. If a user wants to control this value from 
configuration, it can be done like it is used today. For example, both these 
use the "DFS_CLIENT_SOCKET_TIMEOUT_KEY" configuration value to timeout. Like 
you said, this change does not change any timeout mechanisms to NN 
communication.

{quote}
>
> Here we have a situation, where clients are waiting for long time.HDFS-1880.
{quote}

Based on the attached trace, I can see that DN is trying to reconnect to NN 
because it wants to send heartbeats to NN. When you say client, do you mean 
DFSClient is waiting to also doing the same thing trying to communicate with 
NN. For "connection timeouts" The maximum number of times a client should wait 
during each time is close to 15 minutes (45 retries with each "connect() taking 
20 seconds). For IOExceptions, it should not try more than 4 minutes or so.
In the trace that is attached here, you can see that it is an "IOException" and 
not a "SocketTimeoutException". Whenever an IOException is encountered, it 
tries "ipc.client.connect.max.retries" before it gives up, which can be 
controlled by conf. As you can see, it does give up after 10 retries, but since 
DN keeps trying to send heartbeats, it keeps doing it even after it fails.

{code}
conf.getInt("ipc.client.connect.max.retries", 10)
{code}


{quote}
>
> I thought, HADOOP-6889 can solve that problem. But how this can be controlled
> by the user in Hadoop (looks no configuration parameters available).
>

> I plan to add a new configuration ipc.client.max.pings that specifies the max
> number of pings that a client could try. If a response can not be received
> after the specified max number of pings, a SocketTimeoutException is thrown.
> If this configuration property is not set, a client maintains the current
> semantics, waiting forever.

>
> We have choosen this implementation for our cluster.
>
> I am just checking , whether i can use rpcTimeOut itself to control. ( since
> this change already committed).
>
> Can you please clarify more?
{quote}

If you just want to fail the call after a certain number of pings, introducing 
this new value "max.pings" might be a good idea. By using rpcTimeout, all it is 
doing is setting the socket timeout to be "rpcTimeout". There are no pings sent 
at all.

{quote}
>
> Can you just check HDFS-1880.
>
>
> @Hairong
> I thought about introducing a configuration parameter. But clients or
> DataNodes want to have timeout for RPCs to DataNodes but no timeout for RPCs
> to NameNodes. Adding a rpcTimeout parameter makes this easy.
>  I think considering HA, clients and NameNode also requires some timeout.
>  If Active goes down, then clients should not wait in timeouts right?
{quote}
I do not know enough about HA to comment about this.


> When Namenode network is unplugged, DFSClient operations waits for ever
> -----------------------------------------------------------------------
>
>                 Key: HDFS-1880
>                 URL: https://issues.apache.org/jira/browse/HDFS-1880
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client
>            Reporter: Uma Maheswara Rao G
>
> When NN/DN is shutdown gracefully, the DFSClient operations which are waiting 
> for a response from NN/DN, will throw exception & come out quickly
> But when the NN/DN network is unplugged, the DFSClient operations which are 
> waiting for a response from NN/DN, waits for ever.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HDFS-1880) When Namenode network is unplugged, DFSClient operations waits for ever

Reply via email to