Please refer to line 307 of src/core/org.apache/hadoop/ipc/Client.java:
          } catch (SocketTimeoutException toe) {
            /* The max number of retries is 45,
             * which amounts to 20s*45 = 15 minutes retries.
             */
            handleConnectionFailure(timeoutFailures++, 45, toe);


On Wed, Apr 21, 2010 at 4:48 AM, Brian Bockelman <bbock...@cse.unl.edu>wrote:

>
> On Apr 21, 2010, at 12:37 AM, Sridhar Chellappa wrote:
>
> > Anyone? Please ?
> > On 4/19/10 3:49 PM, Sridhar Chellappa wrote:
> >> Hi,
> >>
> >> I have an issue where my client's code (which links to libhdfs.a) is
> >> throwing out the following errors :
> >>
> >> ipc.Clnt Retrying connect to server name-node-A/98.137.17.156:4600
> >> Already tried 0 time
> >>
> >> The node in question is is the name node.
> >>
> >> What I want to understand is :
> >>
> >> 1. I suspect this is coming from libhdfs.a. Is that true ?
>
> Correct
>
> >>
> >> 2. What is the retry logic, at a high level, from libhdfs.a to name
> >> node? Pointers to documentation will be really appreciated.
>
> I believe it's generic retry logic - it attempts to connect X times with Y
> seconds between attempt.
>
> I don't know of any place where this internal is documented other than the
> code.  This is a young, fast-paced project - I resort to the (quite
> readable) code a lot.
>
> >>
> >> 3. How do I control the behaviour ?
>
> I'd bet both are configurable with some obscure configuration variable.
>  I'm not the right expert to know it off the top of my head.
>
> >>
> >> 4. Most probably, the retries are happenning because of an unresponsive
> >> Name Node. Are there any logs that I can look at the Name Node using
> >> which I can pin-point why is the Name Node unresponsive ?
>
> Over the last 1.5 years of running in production, I'm not sure I've seen an
> unresponsive NN at our site.  Unless you have a very big one, if you get
> that message, it means:
> 1) The namenode is down
> 2) The namenode is up, but you have an iptables firewall on it.
>
> Try to see if you can open the port manually ("telnet server name-node-A
> 4600") from the client node and namenode to see if there's any difference.
>  This will allow you to distinguish between two possible error cases.
>
> Brian
>
>

Reply via email to