On Apr 21, 2010, at 12:37 AM, Sridhar Chellappa wrote:

> Anyone? Please ?
> On 4/19/10 3:49 PM, Sridhar Chellappa wrote:
>> Hi,
>> 
>> I have an issue where my client's code (which links to libhdfs.a) is
>> throwing out the following errors :
>> 
>> ipc.Clnt Retrying connect to server name-node-A/98.137.17.156:4600
>> Already tried 0 time
>> 
>> The node in question is is the name node.
>> 
>> What I want to understand is :
>> 
>> 1. I suspect this is coming from libhdfs.a. Is that true ?

Correct

>> 
>> 2. What is the retry logic, at a high level, from libhdfs.a to name
>> node? Pointers to documentation will be really appreciated.

I believe it's generic retry logic - it attempts to connect X times with Y 
seconds between attempt.

I don't know of any place where this internal is documented other than the 
code.  This is a young, fast-paced project - I resort to the (quite readable) 
code a lot.

>> 
>> 3. How do I control the behaviour ?

I'd bet both are configurable with some obscure configuration variable.  I'm 
not the right expert to know it off the top of my head.

>> 
>> 4. Most probably, the retries are happenning because of an unresponsive
>> Name Node. Are there any logs that I can look at the Name Node using
>> which I can pin-point why is the Name Node unresponsive ?

Over the last 1.5 years of running in production, I'm not sure I've seen an 
unresponsive NN at our site.  Unless you have a very big one, if you get that 
message, it means:
1) The namenode is down
2) The namenode is up, but you have an iptables firewall on it.

Try to see if you can open the port manually ("telnet server name-node-A 4600") 
from the client node and namenode to see if there's any difference.  This will 
allow you to distinguish between two possible error cases.

Brian

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to