On Apr 21, 2010, at 12:37 AM, Sridhar Chellappa wrote: > Anyone? Please ? > On 4/19/10 3:49 PM, Sridhar Chellappa wrote: >> Hi, >> >> I have an issue where my client's code (which links to libhdfs.a) is >> throwing out the following errors : >> >> ipc.Clnt Retrying connect to server name-node-A/98.137.17.156:4600 >> Already tried 0 time >> >> The node in question is is the name node. >> >> What I want to understand is : >> >> 1. I suspect this is coming from libhdfs.a. Is that true ?
Correct >> >> 2. What is the retry logic, at a high level, from libhdfs.a to name >> node? Pointers to documentation will be really appreciated. I believe it's generic retry logic - it attempts to connect X times with Y seconds between attempt. I don't know of any place where this internal is documented other than the code. This is a young, fast-paced project - I resort to the (quite readable) code a lot. >> >> 3. How do I control the behaviour ? I'd bet both are configurable with some obscure configuration variable. I'm not the right expert to know it off the top of my head. >> >> 4. Most probably, the retries are happenning because of an unresponsive >> Name Node. Are there any logs that I can look at the Name Node using >> which I can pin-point why is the Name Node unresponsive ? Over the last 1.5 years of running in production, I'm not sure I've seen an unresponsive NN at our site. Unless you have a very big one, if you get that message, it means: 1) The namenode is down 2) The namenode is up, but you have an iptables firewall on it. Try to see if you can open the port manually ("telnet server name-node-A 4600") from the client node and namenode to see if there's any difference. This will allow you to distinguish between two possible error cases. Brian
smime.p7s
Description: S/MIME cryptographic signature