Please refer to line 307 of src/core/org.apache/hadoop/ipc/Client.java: } catch (SocketTimeoutException toe) { /* The max number of retries is 45, * which amounts to 20s*45 = 15 minutes retries. */ handleConnectionFailure(timeoutFailures++, 45, toe);
On Wed, Apr 21, 2010 at 4:48 AM, Brian Bockelman <bbock...@cse.unl.edu>wrote: > > On Apr 21, 2010, at 12:37 AM, Sridhar Chellappa wrote: > > > Anyone? Please ? > > On 4/19/10 3:49 PM, Sridhar Chellappa wrote: > >> Hi, > >> > >> I have an issue where my client's code (which links to libhdfs.a) is > >> throwing out the following errors : > >> > >> ipc.Clnt Retrying connect to server name-node-A/98.137.17.156:4600 > >> Already tried 0 time > >> > >> The node in question is is the name node. > >> > >> What I want to understand is : > >> > >> 1. I suspect this is coming from libhdfs.a. Is that true ? > > Correct > > >> > >> 2. What is the retry logic, at a high level, from libhdfs.a to name > >> node? Pointers to documentation will be really appreciated. > > I believe it's generic retry logic - it attempts to connect X times with Y > seconds between attempt. > > I don't know of any place where this internal is documented other than the > code. This is a young, fast-paced project - I resort to the (quite > readable) code a lot. > > >> > >> 3. How do I control the behaviour ? > > I'd bet both are configurable with some obscure configuration variable. > I'm not the right expert to know it off the top of my head. > > >> > >> 4. Most probably, the retries are happenning because of an unresponsive > >> Name Node. Are there any logs that I can look at the Name Node using > >> which I can pin-point why is the Name Node unresponsive ? > > Over the last 1.5 years of running in production, I'm not sure I've seen an > unresponsive NN at our site. Unless you have a very big one, if you get > that message, it means: > 1) The namenode is down > 2) The namenode is up, but you have an iptables firewall on it. > > Try to see if you can open the port manually ("telnet server name-node-A > 4600") from the client node and namenode to see if there's any difference. > This will allow you to distinguish between two possible error cases. > > Brian > >