Hello,

This looks like a bug...
I noticed that HTTPClient hangs in some cases.  I turned on HTTPClient logging and it 
looks like it likes to hang when the server does not close the connection.
Check these logs, they show things pretty clearly.

First, a healthy interaction with a web server.  httpclient.wire logging turns this up 
when a normal response is received:

INFO  [Crawler-5] httpclient.wire - >> "GET  HTTP/1.1
": 06 Jan 2002 00:01:42,505
INFO  [Crawler-5] httpclient.wire - >> "Host: www.faz.de
": 06 Jan 2002 00:01:42,507
INFO  [Crawler-5] httpclient.wire - >> "Content-Length: 0
": 06 Jan 2002 00:01:42,507
INFO  [Crawler-5] httpclient.wire - >> "user-agent: None
": 06 Jan 2002 00:01:42,507
INFO  [Crawler-5] httpclient.wire - >> \r\n: 06 Jan 2002 00:01:42,508
INFO  [Crawler-5] httpclient.wire - << "HTTP/1.1 400 Bad Request" [\r\n]: 06 Jan 2002 
00:01:42,702
INFO  [Crawler-5] httpclient.wire - << "Server: Microsoft-IIS/5.0" [\r\n]: 06 Jan 2002 
00:01:42,703
INFO  [Crawler-5] httpclient.wire - << "Date: Sun, 06 Jan 2002 05:01:33 GMT" [\r\n]: 
06 Jan 2002 00:01:42,703
INFO  [Crawler-5] httpclient.wire - << "Set-Cookie: 
BIGipServerfazde=906039724.20480.0000; path=/" [\r\n]: 06 Jan 2002 00:01:42,704
INFO  [Crawler-5] httpclient.wire - << "Content-Type: text/html" [\r\n]: 06 Jan 2002 
00:01:42,704
INFO  [Crawler-5] httpclient.wire - << "Content-Length: 87" [\r\n]: 06 Jan 2002 
00:01:42,705
INFO  [Crawler-5] httpclient.wire - << 
"<html><head><title>Error</title></head><body>The parameter is incorrect. 
</body></html>": 06 Jan 2002 00:01:42,713


The above includes a nice GET request, response headers, and the body.
It happens to be a 400 Bad Request, but that doesn't matter here.


On the other hand, when HTTPClient hung its interaction with the web server looked 
something like this:

INFO  [Crawler-5] httpclient.wire - >> "GET  HTTP/1.1
": 06 Jan 2002 00:01:42,950
INFO  [Crawler-5] httpclient.wire - >> "Host: www.focus.de
": 06 Jan 2002 00:01:42,951
INFO  [Crawler-5] httpclient.wire - >> "Content-Length: 0
": 06 Jan 2002 00:01:42,951
INFO  [Crawler-5] httpclient.wire - >> "user-agent: Librisia {OG}/0.1
": 06 Jan 2002 00:01:42,951
INFO  [Crawler-5] httpclient.wire - >> \r\n: 06 Jan 2002 00:01:42,952
INFO  [Crawler-5] httpclient.wire - << "<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 
3.0//EN"> <HTML>" [\r\n]: 06 Jan 2002 00:01:43,477
INFO  [Crawler-5] httpclient.wire - << "<HEAD> <TITLE>FOCUS Online | Rubriken | Inhalt 
| Fehler</TITLE>" [\r\n]: 06 Jan 2002 00:01:43,478
...
more HTML here
...
INFO  [Crawler-5] httpclient.wire - << "Da wir unsere Seiten mehrmals t&auml;glich" 
[\r\n]: 06 Jan 2002 00:01:43,881


Notice a few things:
1. There is no response header, the first thing that HTTPClient gets back is the HTML 
body.

2. This server encountered some error (see 'Fehler' in the TITLE - Fehler means error 
auf Deutsch), maybe that's why it didn't respond properly.

3. The very last data sent from the server is some text, but note that it looks like 
it was cut off (there are no </BODY> nor </HTML> tags, even though the response 
contained a nice, correct HTML up to that point).
So it looks like the server choked, started hanging or something, and HTTPClient 
stopped there as well and remained listening for the rest of the server's response, 
which never came.

Finally:

netstat -a|grep focus
tcp        0      0 77-108-148-98.nyc:34555 focus.de:http           ESTABLISHED


Obviously, HTTPClient is still holding onto this connection to www.focus.de server!


I noticed this behaviour with other servers, too.  I don't have their httpclient.wire 
logs, but see this from netstat -a:

tcp        0      0 77-108-148-98.nyc:34819 www.ananzi.co.za:http   CLOSE_WAIT


This looks like a HTTPClient bug to me.
I don't know enough about the HTTPClient internals yet to fix it and send a diff.
Notice that this also implies the that connection/socket timeout fix from a few weeks 
ago doesn't really work, which is what I observed, too.  No matter what timeout I use, 
if connections do time out they invariably time out after a bit more than 3 minutes 
which, I presume, is some Java's internal, default socket timeout value.

Does anyone know how to fix this 'hang bug'?

Thanks,
Otis
P.S.
If it helps, this is the last log line from HttpConnection, which seems to show that 
readLine() method is the method that got stuck.

DEBUG [Crawler-5] org.apache.commons.httpclient.HttpConnection - 
HttpConnection.readLine(): 06 Jan 2002 00:01:43,881

_________________________________________________________________
iVillage.com: Solutions for Your Life 
Check out the most exciting women's community on the Web   
http://www.ivillage.com

--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to