Re: keep alive connections
Alain Bench <[EMAIL PROTECTED]> writes: > | /* Return if we have no intention of further downloading. */ > | if (!(*dt & RETROKF) || (*dt & HEAD_ONLY)) > |{ > | /* In case the caller cares to look... */ > | hs->len = 0L; > | hs->res = 0; > | FREE_MAYBE (type); > | FREE_MAYBE (all_headers); > | CLOSE_INVALIDATE (sock); /* would be CLOSE_FINISH, but there > |might be more bytes in the body. */ > | return RETRFINISHED; > |} > > ...changing CLOSE_INVALIDATE to CLOSE_FINISH. That's exactly the right change. As the comment implies, the only reason for using CLOSE_INVALIDATE is fear that a misbehaving CGI might send more data, thus confusing the next request or even causing deadlock while writing the request to the server. When keep-alive connections are not in use (which can be forced with --no-http-keep-alive), CLOSE_INVALIDATE and CLOSE_FINISH are pretty much identical.
Re: keep alive connections
On Thursday, November 13, 2003 at 2:49:41 PM +0100, Hrvoje Niksic wrote: > Maybe it's time to stop erring on the side of caution, and expect HEAD > to work. I experimented in gethttp() line 1474 of wget-1.9.1/src/http.c: | /* Return if we have no intention of further downloading. */ | if (!(*dt & RETROKF) || (*dt & HEAD_ONLY)) |{ | /* In case the caller cares to look... */ | hs->len = 0L; | hs->res = 0; | FREE_MAYBE (type); | FREE_MAYBE (all_headers); | CLOSE_INVALIDATE (sock); /* would be CLOSE_FINISH, but there | might be more bytes in the body. */ | return RETRFINISHED; |} ...changing CLOSE_INVALIDATE to CLOSE_FINISH. Seems to work, but I'm not familiar with the source and don't know what is the RETROKF case. Anyway since then, verifying updatness of a site as my previous example (long serie of HEADs only) is much faster with timestamping and persistent connexion (like more than 2 times faster). Especially faster over a yet overloaded network. > it would actually be a better idea to read (and discard) the body >> But when do you abort reading if no body comes in clean cases? > I'm not sure I understand the question. Forget it. Question was valid only inside the read-discard hypothesis you expressed 5 mails up in thread. Hypothesis abandonned, question also :-) Bye!Alain. -- Microsoft Outlook Express users concerned about readability: For much better viewing quotes in your messages, check the little freeware program OE-QuoteFix by Dominik Jain on http://flash.to/oblivion/>. It'll change your life. :-) Now exists also for Outlook.
Re: keep alive connections
Alain Bench <[EMAIL PROTECTED]> writes: > There is no such problem. Sorry: I made a stupid log interpretation > mistake yesterday. The connection is always closed after a HEAD, OK, > but is reused after a GET, whatever comes after. > > This makes a typical --timestamping session use connections for only > one request (file not updated) or two requests (updated file and > next file). Only when there is a serie of new files, then they are > fetched on a longer connection. > > So example I mirror every day a site with 300 files, typically 5 > updated and 1 new per day. There is no cgi or redirection involved, all > plain static files, and the server replies cleanly to HEADs, without > spurious bodies. Then this will use nearly 300 connections, for mostly > series of short HEADs, in a situation where the relative benefit of > persistent connections would be the greatest. Alain, I find your arguments quite persuasive. Maybe it's time to stop erring on the side of caution, and expect HEAD to work. After all, it works by default in all "framework"-type environments, such as Apache, PHP, Servlet/JSP, AOLserver, CGI.pm, etc. The only problem are the home-grown, garden variety CGI's, and those can be worked around simply using `--no-http-keep-alive'. it would actually be a better idea to read (and discard) the body > > But when do you abort reading if no body comes in clean cases? I'm not sure I understand the question. Wget doesn't wait for a body to arrive in response to HEAD at all; it simply closes the connection and goes on doing its business. And that's why persistent connections are not used in that case. >> What Wget does only sacrifices persistent connections at times, but >> does the right thing with all kinds of responses and doesn't introduce >> artificial delays. > > And safe side: Suboptimal but never fails. I begin to see there is > no way to deal automatically better with possible spurious HEAD > bodies. What about yet another obscure option for cases like my > example? --naively-believe-head-gives-header-only? See above, I think these days it makes sense to expect HEAD to be respected by default, and to reserve `--no-http-keep-alive' for when it doesn't. > Reading TODO: Please don't say me some bad servers send a body to > GET If-Modified-Since 304 unmodified requests... Oh no! Umm, no, it was mere laziness on my part that I didn't use If-Modified-Since.
Re: keep alive connections
On Wednesday, November 12, 2003 at 2:28:04 PM +0100, Hrvoje Niksic wrote: > Alain Bench <[EMAIL PROTECTED]> writes: >> Wget also closes the connection between a GET (with body) and the >> HEAD for the next file. > I wasn't aware of this problem There is no such problem. Sorry: I made a stupid log interpretation mistake yesterday. The connection is always closed after a HEAD, OK, but is reused after a GET, whatever comes after. This makes a typical --timestamping session use connections for only one request (file not updated) or two requests (updated file and next file). Only when there is a serie of new files, then they are fetched on a longer connection. So example I mirror every day a site with 300 files, typically 5 updated and 1 new per day. There is no cgi or redirection involved, all plain static files, and the server replies cleanly to HEADs, without spurious bodies. Then this will use nearly 300 connections, for mostly series of short HEADs, in a situation where the relative benefit of persistent connections would be the greatest. >>> it would actually be a better idea to read (and discard) the body But when do you abort reading if no body comes in clean cases? >> Would it be possible to close/reopen only if, and as soon as, first >> byte of spurious body comes? > How exactly do you propose to detect the unwanted body? Humm... Is Tarot efficient? ;-) > the purpose of the persistent connections (speed). And load on network and server. > What Wget does only sacrifices persistent connections at times, but > does the right thing with all kinds of responses and doesn't introduce > artificial delays. And safe side: Suboptimal but never fails. I begin to see there is no way to deal automatically better with possible spurious HEAD bodies. What about yet another obscure option for cases like my example? --naively-believe-head-gives-header-only? Reading TODO: Please don't say me some bad servers send a body to GET If-Modified-Since 304 unmodified requests... Oh no! Bye!Alain. -- « if you believe the Content-Length header, I've got a bridge to sell you. »
Re: keep alive connections
Alain Bench <[EMAIL PROTECTED]> writes: > OK, wasn't aware of the spurious HEAD bodies problem. But Wget also > closes the connection between a GET (with body) and the HEAD for the > next file. Could you post a URL for which this happens? I wasn't aware of this problem and would like to fix it. >> But maybe it would actually be a better idea to read (and discard) >> the body than to close the connection and reopen it. > > Hum... Would it be possible to close/reopen only if, and as soon as, > first byte of spurious body comes? This is harder than it seems. How exactly do you propose to detect the unwanted body? If you wait for an arbitrary time for the body data to start arriving, you slow down all downloads and defeat the purpose of the persistent connections (speed). If you don't wait, the detection doesn't work because the body data can start arriving a bit later (which is frequently the case with CGI's). Either case, you lose. What Wget does only sacrifices persistent connections at times, but does the right thing with all kinds of responses and doesn't introduce artificial delays. >>>| Keep-Alive: timeout=15, max=5 >>> Without --timestamping Wget keeps "Reusing fd 3." and closing it only >>> once every 6 files (first + 5 more). >> This might be due to redirections. > > No redirections involved: That closure is normal, due to the "max=5" > the server responds to the first request. At second GET it's "max=4" and > gets decremented each time. Finally at the 6th request there is no more > "Connection:" nor "Keep-Alive:" fields. Oh, I see, it's a server setting. Why do they use such a limit?
Re: keep alive connections
On Tuesday, November 11, 2003 at 2:41:31 PM +0100, Hrvoje Niksic wrote: > Alain Bench <[EMAIL PROTECTED]> writes: >> with --timestamping: Each HEAD and each possible GET uses a new >> connection. > I think the difference is that Wget closes the connection when it > decides not to read the request body. OK, wasn't aware of the spurious HEAD bodies problem. But Wget also closes the connection between a GET (with body) and the HEAD for the next file. > But maybe it would actually be a better idea to read (and discard) the > body than to close the connection and reopen it. Hum... Would it be possible to close/reopen only if, and as soon as, first byte of spurious body comes? I guess it could be difficult to deal cleanly with next file in limit cases... >>| Keep-Alive: timeout=15, max=5 >> Without --timestamping Wget keeps "Reusing fd 3." and closing it only >> once every 6 files (first + 5 more). > This might be due to redirections. No redirections involved: That closure is normal, due to the "max=5" the server responds to the first request. At second GET it's "max=4" and gets decremented each time. Finally at the 6th request there is no more "Connection:" nor "Keep-Alive:" fields. The /etc/apache/httpd.conf says: | # KeepAlive: The number of Keep-Alive persistent requests to accept | # per connection. Set to 0 to deactivate Keep-Alive support | KeepAlive 5 | | # KeepAliveTimeout: Number of seconds to wait for the next request | KeepAliveTimeout 15 Bye!Alain. -- When you want to reply to a mailing list, please avoid doing so from a digest. This often builds incorrect references and breaks threads.
Re: keep alive connections
Herold Heiko <[EMAIL PROTECTED]> writes: >> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED] > >> With the HEAD method you never know when you'll stumble upon a CGI >> that doesn't understand it and that will send the body anyway. But >> maybe it would actually be a better idea to read (and discard) the >> body than to close the connection and reopen it. > > Wouldn't that be suboptimal in case that page is huge (and/or the > connection slow) ? You are right, it would. But it might make good sense for redirections, which typically have very small bodies.
RE: keep alive connections
> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED] > With the HEAD method you never know when you'll stumble upon a CGI > that doesn't understand it and that will send the body anyway. But > maybe it would actually be a better idea to read (and discard) the > body than to close the connection and reopen it. Wouldn't that be suboptimal in case that page is huge (and/or the connection slow) ? Heiko -- -- PREVINET S.p.A. www.previnet.it -- Heiko Herold [EMAIL PROTECTED] -- +39-041-5907073 ph -- +39-041-5907472 fax
Re: keep alive connections
On Tue, 11 Nov 2003, Hrvoje Niksic wrote: > I think the difference is that Wget closes the connection when it decides > not to read the request body. For example, it closes on redirections > because it (intentionally) ignores the body. Another approach could be to read and just ignore the body of redirect pages. You'd gain a close/connect but lose the transfer time. > With the HEAD method you never know when you'll stumble upon a CGI that > doesn't understand it and that will send the body anyway. But maybe it > would actually be a better idea to read (and discard) the body than to close > the connection and reopen it. That approach is just as hard, only depending on different things to work correctly. Since we're talking about silly servers, they could just as well return a body to the HEAD request, and the response is said to be persistant and the Content-Length: is set. The size of the Content-Length in a HEAD request is the size of the body that would be returned if GET is request so you'd have no idea how much data to read Been there. Seen it happen. There's just no good way to deal with HEAD requests that sends back a body. I mean besides yelling at the author of the server side. -- -=- Daniel Stenberg -=- http://daniel.haxx.se -=- ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol
Re: keep alive connections
Alain Bench <[EMAIL PROTECTED]> writes: > Hello Hrvoje, > > On Friday, November 7, 2003 at 11:50:53 PM +0100, Hrvoje Niksic wrote: > >> Wget uses the `Keep-Alive' request header to request persistent >> connections, and understands both the HTTP/1.0 `Keep-Alive' and the >> HTTP/1.1 `Connection: keep-alive' response header. > > This doesn't seem to work together with --timestamping: Each HEAD > and each possible GET uses a new connection. I think the difference is that Wget closes the connection when it decides not to read the request body. For example, it closes on redirections because it (intentionally) ignores the body. With the HEAD method you never know when you'll stumble upon a CGI that doesn't understand it and that will send the body anyway. But maybe it would actually be a better idea to read (and discard) the body than to close the connection and reopen it. > Without --timestamping Wget keeps "Reusing fd 3." and closing it > only once every 6 files (first + 5 more). This might be due to redirections. Look out for the exact circumstances when Wget closes (or doesn't reuse) connections and you'll probably notice it.
Re: keep alive connections
Hello Hrvoje, On Friday, November 7, 2003 at 11:50:53 PM +0100, Hrvoje Niksic wrote: > Wget uses the `Keep-Alive' request header to request persistent > connections, and understands both the HTTP/1.0 `Keep-Alive' and the > HTTP/1.1 `Connection: keep-alive' response header. This doesn't seem to work together with --timestamping: Each HEAD and each possible GET uses a new connection. The server keeps responding: | HTTP/1.0 200 OK | [...] | Connection: Keep-Alive | Keep-Alive: timeout=15, max=5 But Wget 1.9 does each time: | Created socket 3. | [snip request/response] | Registered fd 3 for persistent reuse. | Closing fd 3 | Invalidating fd 3 from further reuse. | Remote file is newer, retrieving. | Created socket 3. | [and so on] Tcpdump confirms the TCP session is FIN closed by Wget. Without --timestamping Wget keeps "Reusing fd 3." and closing it only once every 6 files (first + 5 more). At this moment the FIN would in any case be initiated by the server if not by Wget. Test made on an old Apache 1.1.3, but it seems the same with other servers. BTW, it's nice to see you back and active, Hrvoje! :-) Bye!Alain. -- Mutt 1.5.5.1 is released.
Re: keep alive connections
Daniel Stenberg <[EMAIL PROTECTED]> writes: > On Fri, 7 Nov 2003, Hrvoje Niksic wrote: > >> Persistent connections were available prior to HTTP/1.1, although they were >> not universally implemented. Wget uses the `Keep-Alive' request header to >> request persistent connections, and understands both the HTTP/1.0 >> `Keep-Alive' and the HTTP/1.1 `Connection: keep-alive' response header. > > HTTP 1.1 servers don't (normally) use "Connection: keep-alive". Some of them do when talking to HTTP/1.0 clients, where persistent connections aren't default. For example: $ wget http://www.apache.org --16:01:44-- http://www.apache.org/ [...] 8 Keep-Alive: timeout=5, max=100 9 Connection: Keep-Alive To be on the safe side, Wget supports both response headers.
Re: keep alive connections
On Fri, 7 Nov 2003, Hrvoje Niksic wrote: > Persistent connections were available prior to HTTP/1.1, although they were > not universally implemented. Wget uses the `Keep-Alive' request header to > request persistent connections, and understands both the HTTP/1.0 > `Keep-Alive' and the HTTP/1.1 `Connection: keep-alive' response header. HTTP 1.1 servers don't (normally) use "Connection: keep-alive". Since 1.1 assumes persistant connections by default they only send "Connection: close" if they aren't. -- -=- Daniel Stenberg -=- http://daniel.haxx.se -=- ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol
Re: keep alive connections
Laura Worthington <[EMAIL PROTECTED]> writes: > When I have been testing with wget, wget requests a keep alive > connection, but the responses it receives say "connection close". > > So, it does exactly this...it creates a socket, closes it, reopens > it with the next URL `Connection: close' means that the server will close the connection after the request, i.e. that it has chosen to disregard Wget's keep-alive request. That happens with some servers. > Is it possible to force the situation where you open a connection, > and it doesn't close until completion? You cannot force the server not to close a connection, but if the server cooperates, that's exactly what happens. For example: $ ./wget http://www.apache.org http://www.apache.org/foundation/ --00:53:42-- http://www.apache.org/ ... Resolving www.apache.org... 209.237.227.195 ... --00:53:44-- http://www.apache.org/foundation/ ... Reusing connection to www.apache.org:80. ... Wget has kept the connection to www.apache.org and is reusing it. That can even work for multiple virtual hosts, as long as they share a network interface. For example: $ wget www.apache.org httpd.apache.org --00:50:38-- http://www.apache.org/ ... Connecting to www.apache.org|209.237.227.195|:80... connected. ... --00:50:40-- http://httpd.apache.org/ ... Reusing connection to httpd.apache.org:80. ... In this case, Wget has figured out that `www.apache.org' and `httpd.apache.org' are one and the same and that it doesn't need to reconnect to the same IP address. > Example: > Can I do something where > > wget www.yahoo.com weather.yahoo.com > > reuses the connection between these two URL's? Those two hosts have different IP addresses, so they cannot share a connection. > Do you have an idea of when HTTP 1.1 support will be available? It might be there for the 1.10 release, but I can't guarantee that.
Re: keep alive connections
Laura Worthington <[EMAIL PROTECTED]> writes: > I am confused as to how wget supports keep alive. I am using 1.8.2. > Persistent connections are part of HTTP 1.1, but wget is using HTTP > 1.0. Persistent connections were available prior to HTTP/1.1, although they were not universally implemented. Wget uses the `Keep-Alive' request header to request persistent connections, and understands both the HTTP/1.0 `Keep-Alive' and the HTTP/1.1 `Connection: keep-alive' response header. Most servers support both styles of persistent connections because HTTP/1.0 clients were in extremely wide use when those servers were designed and implemented. (This is the case with Apache, but probably also with IIS and other major players in the field.) A new generation of servers might be expected to only support persistent connections when talking to HTTP/1.1 clients, but then again Wget will get HTTP/1.1 support at some point too.