Re: keep alive connections

2003-11-25 Thread Hrvoje Niksic
Alain Bench <[EMAIL PROTECTED]> writes:

> |  /* Return if we have no intention of further downloading.  */
> |  if (!(*dt & RETROKF) || (*dt & HEAD_ONLY))
> |{
> |  /* In case the caller cares to look...  */
> |  hs->len = 0L;
> |  hs->res = 0;
> |  FREE_MAYBE (type);
> |  FREE_MAYBE (all_headers);
> |  CLOSE_INVALIDATE (sock);   /* would be CLOSE_FINISH, but there
> |might be more bytes in the body. */
> |  return RETRFINISHED;
> |}
>
> ...changing CLOSE_INVALIDATE to CLOSE_FINISH.

That's exactly the right change.  As the comment implies, the only
reason for using CLOSE_INVALIDATE is fear that a misbehaving CGI might
send more data, thus confusing the next request or even causing
deadlock while writing the request to the server.

When keep-alive connections are not in use (which can be forced with
--no-http-keep-alive), CLOSE_INVALIDATE and CLOSE_FINISH are pretty
much identical.



Re: keep alive connections

2003-11-22 Thread Alain Bench
 On Thursday, November 13, 2003 at 2:49:41 PM +0100, Hrvoje Niksic wrote:

> Maybe it's time to stop erring on the side of caution, and expect HEAD
> to work.

I experimented in gethttp() line 1474 of wget-1.9.1/src/http.c:

|  /* Return if we have no intention of further downloading.  */
|  if (!(*dt & RETROKF) || (*dt & HEAD_ONLY))
|{
|  /* In case the caller cares to look...  */
|  hs->len = 0L;
|  hs->res = 0;
|  FREE_MAYBE (type);
|  FREE_MAYBE (all_headers);
|  CLOSE_INVALIDATE (sock); /* would be CLOSE_FINISH, but there
|  might be more bytes in the body. */
|  return RETRFINISHED;
|}

...changing CLOSE_INVALIDATE to CLOSE_FINISH. Seems to work, but I'm
not familiar with the source and don't know what is the RETROKF case.
Anyway since then, verifying updatness of a site as my previous example
(long serie of HEADs only) is much faster with timestamping and
persistent connexion (like more than 2 times faster). Especially faster
over a yet overloaded network.


> it would actually be a better idea to read (and discard) the body
>> But when do you abort reading if no body comes in clean cases?
> I'm not sure I understand the question.

Forget it. Question was valid only inside the read-discard
hypothesis you expressed 5 mails up in thread. Hypothesis abandonned,
question also :-)


Bye!Alain.
-- 
Microsoft Outlook Express users concerned about readability: For much
better viewing quotes in your messages, check the little freeware
program OE-QuoteFix by Dominik Jain on http://flash.to/oblivion/>.
It'll change your life. :-) Now exists also for Outlook.


Re: keep alive connections

2003-11-13 Thread Hrvoje Niksic
Alain Bench <[EMAIL PROTECTED]> writes:

> There is no such problem. Sorry: I made a stupid log interpretation
> mistake yesterday. The connection is always closed after a HEAD, OK,
> but is reused after a GET, whatever comes after.
>
> This makes a typical --timestamping session use connections for only
> one request (file not updated) or two requests (updated file and
> next file). Only when there is a serie of new files, then they are
> fetched on a longer connection.
>
> So example I mirror every day a site with 300 files, typically 5
> updated and 1 new per day. There is no cgi or redirection involved, all
> plain static files, and the server replies cleanly to HEADs, without
> spurious bodies. Then this will use nearly 300 connections, for mostly
> series of short HEADs, in a situation where the relative benefit of
> persistent connections would be the greatest.

Alain, I find your arguments quite persuasive.  Maybe it's time to
stop erring on the side of caution, and expect HEAD to work.  After
all, it works by default in all "framework"-type environments, such as
Apache, PHP, Servlet/JSP, AOLserver, CGI.pm, etc.  The only problem
are the home-grown, garden variety CGI's, and those can be worked
around simply using `--no-http-keep-alive'.

 it would actually be a better idea to read (and discard) the body
>
> But when do you abort reading if no body comes in clean cases?

I'm not sure I understand the question.  Wget doesn't wait for a body
to arrive in response to HEAD at all; it simply closes the connection
and goes on doing its business.  And that's why persistent connections
are not used in that case.

>> What Wget does only sacrifices persistent connections at times, but
>> does the right thing with all kinds of responses and doesn't introduce
>> artificial delays.
>
> And safe side: Suboptimal but never fails. I begin to see there is
> no way to deal automatically better with possible spurious HEAD
> bodies.  What about yet another obscure option for cases like my
> example?  --naively-believe-head-gives-header-only?

See above, I think these days it makes sense to expect HEAD to be
respected by default, and to reserve `--no-http-keep-alive' for when
it doesn't.

> Reading TODO: Please don't say me some bad servers send a body to
> GET If-Modified-Since 304 unmodified requests... Oh no!

Umm, no, it was mere laziness on my part that I didn't use
If-Modified-Since.


Re: keep alive connections

2003-11-13 Thread Alain Bench
 On Wednesday, November 12, 2003 at 2:28:04 PM +0100, Hrvoje Niksic wrote:

> Alain Bench <[EMAIL PROTECTED]> writes:
>> Wget also closes the connection between a GET (with body) and the
>> HEAD for the next file.
> I wasn't aware of this problem

There is no such problem. Sorry: I made a stupid log interpretation
mistake yesterday. The connection is always closed after a HEAD, OK, but
is reused after a GET, whatever comes after.

This makes a typical --timestamping session use connections for only
one request (file not updated) or two requests (updated file and next
file). Only when there is a serie of new files, then they are fetched on
a longer connection.

So example I mirror every day a site with 300 files, typically 5
updated and 1 new per day. There is no cgi or redirection involved, all
plain static files, and the server replies cleanly to HEADs, without
spurious bodies. Then this will use nearly 300 connections, for mostly
series of short HEADs, in a situation where the relative benefit of
persistent connections would be the greatest.


>>> it would actually be a better idea to read (and discard) the body

But when do you abort reading if no body comes in clean cases?


>> Would it be possible to close/reopen only if, and as soon as, first
>> byte of spurious body comes?
> How exactly do you propose to detect the unwanted body?

Humm... Is Tarot efficient? ;-)


> the purpose of the persistent connections (speed).

And load on network and server.


> What Wget does only sacrifices persistent connections at times, but
> does the right thing with all kinds of responses and doesn't introduce
> artificial delays.

And safe side: Suboptimal but never fails. I begin to see there is
no way to deal automatically better with possible spurious HEAD bodies.
What about yet another obscure option for cases like my example?
--naively-believe-head-gives-header-only?


Reading TODO: Please don't say me some bad servers send a body to
GET If-Modified-Since 304 unmodified requests... Oh no!


Bye!Alain.
-- 
« if you believe the Content-Length header, I've got a bridge to sell you. »


Re: keep alive connections

2003-11-12 Thread Hrvoje Niksic
Alain Bench <[EMAIL PROTECTED]> writes:

> OK, wasn't aware of the spurious HEAD bodies problem. But Wget also
> closes the connection between a GET (with body) and the HEAD for the
> next file.

Could you post a URL for which this happens?  I wasn't aware of this
problem and would like to fix it.

>> But maybe it would actually be a better idea to read (and discard)
>> the body than to close the connection and reopen it.
>
> Hum... Would it be possible to close/reopen only if, and as soon as,
> first byte of spurious body comes?

This is harder than it seems.  How exactly do you propose to detect
the unwanted body?  If you wait for an arbitrary time for the body
data to start arriving, you slow down all downloads and defeat the
purpose of the persistent connections (speed).  If you don't wait, the
detection doesn't work because the body data can start arriving a bit
later (which is frequently the case with CGI's).  Either case, you
lose.

What Wget does only sacrifices persistent connections at times, but
does the right thing with all kinds of responses and doesn't introduce
artificial delays.

>>>| Keep-Alive: timeout=15, max=5
>>> Without --timestamping Wget keeps "Reusing fd 3." and closing it only
>>> once every 6 files (first + 5 more).
>> This might be due to redirections.
>
> No redirections involved: That closure is normal, due to the "max=5"
> the server responds to the first request. At second GET it's "max=4" and
> gets decremented each time. Finally at the 6th request there is no more
> "Connection:" nor "Keep-Alive:" fields.

Oh, I see, it's a server setting.  Why do they use such a limit?


Re: keep alive connections

2003-11-12 Thread Alain Bench
 On Tuesday, November 11, 2003 at 2:41:31 PM +0100, Hrvoje Niksic wrote:

> Alain Bench <[EMAIL PROTECTED]> writes:
>> with --timestamping: Each HEAD and each possible GET uses a new
>> connection.
> I think the difference is that Wget closes the connection when it
> decides not to read the request body.

OK, wasn't aware of the spurious HEAD bodies problem. But Wget also
closes the connection between a GET (with body) and the HEAD for the
next file.


> But maybe it would actually be a better idea to read (and discard) the
> body than to close the connection and reopen it.

Hum... Would it be possible to close/reopen only if, and as soon as,
first byte of spurious body comes? I guess it could be difficult to deal
cleanly with next file in limit cases...


>>| Keep-Alive: timeout=15, max=5
>> Without --timestamping Wget keeps "Reusing fd 3." and closing it only
>> once every 6 files (first + 5 more).
> This might be due to redirections.

No redirections involved: That closure is normal, due to the "max=5"
the server responds to the first request. At second GET it's "max=4" and
gets decremented each time. Finally at the 6th request there is no more
"Connection:" nor "Keep-Alive:" fields. The /etc/apache/httpd.conf says:

| # KeepAlive: The number of Keep-Alive persistent requests to accept
| # per connection. Set to 0 to deactivate Keep-Alive support
| KeepAlive 5
|
| # KeepAliveTimeout: Number of seconds to wait for the next request
| KeepAliveTimeout 15


Bye!Alain.
-- 
When you want to reply to a mailing list, please avoid doing so from a
digest. This often builds incorrect references and breaks threads.


Re: keep alive connections

2003-11-11 Thread Hrvoje Niksic
Herold Heiko <[EMAIL PROTECTED]> writes:

>> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
>
>> With the HEAD method you never know when you'll stumble upon a CGI
>> that doesn't understand it and that will send the body anyway.  But
>> maybe it would actually be a better idea to read (and discard) the
>> body than to close the connection and reopen it.
>
> Wouldn't that be suboptimal in case that page is huge (and/or the
> connection slow) ?

You are right, it would.  But it might make good sense for
redirections, which typically have very small bodies.


RE: keep alive connections

2003-11-11 Thread Herold Heiko
> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]

> With the HEAD method you never know when you'll stumble upon a CGI
> that doesn't understand it and that will send the body anyway.  But
> maybe it would actually be a better idea to read (and discard) the
> body than to close the connection and reopen it.

Wouldn't that be suboptimal in case that page is huge (and/or the connection
slow) ?
Heiko

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED]
-- +39-041-5907073 ph
-- +39-041-5907472 fax


Re: keep alive connections

2003-11-11 Thread Daniel Stenberg
On Tue, 11 Nov 2003, Hrvoje Niksic wrote:

> I think the difference is that Wget closes the connection when it decides
> not to read the request body.  For example, it closes on redirections
> because it (intentionally) ignores the body.

Another approach could be to read and just ignore the body of redirect pages.
You'd gain a close/connect but lose the transfer time.

> With the HEAD method you never know when you'll stumble upon a CGI that
> doesn't understand it and that will send the body anyway.  But maybe it
> would actually be a better idea to read (and discard) the body than to close
> the connection and reopen it.

That approach is just as hard, only depending on different things to work
correctly.

Since we're talking about silly servers, they could just as well return a body
to the HEAD request, and the response is said to be persistant and the
Content-Length: is set. The size of the Content-Length in a HEAD request is
the size of the body that would be returned if GET is request so you'd have no
idea how much data to read

Been there. Seen it happen. There's just no good way to deal with HEAD
requests that sends back a body. I mean besides yelling at the author of the
server side.

-- 
 -=- Daniel Stenberg -=- http://daniel.haxx.se -=-
  ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol


Re: keep alive connections

2003-11-11 Thread Hrvoje Niksic
Alain Bench <[EMAIL PROTECTED]> writes:

> Hello Hrvoje,
>
>  On Friday, November 7, 2003 at 11:50:53 PM +0100, Hrvoje Niksic wrote:
>
>> Wget uses the `Keep-Alive' request header to request persistent
>> connections, and understands both the HTTP/1.0 `Keep-Alive' and the
>> HTTP/1.1 `Connection: keep-alive' response header.
>
> This doesn't seem to work together with --timestamping: Each HEAD
> and each possible GET uses a new connection.

I think the difference is that Wget closes the connection when it
decides not to read the request body.  For example, it closes on
redirections because it (intentionally) ignores the body.

With the HEAD method you never know when you'll stumble upon a CGI
that doesn't understand it and that will send the body anyway.  But
maybe it would actually be a better idea to read (and discard) the
body than to close the connection and reopen it.

> Without --timestamping Wget keeps "Reusing fd 3." and closing it
> only once every 6 files (first + 5 more).

This might be due to redirections.  Look out for the exact
circumstances when Wget closes (or doesn't reuse) connections and
you'll probably notice it.


Re: keep alive connections

2003-11-11 Thread Alain Bench
Hello Hrvoje,

 On Friday, November 7, 2003 at 11:50:53 PM +0100, Hrvoje Niksic wrote:

> Wget uses the `Keep-Alive' request header to request persistent
> connections, and understands both the HTTP/1.0 `Keep-Alive' and the
> HTTP/1.1 `Connection: keep-alive' response header.

This doesn't seem to work together with --timestamping: Each HEAD
and each possible GET uses a new connection. The server keeps
responding:

| HTTP/1.0 200 OK
| [...]
| Connection: Keep-Alive
| Keep-Alive: timeout=15, max=5

But Wget 1.9 does each time:

| Created socket 3.
| [snip request/response]
| Registered fd 3 for persistent reuse.
| Closing fd 3
| Invalidating fd 3 from further reuse.
| Remote file is newer, retrieving.
| Created socket 3.
| [and so on]

Tcpdump confirms the TCP session is FIN closed by Wget.

Without --timestamping Wget keeps "Reusing fd 3." and closing it
only once every 6 files (first + 5 more). At this moment the FIN would
in any case be initiated by the server if not by Wget. Test made on an
old Apache 1.1.3, but it seems the same with other servers.


BTW, it's nice to see you back and active, Hrvoje! :-)


Bye!Alain.
-- 
Mutt 1.5.5.1 is released.


Re: keep alive connections

2003-11-08 Thread Hrvoje Niksic
Daniel Stenberg <[EMAIL PROTECTED]> writes:

> On Fri, 7 Nov 2003, Hrvoje Niksic wrote:
>
>> Persistent connections were available prior to HTTP/1.1, although they were
>> not universally implemented.  Wget uses the `Keep-Alive' request header to
>> request persistent connections, and understands both the HTTP/1.0
>> `Keep-Alive' and the HTTP/1.1 `Connection: keep-alive' response header.
>
> HTTP 1.1 servers don't (normally) use "Connection: keep-alive".

Some of them do when talking to HTTP/1.0 clients, where persistent
connections aren't default.  For example:

$ wget http://www.apache.org
--16:01:44--  http://www.apache.org/
[...]
 8 Keep-Alive: timeout=5, max=100
 9 Connection: Keep-Alive

To be on the safe side, Wget supports both response headers.



Re: keep alive connections

2003-11-08 Thread Daniel Stenberg
On Fri, 7 Nov 2003, Hrvoje Niksic wrote:

> Persistent connections were available prior to HTTP/1.1, although they were
> not universally implemented.  Wget uses the `Keep-Alive' request header to
> request persistent connections, and understands both the HTTP/1.0
> `Keep-Alive' and the HTTP/1.1 `Connection: keep-alive' response header.

HTTP 1.1 servers don't (normally) use "Connection: keep-alive". Since 1.1
assumes persistant connections by default they only send "Connection: close"
if they aren't.

-- 
 -=- Daniel Stenberg -=- http://daniel.haxx.se -=-
  ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol


Re: keep alive connections

2003-11-07 Thread Hrvoje Niksic
Laura Worthington <[EMAIL PROTECTED]> writes:

> When I have been testing with wget, wget requests a keep alive
> connection, but the responses it receives say "connection close".
> 
> So, it does exactly this...it creates a socket, closes it, reopens
> it with the next URL

`Connection: close' means that the server will close the connection
after the request, i.e. that it has chosen to disregard Wget's
keep-alive request.  That happens with some servers.

> Is it possible to force the situation where you open a connection,
> and it doesn't close until completion?

You cannot force the server not to close a connection, but if the
server cooperates, that's exactly what happens.  For example:

$ ./wget http://www.apache.org http://www.apache.org/foundation/
--00:53:42--  http://www.apache.org/
...
Resolving www.apache.org... 209.237.227.195
...
--00:53:44--  http://www.apache.org/foundation/
...
Reusing connection to www.apache.org:80.
...

Wget has kept the connection to www.apache.org and is reusing it.
That can even work for multiple virtual hosts, as long as they share a
network interface.  For example:

$ wget www.apache.org httpd.apache.org
--00:50:38--  http://www.apache.org/
...
Connecting to www.apache.org|209.237.227.195|:80... connected.
...
--00:50:40--  http://httpd.apache.org/
...
Reusing connection to httpd.apache.org:80.
...

In this case, Wget has figured out that `www.apache.org' and
`httpd.apache.org' are one and the same and that it doesn't need to
reconnect to the same IP address.

> Example:
> Can I do something where
>
> wget www.yahoo.com weather.yahoo.com
>
> reuses the connection between these two URL's?

Those two hosts have different IP addresses, so they cannot share a
connection.

> Do you have an idea of when HTTP 1.1 support will be available?

It might be there for the 1.10 release, but I can't guarantee that.



Re: keep alive connections

2003-11-07 Thread Hrvoje Niksic
Laura Worthington <[EMAIL PROTECTED]> writes:

> I am confused as to how wget supports keep alive.  I am using 1.8.2.
> Persistent connections are part of HTTP 1.1, but wget is using HTTP
> 1.0.

Persistent connections were available prior to HTTP/1.1, although they
were not universally implemented.  Wget uses the `Keep-Alive' request
header to request persistent connections, and understands both the
HTTP/1.0 `Keep-Alive' and the HTTP/1.1 `Connection: keep-alive'
response header.

Most servers support both styles of persistent connections because
HTTP/1.0 clients were in extremely wide use when those servers were
designed and implemented.  (This is the case with Apache, but probably
also with IIS and other major players in the field.)  A new generation
of servers might be expected to only support persistent connections
when talking to HTTP/1.1 clients, but then again Wget will get
HTTP/1.1 support at some point too.