Re: Serious bug in recursive retrieval behaviour occured in v. 1.8

2002-04-05 Thread Robert Mücke

Dear Ian,

 I couldn't reproduce this with wget 1.8 and a local Apache server
 (but I didn't attempt to reconfigure Apache in an attempt to
 reproduce it).

 A few recursive retrieval bugs were fixed in wget 1.8.1. Is it
 possible for you to test that version? (You may want to limit the
 recursion depth and the maximum amount to download if repeating the
 test!)

I found out that there have to be more circumstances fullfilled in order
to reproduce this bug. I have a local copy of the website where it
occurred, and it seems necessary to have nearly all the files in the
web server dir, not only the corrupted one(s). I tested wget 1.8.1 and it
worked well, without the bug.

Regards, Robert




Serious bug in recursive retrieval behaviour occured in v. 1.8

2002-04-04 Thread Robert Mücke

Dear wget team,

recently found a bug in the version 1.8 of the wget program (recursive
retrieval) that did not occur in earlier versions (at least as far as
I can see, 1.7 is definitly not affected).

The new wget version treats single ?xxx hrefs the same way as hrefs to
anchors (#xxx). So e.g. an, misplaced, a href=xx/a reference
leads to an http request of http://www.xxx.xxx/currentfile.html;
(in difference to earlier versions that treated the  as a single
file name). Now, while this is not a bad thing, wget 1.8 then starts
to send requests of the form
http://www.xxx.xxx/curentfile.html/anotherfile.html;, although
anotherfile.html is e.g. also in the root dir, or at least
http://www.xxx.xxx//curentfile.html; which can causes wget to send a
retrieval requenst for the file a second time and, if time stamp missing,
to download it twice.

I had experience with a server that did not answer 404, file
not found, on such errorous recursive requests, but sent again the contains
of currentfile.html, but now as another URL in another directory level which
ended up in an infinite request
loop, diving deeper and deeper in directories that actually do not exist on
the server. (I actually got serious problems, unfortunately, the person
affected considers legal steps, because the uncontrolled
wget downloaded the site about 20 times over - till it has been shut down)

So it seems to be important to correct this behaviour. I think you only need
to set up a test site (maybe with some subdirs) containing one file with
an errorous href= tag to reproduce this (maybe only in parts
depending on your server configuration).

Sincerly,
Robert Muecke