Dear wget team,
recently found a bug in the version 1.8 of the wget program (recursive
retrieval) that did not occur in earlier versions (at least as far as
I can see, 1.7 is definitly not affected).
The new wget version treats single "?xxx" hrefs the same way as hrefs to
anchors ("#xxx"). So e.g. an, misplaced, xx reference
leads to an http request of "http://www.xxx.xxx/currentfile.html";
(in difference to earlier versions that treated the "" as a single
file name). Now, while this is not a bad thing, wget 1.8 then starts
to send requests of the form
"http://www.xxx.xxx/curentfile.html/anotherfile.html";, although
anotherfile.html is e.g. also in the root dir, or at least
"http://www.xxx.xxx//curentfile.html"; which can causes wget to send a
retrieval requenst for the file a second time and, if time stamp missing,
to download it twice.
I had experience with a server that did not answer 404, file
not found, on such errorous recursive requests, but sent again the contains
of currentfile.html, but now as another URL in another directory level which
ended up in an infinite request
loop, diving deeper and deeper in directories that actually do not exist on
the server. (I actually got serious problems, unfortunately, the person
affected considers legal steps, because the uncontrolled
wget downloaded the site about 20 times over - till it has been shut down)
So it seems to be important to correct this behaviour. I think you only need
to set up a test site (maybe with some subdirs) containing one file with
an errorous href="" tag to reproduce this (maybe only in parts
depending on your server configuration).
Sincerly,
Robert Muecke