wget mirroring busted

Jamie Zawinski Wed, 14 Nov 2001 01:28:47 -0800

I can't make any sense of what's happening, but when I try to use wget
to mirror a particular pair of URLs, it doesn't download everything.


I'm doing this:

    wget -nv -m -nH -np \
       http://www.dnalounge.com/flyers/
       http://www.dnalounge.com/gallery/

It's downloading about every 4th subdirectory under gallery/2001/;
if you look at the index.html file there, you'll see that all links
are in identical syntax, so I don't see why it's downloading 07-13/
but skipping 07-14/.

And then, strangely, if I leave off the flyers/ URL on the command
line, it downloads more of the gallery/ directories -- but not all
of them.  

It's acting as if there's some maximum number of URLs it's willing
to try, or something like that?

I tried this on Linux with both wget 1.5.3 and 1.7.  I've tried it
on two different machines.  With the above command line, I always
get this set of directories:

    FINISHED --01:20:47--
    Downloaded: 23,344,299 bytes in 720 files
    % find flyers gallery -type d | sort
    flyers
    flyers/2001
    flyers/2001/07
    flyers/2001/08
    flyers/2001/09
    flyers/2001/10
    flyers/2001/11
    flyers/2001/12
    gallery
    gallery/2001
    gallery/2001/07-13
    gallery/2001/08-17
    gallery/2001/09-16
    gallery/2001/09-20
    gallery/2001/10-05

if it were working properly, I'd get this set of directories:

    flyers
    flyers/2001
    flyers/2001/07
    flyers/2001/08
    flyers/2001/09
    flyers/2001/10
    flyers/2001/11
    flyers/2001/12
    gallery
    gallery/2001
    gallery/2001/07-13
    gallery/2001/07-14
    gallery/2001/07-28
    gallery/2001/08-01
    gallery/2001/08-04
    gallery/2001/08-10
    gallery/2001/08-17
    gallery/2001/08-31
    gallery/2001/09-01
    gallery/2001/09-16
    gallery/2001/09-20
    gallery/2001/09-23
    gallery/2001/10-05
    gallery/2001/10-14
    gallery/2001/10-31

I added "-d" to the command line, and saved the output to a file,
in case you're interested.  Here are the lines matching one of the 
directories it chose to ignore:

    % grep 10-31 LOG
    flyers/2001/10/31-halloween.html: 
merge("http://www.dnalounge.com/flyers/2001/10/31-halloween.html";, 
"../../../gallery/2001/10-31/") -> 
http://www.dnalounge.com/flyers/2001/10/../../../gallery/2001/10-31/
    parseurl ("http://www.dnalounge.com/flyers/2001/10/../../../gallery/2001/10-31/";) 
-> host www.dnalounge.com -> opath flyers/2001/10/../../../gallery/2001/10-31/ -> dir 
flyers/2001/10/../../../gallery/2001/10-31 -> file  -> ndir gallery/2001/10-31
    newpath: /gallery/2001/10-31/
    http://www.dnalounge.com/gallery/2001/10-31/ already in list, so we don't load.
    gallery/2001/index.html: merge("http://www.dnalounge.com/gallery/2001/";, "10-31/") 
-> http://www.dnalounge.com/gallery/2001/10-31/
    parseurl ("http://www.dnalounge.com/gallery/2001/10-31/";) -> host 
www.dnalounge.com -> opath gallery/2001/10-31/ -> dir gallery/2001/10-31 -> file  -> 
ndir gallery/2001/10-31
    newpath: /gallery/2001/10-31/
    http://www.dnalounge.com/gallery/2001/10-31/ already in list, so we don't load.

The only "ERROR" in the log is about the nonexistent robots.txt.

Any suggestions?

-- 
Jamie Zawinski
[EMAIL PROTECTED]             http://www.jwz.org/
[EMAIL PROTECTED]       http://www.dnalounge.com/

wget mirroring busted

Reply via email to