Re: wget checks first HTML-document against -A

Hrvoje Niksic Wed, 14 Sep 2005 08:37:54 -0700

Dennis Heuer <[EMAIL PROTECTED]> writes:

> Your answer fits only half because I still have to choose -Ahtml,pdf
> and I still get *at least* the first HTML page on my disk


The first HTML page will only be saved temporarily.  You still
shouldn't be needing to use -Ahtml,pdf instead of just -Apdf.

> (try a page like this and you will see that you get a lot of
> unwanted pages on your disk:
> http://web.worldbank.org/external/default/main?theSitePK=258644&menuPK=258666&region=119222&pagePK=51083064&piPK=51246258)

The first problem with this page is that the PDF's are off-site, so
you need to use -H to have Wget retrieve them.  To avoid creating
spurious directories, I recommend -nd, and to avoid deep recursion,
-l1 is needed.  This amounts to:

    wget -H -rl1 -nd -A.pdf 
'http://web.worldbank.org/external/default/main?theSitePK=258644&menuPK=258666&region=119222&pagePK=51083064&piPK=51246258'

The other problem with this page is that it links to a lot of
pages without a ".html" suffix in their URLs, such as
http://www.worldbank.org/.  -A bogusly doesn't reject these because it
considers them to be "directories" rather than files.  I'm not sure if
that's exactly a bug, but it certainly doesn't look like a feature.

Re: wget checks first HTML-document against -A

Reply via email to