Re: wget checks first HTML-document against -A

2005-09-14 Thread Dennis Heuer
Your answer fits only half because I still have to choose -Ahtml,pdf and I 
still get *at least* the first HTML page on my disk (try a page like this and 
you will see that you get a lot of unwanted pages on your disk: 
http://web.worldbank.org/external/default/main?theSitePK=258644menuPK=258666region=119222pagePK=51083064piPK=51246258)

(I am always using the latest version)

Dennis

On Wed, 14 Sep 2005 16:40:22 +0200
Hrvoje Niksic [EMAIL PROTECTED] wrote:

 Dennis Heuer [EMAIL PROTECTED] writes:
 
  For example: if I want to grab a series of pdf's from a list that is
  part of an HTML-document, I want to just set -Apdf. This does not
  work, though, because the HTML-document gets rejected. I have to set
  -Ahtml,pdf.
 
 Really?  Which version of Wget are you using?  I use `wget -rl1
 http://.../something.html -A.extension' a lot and it works for me.
 
 


Re: wget checks first HTML-document against -A

2005-09-14 Thread Hrvoje Niksic
Dennis Heuer [EMAIL PROTECTED] writes:

 Your answer fits only half because I still have to choose -Ahtml,pdf
 and I still get *at least* the first HTML page on my disk

The first HTML page will only be saved temporarily.  You still
shouldn't be needing to use -Ahtml,pdf instead of just -Apdf.

 (try a page like this and you will see that you get a lot of
 unwanted pages on your disk:
 http://web.worldbank.org/external/default/main?theSitePK=258644menuPK=258666region=119222pagePK=51083064piPK=51246258)

The first problem with this page is that the PDF's are off-site, so
you need to use -H to have Wget retrieve them.  To avoid creating
spurious directories, I recommend -nd, and to avoid deep recursion,
-l1 is needed.  This amounts to:

wget -H -rl1 -nd -A.pdf 
'http://web.worldbank.org/external/default/main?theSitePK=258644menuPK=258666region=119222pagePK=51083064piPK=51246258'

The other problem with this page is that it links to a lot of
pages without a .html suffix in their URLs, such as
http://www.worldbank.org/.  -A bogusly doesn't reject these because it
considers them to be directories rather than files.  I'm not sure if
that's exactly a bug, but it certainly doesn't look like a feature.


with recursive wget status code does not reflect success/failure of operation

2005-09-14 Thread Owen Cliffe
I'm not sure if this is a bug or a feature, but with recursive
operation, if a get fails and retrieve_tree bails out then no sensible
error codes are returned to main.c (errors are only passed up if the
user's quota was full, the URL was invalid or there was a write error)
so retrieve_tree always returns RETROK causing the main function to
return with an exit status of 0 even if there was an error. 

With single wgets you get a non-zero error code if the operation
failed. 

This is kind of annoying if you are trying to determine of a recursive
operation completed successfully in a shell script. 

Is there a good reason why retrieve tree doesn't just return the status
of the last failed operation on failure?

Even if this weren't the default behaviour it would be useful as at the
moment there is no way to find out if a recursive get failed or
succeeded. 

owen cliffe
---




Re: with recursive wget status code does not reflect success/failure of operation

2005-09-14 Thread Hrvoje Niksic
Owen Cliffe [EMAIL PROTECTED] writes:

 Is there a good reason why retrieve tree doesn't just return the
 status of the last failed operation on failure?

The original reason (which I don't claim to be good) is because Wget
doesn't stop upon on error, it continues.  Because of this returning a
non-zero error code felt wrong, because the download has after all
finished succesffully.  The quota exceeded case is an exception
consistent with this logic because, when quota is exceeded, Wget
really terminates the entire download.

But you present a convincing argument otherwise.  Maybe Wget should
use different error codes for different cases, like:

0 -- all files downloaded successfully (not counting errors in
 robots.txt and such)
1 -- some errors encountered
2 -- fatal errors encountered (such as the quota exceeded case),
 download aborted

What do others think about this?


Re: with recursive wget status code does not reflect success/failure of operation

2005-09-14 Thread Mauro Tortonesi
Alle 18:58, mercoledì 14 settembre 2005, Hrvoje Niksic ha scritto:
 Owen Cliffe [EMAIL PROTECTED] writes:
  Is there a good reason why retrieve tree doesn't just return the
  status of the last failed operation on failure?

 The original reason (which I don't claim to be good) is because Wget
 doesn't stop upon on error, it continues.  Because of this returning a
 non-zero error code felt wrong, because the download has after all
 finished succesffully.  The quota exceeded case is an exception
 consistent with this logic because, when quota is exceeded, Wget
 really terminates the entire download.

 But you present a convincing argument otherwise.  Maybe Wget should
 use different error codes for different cases, like:

 0 -- all files downloaded successfully (not counting errors in
  robots.txt and such)
 1 -- some errors encountered
 2 -- fatal errors encountered (such as the quota exceeded case),
  download aborted

 What do others think about this?

i think that's a good point.

-- 
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: wget checks first HTML-document against -A

2005-09-14 Thread Hrvoje Niksic
Dennis Heuer [EMAIL PROTECTED] writes:

 I've checked that on a different site and it worked. However: My
 mainpoint (why I called this a (design) bug) is still valid. When I
 target a page and say -Apdf it is clear that only the pdf links are
 valid choices. The options -rl1 should not be neccessary.

Well it's not, strictly speaking.  -r means download recursively,
and the default maximum recursion depth is five levels, for both HTTP
and FTP.  HTML files are special-cased because they are the only
method of retrieving the links necessary for traversing the site.

To phrase it another way: if you use FTP, you would expect something
like:

wget -r ftp://server/dir/ -A.pdf

to recursively download all PDF's in the directory and the directories
below it.  The design decision was to have something like:

wget -r http://server/dir/ -A.pdf

behave the same.  Otherwise, how would you tell Wget to crawl the
whole site and download only the PDF's?