Re: wget checks first HTML-document against -A
Your answer fits only half because I still have to choose -Ahtml,pdf and I still get *at least* the first HTML page on my disk (try a page like this and you will see that you get a lot of unwanted pages on your disk: http://web.worldbank.org/external/default/main?theSitePK=258644menuPK=258666region=119222pagePK=51083064piPK=51246258) (I am always using the latest version) Dennis On Wed, 14 Sep 2005 16:40:22 +0200 Hrvoje Niksic [EMAIL PROTECTED] wrote: Dennis Heuer [EMAIL PROTECTED] writes: For example: if I want to grab a series of pdf's from a list that is part of an HTML-document, I want to just set -Apdf. This does not work, though, because the HTML-document gets rejected. I have to set -Ahtml,pdf. Really? Which version of Wget are you using? I use `wget -rl1 http://.../something.html -A.extension' a lot and it works for me.
Re: wget checks first HTML-document against -A
Dennis Heuer [EMAIL PROTECTED] writes: Your answer fits only half because I still have to choose -Ahtml,pdf and I still get *at least* the first HTML page on my disk The first HTML page will only be saved temporarily. You still shouldn't be needing to use -Ahtml,pdf instead of just -Apdf. (try a page like this and you will see that you get a lot of unwanted pages on your disk: http://web.worldbank.org/external/default/main?theSitePK=258644menuPK=258666region=119222pagePK=51083064piPK=51246258) The first problem with this page is that the PDF's are off-site, so you need to use -H to have Wget retrieve them. To avoid creating spurious directories, I recommend -nd, and to avoid deep recursion, -l1 is needed. This amounts to: wget -H -rl1 -nd -A.pdf 'http://web.worldbank.org/external/default/main?theSitePK=258644menuPK=258666region=119222pagePK=51083064piPK=51246258' The other problem with this page is that it links to a lot of pages without a .html suffix in their URLs, such as http://www.worldbank.org/. -A bogusly doesn't reject these because it considers them to be directories rather than files. I'm not sure if that's exactly a bug, but it certainly doesn't look like a feature.
with recursive wget status code does not reflect success/failure of operation
I'm not sure if this is a bug or a feature, but with recursive operation, if a get fails and retrieve_tree bails out then no sensible error codes are returned to main.c (errors are only passed up if the user's quota was full, the URL was invalid or there was a write error) so retrieve_tree always returns RETROK causing the main function to return with an exit status of 0 even if there was an error. With single wgets you get a non-zero error code if the operation failed. This is kind of annoying if you are trying to determine of a recursive operation completed successfully in a shell script. Is there a good reason why retrieve tree doesn't just return the status of the last failed operation on failure? Even if this weren't the default behaviour it would be useful as at the moment there is no way to find out if a recursive get failed or succeeded. owen cliffe ---
Re: with recursive wget status code does not reflect success/failure of operation
Owen Cliffe [EMAIL PROTECTED] writes: Is there a good reason why retrieve tree doesn't just return the status of the last failed operation on failure? The original reason (which I don't claim to be good) is because Wget doesn't stop upon on error, it continues. Because of this returning a non-zero error code felt wrong, because the download has after all finished succesffully. The quota exceeded case is an exception consistent with this logic because, when quota is exceeded, Wget really terminates the entire download. But you present a convincing argument otherwise. Maybe Wget should use different error codes for different cases, like: 0 -- all files downloaded successfully (not counting errors in robots.txt and such) 1 -- some errors encountered 2 -- fatal errors encountered (such as the quota exceeded case), download aborted What do others think about this?
Re: with recursive wget status code does not reflect success/failure of operation
Alle 18:58, mercoledì 14 settembre 2005, Hrvoje Niksic ha scritto: Owen Cliffe [EMAIL PROTECTED] writes: Is there a good reason why retrieve tree doesn't just return the status of the last failed operation on failure? The original reason (which I don't claim to be good) is because Wget doesn't stop upon on error, it continues. Because of this returning a non-zero error code felt wrong, because the download has after all finished succesffully. The quota exceeded case is an exception consistent with this logic because, when quota is exceeded, Wget really terminates the entire download. But you present a convincing argument otherwise. Maybe Wget should use different error codes for different cases, like: 0 -- all files downloaded successfully (not counting errors in robots.txt and such) 1 -- some errors encountered 2 -- fatal errors encountered (such as the quota exceeded case), download aborted What do others think about this? i think that's a good point. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: wget checks first HTML-document against -A
Dennis Heuer [EMAIL PROTECTED] writes: I've checked that on a different site and it worked. However: My mainpoint (why I called this a (design) bug) is still valid. When I target a page and say -Apdf it is clear that only the pdf links are valid choices. The options -rl1 should not be neccessary. Well it's not, strictly speaking. -r means download recursively, and the default maximum recursion depth is five levels, for both HTTP and FTP. HTML files are special-cased because they are the only method of retrieving the links necessary for traversing the site. To phrase it another way: if you use FTP, you would expect something like: wget -r ftp://server/dir/ -A.pdf to recursively download all PDF's in the directory and the directories below it. The design decision was to have something like: wget -r http://server/dir/ -A.pdf behave the same. Otherwise, how would you tell Wget to crawl the whole site and download only the PDF's?