Re: wget alpha: -r --spider downloading all files
Am 20.07.2006 15:22, Mauro Tortonesi schrieb: Stefan Melbinger ha scritto: By the way, FTP transfers shouldn't be downloaded as a whole, too, in this mode. well, the semantics of --spider for FTP are still not very clear to me. at the moment, i was considering whether to simply perform FTP listing in case --spider is given, or to disable --spider for FTP URLs. I think I see what you mean ... Here's my 2 cents: Starting a link check with FTP support (-r --spider --follow-ftp) on an HTML page, I would expect wget to follow FTP links but only to check these links for existence (without downloading them completely), not to actively search for other files on the FTP server... On the other hand, if I started the same link check on an ftp://-URL, I _would_ expect wget to actively search through the directories, of course... Is that what you meant? Greets
Re: wget alpha: -r --spider downloading all files
Stefan Melbinger ha scritto: By the way, FTP transfers shouldn't be downloaded as a whole, too, in this mode. well, the semantics of --spider for FTP are still not very clear to me. at the moment, i was considering whether to simply perform FTP listing in case --spider is given, or to disable --spider for FTP URLs. what do you think? -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: wget alpha: -r --spider downloading all files
Am 20.07.2006 14:32, Mauro Tortonesi schrieb: Stefan Melbinger ha scritto: As you might have noticed I was trying to use wget as a tool to check for dead links on big websites. The combination of -r and --spider is working in the new alpha version, however wget is still downloading ALL files (no matter if they are parseable for further links or not), instead of just getting the status response for files other than text/html or application/xhtml+xml. I don't think that this makes very much sense; the files are deleted anyway and downloading a 300MB video is not useful if you just want to check links and see whether the video is there at all. you're absolutely right, stefan. i've just started working on it. It's really great how much time you invest into this, thank you! By the way, FTP transfers shouldn't be downloaded as a whole, too, in this mode. Greets
Re: wget alpha: -r --spider downloading all files
Stefan Melbinger ha scritto: Hi, As you might have noticed I was trying to use wget as a tool to check for dead links on big websites. The combination of -r and --spider is working in the new alpha version, however wget is still downloading ALL files (no matter if they are parseable for further links or not), instead of just getting the status response for files other than text/html or application/xhtml+xml. I don't think that this makes very much sense; the files are deleted anyway and downloading a 300MB video is not useful if you just want to check links and see whether the video is there at all. Could somebody suggest a quick hack to disable the downloading of non-parseable documents? I think it must be somewhere in the area of http.c, somewhere around gethttp() or maybe http_loop() - unfortunately, my knowledge of C and my knowledge of this project weren't enough to get any satisfying result. you're absolutely right, stefan. i've just started working on it. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
wget alpha: -r --spider, elapsed time
Hello there, I think I've got the most recent version of wget running now (SVN trunk), and I think there is another problem: When the report about the broken links is printed in the end, the duration of the operation is always 0 seconds: "FINISHED --11:56:10-- Downloaded: 62 files, 948K in 0s (6470 GB/s)" But I guess you know that already ;) Greets, Stefan PS: Does nobody have an idea how to prevent the downloading of complete files, when they are not text/html? And am I the only one thinking that this would be neccessary?
Response to your ListGuru session [MsgId AA20060720.025303.3]
-- Your document is attached. Unrecognized command -- skipping. Use HELP for assistance.
Re: wget alpha: -r --spider, number of broken links
Stefan Melbinger ha scritto: I don't think that non-existing robots.txt-files should be reported as broken links (as long as they are not referenced by some page). Current output, if spanning over 2 hosts (e.g., -D www.domain1.com,www.domain2.com): - Found 2 broken links. http://www.domain1.com/robots.txt referred by: (null) http://www.domain2.com/robots.txt referred by: (null) - What do you think? hi stefan, of course you're right. but you are also late ;-) in fact, this bug is already fixed in the current version of wget, which you can retrieve from our source code repository: http://www.gnu.org/software/wget/wgetdev.html#development thank you very much for your report anyway. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
wget alpha: -r --spider, number of broken links
Hi again, I don't think that non-existing robots.txt-files should be reported as broken links (as long as they are not referenced by some page). Current output, if spanning over 2 hosts (e.g., -D www.domain1.com,www.domain2.com): - Found 2 broken links. http://www.domain1.com/robots.txt referred by: (null) http://www.domain2.com/robots.txt referred by: (null) - What do you think? Greets, Stefan
wget alpha: -r --spider downloading all files
Hi, As you might have noticed I was trying to use wget as a tool to check for dead links on big websites. The combination of -r and --spider is working in the new alpha version, however wget is still downloading ALL files (no matter if they are parseable for further links or not), instead of just getting the status response for files other than text/html or application/xhtml+xml. I don't think that this makes very much sense; the files are deleted anyway and downloading a 300MB video is not useful if you just want to check links and see whether the video is there at all. Could somebody suggest a quick hack to disable the downloading of non-parseable documents? I think it must be somewhere in the area of http.c, somewhere around gethttp() or maybe http_loop() - unfortunately, my knowledge of C and my knowledge of this project weren't enough to get any satisfying result. Any help is appreciated, Greets Stefan