Re: wget alpha: -r --spider downloading all files

2006-07-20 Thread Stefan Melbinger

Am 20.07.2006 15:22, Mauro Tortonesi schrieb:

Stefan Melbinger ha scritto:

By the way, FTP transfers shouldn't be downloaded as a whole, too, in 
this mode.


well, the semantics of --spider for FTP are still not very clear to me.

at the moment, i was considering whether to simply perform FTP listing 
in case --spider is given, or to disable --spider for FTP URLs.


I think I see what you mean ...

Here's my 2 cents:

Starting a link check with FTP support (-r --spider --follow-ftp) on an 
HTML page, I would expect wget to follow FTP links but only to check 
these links for existence (without downloading them completely), not to 
actively search for other files on the FTP server...


On the other hand, if I started the same link check on an ftp://-URL, I 
_would_ expect wget to actively search through the directories, of course...


Is that what you meant?

Greets


Re: wget alpha: -r --spider downloading all files

2006-07-20 Thread Mauro Tortonesi

Stefan Melbinger ha scritto:

By the way, FTP transfers shouldn't be downloaded as a whole, too, in 
this mode.


well, the semantics of --spider for FTP are still not very clear to me.

at the moment, i was considering whether to simply perform FTP listing 
in case --spider is given, or to disable --spider for FTP URLs.


what do you think?

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: wget alpha: -r --spider downloading all files

2006-07-20 Thread Stefan Melbinger

Am 20.07.2006 14:32, Mauro Tortonesi schrieb:

Stefan Melbinger ha scritto:
As you might have noticed I was trying to use wget as a tool to check 
for dead links on big websites. The combination of -r and --spider is 
working in the new alpha version, however wget is still downloading 
ALL files (no matter if they are parseable for further links or not), 
instead of just getting the status response for files other than 
text/html or application/xhtml+xml.


I don't think that this makes very much sense; the files are deleted 
anyway and downloading a 300MB video is not useful if you just want to 
check links and see whether the video is there at all.



you're absolutely right, stefan. i've just started working on it.


It's really great how much time you invest into this, thank you!

By the way, FTP transfers shouldn't be downloaded as a whole, too, in 
this mode.


Greets


Re: wget alpha: -r --spider downloading all files

2006-07-20 Thread Mauro Tortonesi

Stefan Melbinger ha scritto:

Hi,

As you might have noticed I was trying to use wget as a tool to check 
for dead links on big websites. The combination of -r and --spider is 
working in the new alpha version, however wget is still downloading ALL 
files (no matter if they are parseable for further links or not), 
instead of just getting the status response for files other than 
text/html or application/xhtml+xml.


I don't think that this makes very much sense; the files are deleted 
anyway and downloading a 300MB video is not useful if you just want to 
check links and see whether the video is there at all.


Could somebody suggest a quick hack to disable the downloading of 
non-parseable documents? I think it must be somewhere in the area of 
http.c, somewhere around gethttp() or maybe http_loop() - unfortunately, 
my knowledge of C and my knowledge of this project weren't enough to get 
any satisfying result.


you're absolutely right, stefan. i've just started working on it.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


wget alpha: -r --spider, elapsed time

2006-07-20 Thread Stefan Melbinger

Hello there,

I think I've got the most recent version of wget running now (SVN 
trunk), and I think there is another problem:


When the report about the broken links is printed in the end, the 
duration of the operation is always 0 seconds:


"FINISHED --11:56:10--
Downloaded: 62 files, 948K in 0s (6470 GB/s)"

But I guess you know that already ;)

Greets, Stefan

PS: Does nobody have an idea how to prevent the downloading of complete 
files, when they are not text/html? And am I the only one thinking that 
this would be neccessary?


Response to your ListGuru session [MsgId AA20060720.025303.3]

2006-07-20 Thread listguru
--  
 
 Your document is attached. 
Unrecognized command -- skipping.  Use HELP for assistance. 
 



Re: wget alpha: -r --spider, number of broken links

2006-07-20 Thread Mauro Tortonesi

Stefan Melbinger ha scritto:

I don't think that non-existing robots.txt-files should be reported as 
broken links (as long as they are not referenced by some page).


Current output, if spanning over 2 hosts (e.g., -D 
www.domain1.com,www.domain2.com):


-
Found 2 broken links.

http://www.domain1.com/robots.txt referred by:
(null)
http://www.domain2.com/robots.txt referred by:
(null)
-

What do you think?


hi stefan,

of course you're right. but you are also late ;-)

in fact, this bug is already fixed in the current version of wget, which 
you can retrieve from our source code repository:


http://www.gnu.org/software/wget/wgetdev.html#development

thank you very much for your report anyway.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


wget alpha: -r --spider, number of broken links

2006-07-20 Thread Stefan Melbinger

Hi again,

I don't think that non-existing robots.txt-files should be reported as 
broken links (as long as they are not referenced by some page).


Current output, if spanning over 2 hosts (e.g., -D 
www.domain1.com,www.domain2.com):


-
Found 2 broken links.

http://www.domain1.com/robots.txt referred by:
(null)
http://www.domain2.com/robots.txt referred by:
(null)
-

What do you think?

Greets,
  Stefan



wget alpha: -r --spider downloading all files

2006-07-20 Thread Stefan Melbinger

Hi,

As you might have noticed I was trying to use wget as a tool to check 
for dead links on big websites. The combination of -r and --spider is 
working in the new alpha version, however wget is still downloading ALL 
files (no matter if they are parseable for further links or not), 
instead of just getting the status response for files other than 
text/html or application/xhtml+xml.


I don't think that this makes very much sense; the files are deleted 
anyway and downloading a 300MB video is not useful if you just want to 
check links and see whether the video is there at all.


Could somebody suggest a quick hack to disable the downloading of 
non-parseable documents? I think it must be somewhere in the area of 
http.c, somewhere around gethttp() or maybe http_loop() - unfortunately, 
my knowledge of C and my knowledge of this project weren't enough to get 
any satisfying result.


Any help is appreciated,
Greets Stefan