RE: Option to save unfollowed links
From: Hrvoje Niksic [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 01, 2003 9:20 PM Tony Lewis [EMAIL PROTECTED] writes: Would something like the following be what you had in mind? 301 http://www.mysite.com/ 200 http://www.mysite.com/index.html 200 http://www.mysite.com/followed.html 401 http://www.mysite.com/needpw.html --- http://www.othersite.com/notfollowed.html Yes, with the possible extensions of file name where the link was saved, sensible status for non-HTTP (currently FTP) links, etc. url which contained the first encountered link to that object, all urls pointing to that page, number of retries used, total time needed, mean donwload bandwidth... lots of interesting data could be logged that way. Collection of desired fields should definitively be configurable at runtime. Heiko -- -- PREVINET S.p.A. www.previnet.it -- Heiko Herold [EMAIL PROTECTED] -- +39-041-5907073 ph -- +39-041-5907472 fax
Re: Option to save unfollowed links
[ Added Cc to [EMAIL PROTECTED] ] Tony Lewis [EMAIL PROTECTED] writes: The following patch adds a command line option to save any links that are not followed by wget. For example: wget http://www.mysite.com --mirror --unfollowed-links=mysite.links will result in mysite.links containing all URLs that are references to other sites in links on mysite.com. I'm curious: what is the use case for this? Why would you want to save the unfollowed links to an external file?
Re: Option to save unfollowed links
Hrvoje Niksic wrote: I'm curious: what is the use case for this? Why would you want to save the unfollowed links to an external file? I use this to determine what other websites a given website refers to. For example: wget http://directory.google.com/Top/Regional/North_America/United_States/California/Localities/H/Hayward/ - -mirror -np --unfollowed-links=hayward.out By looking at hayward.out, I have a list of all websites that the directory refers to. When I use this file, I sort it and throw away the Google and DMOZ links. Everything else is supposed to be something interesting about Hayward. Tony
Re: Option to save unfollowed links
Tony Lewis [EMAIL PROTECTED] writes: Hrvoje Niksic wrote: I'm curious: what is the use case for this? Why would you want to save the unfollowed links to an external file? I use this to determine what other websites a given website refers to. For example: wget http://directory.google.com/Top/Regional/North_America/United_States/California/Localities/H/Hayward/ - -mirror -np --unfollowed-links=hayward.out By looking at hayward.out, I have a list of all websites that the directory refers to. When I use this file, I sort it and throw away the Google and DMOZ links. Everything else is supposed to be something interesting about Hayward. I see. Hmm.. if you have to post-process the list anyway, wouldn't it be more useful to have a list of *all* encountered URLs? It might be nice to accompany this output with the exit statuses, so people can easily grep for 404's. A comprehensive reporting facility has often been requested. Perhaps something should be done about it for the next release.
Re: Option to save unfollowed links
Tony Lewis [EMAIL PROTECTED] writes: Would something like the following be what you had in mind? 301 http://www.mysite.com/ 200 http://www.mysite.com/index.html 200 http://www.mysite.com/followed.html 401 http://www.mysite.com/needpw.html --- http://www.othersite.com/notfollowed.html Yes, with the possible extensions of file name where the link was saved, sensible status for non-HTTP (currently FTP) links, etc.