RE: Option to save unfollowed links

2003-10-02 Thread Herold Heiko
 From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, October 01, 2003 9:20 PM
 
 Tony Lewis [EMAIL PROTECTED] writes:
 
  Would something like the following be what you had in mind?
 
  301 http://www.mysite.com/
  200 http://www.mysite.com/index.html
  200 http://www.mysite.com/followed.html
  401 http://www.mysite.com/needpw.html
  --- http://www.othersite.com/notfollowed.html
 
 Yes, with the possible extensions of file name where the link was
 saved, sensible status for non-HTTP (currently FTP) links, etc.
 

url which contained the first encountered link to that object, all urls
pointing to that page, number of retries used, total time needed, mean
donwload bandwidth...
lots of interesting data could be logged that way. Collection of desired
fields should definitively be configurable at runtime.

Heiko

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED]
-- +39-041-5907073 ph
-- +39-041-5907472 fax


Re: Option to save unfollowed links

2003-10-01 Thread Hrvoje Niksic
[ Added Cc to [EMAIL PROTECTED] ]

Tony Lewis [EMAIL PROTECTED] writes:

 The following patch adds a command line option to save any links
 that are not followed by wget. For example:

 wget http://www.mysite.com --mirror --unfollowed-links=mysite.links

 will result in mysite.links containing all URLs that are references
 to other sites in links on mysite.com.

I'm curious: what is the use case for this?  Why would you want to
save the unfollowed links to an external file?


Re: Option to save unfollowed links

2003-10-01 Thread Tony Lewis
Hrvoje Niksic wrote:

 I'm curious: what is the use case for this?  Why would you want to
 save the unfollowed links to an external file?

I use this to determine what other websites a given website refers to.

For example:
wget
http://directory.google.com/Top/Regional/North_America/United_States/California/Localities/H/Hayward/
 -
-mirror -np --unfollowed-links=hayward.out

By looking at hayward.out, I have a list of all websites that the directory
refers to. When I use this file, I sort it and throw away the Google and
DMOZ links. Everything else is supposed to be something interesting about
Hayward.

Tony



Re: Option to save unfollowed links

2003-10-01 Thread Hrvoje Niksic
Tony Lewis [EMAIL PROTECTED] writes:

 Hrvoje Niksic wrote:

 I'm curious: what is the use case for this?  Why would you want to
 save the unfollowed links to an external file?

 I use this to determine what other websites a given website refers to.

 For example:
 wget
 http://directory.google.com/Top/Regional/North_America/United_States/California/Localities/H/Hayward/
  -
 -mirror -np --unfollowed-links=hayward.out

 By looking at hayward.out, I have a list of all websites that the
 directory refers to. When I use this file, I sort it and throw away
 the Google and DMOZ links. Everything else is supposed to be
 something interesting about Hayward.

I see.  Hmm.. if you have to post-process the list anyway, wouldn't it
be more useful to have a list of *all* encountered URLs?  It might be
nice to accompany this output with the exit statuses, so people can
easily grep for 404's.

A comprehensive reporting facility has often been requested.  Perhaps
something should be done about it for the next release.



Re: Option to save unfollowed links

2003-10-01 Thread Hrvoje Niksic
Tony Lewis [EMAIL PROTECTED] writes:

 Would something like the following be what you had in mind?

 301 http://www.mysite.com/
 200 http://www.mysite.com/index.html
 200 http://www.mysite.com/followed.html
 401 http://www.mysite.com/needpw.html
 --- http://www.othersite.com/notfollowed.html

Yes, with the possible extensions of file name where the link was
saved, sensible status for non-HTTP (currently FTP) links, etc.