Run the output file through this command (all on one line)

cat index.html | sed 's/<a href="/\n<ahref="/g' | sed -e '/^[^<a href]/d' | sed 
's/<html>.*//' | sed 's/<a href="//' | sed 's/".*//'


The first part prints the html file to the standard output.
The second takes that output and puts a line feed in front of each hyperlink
The third deletes any line that doesn't start with <a href (i.e., isn't a 
hyperlink)
The fourth gets rid of the initial line (not sure why the third doesn't get rid 
of it, but adding this part was easier than troubleshooting).
The fifth removes the initial part of the hyperlink (i.e., before the URL 
starts)
The sixth removes everything on each line after the " closes the URL

It doesn't seem possible to get wget to output to sed, but it's still only two 
steps.
First, wget http://www.foo.com
Then the above command, assuming index.html is the downloaded file.
It works for www.google.com...haven't tested anything else.



________________________________
From: Mark - augustine.com <[email protected]>
To: [email protected]
Sent: Thursday, February 26, 2009 10:25:01 AM
Subject: [Bug-wget] Simple web page listing

 
Hello,
    I'm looking for a way to use 
wget to report a list of URL's of all the web pages on a given web site.  
Not interested in the actual code, or content, just the names of the web 
pages.  Also I would prefer it in a very simple format, just the URLs 
seperated by return characters.
 
e.g.
-------------
http://www.foo.com/a
http://www.foo.com/b
http://www.foo.com/c
http://www.foo.com/d
-------------
 
    Ideas?  Another 
program/service that offers this already?
 
    Please CC your request to 
me.  Thankyou!
 
Best Regards,
Mark 
Mahon



      

Reply via email to