Run the output file through this command (all on one line) cat index.html | sed 's/<a href="/\n<ahref="/g' | sed -e '/^[^<a href]/d' | sed 's/<html>.*//' | sed 's/<a href="//' | sed 's/".*//'
The first part prints the html file to the standard output. The second takes that output and puts a line feed in front of each hyperlink The third deletes any line that doesn't start with <a href (i.e., isn't a hyperlink) The fourth gets rid of the initial line (not sure why the third doesn't get rid of it, but adding this part was easier than troubleshooting). The fifth removes the initial part of the hyperlink (i.e., before the URL starts) The sixth removes everything on each line after the " closes the URL It doesn't seem possible to get wget to output to sed, but it's still only two steps. First, wget http://www.foo.com Then the above command, assuming index.html is the downloaded file. It works for www.google.com...haven't tested anything else. ________________________________ From: Mark - augustine.com <[email protected]> To: [email protected] Sent: Thursday, February 26, 2009 10:25:01 AM Subject: [Bug-wget] Simple web page listing Hello, I'm looking for a way to use wget to report a list of URL's of all the web pages on a given web site. Not interested in the actual code, or content, just the names of the web pages. Also I would prefer it in a very simple format, just the URLs seperated by return characters. e.g. ------------- http://www.foo.com/a http://www.foo.com/b http://www.foo.com/c http://www.foo.com/d ------------- Ideas? Another program/service that offers this already? Please CC your request to me. Thankyou! Best Regards, Mark Mahon
