[Tutor] Website Retrieval Program

Daniel Watkins Wed, 24 Aug 2005 09:39:10 -0700

I'm currently trying to write a script that will get all the files
necessary for a webpage to display correctly, followed by all the
intra-site pages and such forth, in order to try and retrieve one of the
many sites I have got jumbled up on my webspace. After starting the
writing, someone introduced me to wget, but I'm continuing this because
it seems like fun (and that statement is the first step on a slippery
slope :P).


My script thus far reads:
"""
import re
import urllib

source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html";)
next = re.findall('src=".*html"',source.read())
print next
"""

This returns the following:
"['src="nothing_left.html"', 'src="testindex.html"',
'src="nothing_right.html"']"

This is a good start (and it took me long enough! :P), but, ideally, the
re would strip out the 'src=' as well. Does anybody with more re-fu than
me know how I could do that?

Incidentally, feel free to use that page as an example. In addition, I
am aware that this will need to be adjusted and expanded later on, but
it's a start.

Thanks in advance,
Dan

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] Website Retrieval Program

Reply via email to