I'm currently trying to write a script that will get all the files necessary for a webpage to display correctly, followed by all the intra-site pages and such forth, in order to try and retrieve one of the many sites I have got jumbled up on my webspace. After starting the writing, someone introduced me to wget, but I'm continuing this because it seems like fun (and that statement is the first step on a slippery slope :P).
My script thus far reads: """ import re import urllib source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html") next = re.findall('src=".*html"',source.read()) print next """ This returns the following: "['src="nothing_left.html"', 'src="testindex.html"', 'src="nothing_right.html"']" This is a good start (and it took me long enough! :P), but, ideally, the re would strip out the 'src=' as well. Does anybody with more re-fu than me know how I could do that? Incidentally, feel free to use that page as an example. In addition, I am aware that this will need to be adjusted and expanded later on, but it's a start. Thanks in advance, Dan _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor