Daniel Watkins wrote: >I'm currently trying to write a script that will get all the files >necessary for a webpage to display correctly, followed by all the >intra-site pages and such forth, in order to try and retrieve one of the >many sites I have got jumbled up on my webspace. After starting the >writing, someone introduced me to wget, but I'm continuing this because >it seems like fun (and that statement is the first step on a slippery >slope :P). > >My script thus far reads: >""" >import re >import urllib > >source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html") >next = re.findall('src=".*html"',source.read()) >print next >""" > >This returns the following: >"['src="nothing_left.html"', 'src="testindex.html"', >'src="nothing_right.html"']" > >This is a good start (and it took me long enough! :P), but, ideally, the >re would strip out the 'src=' as well. Does anybody with more re-fu than >me know how I could do that? > >Incidentally, feel free to use that page as an example. In addition, I >am aware that this will need to be adjusted and expanded later on, but >it's a start. > >Thanks in advance, >Dan > >_______________________________________________ >Tutor maillist - Tutor@python.org >http://mail.python.org/mailman/listinfo/tutor > > > Well, you don't necessarily need re-fu to do that:
import re import urllib source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html") next = [i[4:] for i in re.findall('src=".*html"',source.read())] print next gives you ['"nothing_left.html"', '"testindex.html"', '"nothing_right.html"'] And if you wanted it to specifically take out "src=" only, I'm sure you could tailor it to do something with i[i.index(...):] instead. -- Email: singingxduck AT gmail DOT com AIM: singingxduck Programming Python for the fun of it. _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor