On 8/24/05, Daniel Watkins <[EMAIL PROTECTED]> wrote: > I'm currently trying to write a script that will get all the files > necessary for a webpage to display correctly, followed by all the > intra-site pages and such forth, in order to try and retrieve one of the > many sites I have got jumbled up on my webspace. After starting the > writing, someone introduced me to wget, but I'm continuing this because > it seems like fun (and that statement is the first step on a slippery > slope :P). > > My script thus far reads: > """ > import re > import urllib > > source = urllib.urlopen("http://www.aehof.org.uk/real_index2.html") > next = re.findall('src=".*html"',source.read()) > print next > """ > > This returns the following: > "['src="nothing_left.html"', 'src="testindex.html"', > 'src="nothing_right.html"']" > > This is a good start (and it took me long enough! :P), but, ideally, the > re would strip out the 'src=' as well. Does anybody with more re-fu than > me know how I could do that? > > Incidentally, feel free to use that page as an example. In addition, I > am aware that this will need to be adjusted and expanded later on, but > it's a start. > > Thanks in advance, > Dan >
You may wish to have a look at the Beautiful Soup module, http://www.crummy.com/software/BeautifulSoup/ Luis. _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor