I've run into a bit of trouble with my spider script. Thus far, it is able to retrieve all of the data off the website that is contained within standard HTML, downloading jpg, gif and bmp images that are related to the files (this restriction only being set by a lack of further definitions by myself). However, I have run into a problem with one link that simply points to a folder (www.aehof.org.uk/forum) within which is contained a phpBB forum.
I've attempted to use 'dircache' but couldn't find a way for it to understand web addresses. However, I may not have hit upon the right combination of syntax, so may be mistaken. I also considered 'os' but it appears to require definition of a particular operating system, which is a direction I'd prefer not to take unless I have to. In addition, the error messages I received from using 'dircache' traced back into 'os' so it is unlikely it would have been suitable for the purpose. The variables at the top of the script will later be cleaned up so they are defined by command line input, they're just there while I sort the actual script itself out. Succinctly, I'm looking for an addition to the following script (ideally under 'default_defs') which will allow me to copy a directly-linked folder and all its contents. I hope I have expressed myself in a useful manner, and any help would be appreciated, although I would prefer no comments on the script in general as I would quite like to develop it myself as a learning project (this being my first proper script). Cheers muchly, Dan spider.py """ import re import urllib import distutils.dir_util site = "http://www.aehof.org.uk" localdir = "/home/daniel/test/" default_defs='[cf]=".*html"|c=".*jpg"|c=".*gif"|c=".*bmp"' if not re.search('/\Z',site): site = site + "/" def get_page_items (source, site = site, defs = default_defs): next = [] text = urllib.urlopen(site+source).read() if re.search(defs,text): for i in re.findall(defs,text): i = i[3:-1] next.append(i) src = [source] + next return src def get_list (source="index.html"): items = [] next = get_page_items (source) for i in next: if i not in items: items.append (i) next.extend (get_page_items (i)) return items def copy_items (): items = get_list() for i in items: distutils.dir_util.create_tree(localdir, items) original = urllib.urlopen(site+i) local = open(localdir+i,'w') body = original.read() local.write(body) local.close() copy_items() """ _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor