I've run into a bit of trouble with my spider script. Thus far, it is
able to retrieve all of the data off the website that is contained
within standard HTML, downloading jpg, gif and bmp images that are
related to the files (this restriction only being set by a lack of
further definitions by myself). However, I have run into a problem with
one link that simply points to a folder (www.aehof.org.uk/forum) within
which is contained a phpBB forum.

I've attempted to use 'dircache' but couldn't find a way for it to
understand web addresses. However, I may not have hit upon the right
combination of syntax, so may be mistaken. I also considered 'os' but it
appears to require definition of a particular operating system, which is
a direction I'd prefer not to take unless I have to. In addition, the
error messages I received from using 'dircache' traced back into 'os' so
it is unlikely it would have been suitable for the purpose.

The variables at the top of the script will later be cleaned up so they
are defined by command line input, they're just there while I sort the
actual script itself out.

Succinctly, I'm looking for an addition to the following script (ideally
under 'default_defs') which will allow me to copy a directly-linked
folder and all its contents.

I hope I have expressed myself in a useful manner, and any help would be
appreciated, although I would prefer no comments on the script in
general as I would quite like to develop it myself as a learning project
(this being my first proper script).

Cheers muchly,
Dan

spider.py
"""
import re
import urllib
import distutils.dir_util

site = "http://www.aehof.org.uk";
localdir = "/home/daniel/test/"
default_defs='[cf]=".*html"|c=".*jpg"|c=".*gif"|c=".*bmp"'

if not re.search('/\Z',site):
    site = site + "/"

def get_page_items (source, site = site, defs = default_defs):
    next = []
    text = urllib.urlopen(site+source).read()

    if re.search(defs,text):
        for i in re.findall(defs,text):
            i = i[3:-1]
            next.append(i)
        
    src = [source] + next
    return src
        
def get_list (source="index.html"):
    items = []
    next = get_page_items (source)
    for i in next:
        if i not in items:
            items.append (i)
            next.extend (get_page_items (i))
    return items

def copy_items ():
    items = get_list()
    for i in items:
        distutils.dir_util.create_tree(localdir, items)
        original = urllib.urlopen(site+i)
        local = open(localdir+i,'w')
        body = original.read()
        local.write(body)
        local.close()

copy_items()
"""

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to