On Tue, Apr 17, 2001 at 01:47:33AM -0700, David A. Desrosiers wrote:
>       How about implementing a STAYONDOMAIN/STAYOFFDOMAIN as well. In
> the literal terms, a "host" is the FQDN, including elements of the URI.
> The "domain" is simply the last two portions of the FQDN starting from the
> right, i.e. "http://www.wired.com/foo/bar.html" is a "host", while
> "wired.com" is a "domain". Subtle difference, but still important for our
> needs.
>       The STAYONDOMAIN/STAYOFFDOMAIN would basically let someone who
> gathers 'wired.com's site to say "STAYONHOST STAYONDOMAIN" and gather the
> images from images.wired.com, while not going offsite to gather content
> there.

I can see how that would be useful. I've spent a couple of hours 
browsing a Python tutorial, and it seems like an interesting 
language, so I'll see what I can do.

I'd also like to create an option that adds the current date to 
db_file and db_name. The file name and document title could then look 
something like this:


Ideally, the format of the date would be customisable (e.g., choose 
which of the year, month, day, hour and minute are included, and in 
what order, and specify separators). I've done a very quick and dirty 
job that uses just the month and day, which suits my purposes quite 
well, but it's hard-coded into Spider.py so that's hardly suitable as 
a long-term solution. What do you think? Is anyone working on or 
planning something similar?


Alice Harris
Internet Services, CITEC
Brisbane, Australia
+61 7 322 22578

