On Tue, Apr 17, 2001 at 01:47:33AM -0700, David A. Desrosiers wrote:
> 
>       How about implementing a STAYONDOMAIN/STAYOFFDOMAIN as well. In
> the literal terms, a "host" is the FQDN, including elements of the URI.
> The "domain" is simply the last two portions of the FQDN starting from the
> right, i.e. "http://www.wired.com/foo/bar.html" is a "host", while
> "wired.com" is a "domain". Subtle difference, but still important for our
> needs.
> 
>       The STAYONDOMAIN/STAYOFFDOMAIN would basically let someone who
> gathers 'wired.com's site to say "STAYONHOST STAYONDOMAIN" and gather the
> images from images.wired.com, while not going offsite to gather content
> there.

I can see how that would be useful. I've spent a couple of hours 
browsing a Python tutorial, and it seems like an interesting 
language, so I'll see what I can do.


I'd also like to create an option that adds the current date to 
db_file and db_name. The file name and document title could then look 
something like this:

        dailynews-010418.pdb
        Daily_News_010418 

Ideally, the format of the date would be customisable (e.g., choose 
which of the year, month, day, hour and minute are included, and in 
what order, and specify separators). I've done a very quick and dirty 
job that uses just the month and day, which suits my purposes quite 
well, but it's hard-coded into Spider.py so that's hardly suitable as 
a long-term solution. What do you think? Is anyone working on or 
planning something similar?

Alys

--
Alice Harris
Internet Services, CITEC
Brisbane, Australia
+61 7 322 22578

Reply via email to