If you can all stretch your minds back more than a month...

On Tue, Apr 17, 2001 at 06:37:54PM +1000, Alice Harris wrote:
>
> If you're interested... I've made a slight change to
> PyPlucker/Spider.py to implement a new 'STAYOFFHOST" option. It's
> basically the opposite of STAYONHOST - when Plucker parses a web
> page, it ignores all links that point to the host of that page, and
> follows all those that point to other hosts. I find it very useful
> for Slashdot as I can now retrieve all the articles that are
> referenced from the Slashdot home page without downloading those
> enormous discussions.
 

On Tue, Apr 17, 2001 at 01:47:33AM -0700, David A. Desrosiers wrote:
> 
>       How about implementing a STAYONDOMAIN/STAYOFFDOMAIN as well. In
> the literal terms, a "host" is the FQDN, including elements of the URI.
> The "domain" is simply the last two portions of the FQDN starting from the
> right, i.e. "http://www.wired.com/foo/bar.html"; is a "host", while
> "wired.com" is a "domain". Subtle difference, but still important for our
> needs.
> 
>       The STAYONDOMAIN/STAYOFFDOMAIN would basically let someone who
> gathers 'wired.com's site to say "STAYONHOST STAYONDOMAIN" and gather the
> images from images.wired.com, while not going offsite to gather content
> there.


I'm close to having the stayondomain / stayoffdomain options done. 
Finally... Sorry 'bout the delay.

Does anyone have suggestions for sites I could test it on? I'm 
looking for sites that have two related domain names, like 
www.wired.com and images.wired.com, as David used in his example, 
except NOT those two in particular because images.wired.com doesn't 
exist. :)

You may also want to give me ideas about how to handle combinations 
of stayondomain, stayoffdomain, stayonhost, and stayondomain. At the 
moment, if a user specified more than one of them, the results would 
be determined solely by the order in which the options are processed 
in the code, which may appear to be without rhyme or reason to the 
user. Can anyone think of valid uses for combinations of them, or 
bizarre combinations that should be picked up as a user-error and 
automatically modified? For example, if both stayondomain and 
stayonhost are specified, how would you like them to be treated? Or 
should we just recommend that only one of those four options be used, 
and then rely upon the users having a clue so that we don't have to 
worry about error checking and the order in which they should be 
processed?

BTW, I'm not sure there'll be a use for the two stayoff... options. 
Ever since David told me that I could get a Palm-friendly version of 
Slashdot from www.custard.org, I've had no need for my stayoffhost 
hack, and I can't imagine me using stayoffdomain at any time. 
However, they were both easy enough to implement. You might want to 
have the stayoff... options as undocumented or quietly-documented 
features to avoid unpleasant complexity in the instructions.

Comments anyone? (I'm subscribed to the list now so you don't need to 
CC me specifically.)

Alys

--
Alice Harris
Internet Services, CITEC, Brisbane, Australia
+61 7 322 22578
[EMAIL PROTECTED], [EMAIL PROTECTED]

Reply via email to