>My Python implementation of the StayOnDomain feature is available by
>following http://www.kittycentral.net/Plucker/Spider.htm
This has now been updated to print a warning that it is ignoring
--stayondomain if --stayonhost is also specified.
Tony McNamara
__
>
>
>>Isn't there somewhere a good reference where one can find the
>>regex to parse URL, as there is one for parsing emails addresses.
>>
Try:
http://www.regexlib.com/
___
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mai
Tony,
Send it to me ([EMAIL PROTECTED]).
Bill
___
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev
Sorry, can't resist this... And a bit off topic?
> The regex that handle this along with the rest is :
>$_ =~ s!(^.*?@)|(^.*?//)|([:/].*$)!!g;
That won't handle stuff like
http://domain.com/something?email=foo@bar
> Isn't there somewhere a good reference where one can find the
> regex
> -Message d'origine-
> De: David A. Desrosiers [SMTP:[EMAIL PROTECTED]]
> Date: vendredi 27 septembre 2002 13:13
> À:Plucker Development List
> Objet: Re: Stayondomain for testing
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
> It's pretty modular and innocuous, and is based on the current tip.
..and here's my attempt, using perl. It works for everything I've
thrown at it so far. I'll shim this into my spider and get some samples:
# -
My Python implementation of the StayOnDomain feature is available by
following http://www.kittycentral.net/Plucker/Spider.htm
Please feel free to test it, bang on it, etc. I think it works... at least
it seems to work and doesn't seem to break anything... but you should
either be comfortable