> -----Message d'origine-----
> De:   David A. Desrosiers [SMTP:[EMAIL PROTECTED]]
> Date: vendredi 27 septembre 2002 13:13
> Ā:    Plucker Development List
> Objet:        Re: Stayondomain for testing
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> > It's pretty modular and innocuous, and is based on the current tip.
> 
>       ..and here's my attempt, using perl. It works for everything I've
> thrown at it so far. I'll shim this into my spider and get some samples:
> 
> # ------------------------------------------------------------------
> use strict;
> my %fqdn;
> 
> my @list = ('http://perlnewbies.com/history/searchit.html',
>             'http://excite.com/search.gw?c=web&lk=excite_home_us&s',
>             'http://search.msn.co.uk/spbasic.htm?MT=Presidents',
>             'http://search.msn.com/results.asp?RS=CHECKED&Armada',
>             'http://search.yahoo.com/bin/search?p=%22Perl',
>             'http://top.cswap.com:80/cat/?biblestudies',
>             'http://203.47.133.209/pipermail/plucker-dev/',
>       'http://sers:[EMAIL PROTECTED]/abstract/spe/medgen/index.htm');
> 
> foreach (@list) {
>         $_ =~ s!(^.*?//)|([:/].*$)!!g;
>         $fqdn{$_}++;
> }
> 
> foreach my $dom (sort keys(%fqdn)) {
>         print "$dom = $fqdn{$dom}\n"
> }
> # ------------------------------------------------------------------

It doesn't work when the URL contains a login/pass (HTTP basic auth), as in : 
    http://login:[EMAIL PROTECTED]/abstract/spe/medgen/index.htm
...where the ':' have a second meaning...

The regex that handle this along with the rest is :
       $_ =~ s!(^.*?@)|(^.*?//)|([:/].*$)!!g;
(I don't know how to better write "the longer of the strings terminated by either @ or 
//")

Isn't there somewhere a good reference where one can find the regex to parse URL, as 
there is one for parsing emails addresses.

NH
_______________________________________________
plucker-dev mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev

Reply via email to