Andrew, 

Good point.  The comprehensive canonicalization code which is present in
the backend is missing from Whiplash.pm.  The next release will fix
this. 

cheers,
vipul

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Andrew
McNaughton
Sent: Monday, July 26, 2004 12:26 AM
To: [EMAIL PROTECTED]
Subject: Re: [Razor-users] Too many false positives


I've been poking around in the razor source code, and it appears that
the identification of domains has a serious bug.

http://www.greenpeace.org.nz/ is shortened to org.nz
http://www.scoop.co.nz/ is shortened to co.nz

Here's the problem code:

             # See if it's a non country domain.  If so,
             # we'll extract the hostname. (SPEC-REF: NORMALIZE)

             if ($host =~ m:\.([^\.]*\.[^\.]{2,4})$:) {
                 $normalized_host = $1;
             }  else {
                 $normalized_host = $host;
             }

I don't think this can be treated so simply.  Some domains will need to
be checked as 2nd level domains (eg spammer.com) and some as third level
domains (eg spammer.co.nz).  Some countries sell 2nd level domains (eg
spammer.nu).

I'm thinking this needs to be handled with a hash lookup on the top
level domain which returns the levle at which the domain is to be
treated.  The following illustrates the general idea, albeit the hash
initialization wants to be put up the top of the file, or perhaps loaded
from a config file.

        $domain_level = {qw(
          com 2
          net 2
          org 2
          gov 2
          museum 2
          nz 3
          au 3
          uk 3
          nu 2
        )};

        my @host = split /\./, $host;
        my $level = $domain_level->{$host[-1]};
        $normalized_host = join '.', @host[-$level,-1];

Does anyone know where to find an existing table or lookup system giving
the level at which various domains are made publically available?  Is
there any sort of mechanism which could avoid having to maintain a list
like this?

Andrew McNaughton



--

No added Sugar.  Not tested on animals.  May contain traces of Nuts.  If
irritation occurs, discontinue use.

-------------------------------------------------------------------
Andrew McNaughton           Living in a shack in Tasmania
[EMAIL PROTECTED]          Between the bush and the sea

Mobile: +61 422 753 792     http://staff.scoop.co.nz/andrew/cv.doc
                             http://www.scoop.co.nz/



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java
Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click
_______________________________________________
Razor-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/razor-users


-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_idG21&alloc_id040&op=click
_______________________________________________
Razor-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/razor-users

Reply via email to