Andrew, Good point. The comprehensive canonicalization code which is present in the backend is missing from Whiplash.pm. The next release will fix this.
cheers, vipul -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andrew McNaughton Sent: Monday, July 26, 2004 12:26 AM To: [EMAIL PROTECTED] Subject: Re: [Razor-users] Too many false positives I've been poking around in the razor source code, and it appears that the identification of domains has a serious bug. http://www.greenpeace.org.nz/ is shortened to org.nz http://www.scoop.co.nz/ is shortened to co.nz Here's the problem code: # See if it's a non country domain. If so, # we'll extract the hostname. (SPEC-REF: NORMALIZE) if ($host =~ m:\.([^\.]*\.[^\.]{2,4})$:) { $normalized_host = $1; } else { $normalized_host = $host; } I don't think this can be treated so simply. Some domains will need to be checked as 2nd level domains (eg spammer.com) and some as third level domains (eg spammer.co.nz). Some countries sell 2nd level domains (eg spammer.nu). I'm thinking this needs to be handled with a hash lookup on the top level domain which returns the levle at which the domain is to be treated. The following illustrates the general idea, albeit the hash initialization wants to be put up the top of the file, or perhaps loaded from a config file. $domain_level = {qw( com 2 net 2 org 2 gov 2 museum 2 nz 3 au 3 uk 3 nu 2 )}; my @host = split /\./, $host; my $level = $domain_level->{$host[-1]}; $normalized_host = join '.', @host[-$level,-1]; Does anyone know where to find an existing table or lookup system giving the level at which various domains are made publically available? Is there any sort of mechanism which could avoid having to maintain a list like this? Andrew McNaughton -- No added Sugar. Not tested on animals. May contain traces of Nuts. If irritation occurs, discontinue use. ------------------------------------------------------------------- Andrew McNaughton Living in a shack in Tasmania [EMAIL PROTECTED] Between the bush and the sea Mobile: +61 422 753 792 http://staff.scoop.co.nz/andrew/cv.doc http://www.scoop.co.nz/ ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click _______________________________________________ Razor-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/razor-users ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_idG21&alloc_id040&op=click _______________________________________________ Razor-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/razor-users