-----Original Message-----
From: Peter Rabbitson [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 24, 2005 2:01 PM
To: beginners@perl.org
Subject: Re: standardising spellings

On Thu, Feb 24, 2005 at 06:01:50PM -0000, Dermot Paikkos wrote:
> Hi,
> 
> I have a list of about 650 names (a small sample is below) that I need 
> to import into a database. When you look at the list there are some 
> obvious duplicates that are spelt slightly differently. I can 
> rationalize some of the data with some simple substitutions but some 
> of the data looks almost impossible to parse programmatically.
> Here what I have done so far - it's not much:
> 

I would use String::Approx's amatch, and run the list in several rounds. The
first round would look for possible 1 step mismatches, then 2 step then 4
step then 6 step etc. Every time you interactively confirm a delete, it is
deleted from some kind of global hash, so the next round will not find the
duplicate a second time. Or if it is a one time show - just write a simple
thing that will run through the list amatching a fixed number of steps, and
delete everything you confirm, writing the result to a file. Then increase
the step and do it again and again untill you get tired of it :)

#####

Neat!  Being one, I particularly enjoyed the "perldoc String::Approx" bit
where it compared McScot to MacScot.

I saw this kind of thing attempted in an old IBM VM CMS user-written utility
called SCANCMS.  I don't have the code any more, but it did things like
collapse "nn" into "n", "turn "sky" into "ski", drop all h's and such like.
So Coffman, Kaufmann and Kauffman would all end up as Cofman or even Cfmn
for the purposes of the comparison.

I don't know if this approach would be any improvement on Approx; probably
not, tho it does look like Approx is a binary black box, and this approach
might be more modifiable.

Rgds, GStC.

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to