-----Original Message----- From: Peter Rabbitson [mailto:[EMAIL PROTECTED] Sent: Thursday, February 24, 2005 2:01 PM To: beginners@perl.org Subject: Re: standardising spellings
On Thu, Feb 24, 2005 at 06:01:50PM -0000, Dermot Paikkos wrote: > Hi, > > I have a list of about 650 names (a small sample is below) that I need > to import into a database. When you look at the list there are some > obvious duplicates that are spelt slightly differently. I can > rationalize some of the data with some simple substitutions but some > of the data looks almost impossible to parse programmatically. > Here what I have done so far - it's not much: > I would use String::Approx's amatch, and run the list in several rounds. The first round would look for possible 1 step mismatches, then 2 step then 4 step then 6 step etc. Every time you interactively confirm a delete, it is deleted from some kind of global hash, so the next round will not find the duplicate a second time. Or if it is a one time show - just write a simple thing that will run through the list amatching a fixed number of steps, and delete everything you confirm, writing the result to a file. Then increase the step and do it again and again untill you get tired of it :) ##### Neat! Being one, I particularly enjoyed the "perldoc String::Approx" bit where it compared McScot to MacScot. I saw this kind of thing attempted in an old IBM VM CMS user-written utility called SCANCMS. I don't have the code any more, but it did things like collapse "nn" into "n", "turn "sky" into "ski", drop all h's and such like. So Coffman, Kaufmann and Kauffman would all end up as Cofman or even Cfmn for the purposes of the comparison. I don't know if this approach would be any improvement on Approx; probably not, tho it does look like Approx is a binary black box, and this approach might be more modifiable. Rgds, GStC. -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <http://learn.perl.org/> <http://learn.perl.org/first-response>