On 26 May 2011 09:13, Tarlika Elisabeth Schmitz <postgres...@numerixtechnology.de> wrote: > On Wed, 25 May 2011 09:25:48 -0600 > Rob Sargent <robjsarg...@gmail.com> wrote: > >> >> >>On 05/24/2011 10:57 AM, Lew wrote: >>> Tarlika Elisabeth Schmitz wrote: >>> >>>> CREATE TABLE person >>>> ( >>>> id integer NOT NULL, >>>> "name" character varying(256) NOT NULL, >>>> "location" character varying(256), >>>> CONSTRAINT person_pkey PRIMARY KEY (id) >>>> ); >>>> >>>> this was just a TEMPORARY table I created for quick analysis >>>> of my CSV data (now renamed to temp_person). > > CREATE TABLE country > ( > id character varying(3) NOT NULL, -- alpha-3 code > "name" character varying(50) NOT NULL, > CONSTRAINT country_pkey PRIMARY KEY (id) > ); > > >>To minimize the ultimately quite necessary human adjudication, one >>might make good use of what is often termed "crowd sourcing": Keep >>all the distinct "hand entered" values and a map to the final human >>assessment. > > I was wondering how to do just that. I don't think it would be a good > idea to hard code this into the clean-up script. Take, for instance, > variations of COUNTRY.NAME spelling. Where would I store these? > > I could do with a concept for this problem, which applies to a lot of > string-type info.
I'd start w/ downloading a list as mentioned here: http://answers.google.com/answers/threadview?id=596822 And run it through a wee perl script using http://search.cpan.org/~maurice/Text-DoubleMetaphone-0.07/DoubleMetaphone.pm to make phonetic matches ... Then I'd run your own data through DoubleMetaphone, and clean up matches if not too many false positives show up. Cheers, Andrej -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql