On Tue, 2003-05-27 at 19:44, Nik Butler wrote: > Heres a problem for the perl ancients among you..... > > One of our customers ( I say our since like the Borg, ive joined a > collective ) requires a regular deduplication of list information ( > mostly CSV ) against a existing database (SQL Server 2k) . > > now im fairly sure that this is exactly what Perl was designed for ... > however when searching for tools and advice on utilising those tools I > do tend to come up a little non plussed.
The trouble is that people are not very consistent at writing their addresses, neither do they spell terribly exactly. You can use one or more of the fuzzy match algorithms, some clever sorting, together with agrep and friends, but it will only go so far. At the end of the day there is no substitute for human intervention and eyeball pattern matching... Unfortunately, to do this properly requires fuzzy logic and some intelligent human interaction. Basically, perl is your friend for doing the obvious, simple stuff - ie the addresses that are identical. Also for generating the 'possibles' you will need to scan. The snail mailing list specialists keep this sort of software close to their chests because it is that which gives them the edge, viz: "clean" (deduped) lists, that pays top dollar. Best of luck... Dirk -- Please Note: Some Quantum Physics Theories Suggest That When the Consumer Is Not Directly Observing This Product, It May Cease to Exist or Will Exist Only in a Vague and Undetermined State.