On Tue, 2003-05-27 at 19:44, Nik Butler wrote:
> Heres a problem for the perl ancients among you.....
> 
> One of our customers ( I say our since like the Borg, ive joined a
> collective ) requires a regular deduplication of list information (
> mostly CSV ) against a existing database (SQL Server 2k) .
> 
> now im fairly sure that this is exactly what Perl was designed for ...
> however when searching for tools and advice on utilising those tools I
> do tend to come up a little non plussed.


The trouble is that people are not very consistent at writing their
addresses, neither do they spell terribly exactly.  You can use one or
more of the fuzzy match algorithms, some clever sorting, together with
agrep and friends, but it will only go so far. At the end of the day
there is no substitute for human intervention and eyeball pattern
matching...

Unfortunately, to do this properly requires fuzzy logic and some
intelligent human interaction. Basically, perl is your friend for doing
the obvious, simple stuff - ie the addresses that are identical. Also
for generating the 'possibles' you will need to scan.

The snail mailing list specialists keep this sort of software close to
their chests because it is that which gives them the edge, viz: "clean"
(deduped) lists, that pays top dollar.

Best of luck...

Dirk
-- 
Please Note: Some Quantum Physics Theories Suggest That When the
Consumer Is Not Directly Observing This Product, It May Cease to
Exist or Will Exist Only in a Vague and Undetermined State.



Reply via email to