Re: Someone must have solved this one ?

2003-05-30 Thread Jon Reades
Nik Butler wrote:

has anyone actually had experience ( postive or negative ) with such
tools in doing mail address deduplication of text files with SQL 


I doubt that there are many drop-in tools out there since the question 
of what constitutes a duplicate address/customer will depend on the 
nature of the business.

For instance, if you're BG then the chances that you have two customers, 
one named Jon Reades and the other named John Reades at the same address 
and postcode are very low. If you're a bookstore then the chances are 
rather higher.

I've heard of some interesting ways of tackling the problem that 
generally involve attempting to correct for common input errors by 
hashing the name in one or more of the following ways for the purposes 
of comparing two records that you have reason to believe might be the 
same person:

1. Removing all duplicate vowels (so 'aa' becomes 'a', and so on)
2. Remove all duplicate consonants (so 'll' becomes 'l', and so on)
3. Removing vowels, whether single or multiple, altogether and replacing 
with a single placeholder (s|[aeiow]+|-|g)
4. Lower-casing everything
5. You might apply some similar rules (very carefully) to some types of 
address/postcode information
6. You might also consider trying variations on commonly mis-heard 
letters -- D, E, G, P, etc.

But basically, the rules are up to you in terms of what you'll accept as 
the tradeoff between verification and deduplication.

HTH,

jon
--
jon reades
fulcrum analytics
t: 0870.366.9338
m: 0797.698.7392
f: 0870.888.8880
lower ground floor
2 sheraton street
london w1f 8bh



Re: Someone must have solved this one ?

2003-05-30 Thread Toby Corkindale
On Tue, May 27, 2003 at 07:44:58PM +0100, Nik Butler wrote:
> Heres a problem for the perl ancients among you.
> 
> One of our customers ( I say our since like the Borg, ive joined a
> collective ) requires a regular deduplication of list information (
> mostly CSV ) against a existing database (SQL Server 2k) .
> 
> now im fairly sure that this is exactly what Perl was designed for ...
> however when searching for tools and advice on utilising those tools I
> do tend to come up a little non plussed.
> 
> has anyone actually had experience ( postive or negative ) with such
> tools in doing mail address deduplication of text files with SQL 
> 

The laziest method would be to let the SQL server do the work.. eg:
CREATE UNIQUE INDEX address_idx ON contacts (sirname, house_number, postcode);

However, before you could use this you would have to normalise all your
information. For eg: Make sure the postcode was in uppercase with a space
seperator. Not hard to do though; it's a once-off for the SQL database, but
you'd need to run it on every entry from the new text files. Not a big loss
though.

I know this is barely using Perl as anything other than an intelligent DB
frontend, but you have to admit databases are quite good at doing indexing and
so forth - so why try to re-write the wheel?

tjc

Note: Example above is taken from PostgreSQL, so may not work verbatim on
inferior DBs ;)




Re: Someone must have solved this one ?

2003-05-29 Thread Dirk Koopman
Remind me to read my posts before pressing the send button just one more
time would you?

On Thu, 2003-05-29 at 09:05, Dirk Koopman wrote:
> On Tue, 2003-05-27 at 19:44, Nik Butler wrote:
> > Heres a problem for the perl ancients among you.
> > 
> > One of our customers ( I say our since like the Borg, ive joined a
> > collective ) requires a regular deduplication of list information (
> > mostly CSV ) against a existing database (SQL Server 2k) .
> > 
> > now im fairly sure that this is exactly what Perl was designed for ...
> > however when searching for tools and advice on utilising those tools I
> > do tend to come up a little non plussed.
> 
> 
> The trouble is that people are not very consistent at writing their
> addresses, neither do they spell terribly exactly.  You can use one or
> more of the fuzzy match algorithms, some clever sorting, together with
> agrep and friends, but it will only go so far. At the end of the day
> there is no substitute for human intervention and eyeball pattern
> matching...
> 
> Unfortunately, to do this properly requires fuzzy logic and some
> intelligent human interaction. Basically, perl is your friend for doing
> the obvious, simple stuff - ie the addresses that are identical. Also
> for generating the 'possibles' you will need to scan.
> 
> The snail mailing list specialists keep this sort of software close to
> their chests because it is that which gives them the edge, viz: "clean"
> (deduped) lists, that pays top dollar.
> 
> Best of luck...
> 
> Dirk
-- 
Please Note: Some Quantum Physics Theories Suggest That When the
Consumer Is Not Directly Observing This Product, It May Cease to
Exist or Will Exist Only in a Vague and Undetermined State.





Re: Someone must have solved this one ?

2003-05-29 Thread Dirk Koopman
On Tue, 2003-05-27 at 19:44, Nik Butler wrote:
> Heres a problem for the perl ancients among you.
> 
> One of our customers ( I say our since like the Borg, ive joined a
> collective ) requires a regular deduplication of list information (
> mostly CSV ) against a existing database (SQL Server 2k) .
> 
> now im fairly sure that this is exactly what Perl was designed for ...
> however when searching for tools and advice on utilising those tools I
> do tend to come up a little non plussed.


The trouble is that people are not very consistent at writing their
addresses, neither do they spell terribly exactly.  You can use one or
more of the fuzzy match algorithms, some clever sorting, together with
agrep and friends, but it will only go so far. At the end of the day
there is no substitute for human intervention and eyeball pattern
matching...

Unfortunately, to do this properly requires fuzzy logic and some
intelligent human interaction. Basically, perl is your friend for doing
the obvious, simple stuff - ie the addresses that are identical. Also
for generating the 'possibles' you will need to scan.

The snail mailing list specialists keep this sort of software close to
their chests because it is that which gives them the edge, viz: "clean"
(deduped) lists, that pays top dollar.

Best of luck...

Dirk
-- 
Please Note: Some Quantum Physics Theories Suggest That When the
Consumer Is Not Directly Observing This Product, It May Cease to
Exist or Will Exist Only in a Vague and Undetermined State.





Re: Someone must have solved this one ?

2003-05-28 Thread Shevek
On Tue, 27 May 2003, Nik Butler wrote:

> Heres a problem for the perl ancients among you.
> 
> One of our customers ( I say our since like the Borg, ive joined a
> collective ) requires a regular deduplication of list information (
> mostly CSV ) against a existing database (SQL Server 2k) .
> 
> 
> now im fairly sure that this is exactly what Perl was designed for ...
> however when searching for tools and advice on utilising those tools I
> do tend to come up a little non plussed.
> 
> has anyone actually had experience ( postive or negative ) with such
> tools in doing mail address deduplication of text files with SQL 

Sort the smaller list. For each entry in the big list, binary search the 
smaller one.

In perl, however, you would probably turn the 'remove me' list into a huge 
regular expression anchored at both ends, and feed each entry of the 
master list into the regex. This would probably be faster.

S.

-- 
Shevekhttp://www.anarres.org/
I am the Borg. http://design.anarres.org/



Someone must have solved this one ?

2003-05-28 Thread Nik Butler
Heres a problem for the perl ancients among you.

One of our customers ( I say our since like the Borg, ive joined a
collective ) requires a regular deduplication of list information (
mostly CSV ) against a existing database (SQL Server 2k) .


now im fairly sure that this is exactly what Perl was designed for ...
however when searching for tools and advice on utilising those tools I
do tend to come up a little non plussed.

has anyone actually had experience ( postive or negative ) with such
tools in doing mail address deduplication of text files with SQL 


Cheers for you input..




-- 
[EMAIL PROTECTED]   http://www.wired4life.org/  Wired4Life, an Answer.


-BEGIN GEEK CODE BLOCK-
Version: 3.1
GB d+ s:- a C+++ U P++ L+++ E--- W++ N++ o K w--- 
O+ M+ V- PS+ PE- Y++ PGP 5+++ X R tv++ b+++ DI++ D--- 
G++ e* h--- r+++ z** 
--END GEEK CODE BLOCK--