Re: Someone must have solved this one ?
Nik Butler wrote: has anyone actually had experience ( postive or negative ) with such tools in doing mail address deduplication of text files with SQL I doubt that there are many drop-in tools out there since the question of what constitutes a duplicate address/customer will depend on the nature of the business. For instance, if you're BG then the chances that you have two customers, one named Jon Reades and the other named John Reades at the same address and postcode are very low. If you're a bookstore then the chances are rather higher. I've heard of some interesting ways of tackling the problem that generally involve attempting to correct for common input errors by hashing the name in one or more of the following ways for the purposes of comparing two records that you have reason to believe might be the same person: 1. Removing all duplicate vowels (so 'aa' becomes 'a', and so on) 2. Remove all duplicate consonants (so 'll' becomes 'l', and so on) 3. Removing vowels, whether single or multiple, altogether and replacing with a single placeholder (s|[aeiow]+|-|g) 4. Lower-casing everything 5. You might apply some similar rules (very carefully) to some types of address/postcode information 6. You might also consider trying variations on commonly mis-heard letters -- D, E, G, P, etc. But basically, the rules are up to you in terms of what you'll accept as the tradeoff between verification and deduplication. HTH, jon -- jon reades fulcrum analytics t: 0870.366.9338 m: 0797.698.7392 f: 0870.888.8880 lower ground floor 2 sheraton street london w1f 8bh
Re: Someone must have solved this one ?
On Tue, May 27, 2003 at 07:44:58PM +0100, Nik Butler wrote: > Heres a problem for the perl ancients among you. > > One of our customers ( I say our since like the Borg, ive joined a > collective ) requires a regular deduplication of list information ( > mostly CSV ) against a existing database (SQL Server 2k) . > > now im fairly sure that this is exactly what Perl was designed for ... > however when searching for tools and advice on utilising those tools I > do tend to come up a little non plussed. > > has anyone actually had experience ( postive or negative ) with such > tools in doing mail address deduplication of text files with SQL > The laziest method would be to let the SQL server do the work.. eg: CREATE UNIQUE INDEX address_idx ON contacts (sirname, house_number, postcode); However, before you could use this you would have to normalise all your information. For eg: Make sure the postcode was in uppercase with a space seperator. Not hard to do though; it's a once-off for the SQL database, but you'd need to run it on every entry from the new text files. Not a big loss though. I know this is barely using Perl as anything other than an intelligent DB frontend, but you have to admit databases are quite good at doing indexing and so forth - so why try to re-write the wheel? tjc Note: Example above is taken from PostgreSQL, so may not work verbatim on inferior DBs ;)
Re: Someone must have solved this one ?
Remind me to read my posts before pressing the send button just one more time would you? On Thu, 2003-05-29 at 09:05, Dirk Koopman wrote: > On Tue, 2003-05-27 at 19:44, Nik Butler wrote: > > Heres a problem for the perl ancients among you. > > > > One of our customers ( I say our since like the Borg, ive joined a > > collective ) requires a regular deduplication of list information ( > > mostly CSV ) against a existing database (SQL Server 2k) . > > > > now im fairly sure that this is exactly what Perl was designed for ... > > however when searching for tools and advice on utilising those tools I > > do tend to come up a little non plussed. > > > The trouble is that people are not very consistent at writing their > addresses, neither do they spell terribly exactly. You can use one or > more of the fuzzy match algorithms, some clever sorting, together with > agrep and friends, but it will only go so far. At the end of the day > there is no substitute for human intervention and eyeball pattern > matching... > > Unfortunately, to do this properly requires fuzzy logic and some > intelligent human interaction. Basically, perl is your friend for doing > the obvious, simple stuff - ie the addresses that are identical. Also > for generating the 'possibles' you will need to scan. > > The snail mailing list specialists keep this sort of software close to > their chests because it is that which gives them the edge, viz: "clean" > (deduped) lists, that pays top dollar. > > Best of luck... > > Dirk -- Please Note: Some Quantum Physics Theories Suggest That When the Consumer Is Not Directly Observing This Product, It May Cease to Exist or Will Exist Only in a Vague and Undetermined State.
Re: Someone must have solved this one ?
On Tue, 2003-05-27 at 19:44, Nik Butler wrote: > Heres a problem for the perl ancients among you. > > One of our customers ( I say our since like the Borg, ive joined a > collective ) requires a regular deduplication of list information ( > mostly CSV ) against a existing database (SQL Server 2k) . > > now im fairly sure that this is exactly what Perl was designed for ... > however when searching for tools and advice on utilising those tools I > do tend to come up a little non plussed. The trouble is that people are not very consistent at writing their addresses, neither do they spell terribly exactly. You can use one or more of the fuzzy match algorithms, some clever sorting, together with agrep and friends, but it will only go so far. At the end of the day there is no substitute for human intervention and eyeball pattern matching... Unfortunately, to do this properly requires fuzzy logic and some intelligent human interaction. Basically, perl is your friend for doing the obvious, simple stuff - ie the addresses that are identical. Also for generating the 'possibles' you will need to scan. The snail mailing list specialists keep this sort of software close to their chests because it is that which gives them the edge, viz: "clean" (deduped) lists, that pays top dollar. Best of luck... Dirk -- Please Note: Some Quantum Physics Theories Suggest That When the Consumer Is Not Directly Observing This Product, It May Cease to Exist or Will Exist Only in a Vague and Undetermined State.
Re: Someone must have solved this one ?
On Tue, 27 May 2003, Nik Butler wrote: > Heres a problem for the perl ancients among you. > > One of our customers ( I say our since like the Borg, ive joined a > collective ) requires a regular deduplication of list information ( > mostly CSV ) against a existing database (SQL Server 2k) . > > > now im fairly sure that this is exactly what Perl was designed for ... > however when searching for tools and advice on utilising those tools I > do tend to come up a little non plussed. > > has anyone actually had experience ( postive or negative ) with such > tools in doing mail address deduplication of text files with SQL Sort the smaller list. For each entry in the big list, binary search the smaller one. In perl, however, you would probably turn the 'remove me' list into a huge regular expression anchored at both ends, and feed each entry of the master list into the regex. This would probably be faster. S. -- Shevekhttp://www.anarres.org/ I am the Borg. http://design.anarres.org/
Someone must have solved this one ?
Heres a problem for the perl ancients among you. One of our customers ( I say our since like the Borg, ive joined a collective ) requires a regular deduplication of list information ( mostly CSV ) against a existing database (SQL Server 2k) . now im fairly sure that this is exactly what Perl was designed for ... however when searching for tools and advice on utilising those tools I do tend to come up a little non plussed. has anyone actually had experience ( postive or negative ) with such tools in doing mail address deduplication of text files with SQL Cheers for you input.. -- [EMAIL PROTECTED] http://www.wired4life.org/ Wired4Life, an Answer. -BEGIN GEEK CODE BLOCK- Version: 3.1 GB d+ s:- a C+++ U P++ L+++ E--- W++ N++ o K w--- O+ M+ V- PS+ PE- Y++ PGP 5+++ X R tv++ b+++ DI++ D--- G++ e* h--- r+++ z** --END GEEK CODE BLOCK--