Matt,

If you went with option #3, you could avoid the looping by using (tCount - 1) 
as the LIMIT in the delete statement instead of always using 1.


Regards,
Nathan

-----Original Message-----
From: Matt Neimeyer [mailto:m...@neimeyer.org] 
Sent: Tuesday, July 14, 2009 8:58 AM
To: mysql@lists.mysql.com
Subject: Removing Duplicate Records

In our database we have an Organizations table and a Contacts table,
and a linking table that associates Contacts with Organizations.
Occassionally we manually add to the linking table with information
gleaned from outside data sources. This is common enough to be
annoying, since it ends up with duplicate linkages, but it's FAR from
an everyday occurance.

I have three options for dealing with the resulting duplicates and I
would appreciate some advice on which option might be best.

1. Attack the horrific spaghetti code that determines the Org and
Contact ids and then does the manual add. Creating Orgs and Contacts
as needed.

Calling this code horrific is a kindness... we're talking evil...
We've pretty much ruled this out due to the horror... but I mention
that I considered it.

2. Do a create table and populate that new table with the results of a
select distinct from the old table then swap the tables.

3. Do... "SELECT count(*) AS tCount,OrgID,ContID FROM OrgContLink
GROUP BY OrgID,ContID HAVING tCount > 1". Then for every record in the
result "DELETE FROM OrgContLink WHERE OrgID=X AND ContID=Y LIMIT 1".
Then repeat until no results are found.

I like option 2 in so far as it's more... Atomic? One create...select,
one swap and its done. But even though it feels more "pure" I worry
that the overhead of completely creating and destroying entire tables
seems like throwing the baby out with the bathwater. IOW: Is
rebuilding a whole table for a few (hundred at most) offending
duplicate records overkill.

I like option 3 in that it leaves everything as is but does require a
lot of looping and feels inefficient. However, since we'd be running
this only after we do our imports it's not like this looping
inefficient code would be running all the time.

I know I could probably also put a unique key on both orgid and contid
but unless I'm reading the docs wrong I can't add a key in such a way
that the duplicate key insertion would silently fail without requiring
the offending application to do "INSERT ... ON DUPLICATE KEY..." which
gets back to modifying the spaghetti code from option 1.

Thanks in advance for your advice.

Matt

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql?unsub=nsulli...@cappex.com


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql?unsub=arch...@jab.org

Reply via email to