Matt, If you went with option #3, you could avoid the looping by using (tCount - 1) as the LIMIT in the delete statement instead of always using 1.
Regards, Nathan -----Original Message----- From: Matt Neimeyer [mailto:m...@neimeyer.org] Sent: Tuesday, July 14, 2009 8:58 AM To: mysql@lists.mysql.com Subject: Removing Duplicate Records In our database we have an Organizations table and a Contacts table, and a linking table that associates Contacts with Organizations. Occassionally we manually add to the linking table with information gleaned from outside data sources. This is common enough to be annoying, since it ends up with duplicate linkages, but it's FAR from an everyday occurance. I have three options for dealing with the resulting duplicates and I would appreciate some advice on which option might be best. 1. Attack the horrific spaghetti code that determines the Org and Contact ids and then does the manual add. Creating Orgs and Contacts as needed. Calling this code horrific is a kindness... we're talking evil... We've pretty much ruled this out due to the horror... but I mention that I considered it. 2. Do a create table and populate that new table with the results of a select distinct from the old table then swap the tables. 3. Do... "SELECT count(*) AS tCount,OrgID,ContID FROM OrgContLink GROUP BY OrgID,ContID HAVING tCount > 1". Then for every record in the result "DELETE FROM OrgContLink WHERE OrgID=X AND ContID=Y LIMIT 1". Then repeat until no results are found. I like option 2 in so far as it's more... Atomic? One create...select, one swap and its done. But even though it feels more "pure" I worry that the overhead of completely creating and destroying entire tables seems like throwing the baby out with the bathwater. IOW: Is rebuilding a whole table for a few (hundred at most) offending duplicate records overkill. I like option 3 in that it leaves everything as is but does require a lot of looping and feels inefficient. However, since we'd be running this only after we do our imports it's not like this looping inefficient code would be running all the time. I know I could probably also put a unique key on both orgid and contid but unless I'm reading the docs wrong I can't add a key in such a way that the duplicate key insertion would silently fail without requiring the offending application to do "INSERT ... ON DUPLICATE KEY..." which gets back to modifying the spaghetti code from option 1. Thanks in advance for your advice. Matt -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=nsulli...@cappex.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql?unsub=arch...@jab.org