----- Original Message ----- From: "Scott Haneda" <[EMAIL PROTECTED]> To: "MySql" <mysql@lists.mysql.com> Sent: Thursday, May 05, 2005 3:39 AM Subject: The age old delete duplicates
> I have been researching on how to deal with duplicate data. While I have a > case where there can be duplicate data, and I want to get rid of it, the > general ideas I seem to find are as follows, plus my own, which I would like > opinions on since I have not seen it mentioned. > > I am not in a situation where I can test for duplicates on insert, so the on > duplicate key update deal does me no good here. > Okay, stop right there. If you design your tables with appropriate primary keys, there is no reason you should be getting duplicates in the first place. Furthermore, there is no reason that you have to test for those duplicates either at the command line or in your programs. For example, if I create an Employee table with one row per employee in my company, and if I give that table an appropriate primary key, say employee number or social security number or possibly the full name plus the date of birth, MySQL itself will prohibit me from inserting any rows that duplicate an existing row. For instance, if employee number is my primary key and I've already got an employee whose number is 123, there is no way that MySQL will allow me to insert a second record whose key is 123. It's that simple. > Most data points to selecting distinct records to a temp table, deleting the > source records, and selecting back into the original, three steps. I > imagine there is a bit of overhead to the create temp table, it afterall, is > writing a small file to disk. > > saying I want to favor newer records over old, and I have a pk of id, is > there any flaw in: > SELECT distinct(id) from table where my_param = 'foo' order by id; > that gives me the newest unique record id's > // little logic to convert result set inot a mysql list > DELETE from table where my_param = 'foo' and id NOT IN (the_result) > > So I have two steps, I save the temp table creation, AND I don't bother > doing anything at all in the case where there are only uniqe records to > begin with, ie: nothing to delete. > > Is there something wrong with this, is seems a wee bit more efficient that > the temp table deal for my case? > This sort of messing around should only be necessary if you neglect to put a primary key on the table in the first place. And I have yet to hear a convincing case for ever not having a primary key in a relational database, although I suppose it is always possible that someone is going to come up with one. It's not always obvious what the primary key should be though so if you're having trouble thinking of a good primary key for your table(s), describe the data and we'll try to help you choose the primary key(s) you need. Seriously, I think you have to decide to keep the barn door closed (create primary keys on all of your tables) before your horses have left rather than worry about the best way to close it once the horses are gone. Rhino -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.308 / Virus Database: 266.11.5 - Release Date: 04/05/2005 -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]