Re: The age old delete duplicates

Rhino Thu, 05 May 2005 05:31:28 -0700

----- Original Message ----- 
From: "Scott Haneda" <[EMAIL PROTECTED]>
To: "MySql" <mysql@lists.mysql.com>
Sent: Thursday, May 05, 2005 3:39 AM
Subject: The age old delete duplicates



> I have been researching on how to deal with duplicate data.  While I have
a
> case where there can be duplicate data, and I want to get rid of it, the
> general ideas I seem to find are as follows, plus my own, which I would
like
> opinions on since I have not seen it mentioned.
>
> I am not in a situation where I can test for duplicates on insert, so the
on
> duplicate key update deal does me no good here.
>
Okay, stop right there. If you design your tables with appropriate primary
keys, there is no reason you should be getting duplicates in the first
place. Furthermore, there is no reason that you have to test for those
duplicates either at the command line or in your programs.

For example, if I create an Employee table with one row per employee in my
company, and if I give that table an appropriate primary key, say employee
number or social security number or possibly the full name plus the date of
birth, MySQL itself will prohibit me from inserting any rows that duplicate
an existing row. For instance, if employee number is my primary key and I've
already got an employee whose number is 123, there is no way that MySQL will
allow me to insert a second record whose key is 123. It's that simple.


> Most data points to selecting distinct records to a temp table, deleting
the
> source records, and selecting back into the original, three steps.  I
> imagine there is a bit of overhead to the create temp table, it afterall,
is
> writing a small file to disk.
>
> saying I want to favor newer records over old, and I have a pk of id, is
> there any flaw in:
> SELECT distinct(id) from table where my_param = 'foo' order by id;
> that gives me the newest unique record id's
> // little logic to convert result set inot a mysql list
> DELETE from table where my_param = 'foo' and id NOT IN (the_result)
>
> So I have two steps, I save the temp table creation, AND I don't bother
> doing anything at all in the case where there are only uniqe records to
> begin with, ie: nothing to delete.
>
> Is there something wrong with this, is seems a wee bit more efficient that
> the temp table deal for my case?
>
This sort of messing around should only be necessary if you neglect to put a
primary key on the table in the first place. And I have yet to hear a
convincing case for ever not having a primary key in a relational database,
although I suppose it is always possible that someone is going to come up
with one.

It's not always obvious what the primary key should be though so if you're
having trouble thinking of a good primary key for your table(s), describe
the data and we'll try to help you choose the primary key(s) you need.

Seriously, I think you have to decide to keep the barn door closed (create
primary keys on all of your tables) before your horses have left rather than
worry about the best way to close it once the horses are gone.

Rhino



-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.11.5 - Release Date: 04/05/2005


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: The age old delete duplicates

Reply via email to