Re: The age old delete duplicates

Joerg Bruehe Thu, 05 May 2005 02:09:32 -0700

Hi Scott, all!

Scott Haneda wrote:

I have been researching on how to deal with duplicate data.  [[...]]

I am not in a situation where I can test for duplicates on insert, so the on
duplicate key update deal does me no good here.

Most data points to selecting distinct records to a temp table, deleting the
source records, and selecting back into the original, three steps.  I
imagine there is a bit of overhead to the create temp table, it afterall, is
writing a small file to disk.

saying I want to favor newer records over old, and I have a pk of id, is
there any flaw in:
SELECT distinct(id) from table where my_param = 'foo' order by id;
that gives me the newest unique record id's
// little logic to convert result set inot a mysql list
DELETE from table where my_param = 'foo' and id NOT IN (the_result)

1) As 'id' is a PK in your example, all its values will be distinct anyway, so this approach cannot work.

2) Your mentioning of "favor newer records over old" and relating that with 'id' (your PK) is not justified in general, because there is no necessity that the 'id' value is formed in ascending sequence - unless you use auto-increment, timestamp, or something with a similar effect.

Assuming that "my_param" is the column that must not contain duplicates, what you want to do is something like

   SELECT  COUNT(id), my_param FROM table HAVING COUNT(id)> 1

to find those values which should be unique but have duplicates. (Note that this does not include a condition on "my_param", assuming the duplicates might occur for any value. You might be in a situation where you can add a restriction on "my_param").

If "id" values are really increasing, so you want to keep the maximum, you should follow the "groupwise maximum" approach described in the manual, it will yield those "id" values you want to keep.

As an alternative, search for the "groupwise minimum" of all groups with a "count" greater than one, these would be values to delete.


So I have two steps, I save the temp table creation, AND I don't bother
doing anything at all in the case where there are only uniqe records to
begin with, ie: nothing to delete.

Is there something wrong with this, is seems a wee bit more efficient that
the temp table deal for my case?

The way I understood your situation, this is going to be a recurrent problem, arising again and again. In that case, I would see to do without a table copy approach: Your data are likely to grow, and the older data will hopefully be "clean", so you will be shuffling larger and larger tables for an (about) constant number of erroneous records.

I would see to do it "in-place", because then the effort will not grow linear with the table size. Sure, the time to find duplicate rows may rise as the table grows, but the time to fix should remain about constant.


HTH,
Jörg


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: The age old delete duplicates

Reply via email to