Re: The age old delete duplicates

Scott Haneda Thu, 05 May 2005 02:20:06 -0700

on 5/5/05 2:11 AM, Joerg Bruehe at [EMAIL PROTECTED] wrote:

> Hi Scott, all!
> 
> Scott Haneda wrote:
>> I have been researching on how to deal with duplicate data.  [[...]]
>> 
>> I am not in a situation where I can test for duplicates on insert, so the on
>> duplicate key update deal does me no good here.
>> 
>> Most data points to selecting distinct records to a temp table, deleting the
>> source records, and selecting back into the original, three steps.  I
>> imagine there is a bit of overhead to the create temp table, it afterall, is
>> writing a small file to disk.
>> 
>> saying I want to favor newer records over old, and I have a pk of id, is
>> there any flaw in:
>> SELECT distinct(id) from table where my_param = 'foo' order by id;
>> that gives me the newest unique record id's
>> // little logic to convert result set inot a mysql list
>> DELETE from table where my_param = 'foo' and id NOT IN (the_result)
> 
> 1) As 'id' is a PK in your example, all its values will be distinct
> anyway, so this approach cannot work.


My mistake, sorry, was rushing.
 
> 2) Your mentioning of "favor newer records over old" and relating that
> with 'id' (your PK) is not justified in general, because there is no
> necessity that the 'id' value is formed in ascending sequence - unless
> you use auto-increment, timestamp, or something with a similar effect.

The id is a auto_inc indeed.

> Assuming that "my_param" is the column that must not contain duplicates,
> what you want to do is something like
> 
>     SELECT  COUNT(id), my_param FROM table HAVING COUNT(id)> 1

If you can bear with me, let me re-explain a little better:

id auto inc unique key
user_id contains the users id
product_id, this is where there can and will be dupes, I want the newest
saved

So I select distinct(product_id) from table where user_id = '123' order by
id
delete from table where user_id = '123' AND product_id not in (above result
list)

Ok?

> to find those values which should be unique but have duplicates.
> (Note that this does not include a condition on "my_param", assuming the
> duplicates might occur for any value. You might be in a situation where
> you can add a restriction on "my_param").
> 
> 
> If "id" values are really increasing, so you want to keep the maximum,
> you should follow the "groupwise maximum" approach described in the
> manual, it will yield those "id" values you want to keep.

I will need to lookup groupwise, never used that before.

> As an alternative, search for the "groupwise minimum" of all groups with
> a "count" greater than one, these would be values to delete.
> 
>> 
>> So I have two steps, I save the temp table creation, AND I don't bother
>> doing anything at all in the case where there are only uniqe records to
>> begin with, ie: nothing to delete.
>> 
>> Is there something wrong with this, is seems a wee bit more efficient that
>> the temp table deal for my case?
> 
> The way I understood your situation, this is going to be a recurrent
> problem, arising again and again. In that case, I would see to do
> without a table copy approach: Your data are likely to grow, and the
> older data will hopefully be "clean", so you will be shuffling larger
> and larger tables for an (about) constant number of erroneous records.

Basically, I have a chopping cart, this one is a little weird, for reasons
not worth explaining, you have a cart when a user is not logged in, and they
*may* have one they made at some point when logged in.  There is a chance
the logged in cart already has stuff in the non logged in cart, I don't want
duplicate items in the cart, so I want to just get rid of the oldest one.

> I would see to do it "in-place", because then the effort will not grow
> linear with the table size. Sure, the time to find duplicate rows may
> rise as the table grows, but the time to fix should remain about constant.

Thanks for any additional pointers.  I can select all and select distinct,
compare the two in a repeat loop in my code, looking for a nice all in mysql
way to pull it off.
-- 
-------------------------------------------------------------
Scott Haneda                                Tel: 415.898.2602
<http://www.newgeo.com>                     Novato, CA U.S.A.



-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: The age old delete duplicates

Reply via email to