on 5/5/05 2:11 AM, Joerg Bruehe at [EMAIL PROTECTED] wrote: > Hi Scott, all! > > Scott Haneda wrote: >> I have been researching on how to deal with duplicate data. [[...]] >> >> I am not in a situation where I can test for duplicates on insert, so the on >> duplicate key update deal does me no good here. >> >> Most data points to selecting distinct records to a temp table, deleting the >> source records, and selecting back into the original, three steps. I >> imagine there is a bit of overhead to the create temp table, it afterall, is >> writing a small file to disk. >> >> saying I want to favor newer records over old, and I have a pk of id, is >> there any flaw in: >> SELECT distinct(id) from table where my_param = 'foo' order by id; >> that gives me the newest unique record id's >> // little logic to convert result set inot a mysql list >> DELETE from table where my_param = 'foo' and id NOT IN (the_result) > > 1) As 'id' is a PK in your example, all its values will be distinct > anyway, so this approach cannot work.
My mistake, sorry, was rushing. > 2) Your mentioning of "favor newer records over old" and relating that > with 'id' (your PK) is not justified in general, because there is no > necessity that the 'id' value is formed in ascending sequence - unless > you use auto-increment, timestamp, or something with a similar effect. The id is a auto_inc indeed. > Assuming that "my_param" is the column that must not contain duplicates, > what you want to do is something like > > SELECT COUNT(id), my_param FROM table HAVING COUNT(id)> 1 If you can bear with me, let me re-explain a little better: id auto inc unique key user_id contains the users id product_id, this is where there can and will be dupes, I want the newest saved So I select distinct(product_id) from table where user_id = '123' order by id delete from table where user_id = '123' AND product_id not in (above result list) Ok? > to find those values which should be unique but have duplicates. > (Note that this does not include a condition on "my_param", assuming the > duplicates might occur for any value. You might be in a situation where > you can add a restriction on "my_param"). > > > If "id" values are really increasing, so you want to keep the maximum, > you should follow the "groupwise maximum" approach described in the > manual, it will yield those "id" values you want to keep. I will need to lookup groupwise, never used that before. > As an alternative, search for the "groupwise minimum" of all groups with > a "count" greater than one, these would be values to delete. > >> >> So I have two steps, I save the temp table creation, AND I don't bother >> doing anything at all in the case where there are only uniqe records to >> begin with, ie: nothing to delete. >> >> Is there something wrong with this, is seems a wee bit more efficient that >> the temp table deal for my case? > > The way I understood your situation, this is going to be a recurrent > problem, arising again and again. In that case, I would see to do > without a table copy approach: Your data are likely to grow, and the > older data will hopefully be "clean", so you will be shuffling larger > and larger tables for an (about) constant number of erroneous records. Basically, I have a chopping cart, this one is a little weird, for reasons not worth explaining, you have a cart when a user is not logged in, and they *may* have one they made at some point when logged in. There is a chance the logged in cart already has stuff in the non logged in cart, I don't want duplicate items in the cart, so I want to just get rid of the oldest one. > I would see to do it "in-place", because then the effort will not grow > linear with the table size. Sure, the time to find duplicate rows may > rise as the table grows, but the time to fix should remain about constant. Thanks for any additional pointers. I can select all and select distinct, compare the two in a repeat loop in my code, looking for a nice all in mysql way to pull it off. -- ------------------------------------------------------------- Scott Haneda Tel: 415.898.2602 <http://www.newgeo.com> Novato, CA U.S.A. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]