Re: Removing Duplicate Records
Ah... Yes. Good point. I like this because I was planning on keeping the output somewhere for a while. (In case we need an "accounting" at some point) So it will be easy enough to dump what's being deleted to the screen while we loop over our candidates. Thanks! On Tue, Jul 14, 2009 at 10:16 AM, Nathan Sullivan wrote: > If you went with option #3, you could avoid the looping by using (tCount - 1) > as the LIMIT in > the delete statement instead of always using 1. >> 3. Do... "SELECT count(*) AS tCount,OrgID,ContID FROM OrgContLink >> GROUP BY OrgID,ContID HAVING tCount > 1". Then for every record in the >> result "DELETE FROM OrgContLink WHERE OrgID=X AND ContID=Y LIMIT 1". >> Then repeat until no results are found. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: Removing Duplicate Records
That's assuming that there is a unique identifier field, like an auto increment field. Although that could be added after the fact. Also, you need to run the query multiple times until it returns no affected records. So if there are 4 copies of a record, it would need to be run 3 times to get rid of all the dups. But I agree, that is the best way to remove duplicates in place provided the table is not too large. Brent Baisley On Tue, Jul 14, 2009 at 11:52 AM, Marcus Bointon wrote: > You can combine the two queries you have in option 3 (you'll need to change > field names, but you should get the idea), something like this: > DELETE table1 FROM table1, (SELECT MAX(id) AS dupid, COUNT(id) AS dupcnt > FROM table1 WHERE field1 IS NOT NULL GROUP BY link_id HAVING dupcnt>1) AS > dups WHERE table1.id=dups.dupid; > Marcus > -- > Marcus Bointon > Synchromedia Limited: Creators of http://www.smartmessages.net/ > UK resellers of i...@hand CRM solutions > mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ > > > -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Re: Removing Duplicate Records
You can combine the two queries you have in option 3 (you'll need to change field names, but you should get the idea), something like this: DELETE table1 FROM table1, (SELECT MAX(id) AS dupid, COUNT(id) AS dupcnt FROM table1 WHERE field1 IS NOT NULL GROUP BY link_id HAVING dupcnt>1) AS dups WHERE table1.id=dups.dupid; Marcus -- Marcus Bointon Synchromedia Limited: Creators of http://www.smartmessages.net/ UK resellers of i...@hand CRM solutions mar...@synchromedia.co.uk | http://www.synchromedia.co.uk/ smime.p7s Description: S/MIME cryptographic signature
RE: Removing Duplicate Records
Matt, If you went with option #3, you could avoid the looping by using (tCount - 1) as the LIMIT in the delete statement instead of always using 1. Regards, Nathan -Original Message- From: Matt Neimeyer [mailto:m...@neimeyer.org] Sent: Tuesday, July 14, 2009 8:58 AM To: mysql@lists.mysql.com Subject: Removing Duplicate Records In our database we have an Organizations table and a Contacts table, and a linking table that associates Contacts with Organizations. Occassionally we manually add to the linking table with information gleaned from outside data sources. This is common enough to be annoying, since it ends up with duplicate linkages, but it's FAR from an everyday occurance. I have three options for dealing with the resulting duplicates and I would appreciate some advice on which option might be best. 1. Attack the horrific spaghetti code that determines the Org and Contact ids and then does the manual add. Creating Orgs and Contacts as needed. Calling this code horrific is a kindness... we're talking evil... We've pretty much ruled this out due to the horror... but I mention that I considered it. 2. Do a create table and populate that new table with the results of a select distinct from the old table then swap the tables. 3. Do... "SELECT count(*) AS tCount,OrgID,ContID FROM OrgContLink GROUP BY OrgID,ContID HAVING tCount > 1". Then for every record in the result "DELETE FROM OrgContLink WHERE OrgID=X AND ContID=Y LIMIT 1". Then repeat until no results are found. I like option 2 in so far as it's more... Atomic? One create...select, one swap and its done. But even though it feels more "pure" I worry that the overhead of completely creating and destroying entire tables seems like throwing the baby out with the bathwater. IOW: Is rebuilding a whole table for a few (hundred at most) offending duplicate records overkill. I like option 3 in that it leaves everything as is but does require a lot of looping and feels inefficient. However, since we'd be running this only after we do our imports it's not like this looping inefficient code would be running all the time. I know I could probably also put a unique key on both orgid and contid but unless I'm reading the docs wrong I can't add a key in such a way that the duplicate key insertion would silently fail without requiring the offending application to do "INSERT ... ON DUPLICATE KEY..." which gets back to modifying the spaghetti code from option 1. Thanks in advance for your advice. Matt -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=nsulli...@cappex.com -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
Removing Duplicate Records
In our database we have an Organizations table and a Contacts table, and a linking table that associates Contacts with Organizations. Occassionally we manually add to the linking table with information gleaned from outside data sources. This is common enough to be annoying, since it ends up with duplicate linkages, but it's FAR from an everyday occurance. I have three options for dealing with the resulting duplicates and I would appreciate some advice on which option might be best. 1. Attack the horrific spaghetti code that determines the Org and Contact ids and then does the manual add. Creating Orgs and Contacts as needed. Calling this code horrific is a kindness... we're talking evil... We've pretty much ruled this out due to the horror... but I mention that I considered it. 2. Do a create table and populate that new table with the results of a select distinct from the old table then swap the tables. 3. Do... "SELECT count(*) AS tCount,OrgID,ContID FROM OrgContLink GROUP BY OrgID,ContID HAVING tCount > 1". Then for every record in the result "DELETE FROM OrgContLink WHERE OrgID=X AND ContID=Y LIMIT 1". Then repeat until no results are found. I like option 2 in so far as it's more... Atomic? One create...select, one swap and its done. But even though it feels more "pure" I worry that the overhead of completely creating and destroying entire tables seems like throwing the baby out with the bathwater. IOW: Is rebuilding a whole table for a few (hundred at most) offending duplicate records overkill. I like option 3 in that it leaves everything as is but does require a lot of looping and feels inefficient. However, since we'd be running this only after we do our imports it's not like this looping inefficient code would be running all the time. I know I could probably also put a unique key on both orgid and contid but unless I'm reading the docs wrong I can't add a key in such a way that the duplicate key insertion would silently fail without requiring the offending application to do "INSERT ... ON DUPLICATE KEY..." which gets back to modifying the spaghetti code from option 1. Thanks in advance for your advice. Matt -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/mysql?unsub=arch...@jab.org
RE: removing duplicate records
If you use IGNORE in the insert IGNORE into new_table you will get the result you want. > -Original Message- > From: walt [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, July 30, 2002 3:11 PM > To: David Kramer; [EMAIL PROTECTED] > Subject: Re: removing duplicate records > > Thanks David! > > The entire row, not just one or two columns, is a duplicate which makes > life > fun.. > :-) > > I can key or unique index only a few columns once the data is cleaned up > to > prevent this problem. > > If I create an identical table and include either a key or unique index > (innodb seems to like the index better) on all the columns and do a > > insert into new_table using select * from old_table > > will mysql quit inserting once a duplicate is hit, or will it keep going > and > skip over the duplicates? > > walt > > On Tuesday 30 July 2002 03:57 pm, David Kramer wrote: > > You could always use an insert statement into a second table, when > > performing the insert use a GROUP BY clause to consalidate your records > > > > something along these lines but this isnt 100% accurate, I would need > the > > table DDL and business rules behind the DEDUP > > > > Insert into tableB > > ( > > column names, ... > > > > > > ) > > (select > > column a, > > column b, > > max(column c), --or you could use MIN > > from > > table A > > > > group by > > column a, > > column b); > > > > > > > > > > > > **JUST make sure your Identifing column, i.e. the column you use to tell > > which is a duplicate record or not is included in the group by. Also > what > > are the business rules behind the DEDUP(Deduplication)? Are the other > > values contained in the other columns necessary? If you tell me more > about > > what your trying to do and provide some Table DDL I can help you write > this > > query. Just let me know! > > > > Thanks, > > > > DK > > > > group by statement > > > > -Original Message- > > From: walt [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, July 30, 2002 12:43 PM > > To: [EMAIL PROTECTED] > > Subject: removing duplicate records > > > > > > Does anyone know a good way find and remove duplicate records from a > table? > > I can create an identical table structure and use a script to pull > records > > from the existing table and insert them into the new table if they are > not > > duplicates, but I'd rather not do it that way. Here is an example of an > sql > > script I use for Oracle databases > > > > delete from employee a > > whererowid < ( > > select max(rowid) > > fromemployee b > > where b.COL1 = a.COL1 > > and b.COL2 = a.COL2 > > and b.COL# = a.COL#); > > > > sql, query > > > > Thanks in advance! > > > > > - > Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > > To request this thread, e-mail <[EMAIL PROTECTED]> > To unsubscribe, e-mail [EMAIL PROTECTED]> > Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
RE: removing duplicate records
Vivian - Im just curious as to why I shouldnt trust group by? Is there a bug or something within Mysql regarding the group by clause? Walt - you could also create a compound primary key that consists of all columns, then create a unique constraint on the Primary key. I've used this many times in the past, and it works quite well. The only caveat is the SQL will puke when it tries to load the duplicate records into the table, I personally would use the DEDUP process of Insert/select with a group by... But you might want to see what Vivian's response is on the group by issue. DK -Original Message- From: Vivian Wang [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 30, 2002 2:37 PM To: walt; David Kramer; [EMAIL PROTECTED] Subject: Re: removing duplicate records Walt, Don't trust group by. I am using mysqldump and sort -u on *nix for duplicate. Try both ways, let me know your result. Thanks At 04:11 PM 7/30/2002 -0400, walt wrote: >Thanks David! > >The entire row, not just one or two columns, is a duplicate which makes life >fun.. >:-) > >I can key or unique index only a few columns once the data is cleaned up to >prevent this problem. > >If I create an identical table and include either a key or unique index >(innodb seems to like the index better) on all the columns and do a > >insert into new_table using select * from old_table > >will mysql quit inserting once a duplicate is hit, or will it keep going and >skip over the duplicates? > >walt > >On Tuesday 30 July 2002 03:57 pm, David Kramer wrote: > > You could always use an insert statement into a second table, when > > performing the insert use a GROUP BY clause to consalidate your records > > > > something along these lines but this isnt 100% accurate, I would need the > > table DDL and business rules behind the DEDUP > > > > Insert into tableB > > ( > > column names, ... > > > > > > ) > > (select > > column a, > > column b, > > max(column c), --or you could use MIN > > from > > table A > > > > group by > > column a, > > column b); > > > > > > > > > > > > **JUST make sure your Identifing column, i.e. the column you use to tell > > which is a duplicate record or not is included in the group by. Also what > > are the business rules behind the DEDUP(Deduplication)? Are the other > > values contained in the other columns necessary? If you tell me more about > > what your trying to do and provide some Table DDL I can help you write this > > query. Just let me know! > > > > Thanks, > > > > DK > > > > group by statement > > > > -Original Message- > > From: walt [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, July 30, 2002 12:43 PM > > To: [EMAIL PROTECTED] > > Subject: removing duplicate records > > > > > > Does anyone know a good way find and remove duplicate records from a table? > > I can create an identical table structure and use a script to pull records > > from the existing table and insert them into the new table if they are not > > duplicates, but I'd rather not do it that way. Here is an example of an sql > > script I use for Oracle databases > > > > delete from employee a > > whererowid < ( > > select max(rowid) > > fromemployee b > > where b.COL1 = a.COL1 > > and b.COL2 = a.COL2 > > and b.COL# = a.COL#); > > > > sql, query > > > > Thanks in advance! > > > > >- >Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > >To request this thread, e-mail <[EMAIL PROTECTED]> >To unsubscribe, e-mail ><[EMAIL PROTECTED]> >Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: removing duplicate records
Walt, Don't trust group by. I am using mysqldump and sort -u on *nix for duplicate. Try both ways, let me know your result. Thanks At 04:11 PM 7/30/2002 -0400, walt wrote: >Thanks David! > >The entire row, not just one or two columns, is a duplicate which makes life >fun.. >:-) > >I can key or unique index only a few columns once the data is cleaned up to >prevent this problem. > >If I create an identical table and include either a key or unique index >(innodb seems to like the index better) on all the columns and do a > >insert into new_table using select * from old_table > >will mysql quit inserting once a duplicate is hit, or will it keep going and >skip over the duplicates? > >walt > >On Tuesday 30 July 2002 03:57 pm, David Kramer wrote: > > You could always use an insert statement into a second table, when > > performing the insert use a GROUP BY clause to consalidate your records > > > > something along these lines but this isnt 100% accurate, I would need the > > table DDL and business rules behind the DEDUP > > > > Insert into tableB > > ( > > column names, ... > > > > > > ) > > (select > > column a, > > column b, > > max(column c), --or you could use MIN > > from > > table A > > > > group by > > column a, > > column b); > > > > > > > > > > > > **JUST make sure your Identifing column, i.e. the column you use to tell > > which is a duplicate record or not is included in the group by. Also what > > are the business rules behind the DEDUP(Deduplication)? Are the other > > values contained in the other columns necessary? If you tell me more about > > what your trying to do and provide some Table DDL I can help you write this > > query. Just let me know! > > > > Thanks, > > > > DK > > > > group by statement > > > > -Original Message- > > From: walt [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, July 30, 2002 12:43 PM > > To: [EMAIL PROTECTED] > > Subject: removing duplicate records > > > > > > Does anyone know a good way find and remove duplicate records from a table? > > I can create an identical table structure and use a script to pull records > > from the existing table and insert them into the new table if they are not > > duplicates, but I'd rather not do it that way. Here is an example of an sql > > script I use for Oracle databases > > > > delete from employee a > > whererowid < ( > > select max(rowid) > > fromemployee b > > where b.COL1 = a.COL1 > > and b.COL2 = a.COL2 > > and b.COL# = a.COL#); > > > > sql, query > > > > Thanks in advance! > > > > >- >Before posting, please check: >http://www.mysql.com/manual.php (the manual) >http://lists.mysql.com/ (the list archive) > >To request this thread, e-mail <[EMAIL PROTECTED]> >To unsubscribe, e-mail ><[EMAIL PROTECTED]> >Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: removing duplicate records
Thanks David! The entire row, not just one or two columns, is a duplicate which makes life fun.. :-) I can key or unique index only a few columns once the data is cleaned up to prevent this problem. If I create an identical table and include either a key or unique index (innodb seems to like the index better) on all the columns and do a insert into new_table using select * from old_table will mysql quit inserting once a duplicate is hit, or will it keep going and skip over the duplicates? walt On Tuesday 30 July 2002 03:57 pm, David Kramer wrote: > You could always use an insert statement into a second table, when > performing the insert use a GROUP BY clause to consalidate your records > > something along these lines but this isnt 100% accurate, I would need the > table DDL and business rules behind the DEDUP > > Insert into tableB > ( > column names, ... > > > ) > (select > column a, > column b, > max(column c), --or you could use MIN > from > table A > > group by > column a, > column b); > > > > > > **JUST make sure your Identifing column, i.e. the column you use to tell > which is a duplicate record or not is included in the group by. Also what > are the business rules behind the DEDUP(Deduplication)? Are the other > values contained in the other columns necessary? If you tell me more about > what your trying to do and provide some Table DDL I can help you write this > query. Just let me know! > > Thanks, > > DK > > group by statement > > -Original Message- > From: walt [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, July 30, 2002 12:43 PM > To: [EMAIL PROTECTED] > Subject: removing duplicate records > > > Does anyone know a good way find and remove duplicate records from a table? > I can create an identical table structure and use a script to pull records > from the existing table and insert them into the new table if they are not > duplicates, but I'd rather not do it that way. Here is an example of an sql > script I use for Oracle databases > > delete from employee a > whererowid < ( > select max(rowid) > fromemployee b > where b.COL1 = a.COL1 > and b.COL2 = a.COL2 > and b.COL# = a.COL#); > > sql, query > > Thanks in advance! - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
RE: removing duplicate records
You could always use an insert statement into a second table, when performing the insert use a GROUP BY clause to consalidate your records something along these lines but this isnt 100% accurate, I would need the table DDL and business rules behind the DEDUP Insert into tableB ( column names, ... ) (select column a, column b, max(column c), --or you could use MIN from table A group by column a, column b); **JUST make sure your Identifing column, i.e. the column you use to tell which is a duplicate record or not is included in the group by. Also what are the business rules behind the DEDUP(Deduplication)? Are the other values contained in the other columns necessary? If you tell me more about what your trying to do and provide some Table DDL I can help you write this query. Just let me know! Thanks, DK group by statement -Original Message- From: walt [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 30, 2002 12:43 PM To: [EMAIL PROTECTED] Subject: removing duplicate records Does anyone know a good way find and remove duplicate records from a table? I can create an identical table structure and use a script to pull records from the existing table and insert them into the new table if they are not duplicates, but I'd rather not do it that way. Here is an example of an sql script I use for Oracle databases delete from employee a whererowid < ( select max(rowid) fromemployee b where b.COL1 = a.COL1 and b.COL2 = a.COL2 and b.COL# = a.COL#); sql, query Thanks in advance! -- Walter Anthony System Administrator National Electronic Attachment Atlanta, Georgia 1-800-782-5150 ext. 1608 "If it's not broketweak it" - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
Re: removing duplicate records
If the columns are defined as not null, just add a unique index on all the fields. alter table ignore add unique(col1.col2.col#); walt wrote: >Does anyone know a good way find and remove duplicate records from a table? >I can create an identical table structure and use a script to pull records >from the existing table and insert them into the new table if they are not >duplicates, but I'd rather not do it that way. Here is an example of an sql >script I use for Oracle databases > >delete from employee a >whererowid < ( >select max(rowid) >fromemployee b >where b.COL1 = a.COL1 >and b.COL2 = a.COL2 >and b.COL# = a.COL#); > >sql, query > >Thanks in advance! > - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php
removing duplicate records
Does anyone know a good way find and remove duplicate records from a table? I can create an identical table structure and use a script to pull records from the existing table and insert them into the new table if they are not duplicates, but I'd rather not do it that way. Here is an example of an sql script I use for Oracle databases delete from employee a whererowid < ( select max(rowid) fromemployee b where b.COL1 = a.COL1 and b.COL2 = a.COL2 and b.COL# = a.COL#); sql, query Thanks in advance! -- Walter Anthony System Administrator National Electronic Attachment Atlanta, Georgia 1-800-782-5150 ext. 1608 "If it's not broketweak it" - Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php