On 6/14/2012 5:57 PM, Gary Aitken wrote:
Hi all,

I've looked high and low for what I hope is a trivial answer.

I was trying to load a table using LOAD DATA INFILE.  Unfortunately, it craps 
out because there are some duplicate primary keys.  Not surprising as the 
source did not enforce uniqueness.  My problem is the load data simply dies 
without indicating which line of the input file was in error; the error message 
refers to line 3, which is not even the SQL statement for the LOAD DATA INTO 
statement:

I can get the table loaded by specifying REPLACE INTO TABLE, but that still 
leaves me with not knowing where the duplicate records are.

So...  I wanted to read the data line at a time and use a plain INSERT 
statement.  That way I could check for duplicate keys and discover where the 
duplicate records are.  However, I can't find a way to read input from the 
console or a file.  What am I missing?  I know I could write a java or C++ 
program to do this, but it seems like overkill for what should be a trivial 
task.

Thanks for any pointers,

Gary


The trivial thing I do to solve this problem is to create a copy of the destination table without any PRIMARY KEY or UNIQUE constraints on it. This gives you an empty space to which you can bulk import your raw data. I am not sure if there is an official term for this but I call it a 'landing table'. This is the first step of the import process.

Once you can get your data off of disk and onto the landing table (it's where the raw import lands inside the database) you can check it for duplicates very easily.

1) create a normal index for the PRIMARY KEY column
2) create another table that has a list of duplicateslike this

CREATE TABLE dup_list ENGINE=MYISAM SELECT pkcol, count(*) freq FROM landing GROUP BY pkcol HAVING freq >1;

notes:
* use a MyISAM table for this preparation work even if the destination table is using the InnoDB storage engine, you really do not need to protect this data with a transaction yet.
* MyISAM indexes are also very fast for count(*) queries.

You have clearly identified all duplicate rows in the incoming data. You can also compare those rows with your existing rows to see if any of them duplicate each other (hint: INNER JOIN is your friend) or if any exist in one table but not the other (hint: LEFT JOIN).

From here you should be able to cleanse the incoming data (remove duplicates, adjust any weird fields) and merge it with your existing data to maintain the relational and logical integrity of your tables.

Regards,
--
Shawn Green
MySQL Principal Technical Support Engineer
Oracle USA, Inc. - Hardware and Software, Engineered to Work Together.
Office: Blountville, TN



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql

Reply via email to