I read CSV files into a table using 

mysqlimport  --fields-terminated-by=',' --replace .....

That would work fine and, by using the appropriate unique index, it
would deal with the fact that some of the data appears in multiple CSV
files.  However, though the same data is in several files, the date
format is not the same in all of them.  The result is that there are
duplicate records in the resulting table without violating unique
keys.

I devised a way of adding an additional column and converting the
dates into a "proper" date format thereby making the duplicates
detectable.  There are several ways I can get rid of them:

a) Write out the table using SELECT * DISTINCT INTO OUTFILE with the
appropriate GROUP BY.  Then truncate the table and read back from the
OUTFILE using LOAD DATA INFILE.

b) Create a new table with the same structure but with a slightly
different multiple-column unique key.  Use REPLACE INTO the new table
and the new key will remove the duplicates.  Truncate the original
table, and copy back to the records from the interim table.


There's a major hassle with method a).  The OUTFILE has to be removed
before the same file name can be used again and that requires access
privileges to the database directory.  I'm not entirely happy with
method b) either since it is not particularly scalable nor generic.

I'd have thought this would be an FAQ, but it appears not so from my
searching.  I normally work with a language called S (dialect R) which
handles matrices in a multitude of ways.  Removing duplicates is
straightforward in that language, but from what I know about SQL so
far, it is rather complicated in MySQL.


What do other people do with duplicates?

TIA

-- 
   ___     Patrick Connolly      
 {~._.~}   
 _( Y )_          Good judgment comes from experience 
(:_~*~_:)         Experience comes from bad judgment    
 (_)-(_)            


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to