I've been importing data from an old system, part of which contains a 
big table full of text. The table contains around 1.5 million entries, 
but a lot of them are duplicate to the the extent that unique entries 
total less than 10,000.

It's pretty easy to create a list of these duplicate entries using a 
GROUP BY query:

INSERT INTO DUP_LIST(DUP_COUNT, NEW_ID, TEXT)
   SELECT COUNT(1), MIN(TEXT_ID), TEXT
     FROM TEXT_TAB
     GROUP BY TEXT
     HAVING COUNT(1) > 1;

And that takes about 15 mins to run.

However I need a mapping between the old duplicate TEXT_IDs and the 
selected unique one so I can change entries in the tables that reference 
the TEXT_TAB (I'm using the MIN(TEXT_ID) to select that in the above query):

INSERT INTO DUP_MAP(OLD_TEXT_ID, NEW_TEXT_ID)
   SELECT a.TEXT_ID, b.NEW_ID
     FROM TEXT_TAB a
     JOIN DUP_LIST b
       ON a.TEXT_ID > b.NEW_ID
       AND a.TEXT = b.TEXT;

That query takes over 3 hours.

I'm pretty sure I'm doing something stupid here, that second query is 
obviously very bad. The first query touches on all the info I need 
(duplicate IDs) but I can't see how to extract that information in a 
useful way. Is it possible?

Scott

Reply via email to