I've been importing data from an old system, part of which contains a big table full of text. The table contains around 1.5 million entries, but a lot of them are duplicate to the the extent that unique entries total less than 10,000.
It's pretty easy to create a list of these duplicate entries using a GROUP BY query: INSERT INTO DUP_LIST(DUP_COUNT, NEW_ID, TEXT) SELECT COUNT(1), MIN(TEXT_ID), TEXT FROM TEXT_TAB GROUP BY TEXT HAVING COUNT(1) > 1; And that takes about 15 mins to run. However I need a mapping between the old duplicate TEXT_IDs and the selected unique one so I can change entries in the tables that reference the TEXT_TAB (I'm using the MIN(TEXT_ID) to select that in the above query): INSERT INTO DUP_MAP(OLD_TEXT_ID, NEW_TEXT_ID) SELECT a.TEXT_ID, b.NEW_ID FROM TEXT_TAB a JOIN DUP_LIST b ON a.TEXT_ID > b.NEW_ID AND a.TEXT = b.TEXT; That query takes over 3 hours. I'm pretty sure I'm doing something stupid here, that second query is obviously very bad. The first query touches on all the info I need (duplicate IDs) but I can't see how to extract that information in a useful way. Is it possible? Scott