>I've been importing data from an old system, part of which contains a big 
>table full of text. The table 
>contains around 1.5 million entries, but a lot of them are duplicate to the 
>the extent that unique entries 
>total less than 10,000.
>
>It's pretty easy to create a list of these duplicate entries using a GROUP BY 
>query:
>
>INSERT INTO DUP_LIST(DUP_COUNT, NEW_ID, TEXT)
>   SELECT COUNT(1), MIN(TEXT_ID), TEXT
>     FROM TEXT_TAB
>     GROUP BY TEXT
>     HAVING COUNT(1) > 1;
>
>And that takes about 15 mins to run.
>
>However I need a mapping between the old duplicate TEXT_IDs and the selected 
>unique one so I can change 
>entries in the tables that reference the TEXT_TAB (I'm using the MIN(TEXT_ID) 
>to select that in the above 
>query):
>
>INSERT INTO DUP_MAP(OLD_TEXT_ID, NEW_TEXT_ID)
>   SELECT a.TEXT_ID, b.NEW_ID
>     FROM TEXT_TAB a
>     JOIN DUP_LIST b
>       ON a.TEXT_ID > b.NEW_ID
>       AND a.TEXT = b.TEXT;
>
>That query takes over 3 hours.
>
>I'm pretty sure I'm doing something stupid here, that second query is 
>obviously very bad. The first query 
>touches on all the info I need (duplicate IDs) but I can't see how to extract 
>that information in a useful 
>way. Is it possible?

Hi Scott!

I don't see anything stupid, though of course, you would not want an index to 
be used for a.TEXT_ID (since that will only eliminate the first out of an 
average of 150 duplicates. Depending on your plan, a slight change might make 
the query a bit quicker:

INSERT INTO DUP_MAP(OLD_TEXT_ID, NEW_TEXT_ID)
   SELECT a.TEXT_ID, b.NEW_ID
     FROM TEXT_TAB a
     JOIN DUP_LIST b
       ON a.TEXT_ID+0 > b.NEW_ID
       AND a.TEXT = b.TEXT;

Another, more important thing is whether you have an index on DUP_LIST.TEXT. If 
it is too long for an index, then create an index covering part of the field 
and add to your query as well if necessary (I hardly ever need expression 
indexes or calculated indexed fields, so I'm uncertain what you have to add).

I agree that inserting less than 150 records per second sounds like it could be 
improved.

HTH,
Set

Reply via email to