> I'm new to SQL, and SQLite, and I find myself needing to
> identify duplicate records in a SQLite table (there are
> about 2K duplicates in a 36K row table).  Any suggestions
> or magic SQL queries appreciated :-)

Here are a couple of strategies I've used successfully for finding and
potentially removing duplicate records.  It assumes a primary key field that
is *separate* from the fields being checked for duplication.  This example
also assumes that I only want to check for duplication on two of the fields
in the table: field1 and field2.

Assume a table such as follows, that might included duplicate records.

    create table checkedtable
        (
            priKey    integer primary key,
            field1    char    (10),
            field2    char    (10),
            <other possible fields>
        );

and a temporary table to hold the duplicates

    create temp table dups
        (
            field1,
            field2,
            reccount,
            keyvalue
        );

The first method is probably the simplest:

    insert into dups (field1, field2, keyvalue)
        select field1, field2, max(priKey)
        from   checkedtable
        group by field1, field2


I don't know if this one works on SQLite, as I've never had reason to try
it.  (I've learned a fair bit about DB design in the last couple of years
before starting with SQLite, so duplication that needs correction is less of
a problem now :)  It does work on SQL Server 2000, though.

The second method is one that I used before I understood that GROUP BY could
be used for more than counting. :-)

Get all the records that have duplicate information and put them into the
temp table:

    insert into dups (field1, field2, reccount)
        select field1, field2, count(priKey)
        from checkedtable
        group by field1, field2
        having count(priKey) > 1

Now, get the key for one of them:

    update  dups
    set     keyvalue = checkedtable.priKey
    from    checkedtable
    where   checkedtable.field1 = checkedtable.field1
    and     checkedtable.field2 = checkedtable.field2



Using either of these methods, the temporary table now contains key values
that point to one of each set of duplicated records in the original table.
It can be used to delete them, archive them, change them, whatever, by
joining back to the original.

Note that the code presented here is off the top of my head, and not tested,
so there may be syntactic errors that I didn't catch while writing it.  With
any luck, there aren't any logic errors.  Hopefully, it will give you a
starting point.

Brad

Reply via email to