[SQL] identifying duplicates in table with redundancies

Tarlika Elisabeth Schmitz Thu, 23 Sep 2010 15:35:45 -0700

I loaded data from a spread into a interim table so I can analyze the
quality of the data.


The table contains an entry for every student (250K records) and his
trainer. Eventually, I want to extract a unique list of trainers from
it. But first of all I want to check for duplicates:

1) multiples trainer names for same trainer id
2) multiple trainer ids for same trainer name

I cobbled together the SQL and it does the job but it seems rather
convoluted. I would like to know how I can improve it. 

CREATE TABLE student (
id INTEGER NOT NULL,
name VARCHAR(256) NOT NULL,
trainer_id INTEGER,
trainer_name VARCHAR(256),
);

====
EXAMPLE DATA

22 John 1 Macdonald
23 Jane 1 MacDonald
24 Paul 1 MacDonald
25 Dick 2 Smith
26 Bill 3 Smith
27 Kate 3 Smith
====

-- outputs trainer ids which appear under different names
select trainer_id, trainer_name from 
(
    -- different id/name combinations
    select distinct on (trainer_name) trainer_id, trainer_name
    from student
    where trainer_id in
    (
        -- trainer ids with appearing with different names
        select distinct on (id) id
        from 
        (
            -- distinct trainer id-name
            select distinct on (trainer_id,trainer_name) 
            trainer_id as id, 
            trainer_name as name from student
        ) as trainer
        group by trainer.id
        having count (trainer.name) > 1 
    ) 
) as y
order by trainer_id

-- 

Best Regards,
Tarlika Elisabeth Schmitz

-- 
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

[SQL] identifying duplicates in table with redundancies

Reply via email to