On 09/28/2010 10:36 PM, Tarlika Elisabeth Schmitz wrote:
On Tue, 28 Sep 2010 11:34:31 +0100
"Oliveiros d'Azevedo Cristina"<oliveiros.crist...@marktest.pt> wrote:
----- Original Message -----
From: "Tarlika Elisabeth Schmitz"<postgre...@numerixtechnology.de>
To:<pgsql-sql@postgresql.org>
Sent: Monday, September 27, 2010 5:54 PM
Subject: Re: [SQL] identifying duplicates in table with redundancies
On Fri, 24 Sep 2010 18:12:18 +0100
Oliver d'Azevedo Christina<oliveiros.crist...@gmail.com> wrote:
SELECT DISTINCT trainer_id,trainer_name
FROM (
SELECT trainer_name -- The field you want to test for duplicates
FROM (
SELECT DISTINCT "trainer_id","trainer_name"
FROM student
) x
GROUP BY "trainer_name" -- the field you want to test for
duplicates
HAVING (COUNT(*)> 1)
) z
NATURAL JOIN student y
What indices would you recommend for this operation?
But, on this query in particular I would recomend an indice on
trainer_name, as this field will be used on the join and on the group
by. For the other query, the one you get by substituting trainer_name
by trainer_id, place an index on trainer_id.
Also, these indexes may help speed up the order by clause, if you use
one.
If you have a table with lots of data you can try them around and see
how performance varies (and don't forget there's also EXPLAIN ANALYZE)
Strangely, these indices did not do anything.
Without, the query took about 8500ms. Same with index.
The table has 250000 records. 11000 have trainer_name = null. Only
13000 unique trainer_names.
It is not hugely important as these queries are not time-critical.
This is only a helper table, which I use to analyze the date prior to
populating the destination tables with the data.
Regards,
Tarlika
I guess explain analyze shows up a seq scan. try avoiding to use
distinct. use group by instead.
regards
Andreas
--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql