Re: [sqlite] Speed up count(distinct col)

BareFeetWare Fri, 11 Feb 2011 16:48:46 -0800

On 11/02/2011, at 11:51 PM, Yuzem wrote:

>> you want the number of each unique (capacity, person). No problem. We'll 
>> just add a People column to the statistics, and change the uniqueness 
>> constraint.


Hmm, seems I didn't quite change the uniqueness constraint. I meant to allow 
multiple capacities and multiple people, but keep each combination of 
(capacity, person) unique. So we need a Capacity_ID column and to use it 
instead of ID when inserting. The schema should therefore be:

begin immediate
;
drop table if exists "Capacity People Statistics"
;
create table "Capacity People Statistics"
(       ID integer primary key not null
,       Capacity_ID integer not null references "Capacity" (ID) on delete 
cascade
,       People_ID integer not null references "People" (ID) on delete cascade
,       Count integer not null
,       unique (Capacity_ID, People_ID)
)
;
insert into "Capacity People Statistics" (Capacity_ID, People_ID, Count)
select Capacity_ID, People_ID, count(*) from "Movie People"
group by Capacity_ID, People_ID having Count > 0
;
drop trigger if exists "Movie People insert"
;
create trigger "Movie People insert"
on "Movie People"
after insert
begin
insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, 
Count)
select
        new.Capacity_ID
,       new.People_ID
,       (       select coalesce(Count, 0) + 1 from "Capacity People Statistics"
                where Capacity_ID = new.Capacity_ID and People_ID = 
new.People_ID
        )
;
end
;
drop trigger if exists "Movie People delete"
;
create trigger "Movie People delete"
on "Movie People"
after delete
begin
insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, 
Count)
select
        old.Capacity_ID
,       old.People_ID
,       (       select coalesce(Count, 0) - 1 from "Capacity People Statistics"
                where Capacity_ID = old.Capacity_ID and People_ID = 
old.People_ID
        )
;
end
;
drop trigger if exists "Movie People update"
;
create trigger "Movie People update"
on "Movie People"
after update of Capacity_ID, People_ID
begin
insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, 
Count)
select
        old.Capacity_ID
,       old.People_ID
,       (       select coalesce(Count, 0) - 1 from "Capacity People Statistics"
                where Capacity_ID = old.Capacity_ID and People_ID = 
old.People_ID
        )
;
insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, 
Count)
select
        new.Capacity_ID
,       new.People_ID
,       (       select coalesce(Count, 0) + 1 from "Capacity People Statistics"
                where Capacity_ID = new.Capacity_ID and People_ID = 
new.People_ID
        )
;
end
;
commit
;

What is your source for the data? If I have that, I can test my schema 
properly, rather than sitting here entering a pile of dummy data. I looked at 
IMDB, but they only seem to have a paid license download.

> That should work, if I understand it correctly now I can count directors with:
> SELECT count(*) FROM "Capacity People Statistics" WHERE ID = 'directors';

No. Sorry, I should have included that. You count directors like this:

select sum(Count) from "Capacity People Statistics"
where Capacity_ID = (select ID from Capacity where Name = 'director')
;

or count how many times a particular person is involved in movies:

select sum(Count) from "Capacity People Statistics"
where People_ID = (select ID from People where Name = 'Clint Eastwood')
;

or count the number of times a particular person is a director (notice no sum 
needed):

select Count from "Capacity People Statistics"
where Capacity_ID = (select ID from Capacity where Name = 'director')
        and People_ID = (select ID from People where Name = 'Clint Eastwood')
;

> It is similar to using different tables with only the IDs.
> Instead of using a table for directors another for writers, etc... this is
> specified in a column, it is slower but it uses only one table.

I don't think you'll find a noticeable speed difference. It's searching just 
indexed primary key integer columns, which is very fast. It is optimized (with 
auto indexes) for searching by Capacity_ID, People_ID in that order, and 
searching by Capacity_ID alone. If you also do a lot of searches by People_ID 
alone, then add an index:

create index "Capacity People Statistics - People"
on "Capacity People Statistics" (People_ID)
;

> If I use different tables the result is instant, I don't know if it will be 
> much faster to count from "Capacity People Statistics" than counting from
> "Movie People".

I expect the above to be about the same speed or faster (since part of the 
counting is already done) than separate tables, but far more flexible (eg no 
need to add a table to accommodate a new capacity), and better normalized.

> Another thing: I don't understand the purpose of the Count column in table
> "Capacity People Statistics"

It hopefully now makes sense with my correction and example selects above.

Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Speed up count(distinct col)

Reply via email to