Re: [sqlite] Speed up count(distinct col)

2011-02-17 Thread Yuzem


BareFeetWare-2 wrote:
> 
> Oh, I see, so the "complication" is that you would have to change the
> columns used in your bash script. That shouldn't be a big problem, but
> I'll leave the bash script to you.
> 
It isn't a big problem, the "complication" was to adapt all the tables and
inserts and column declarations only for testing.


BareFeetWare-2 wrote:
> 
> If for some reason rewriting the insert command in the bash script is
> insurmountable, you can just create a view in SQL to match the
> expectations of the bash script. That view can funnel each insert to the
> underlying SQL schema table columns using an "instead of insert" trigger.
> Let me know if you need more info on this.
> 
That's not necessary, I will adapt your code to my database schema.


BareFeetWare-2 wrote:
> 
> or you can change the delete trigger to remove statistics that drop to a
> zero count:
> 
> begin immediate
> ;
> drop trigger if exists "Movie People delete"
> ;
> create trigger "Movie People delete"
> on "Movie People"
> after delete
> begin
> insert or replace into "Capacity People Statistics" (Capacity_ID,
> People_ID, Count)
> select
>   old.Capacity_ID
> , old.People_ID
> , (   select coalesce(Count, 0) - 1 from "Capacity People Statistics"
>   where Capacity_ID = old.Capacity_ID and People_ID = 
> old.People_ID
>   )
> ;
> delete from "Capacity People Statistics"
> where Count = 0
> ;
> end
> ;
> commit
> ;
> 
Oh, I see, that makes a lot of sense, that was one of the problem I had,
this way I can delete directors without using a distinct, great!


BareFeetWare-2 wrote:
> 
> Great, I'm glad we finally got there :-)
> 
Thank you very much for all your help, I will try to adapt the code to my
schema, I will let you know if I meet any problem.
Thanks again!
-- 
View this message in context: 
http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30954516.html
Sent from the SQLite mailing list archive at Nabble.com.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-15 Thread BareFeetWare
> IMDb Pro offers additional information and services, all the information I
> am grabbing is already available to the public and I am not publishing this
> information, there are a lot of programs that do this.

OK, good to know, thanks.

>>> The script is written in bash and I can give you the code but I think it
>>> would be very complicated to adapt it to your schema.
>> 
>> Complicated? Why? In what format is the grabbed data (ie what tables,
>> columns, rows)? It usually just takes an insert select to move data from
>> one table's schema to another.
>> 
> Well, you would have to replace the insert commands, lets say that I have
> table movies with columns id,title,year,plot
> In the script I have to set the columns as variables:
> id=tt12345
> title="movie title"
> year=2011
> plot="this is the plot"
> 
> Ant then I call the command insert:
> insert movies
> 
> The command insert knows the columns for each table, it goes column by
> column, if the column is declared as a variable it is inserted so you would
> have to replace all those commands with something like:
> echo "INSERT INTO movies" etc...

Oh, I see, so the "complication" is that you would have to change the columns 
used in your bash script. That shouldn't be a big problem, but I'll leave the 
bash script to you.

If for some reason rewriting the insert command in the bash script is 
insurmountable, you can just create a view in SQL to match the expectations of 
the bash script. That view can funnel each insert to the underlying SQL schema 
table columns using an "instead of insert" trigger. Let me know if you need 
more info on this.

> Let me know if you are interested in the script, the script is included in 
> https://launchpad.net/figuritas/+download the application , but I can post
> and explain the relevant functions if you want.

No, that's fine, I just couldn't understand why changing the insert statement 
in the bash script is a problem.

>> What you want, I think, however, is how many people there are who are
>> directors (counting each person just once, even if they direct multiple
>> movies), given by this:
>> 
>> select count(*) from "Capacity People Statistics"
>> where Capacity_ID = (select ID from Capacity where Name = 'director')
>> ;
>> 
> Ok then, yes I wanted to count directors counting each person just once, I 
> think that that code will not count much faster.

I expect it will be *much* faster than count(distinct), almost instant.

>> Hopefully now that I've given you the query you actually needed, it now
>> makes sense ;-)
>> 
> Yes it does, but then I don't need the Count column since I want to speed up 
> counting distinct directors but counting each person just once.

Well, you will actually need the Count column if you plan on facilitating 
deletion of directors, since you will need to track when the count of a 
person's director roles drops to zero (so they're no longer counted). In that 
case, you will also need a slightly modified select:

select count(*) from "Capacity People Statistics"
where Capacity_ID = (select ID from Capacity where Name = 'director')
and Count > 0
;

or you can change the delete trigger to remove statistics that drop to a zero 
count:

begin immediate
;
drop trigger if exists "Movie People delete"
;
create trigger "Movie People delete"
on "Movie People"
after delete
begin
insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, 
Count)
select
old.Capacity_ID
,   old.People_ID
,   (   select coalesce(Count, 0) - 1 from "Capacity People Statistics"
where Capacity_ID = old.Capacity_ID and People_ID = 
old.People_ID
)
;
delete from "Capacity People Statistics"
where Count = 0
;
end
;
commit
;

>> No, I don't think you need more tables to achieve the above.
>> 
> Yes you can do it with the above but I want to make it very fast, with those
> two additional tables I think it will be much faster.

Great, I'm glad we finally got there :-)

Tom
BareFeetWare

--
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-15 Thread Yuzem


BareFeetWare-2 wrote:
> 
> Does IMDB allow use of their data this way? After my brief reading of
> their site, I thought they charge a $15k minimum per year for data.
> 
IMDb Pro offers additional information and services, all the information I
am grabbing is already available to the public and I am not publishing this
information, there are a lot of programs that do this.


BareFeetWare-2 wrote:
> 
>> The script is written in bash and I can give you the code but I think it
>> would be very complicated to adapt it to your schema.
> 
> Complicated? Why? In what format is the grabbed data (ie what tables,
> columns, rows)? It usually just takes an insert select to move data from
> one table's schema to another.
> 
Well, you would have to replace the insert commands, lets say that I have
table movies with columns id,title,year,plot
In the script I have to set the columns as variables:
id=tt12345
title="movie title"
year=2011
plot="this is the plot"

Ant then I call the command insert:
insert movies

The command insert knows the columns for each table, it goes column by
column, if the column is declared as a variable it is inserted so you would
have to replace all those commands with something like:
echo "INSERT INTO movies" etc...

Let me know if you are interested in the script, the script is included in 
https://launchpad.net/figuritas/+download the application , but I can post
and explain the relevant functions if you want.


BareFeetWare-2 wrote:
> 
> Well, yes, but it depends of the definition of "how many directors there
> are". The above counts how many directors there are in total, counting the
> same person for each movie they direct.
> 
> What you want, I think, however, is how many people there are who are
> directors (counting each person just once, even if they direct multiple
> movies), given by this:
> 
> select count(*) from "Capacity People Statistics"
> where Capacity_ID = (select ID from Capacity where Name = 'director')
> ;
> 
Ok then, yes I wanted to count directors counting each person just once, I
think that that code will not count much faster.


BareFeetWare-2 wrote:
> 
> Hopefully now that I've given you the query you actually needed, it now
> makes sense ;-)
> 
Yes it does, but then I don't need the Count column since I want to speed up
counting distinct directors but counting each person just once.


BareFeetWare-2 wrote:
> 
> No, I don't think you need more tables to achieve the above.
> 
Yes you can do it with the above but I want to make it very fast, with those
two additional tables I think it will be much faster.
-- 
View this message in context: 
http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30935871.html
Sent from the SQLite mailing list archive at Nabble.com.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-13 Thread Random Coder
On Feb 13, 2011, at 3:38 PM, BareFeetWare  wrote:

> On 13/02/2011, at 1:04 AM, Yuzem wrote:
> 
>> I am grabbing the data from the each movie imdb webpage.
> 
> Does IMDB allow use of their data this way? After my brief reading of their 
> site, I thought they charge a $15k minimum per year for data.

They do allow some limited free use of their data, though I doubt they'll be 
happy with users scrapping their site, I'd imagine they prefer devs download 
the database files they provide:

http://www.imdb.com/interfaces

> 
>>> 
>>> 
>> 
>> 
>> 
>>> 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-13 Thread BareFeetWare
On 13/02/2011, at 1:04 AM, Yuzem wrote:

> I am grabbing the data from the each movie imdb webpage.

Does IMDB allow use of their data this way? After my brief reading of their 
site, I thought they charge a $15k minimum per year for data.

> The script is written in bash and I can give you the code but I think it 
> would be very complicated to adapt it to your schema.

Complicated? Why? In what format is the grabbed data (ie what tables, columns, 
rows)? It usually just takes an insert select to move data from one table's 
schema to another.

>> You count directors like this:
>> 
>> select sum(Count) from "Capacity People Statistics"
>> where Capacity_ID = (select ID from Capacity where Name = 'director')
>> ;
>> 
> Are you sure that this count how many directors there are?

Well, yes, but it depends of the definition of "how many directors there are". 
The above counts how many directors there are in total, counting the same 
person for each movie they direct.

What you want, I think, however, is how many people there are who are directors 
(counting each person just once, even if they direct multiple movies), given by 
this:

select count(*) from "Capacity People Statistics"
where Capacity_ID = (select ID from Capacity where Name = 'director')
;

> If I understand it correctly (probably I don't) you have for example table
> "Movie People":
> 1|director|1
> 2|director|1
> 3|director|2
> 
> There are 2 directors and then in "Capacity People Statistics" you should
> have:
> director|1|2
> director|2|1
> 
> If you use the previous code:
> select sum(Count) from "Capacity People Statistics"
> where Capacity_ID = (select ID from Capacity where Name = 'director')
> ;
> 
> I think it will  return 3 but there are only 2 directors.

Correct. That query will tell you that there are three (3) directors in the 
database. But two (2) of those directors are the same person.

To instead get what you want, the number of people who are directors, do this 
(repeat of above SQL):

select count(*) from "Capacity People Statistics"
where Capacity_ID = (select ID from Capacity where Name = 'director')
;

which gives your desired answer of 2.

> Sorry, but I still don't understands it because I don't understand that you 
> can count directors that way.

Hopefully now that I've given you the query you actually needed, it now makes 
sense ;-)

>> I expect the above to be about the same speed or faster... than separate 
>> tables, but far more flexible (eg no need to add a table to accommodate a 
>> new capacity), and better normalized.
>> 

> But using different tables provides an instant result, you can try it with
> any table:
> SELECT count(*) from table;

Let me know if you see any noticeable difference in speed.

> In the other hand you are right, it is less flexible.

> Oh wait... I think I got it. I need two more tables:

No, I don't think you need more tables to achieve the above.

Thanks,
Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-12 Thread Yuzem


BareFeetWare-2 wrote:
> 
> What is your source for the data? If I have that, I can test my schema
> properly, rather than sitting here entering a pile of dummy data. I looked
> at IMDB, but they only seem to have a paid license download.
> 
I am grabbing the data from the each movie imdb webpage. The script is
written in bash and I can give you the code but I think it would be very
complicated to adapt it to your schema.


BareFeetWare-2 wrote:
> 
> No. Sorry, I should have included that. You count directors like this:
> 
> select sum(Count) from "Capacity People Statistics"
> where Capacity_ID = (select ID from Capacity where Name = 'director')
> ;
> 
Are you sure that this count how many directors there are?
If I understand it correctly (probably I don't) you have for example table
"Movie People":
1|director|1
2|director|1
3|director|2

There are 2 directors and then in "Capacity People Statistics" you should
have:
director|1|2
director|2|1

If you use the previous code:
select sum(Count) from "Capacity People Statistics"
where Capacity_ID = (select ID from Capacity where Name = 'director')
;

I think it will  return 3 but there are only 2 directors.


BareFeetWare-2 wrote:
> 
>> Another thing: I don't understand the purpose of the Count column in
>> table
>> "Capacity People Statistics"
> 
> It hopefully now makes sense with my correction and example selects above.
> 
Sorry, but I still don't understands it because I don't understand that you
can count directors that way.


BareFeetWare-2 wrote:
> 
> I expect the above to be about the same speed or faster (since part of the
> counting is already done) than separate tables, but far more flexible (eg
> no need to add a table to accommodate a new capacity), and better
> normalized.
> 
But using different tables provides an instant result, you can try it with
any table:
SELECT count(*) from table;

In the other hand you are right, it is less flexible.
A more flexible and faster approach would be a table with total directors,
writers, etc... but that is very complicated because I would have to count
distinct directors, writers, etc... in every insert.

Oh wait... I think I got it. I need two more tables:
Tables:
"Movie People" (movie_id capacity people_id)
"Capacity People" (capacity_id people_id)
"Capacity Count" (capacity count)

On every insert into "Movie People" I trigger an insert into "Capacity
People" and on every insert/delete from "Capacity People" I add/substract 1
from the corresponding capacity in "Capacity Count"

Then I can count directors with:
SELECT count FROM "Capacity Count" WHERE capacity = 'directors';

What do you think? (I didn't test it)


BareFeetWare-2 wrote:
> 
>> Another thing: I don't understand the purpose of the Count column in
>> table
>> "Capacity People Statistics"
> 
> It hopefully now makes sense with my correction and example selects above.
> 
Sorry, but I still don't understands it, of-course, this is because I don't
understand that you can count directors that way.
-- 
View this message in context: 
http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30908962.html
Sent from the SQLite mailing list archive at Nabble.com.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-11 Thread BareFeetWare
On 11/02/2011, at 11:51 PM, Yuzem wrote:

>> you want the number of each unique (capacity, person). No problem. We'll 
>> just add a People column to the statistics, and change the uniqueness 
>> constraint.

Hmm, seems I didn't quite change the uniqueness constraint. I meant to allow 
multiple capacities and multiple people, but keep each combination of 
(capacity, person) unique. So we need a Capacity_ID column and to use it 
instead of ID when inserting. The schema should therefore be:

begin immediate
;
drop table if exists "Capacity People Statistics"
;
create table "Capacity People Statistics"
(   ID integer primary key not null
,   Capacity_ID integer not null references "Capacity" (ID) on delete 
cascade
,   People_ID integer not null references "People" (ID) on delete cascade
,   Count integer not null
,   unique (Capacity_ID, People_ID)
)
;
insert into "Capacity People Statistics" (Capacity_ID, People_ID, Count)
select Capacity_ID, People_ID, count(*) from "Movie People"
group by Capacity_ID, People_ID having Count > 0
;
drop trigger if exists "Movie People insert"
;
create trigger "Movie People insert"
on "Movie People"
after insert
begin
insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, 
Count)
select
new.Capacity_ID
,   new.People_ID
,   (   select coalesce(Count, 0) + 1 from "Capacity People Statistics"
where Capacity_ID = new.Capacity_ID and People_ID = 
new.People_ID
)
;
end
;
drop trigger if exists "Movie People delete"
;
create trigger "Movie People delete"
on "Movie People"
after delete
begin
insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, 
Count)
select
old.Capacity_ID
,   old.People_ID
,   (   select coalesce(Count, 0) - 1 from "Capacity People Statistics"
where Capacity_ID = old.Capacity_ID and People_ID = 
old.People_ID
)
;
end
;
drop trigger if exists "Movie People update"
;
create trigger "Movie People update"
on "Movie People"
after update of Capacity_ID, People_ID
begin
insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, 
Count)
select
old.Capacity_ID
,   old.People_ID
,   (   select coalesce(Count, 0) - 1 from "Capacity People Statistics"
where Capacity_ID = old.Capacity_ID and People_ID = 
old.People_ID
)
;
insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, 
Count)
select
new.Capacity_ID
,   new.People_ID
,   (   select coalesce(Count, 0) + 1 from "Capacity People Statistics"
where Capacity_ID = new.Capacity_ID and People_ID = 
new.People_ID
)
;
end
;
commit
;

What is your source for the data? If I have that, I can test my schema 
properly, rather than sitting here entering a pile of dummy data. I looked at 
IMDB, but they only seem to have a paid license download.

> That should work, if I understand it correctly now I can count directors with:
> SELECT count(*) FROM "Capacity People Statistics" WHERE ID = 'directors';

No. Sorry, I should have included that. You count directors like this:

select sum(Count) from "Capacity People Statistics"
where Capacity_ID = (select ID from Capacity where Name = 'director')
;

or count how many times a particular person is involved in movies:

select sum(Count) from "Capacity People Statistics"
where People_ID = (select ID from People where Name = 'Clint Eastwood')
;

or count the number of times a particular person is a director (notice no sum 
needed):

select Count from "Capacity People Statistics"
where Capacity_ID = (select ID from Capacity where Name = 'director')
and People_ID = (select ID from People where Name = 'Clint Eastwood')
;

> It is similar to using different tables with only the IDs.
> Instead of using a table for directors another for writers, etc... this is
> specified in a column, it is slower but it uses only one table.

I don't think you'll find a noticeable speed difference. It's searching just 
indexed primary key integer columns, which is very fast. It is optimized (with 
auto indexes) for searching by Capacity_ID, People_ID in that order, and 
searching by Capacity_ID alone. If you also do a lot of searches by People_ID 
alone, then add an index:

create index "Capacity People Statistics - People"
on "Capacity People Statistics" (People_ID)
;

> If I use different tables the result is instant, I don't know if it will be 
> much faster to count from "Capacity People Statistics" than counting from
> "Movie People".

I expect the above to be about the same speed or faster (since part of the 
counting is already done) than separate tables, but far more flexible (eg no 
need to add a table to accommodate a new capacity), and better normalized.

> Another thing: I don't understand the purpose of the Count column in table
> "Capacity People Statistics"

It hopefully now makes sense with my correction and example se

Re: [sqlite] Speed up count(distinct col)

2011-02-11 Thread Yuzem


BareFeetWare-2 wrote:
> 
> Oh, I see. I was counting the total number of each unique capacity
> (including director, writers), but you want the number of each unique
> (capacity, person). No problem. We'll just add a People column to the
> statistics, and change the uniqueness constraint.
> 
> Replace my earlier "Capacity Statistics" and triggers with this:
> 

That should work, if I understand it correctly now I can count directors
with:
SELECT count(*) FROM "Capacity People Statistics" WHERE ID = 'directors';
Instead of:
SELECT count(distinct People_ID) FROM "Movie People" WHERE Capacity_ID =
'directors';

It is similar to using different tables with only the IDs.
Instead of using a table for directors another for writers, etc... this is
specified in a column, it is slower but it uses only one table.
If I use different tables the result is instant, I don't know if it will be
much faster to count from "Capacity People Statistics" than counting from
"Movie People".

Another thing: I don't understand the purpose of the Count column in table
"Capacity People Statistics"


Simon Slavin-3 wrote:
> 
> So if I deleted one record and created another you wouldn't spot it ?
> 
Yes, any change to the database will result in a complete refresh of the
cache.


Simon Slavin-3 wrote:
> 
> Your process is trustworthy only when you are certain that the database
> file is not currently open.  If there's a chance that some application may
> be modifying the file when you check these things then the results you get
> may not be up-to-date. 
> 
Yes but what can I do about it, the same happens if I open a text document
that it is already open in another text editor.
I could check if there is a journal file but I don't know if it is necessary
since I am not keeping any connection open, I try to make the connections as
short as possible and if I get some data that it isn't up to date I don't
see much problem.

Something that I just realized, comparing INTEGER columns doesn't seems
faster than comparing TEXT columns, this:
SELECT count(distinct ROWID) FROM keywords;
Result: 83513
Is slower than this:
SELECT count(distinct keywords) FROM keywords;
Result: 17321
-- 
View this message in context: 
http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30900999.html
Sent from the SQLite mailing list archive at Nabble.com.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread Simon Slavin

On 11 Feb 2011, at 2:37am, Yuzem wrote:

> Simon Slavin-3 wrote:
>> 
> 
>> By looking at the file on disk ?  Are you taking into account the journal
>> file ? 
>> 
> 
> Yes, I do all the counts

So if I deleted one record and created another you wouldn't spot it ?

> and save the data to a file and then if the file is
> newer then the database I use the file else I count again.

Your process is trustworthy only when you are certain that the database file is 
not currently open.  If there's a chance that some application may be modifying 
the file when you check these things then the results you get may not be 
up-to-date.

The only way to correctly tell if the data in the database has changed is to 
use SQLite calls.  Also, your operating system does not update the file 
attributes every time the contents of the file changes.  It will often wait for 
a change of sector or a buffer to be flushed.

> No, I am not taking the journal file into account, I don't know what you
> mean by that.

SQLite uses a journal file to temporarily store change to the database.  You 
will find this journal file appearing in the same directory as the database 
file if you have the database open and have made changes.  These changes will 
eventually be saved in the database file -- definitely when you close all the 
SQLite handles to that file -- but until then you have no way of knowing what 
data SQLite considers is actually in the database apart from using SQLite calls.

Read here, especially sections 2.1 to 2.3, about the temporary files SQLite 
makes:

http://www.sqlite.org/tempfiles.html

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread BareFeetWare
On 11/02/2011, at 1:37 PM, Yuzem wrote:

> For example lets say I have the following data:
> 1|director|1
> 2|director|2
> 3|director|1
> 
> In this example the total count for directors is 2, I have two distinct 
> directors.
> In the table "Capacity Statistics" I will have:
> director|2
> 
> The triggers you made add/subtract 1 from "Capacity Statistics" on
> insert/delete on "movies people"
> What happens if I add the following to "movies people"?
> 4|director|2
> 
> The trigger should add 1 to "Capacity Statistics":
> director|3
> 
> But there are still 2 directors:
> 1|director|1
> 2|director|2
> 3|director|1
> 4|director|2

Oh, I see. I was counting the total number of each unique capacity (including 
director, writers), but you want the number of each unique (capacity, person). 
No problem. We'll just add a People column to the statistics, and change the 
uniqueness constraint.

Replace my earlier "Capacity Statistics" and triggers with this:

begin immediate
;
create table "Capacity People Statistics"
(   ID integer primary key references "Capacity" (ID) on delete cascade
,   People_ID integer not null references "People" (ID) on delete cascade
,   Count integer not null
,   unique (ID, People_ID)
)
;
insert into "Capacity People Statistics" (ID, People_ID, Count)
select Capacity_ID, People_ID, count(*) from "Movie People"
group by Capacity_ID, People_ID having Count > 0
;
create trigger "Movie People insert"
on "Movie People"
after insert
begin
insert or replace into "Capacity People Statistics" (ID, People_ID, Count)
select
new.Capacity_ID
,   new.People_ID
,   (   select coalesce(Count, 0) + 1 from "Capacity People Statistics"
where ID = new.Capacity_ID and People_ID = new.People_ID
)
;
end
;
create trigger "Movie People delete"
on "Movie People"
after delete
begin
insert or replace into "Capacity People Statistics" (ID, People_ID, Count)
select
old.Capacity_ID
,   old.People_ID
,   (   select coalesce(Count, 0) - 1 from "Capacity People Statistics"
where ID = old.Capacity_ID and People_ID = old.People_ID
)
;
end
;
create trigger "Movie People update"
on "Movie People"
after update of Capacity_ID, People_ID
begin
insert or replace into "Capacity People Statistics" (ID, People_ID, Count)
select
old.Capacity_ID
,   old.People_ID
,   (   select coalesce(Count, 0) - 1 from "Capacity People Statistics"
where ID = old.Capacity_ID and People_ID = old.People_ID
)
;
insert or replace into "Capacity People Statistics" (ID, People_ID, Count)
select
new.Capacity_ID
,   new.People_ID
,   (   select coalesce(Count, 0) + 1 from "Capacity People Statistics"
where ID = new.Capacity_ID and People_ID = new.People_ID
)
;
end
;
commit
;


Tom
BareFeetWare

--
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread Yuzem


Simon Slavin-3 wrote:
> 
> By looking at the file on disk ?  Are you taking into account the journal
> file ? 
> 
Yes, I do all the counts and save the data to a file and then if the file is
newer then the database I use the file else I count again.
No, I am not taking the journal file into account, I don't know what you
mean by that.


BareFeetWare-2 wrote:
> 
> Does tis answer your needs? If not, please explain further, but it will
> probably only require modifying a constraint in the schema I proposed,
> rather than denormalizing or partitioning. 
> 
I will try to explain it.
I have the table "movies people" with columns "movie_ID, capacity_ID,
people_ID"
I understand that all rows are unique but people_ID isn't unique.
For example lets say I have the following data:
1|director|1
2|director|2
3|director|1

In this example the total count for directors is 2, I have two distinct
directors.
In the table "Capacity Statistics" I will have:
director|2

The triggers you made add/subtract 1 from "Capacity Statistics" on
insert/delete on "movies people"
What happens if I add the following to "movies people"?
4|director|2

The trigger should add 1 to "Capacity Statistics":
director|3

But there are still 2 directors:
1|director|1
2|director|2
3|director|1
4|director|2


BareFeetWare-2 wrote:
> 
> Erm ... there are also movies which have more than one person directing. 
> You need to copy across the rowid from the MoviePeople table, and delete
> just based on that particular record. 
> 
Yes, of-course.
-- 
View this message in context: 
http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30898156.html
Sent from the SQLite mailing list archive at Nabble.com.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread BareFeetWare
On 11/02/2011, at 11:30 AM, Simon Slavin wrote:

> Come to think of it, what's really happening here is a FOREIGN KEY situation.

Yes, that's why I have the foreign keys (ie "references") in the schema. So, 
for instance, if you delete a movie, all of the actors, directors etc 
associated with that movie are automatically deleted, but the same people are 
kept for other movies.

SQL takes care of all the nitty gritty stuff, so we don't have to reinvent the 
wheel in application code.

Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread Simon Slavin

On 11 Feb 2011, at 12:26am, BareFeetWare wrote:

> On 11/02/2011, at 11:11 AM, Simon Slavin wrote:
> 
>> Erm ... there are also movies which have more than one person directing.  
>> You need to copy across the rowid from the MoviePeople table, and delete 
>> just based on that particular record.
> 
> Adding to what Simon said:
> 
> The schema I posted allows for multiple directors for each movie, the same 
> person having multiple capacities in the same movie (eg writer, director and 
> actor) etc. Any changes to the model and what's allowed (ie constraints) 
> should be done there and not in your application layer, IMNSHO.
> 
> If you want to delete or insert, based on text (eg person's name), I suggest 
> NOT getting your application to get the matching rowid, then reinjected it 
> into a second SQL call. Do it all in one SQL transaction

Come to think of it, what's really happening here is a FOREIGN KEY situation.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread BareFeetWare
On 11/02/2011, at 11:11 AM, Simon Slavin wrote:

> Erm ... there are also movies which have more than one person directing.  You 
> need to copy across the rowid from the MoviePeople table, and delete just 
> based on that particular record.

Adding to what Simon said:

The schema I posted allows for multiple directors for each movie, the same 
person having multiple capacities in the same movie (eg writer, director and 
actor) etc. Any changes to the model and what's allowed (ie constraints) should 
be done there and not in your application layer, IMNSHO.

If you want to delete or insert, based on text (eg person's name), I suggest 
NOT getting your application to get the matching rowid, then reinjected it into 
a second SQL call. Do it all in one SQL transaction, such as:

delete from "Movie People"
where Movie_ID = (select Movie_ID from Movies where Title = 'Back to the 
Future')
and People_ID = (select ID from People where Name = 'Eric Stoltz')
and Capacity_ID = (select ID from Capacity where Name = 'actor')

or:

update "Movie People"
set People_ID = (select ID from People where Name = 'Michael J Fox')
where Movie_ID = (select Movie_ID from Movies where Title = 'Back to the 
Future')
and People_ID = (select ID from People where Name = 'Eric Stoltz')
and Capacity_ID = (select ID from Capacity where Name = 'actor')

By the way, how are you getting the raw data? What URL or RSS feed or whatever?

Thanks,
Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread Simon Slavin

On 10 Feb 2011, at 11:55pm, Yuzem wrote:

> That's the beauty of using a additional table. I make the column unique in
> the the extra table and then on any insert in "movies people" I insert in
> for example the directors table and any duplicate will be automatically
> rejected.
> 
> The same problem exist on delete, the deleted entry may not be unique.
> 
> For example:
> movie1|director|people1
> movie2|director|people1

Erm ... there are also movies which have more than one person directing.  You 
need to copy across the rowid from the MoviePeople table, and delete just based 
on that particular record.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread BareFeetWare
> BareFeetWare-2 wrote:
>> 
>> In that case, you should cache the counts in a separate table or two. That
>> has a negligible overhead when you add a movie (which is infrequent), and
>> basically no overhead when viewing (which is frequent).
> I am doing that but in the application level, the down side is that I am
> updating the cache on any change, I check the modified time of the database.

I strongly suggest doing as much as you can at the SQL level, especially where 
it concerns the integrity of your data. You should find it faster, more 
reliable, self contained and logical.

> BareFeetWare-2 wrote:
>> 
>> Do that with triggers (which are a good thing in this context), to change
>> the relevant count when an insert, delete or update occurs in the "Movie
>> People" table. Something like:
>> 
> I think, not sure, that there will be a problem with that code.
> I am not an expert so I had to stare that code some time to understand it.
> If I am correct you are adding and subtracting 1 on every insert or delete.

Yes, and doing a delete/insert combo for an update (ie if a person's capacity 
in a movie changes).

> The problem is that an insert may not be unique so before augmenting the
> counter you have to check if it is unique and therefore you would have to
> run a count(distinct col) on every insert which would be overkill.

If I understand you correctly, the non-unique criteria is already handled by 
the schema design. The "Movie People" table allows the same person to be listed 
as the director for multiple movies. The constraints on the "Movie People" only 
require no nulls and unique combinations of Movie, Capacity, People (ie the 
same person can't be listed as director for the same movie twice, but can be 
listed as director for two movies). Because it uses the "unique" constraint for 
this, SQLite automatically builds an index "Movie People"(Movie_ID, 
Capacity_ID, People_ID), so that whenever you try to insert a new row, it 
quickly checks if it already exists. You can throw new non-unique rows at it 
with "insert or ignore" if you just want it to ignore duplicates, which then 
won't trigger the count increase because nothing was inserted. Or use plain old 
"insert" if you want to be alerted to any attempted unique violations.

Does tis answer your needs? If not, please explain further, but it will 
probably only require modifying a constraint in the schema I proposed, rather 
than denormalizing or partitioning.

Thanks,
Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread Simon Slavin

On 10 Feb 2011, at 11:55pm, Yuzem wrote:

> I check the modified time of the database.

By looking at the file on disk ?  Are you taking into account the journal file ?

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread Yuzem


BareFeetWare-2 wrote:
> 
> In that case, you should cache the counts in a separate table or two. That
> has a negligible overhead when you add a movie (which is infrequent), and
> basically no overhead when viewing (which is frequent).
I am doing that but in the application level, the down side is that I am
updating the cache on any change, I check the modified time of the database.


BareFeetWare-2 wrote:
> 
> Do that with triggers (which are a good thing in this context), to change
> the relevant count when an insert, delete or update occurs in the "Movie
> People" table. Something like:
> 
I think, not sure, that there will be a problem with that code.
I am not an expert so I had to stare that code some time to understand it.
If I am correct you are adding and subtracting 1 on every insert or delete.

The problem is that an insert may not be unique so before augmenting the
counter you have to check if it is unique and therefore you would have to
run a count(distinct col) on every insert which would be overkill.

That's the beauty of using a additional table. I make the column unique in
the the extra table and then on any insert in "movies people" I insert in
for example the directors table and any duplicate will be automatically
rejected.

The same problem exist on delete, the deleted entry may not be unique.

For example:
movie1|director|people1
movie2|director|people1

-- 
View this message in context: 
http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30897526.html
Sent from the SQLite mailing list archive at Nabble.com.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread BareFeetWare
Oops, I should have said old instead of new in a couple of places:

> begin immediate
> ;
> create table "Capacity Statistics"
> ( ID integer primary key unique references "Capacity" (ID) on delete 
> cascade
> , Count integer not null
> )
> ;
> insert into "Capacity Statistics" (ID, Count) select Capacity_ID, count(*) 
> from "Movie People" group by Capacity_ID having Count > 0
> ;
> create trigger "Movie People insert"
> on "Movie People"
> after insert
> begin
> insert or replace into "Capacity Statistics" (ID, Count)
> select new.Capacity_ID, (select coalesce(Count, 0) + 1 from "Capacity 
> Statistics" where ID = new.Capacity_ID)
> ;
> end
> ;
> create trigger "Movie People delete"
> on "Movie People"
> after insert
> begin
> insert or replace into "Capacity Statistics" (ID, Count)
> select new.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity 
> Statistics" where ID = old.Capacity_ID)

Should be:

select old.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity 
Statistics" where ID = old.Capacity_ID)

> ;
> end
> ;
> create trigger "Movie People update"
> on "Movie People"
> after update of Capacity_ID
> begin
> insert or replace into "Capacity Statistics" (ID, Count)
> select new.Capacity_ID, (select coalesce(Count, 0) + 1 from "Capacity 
> Statistics" where ID = new.Capacity_ID)
> ;
> insert or replace into "Capacity Statistics" (ID, Count)
> select new.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity 
> Statistics" where ID = old.Capacity_ID)

Should be:

select old.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity 
Statistics" where ID = old.Capacity_ID)

> ;
> end
> ;
> commit
> ;


Thanks,
Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread BareFeetWare
On 11/02/2011, at 9:40 AM, Yuzem wrote:

> Yes, in my application I have in the sidebar all those sections (movies,
> years, tags, keywords, actors, directors, writers, etc...) and I count each
> one, how many movies, how many years, etc...
> It isn't very slow if I update only one item but the problem is when I have
> to update the entire list, this uses a lot of cpu and takes some time and I
> have to do it every time a movie is added.
> http://3.bp.blogspot.com/_EbNPUgfUDXs/TIpif4U1III/A2A/YFSGOAdpOGA/s1600/nested-brosers.jpg
> Here there is an old screenshot from the sidebar.

In that case, you should cache the counts in a separate table or two. That has 
a negligible overhead when you add a movie (which is infrequent), and basically 
no overhead when viewing (which is frequent).

Do that with triggers (which are a good thing in this context), to change the 
relevant count when an insert, delete or update occurs in the "Movie People" 
table. Something like:

begin immediate
;
create table "Capacity Statistics"
(   ID integer primary key unique references "Capacity" (ID) on delete 
cascade
,   Count integer not null
)
;
insert into "Capacity Statistics" (ID, Count) select Capacity_ID, count(*) from 
"Movie People" group by Capacity_ID having Count > 0
;
create trigger "Movie People insert"
on "Movie People"
after insert
begin
insert or replace into "Capacity Statistics" (ID, Count)
select new.Capacity_ID, (select coalesce(Count, 0) + 1 from "Capacity 
Statistics" where ID = new.Capacity_ID)
;
end
;
create trigger "Movie People delete"
on "Movie People"
after insert
begin
insert or replace into "Capacity Statistics" (ID, Count)
select new.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity 
Statistics" where ID = old.Capacity_ID)
;
end
;
create trigger "Movie People update"
on "Movie People"
after update of Capacity_ID
begin
insert or replace into "Capacity Statistics" (ID, Count)
select new.Capacity_ID, (select coalesce(Count, 0) + 1 from "Capacity 
Statistics" where ID = new.Capacity_ID)
;
insert or replace into "Capacity Statistics" (ID, Count)
select new.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity 
Statistics" where ID = old.Capacity_ID)
;
end
;
commit
;

By contrast, I suggest it would be a bad idea to denormalize and partition your 
data (ie separate directors and writers tables) just for the sake of tracking 
the count. The solution above maintains a normalized database, gives you the 
flexibility to add or remove Capacities in data rather than having to change 
the schema, and better reflects the real data model.


Thanks,
Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread Yuzem


Simon Slavin-3 wrote:
> 
> How much slower.  Did you make an index SQLite could use for that query ?
> 
Using distinct isn't slow, it is what I would expect but count(*) is
incredibly fast, it is instantaneous no matter how large is the table.
Yes, I tried with an index.


BareFeetWare-2 wrote:
> 
> Or you can try this:
> 
> select count(*) from People
> where ID in
> ( select People_ID from "Movie People" where Capacity_ID =
> (select ID from Capacity where Name = 'director')
> );
> 
> But I expect you'll get the same performance. 
> 
Yes, almost the same performance.


BareFeetWare-2 wrote:
> 
> You don't want to denormalize and have separate director and writer tables
> etc. That will get ugly. 
> 
That's what I thought.


Petite Abeille-2 wrote:
> 
> Right... looking at something like the casting information in IMdb (actor,
> director, writer, etc), there are about 25M movie + role + person
> combinations. Which, while not huge, starts to get taxing when queried on
> a lowly laptop.
> 
> For example, out of these 25M rows, about 1.2M represent directors (~217K)
> in movies (~1M). 
> 

Yes, in my application I have in the sidebar all those sections (movies,
years, tags, keywords, actors, directors, writers, etc...) and I count each
one, how many movies, how many years, etc...
It isn't very slow if I update only one item but the problem is when I have
to update the entire list, this uses a lot of cpu and takes some time and I
have to do it every time a movie is added.
http://3.bp.blogspot.com/_EbNPUgfUDXs/TIpif4U1III/A2A/YFSGOAdpOGA/s1600/nested-brosers.jpg
Here there is an old screenshot from the sidebar. 


Petite Abeille-2 wrote:
> 
> Assuming a movie_cast table [1] and a cast dimension [2], you could record
> the distinct count for each cast in the dimension once, and store it. That
> way you don't have to recompute it over and over.
> 
Yes, I am caching all counts in the application level but I have to update
it every time the database changes so I have to update it at least every
time I add a movie.


Petite Abeille-2 wrote:
> 
> Alternatively, you could indeed partition that bulky movie_cast table by
> its cast type, which will in effect reduce the search space to at most ~9M
> (movies by actors, ~1.3M movies, ~1.3M actors). 
> 
> That said, even partitioning will not help you much here as you still have
> to search through ~9M records to figure out how many distinct actors you
> have.
> 
Actually, I already have different tables for directors, writers, etc...


Petite Abeille-2 wrote:
> 
> So... perhaps best to cheat :)
> 
> One way to cheat is to precompute the answer by, for example, adding a
> is_* flag on your person table:
> 
> update person set is_actor = ( select count( distinct person_id ) from
> movie_actor where movie_actor.person_id = person.id );
> 
> The person table is much smaller (~3M) and easier to query in bulk:
> 
> explain query plan
> select count( * ) from person where is_actor = 1;
> 0|0|0|SEARCH TABLE person USING COVERING INDEX person_is_actor
> (is_actor=?) (~1596808 rows)
> 
> 1310119
> CPU Time: user 0.256286 sys 0.000257
> 
> There you go :)
> 
That's a good idea, I wonder if it isn't better to use a different table to
store only the ids for the directors, etc, as I said before.
I would end up with these tables:
movies 
people
movies_directors (movieId, peopleId)
movies_writers (movieId, peopleId)
directors (peopleId)
writers (peopleId)

Then I can count directors with:
SELECT count(*) FROM directors;


Petite Abeille-2 wrote:
> 
> This is where I wish SQLite could have bitmap indices:
> http://en.wikipedia.org/wiki/Bitmap_index
> 
That seems what I need.
-- 
View this message in context: 
http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30897078.html
Sent from the SQLite mailing list archive at Nabble.com.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread Petite Abeille

On Feb 10, 2011, at 5:00 PM, Yuzem wrote:

> The only thing I can think of is to have additional tables for the ids of
> all directors, all writers, etc...
> Tables "movies", "people", "capacity" and then tables "directors",
> "writers", etc.. with only the IDs so I can count using count(*) which is
> super fast.

Right... looking at something like the casting information in IMdb (actor, 
director, writer, etc), there are about 25M movie + role + person combinations. 
Which, while not huge, starts to get taxing when queried on a lowly laptop.

For example, out of these 25M rows, about 1.2M represent directors (~217K) in 
movies (~1M). 

Assuming a movie_cast table [1] and a cast dimension [2], you could record the 
distinct count for each cast in the dimension once, and store it. That way you 
don't have to recompute it over and over.

Alternatively, you could indeed partition that bulky movie_cast table by its 
cast type, which will in effect reduce the search space to at most ~9M (movies 
by actors, ~1.3M movies, ~1.3M actors). 

That said, even partitioning will not help you much here as you still have to 
search through ~9M records to figure out how many distinct actors you have.

So... perhaps best to cheat :)

One way to cheat is to precompute the answer by, for example, adding a is_* 
flag on your person table:

update person set is_actor = ( select count( distinct person_id ) from 
movie_actor where movie_actor.person_id = person.id );

The person table is much smaller (~3M) and easier to query in bulk:

explain query plan
select count( * ) from person where is_actor = 1;
0|0|0|SEARCH TABLE person USING COVERING INDEX person_is_actor (is_actor=?) 
(~1596808 rows)

1310119
CPU Time: user 0.256286 sys 0.000257

There you go :)

This is where I wish SQLite could have bitmap indices:

http://en.wikipedia.org/wiki/Bitmap_index

Oh, well...

[1] http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L161
[2] http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L146




___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread BareFeetWare
> Thanks you but what I want to do is to count without using count(distinct 
> col) because it is much slower than count(*).

I think you'll find the schema I posted very fast, since it's running 
everything, including distinct, on primary key columns.

Or you can try this:

select count(*) from People
where ID in
(   select People_ID from "Movie People" where Capacity_ID =
(select ID from Capacity where Name = 'director')
);

But I expect you'll get the same performance.

You don't want to denormalize and have separate director and writer tables etc. 
That will get ugly.

Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread Simon Slavin

On 10 Feb 2011, at 4:00pm, Yuzem wrote:

> Thanks you but what I want to do is to count without using count(distinct
> col) because it is much slower than count(*).

How much slower.  Did you make an index SQLite could use for that query ?  A 
good index for that might be

People_ID,Capacity_ID

but it might be faster with those two the other way around.  Try them both out 
and see which gives you faster results.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread Yuzem


BareFeetWare-2 wrote:
> 
> Then you can count the directors like this:
> 
> select count(distinct People_ID) from "Movie People" join Capacity on
> "Movie People".Capacity_ID = Capacity.ID where Capacity.Name = 'director';
> 
> or:
> 
> select count(distinct People_ID) from "Movie People" where Capacity_ID =
> (select ID from Capacity where Name = 'director');
> 
> or you can create a view [...]
> 

Thanks you but what I want to do is to count without using count(distinct
col) because it is much slower than count(*).
In the previous examples about tags I can do it, I have "movies" and "tags"
related by "movies_tags"
and I can do:
SELECT count(*) FROM tags;
I want to know if there is any standard approach to do that.
The only thing I can think of is to have additional tables for the ids of
all directors, all writers, etc...
Tables "movies", "people", "capacity" and then tables "directors",
"writers", etc.. with only the IDs so I can count using count(*) which is
super fast.
-- 
View this message in context: 
http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30893712.html
Sent from the SQLite mailing list archive at Nabble.com.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-10 Thread BareFeetWare
On 08/02/2011, at 10:19 AM, Yuzem wrote:

> I have the tables "movies" and "people"
> Those two tables are related by tables "directors", "writers", etc...
> movies: id, title
> people: id, name
> 
> directors: movieId, peopleId
> writers: movieId, peopleId
> etc...
> 
> How can I normalize that so I can count directors and writers using
> count(*)?
> Should I create another table for "directors" and another for "writers",
> etc... with only the ids?
> I will end up having:
> movies > movies_directors < (directors people)

After implementing the schema in my previous post, add this:

begin immediate;
create table People
(   ID integer primary key not null
,   Name text collate no case not null unique
);
create table Capacity
(   ID integer primary key not null
,   Name text collate no case not null unique
);
create table "Movie People"
(   ID integer primary key not null
,   Movie_ID integer not null references Movies (Movie_ID) on delete cascade
,   Capacity_ID integer not null references Capacity (ID) on delete cascade
,   People_ID integer not null references People (ID) on delete cascade
,   unique (Movie_ID, Capacity_ID, People_ID)
);
commit;

Then you can count the directors like this:

select count(distinct People_ID) from "Movie People" join Capacity on "Movie 
People".Capacity_ID = Capacity.ID where Capacity.Name = 'director';

or:

select count(distinct People_ID) from "Movie People" where Capacity_ID = 
(select ID from Capacity where Name = 'director');

or you can create a view to alphabetically list each director with a CSV list 
of their movies, like this:

create view "Directors"
as
select People.Name, group_concat (Movies.Title, ', ')
from "Movie People"
join Movies on "Movie People".Movie_ID = Movies.Movie_ID
join Capacity on "Movie People".Capacity_ID = Capacity.ID
join People on "Movie People".People_ID = People.ID
where Capacity.Name = 'director';
group by People_ID
order by People.Name
;

Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-08 Thread BareFeetWare
You have:

CREATE TABLE movies(movie_id INTEGER, title TEXT, unique(movie_id));
CREATE TABLE tags(movie_id INTEGER, tag TEXT, unique(movie_id,tag));

You can solve your problem, using pure SQL. No need to resort to the 
application layer. Just execute the SQL transaction below. It takes care of 
everything, including normalizing, assigning integer IDs to unique tags, 
filtering our duplicates and auto indexes. I haven't tested on your tables 
specifically, but I do similar all the time. Any errors should be simple name 
errors.

begin immediate;
create temp table "Movies Old" as select * from Movies;
create temp table "Tags Old" from Tags;
drop table Movies;
drop table Tags;
create table Movies
(   Movie_ID integer primary key not null
,   Title text collate no case not null
);
create table Tags
(   Tag_ID integer primary key not null
,   Name text not null unique collate nocase
);
create table Movie_Tags
(   ID integer primary key not null
,   Movie_ID integer not null
references Movies (Movie_ID) on delete cascade
,   Tag_ID integer not null
references Tags (Tag_ID) on delete cascade
,   unique (Movie_ID, Tag_ID)
);
insert into Movies (Movie_ID, Title) select Movie_ID, Title from "Movies Old";

insert or ignore into Tags (Name) select distinct Tag from "Tags Old";

insert into Movie_Tags (Movie_ID, Tag_ID)
select distinct Movie_ID, (select Tag_ID from Tags where Name = "Tags Old".Tag)
from "Tags Old";

drop table "Movies Old";
drop table "Tags Old";

commit or rollback;

Note that the last line "commit or rollback" is not an actual SQLite command. 
You will need to manually decide at that point whether to issue a commit or 
rollback. If there are any errors in previous lines, use rollback. 
Unfortunately, SQLite does not automatically rollback all commands in a 
transaction if one of the commands fails (specifically create and drop 
commands).

Then you should be able to count tags very fast by:

select count(*) from Tags;

In order to make foreign keys work, you need to have SQLite version 3.6.19 or 
later, and use this pragma when you open a connection to SQLite (ie before any 
commands that require use of foreign keys):

pragma foreign_keys = yes;

A normalized database using integer keys is fast, small and elegant. Viewing 
integers by users is ugly. So for viewing the Movie-Tag combinations (and data 
entry), you might want to create a view to make it user friendly. It depends on 
your user interface. Reply here if you need that. Just something like this:

create view "Movie Tags View"
as select
Movie_Tags.ID as ID
,   Movies.Title as "Movie"
,   Tags.Name as "Tag"
from Movie_Tags
join Movies on Movie_Tags.Movie_ID = Movies.Movie_ID
join Tags on Movie_Tags.Tag_ID = Tags.Tag_ID
;

Hope this helps,
Tom
BareFeetWare

 --
Comparison of SQLite GUI tools:
http://www.barefeetware.com/sqlite/compare/?ml
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-07 Thread Petite Abeille

On Feb 7, 2011, at 9:32 PM, Yuzem wrote:

>> Hey... sounds like IMdb :P
>> 
> Yes, I'm coding a  http://yuzem.blogspot.com/p/figuritas-screenshots.html
> movie manager  that grabs the info from imdb.

A fine hobby :)

> I thought sqlite didn't handle foreign keys correctly,

I thought wrong :)

http://www.sqlite.org/foreignkeys.html

> I am using triggers
> to automatically delete tags when a movie gets deleted.

Don't bother.

> If I use a foreign key will it automatically remove the tag if the movie
> gets deleted?

See above.

> Anyway, to use integers in the "tags" table is very complicated

Nah... it's called normalization... it's a good thing :)

http://en.wikipedia.org/wiki/Database_normalization

> because I will have to assign the corresponding number to each tag that I 
> insert

Yep. It's a good thing.


> and I
> have to insert lots of keywords for every movie.

IMdb contains ~3.5M movie-keyword pairs (~405K distinct movies, 107K distinct 
keywords). Not much to write home about.


> Does sqlite has any function to convert a text string into an unique number?

You do it the other way around: create a unique entry in your tag table, then 
use the tag id in the movie_tag table that join movies to tags. Think of it as 
compression: instead of storing, say, the literal "accidental-cannibalism" 
multiple times in your movie_tag, you store the literal only once in the tag 
table, than use the literal id in the join table. And they lived happily ever 
after third normal form :))


 
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-07 Thread Josh Marell
I would create a tagList table (integer tagID, string tagName,
unique(tagName))

Before performing your batch of inserts, query the tagList table (integer
tagID, string tagName), and generate a map (key on tagName, value of tagID).
For each tag you want to insert, see if it exists in the map.  If it
doesn't, insert first into tagList table and get the tagID of that new entry
and update your map.  Perform your insert with the tagID instead now.  If it
does exist, use the value of the tagName key in your map.

I chose a map (or some similar implementation), because the lookup should be
quick, and you won't need to query the DB for each new tag during your batch
of inserts.

On Mon, Feb 7, 2011 at 2:32 PM, Yuzem  wrote:

>
>
> Petite Abeille-2 wrote:
> >
> > Hey... sounds like IMdb :P
> >
> Yes, I'm coding a  http://yuzem.blogspot.com/p/figuritas-screenshots.html
> movie manager  that grabs the info from imdb.
>
>
> Petite Abeille-2 wrote:
> >
> > In any case, as mentioned by Michael Black, you might benefit greatly by
> > normalizing your table and indexing the foreign key:
> >
> > http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L401
> >
>
> I thought sqlite didn't handle foreign keys correctly, I am using triggers
> to automatically delete tags when a movie gets deleted.
> If I use a foreign key will it automatically remove the tag if the movie
> gets deleted?
> Anyway, to use integers in the "tags" table is very complicated because I
> will have to assign the corresponding number to each tag that I insert and
> I
> have to insert lots of keywords for every movie.
> Does sqlite has any function to convert a text string into an unique
> number?
>
> --
> View this message in context:
> http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30867411.html
> Sent from the SQLite mailing list archive at Nabble.com.
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-07 Thread Eric Smith
Yuzem wrote: 

> I thought sqlite didn't handle foreign keys correctly, I am using 
> triggers to automatically delete tags when a movie gets deleted.  

There's a new thing, 'PRAGMA foreign_keys=1', that causes SQLite to 
enforce them.  Check out the docs on that.  

> Anyway, to use integers in the "tags" table is very complicated because 
> I will have to assign the corresponding number to each tag that I insert 
> and I have to insert lots of keywords for every movie.  

Well, technically you can have SQLite do that assignment for you.  But
yes, you'll have to deal with that mapping somewhere.  You can abstract
that away at a very low level in your infrastructure though.

It's not so terrible: again it's something like (in one high level 
language)

set tagId [db one {SELECT tagId FROM tagNames WHERE tagName=$tagName}]
db eval {INSERT INTO tags VALUES($movieId, $tagId)}

That one SELECT lookup will probably save you tons of string comparisons
later on.  Depends of course on the nature of your data set and query
stats.

> Does sqlite has any function to convert a text string into an unique 
> number?  

If you run (again correcting for your preferred language)

db eval {INSERT INTO tagNames(tagName) VALUES($tagName)}

then SQLite will generate the tagId column value for you if the column
is declared INTEGER PRIMARY KEY.  Again, you can get it back out at some
point later using SELECT as above, or if it's needed immediately (which
it probably is) you can use last_insert_rowid. 

So, again in Tcl, your whole insertion path will probably look something
like:

db eval {INSERT OR IGNORE INTO tagNames(tagName) VALUES($tagName)}
if {[db changes]} {
  set tagId [db last_insert_rowid]
} else {
  set tagId [db one {SELECT tagId FROM tagNames WHERE tagName=$tagName}]
}
assert {$tagId!=""}
db eval {INSERT OR IGNORE INTO tags VALUES($movieId, $tagId)}

That could possibly be optimized further.  But you could give it a go as
a first pass and see how far it takes you.  It's not so complicated, and
in 7 lines (plus 1 sanity-check line) we've abstracted the ID<->text 
mapping.

Eric

-- 
Eric A. Smith

Carperpetuation (kar' pur pet u a shun), n.:
The act, when vacuuming, of running over a string at least a
dozen times, reaching over and picking it up, examining it, then
putting it back down to give the vacuum one more chance.
-- Rich Hall, "Sniglets"
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-07 Thread Yuzem


Petite Abeille-2 wrote:
> 
> Hey... sounds like IMdb :P
> 
Yes, I'm coding a  http://yuzem.blogspot.com/p/figuritas-screenshots.html
movie manager  that grabs the info from imdb.


Petite Abeille-2 wrote:
> 
> In any case, as mentioned by Michael Black, you might benefit greatly by
> normalizing your table and indexing the foreign key: 
> 
> http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L401
> 

I thought sqlite didn't handle foreign keys correctly, I am using triggers
to automatically delete tags when a movie gets deleted.
If I use a foreign key will it automatically remove the tag if the movie
gets deleted?
Anyway, to use integers in the "tags" table is very complicated because I
will have to assign the corresponding number to each tag that I insert and I
have to insert lots of keywords for every movie.
Does sqlite has any function to convert a text string into an unique number?

-- 
View this message in context: 
http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30867411.html
Sent from the SQLite mailing list archive at Nabble.com.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Speed up count(distinct col)

2011-02-07 Thread Petite Abeille

On Feb 7, 2011, at 4:26 PM, Yuzem wrote:

> 
> I have many movies by tag and many tables like "tags" (keywords, countries,
> languages, genres, etc..)

Hey... sounds like IMdb :P

> This solution implies one additional table and two triggers by each table.

To paraphrase:

Some people, when confronted with a problem, think “I know, I'll use triggers.” 
  Now they have two problems.

In any case, as mentioned by Michael Black, you might benefit greatly by 
normalizing your table and indexing the foreign key: 

http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L401

Don't bother with the triggers.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Speed up count(distinct col)

2011-02-07 Thread Yuzem

Hi.
I have this tables:
CREATE TABLE movies(movie_id INTEGER, title TEXT, unique(movie_id));
CREATE TABLE tags(movie_id INTEGER, tag TEXT, unique(movie_id,tag));

I have many movies by tag and many tables like "tags" (keywords, countries,
languages, genres, etc..)
I can count different movies very fast with:
SELECT count(*) FROM movies;

But the other tables are much slower depending on the size of the table:
SELECT count(distinct tag) FROM tags;

My solution is to create an additional table "count_tags" and then every
time a tag is added to table "tags" a trigger adds the tag to "count_tags",
I need also another trigger to remove the tag, then I can do:
SELECT count(*) FROM count_tags;

This solution implies one additional table and two triggers by each table.

I wanted to know if there is a simpler solution, maybe by using indexes.
Thanks in advance.
-- 
View this message in context: 
http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30864622.html
Sent from the SQLite mailing list archive at Nabble.com.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users