Re: [sqlite] Speed up count(distinct col)
BareFeetWare-2 wrote: > > Oh, I see, so the "complication" is that you would have to change the > columns used in your bash script. That shouldn't be a big problem, but > I'll leave the bash script to you. > It isn't a big problem, the "complication" was to adapt all the tables and inserts and column declarations only for testing. BareFeetWare-2 wrote: > > If for some reason rewriting the insert command in the bash script is > insurmountable, you can just create a view in SQL to match the > expectations of the bash script. That view can funnel each insert to the > underlying SQL schema table columns using an "instead of insert" trigger. > Let me know if you need more info on this. > That's not necessary, I will adapt your code to my database schema. BareFeetWare-2 wrote: > > or you can change the delete trigger to remove statistics that drop to a > zero count: > > begin immediate > ; > drop trigger if exists "Movie People delete" > ; > create trigger "Movie People delete" > on "Movie People" > after delete > begin > insert or replace into "Capacity People Statistics" (Capacity_ID, > People_ID, Count) > select > old.Capacity_ID > , old.People_ID > , ( select coalesce(Count, 0) - 1 from "Capacity People Statistics" > where Capacity_ID = old.Capacity_ID and People_ID = > old.People_ID > ) > ; > delete from "Capacity People Statistics" > where Count = 0 > ; > end > ; > commit > ; > Oh, I see, that makes a lot of sense, that was one of the problem I had, this way I can delete directors without using a distinct, great! BareFeetWare-2 wrote: > > Great, I'm glad we finally got there :-) > Thank you very much for all your help, I will try to adapt the code to my schema, I will let you know if I meet any problem. Thanks again! -- View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30954516.html Sent from the SQLite mailing list archive at Nabble.com. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
> IMDb Pro offers additional information and services, all the information I > am grabbing is already available to the public and I am not publishing this > information, there are a lot of programs that do this. OK, good to know, thanks. >>> The script is written in bash and I can give you the code but I think it >>> would be very complicated to adapt it to your schema. >> >> Complicated? Why? In what format is the grabbed data (ie what tables, >> columns, rows)? It usually just takes an insert select to move data from >> one table's schema to another. >> > Well, you would have to replace the insert commands, lets say that I have > table movies with columns id,title,year,plot > In the script I have to set the columns as variables: > id=tt12345 > title="movie title" > year=2011 > plot="this is the plot" > > Ant then I call the command insert: > insert movies > > The command insert knows the columns for each table, it goes column by > column, if the column is declared as a variable it is inserted so you would > have to replace all those commands with something like: > echo "INSERT INTO movies" etc... Oh, I see, so the "complication" is that you would have to change the columns used in your bash script. That shouldn't be a big problem, but I'll leave the bash script to you. If for some reason rewriting the insert command in the bash script is insurmountable, you can just create a view in SQL to match the expectations of the bash script. That view can funnel each insert to the underlying SQL schema table columns using an "instead of insert" trigger. Let me know if you need more info on this. > Let me know if you are interested in the script, the script is included in > https://launchpad.net/figuritas/+download the application , but I can post > and explain the relevant functions if you want. No, that's fine, I just couldn't understand why changing the insert statement in the bash script is a problem. >> What you want, I think, however, is how many people there are who are >> directors (counting each person just once, even if they direct multiple >> movies), given by this: >> >> select count(*) from "Capacity People Statistics" >> where Capacity_ID = (select ID from Capacity where Name = 'director') >> ; >> > Ok then, yes I wanted to count directors counting each person just once, I > think that that code will not count much faster. I expect it will be *much* faster than count(distinct), almost instant. >> Hopefully now that I've given you the query you actually needed, it now >> makes sense ;-) >> > Yes it does, but then I don't need the Count column since I want to speed up > counting distinct directors but counting each person just once. Well, you will actually need the Count column if you plan on facilitating deletion of directors, since you will need to track when the count of a person's director roles drops to zero (so they're no longer counted). In that case, you will also need a slightly modified select: select count(*) from "Capacity People Statistics" where Capacity_ID = (select ID from Capacity where Name = 'director') and Count > 0 ; or you can change the delete trigger to remove statistics that drop to a zero count: begin immediate ; drop trigger if exists "Movie People delete" ; create trigger "Movie People delete" on "Movie People" after delete begin insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, Count) select old.Capacity_ID , old.People_ID , ( select coalesce(Count, 0) - 1 from "Capacity People Statistics" where Capacity_ID = old.Capacity_ID and People_ID = old.People_ID ) ; delete from "Capacity People Statistics" where Count = 0 ; end ; commit ; >> No, I don't think you need more tables to achieve the above. >> > Yes you can do it with the above but I want to make it very fast, with those > two additional tables I think it will be much faster. Great, I'm glad we finally got there :-) Tom BareFeetWare -- Comparison of SQLite GUI tools: http://www.barefeetware.com/sqlite/compare/?ml ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
BareFeetWare-2 wrote: > > Does IMDB allow use of their data this way? After my brief reading of > their site, I thought they charge a $15k minimum per year for data. > IMDb Pro offers additional information and services, all the information I am grabbing is already available to the public and I am not publishing this information, there are a lot of programs that do this. BareFeetWare-2 wrote: > >> The script is written in bash and I can give you the code but I think it >> would be very complicated to adapt it to your schema. > > Complicated? Why? In what format is the grabbed data (ie what tables, > columns, rows)? It usually just takes an insert select to move data from > one table's schema to another. > Well, you would have to replace the insert commands, lets say that I have table movies with columns id,title,year,plot In the script I have to set the columns as variables: id=tt12345 title="movie title" year=2011 plot="this is the plot" Ant then I call the command insert: insert movies The command insert knows the columns for each table, it goes column by column, if the column is declared as a variable it is inserted so you would have to replace all those commands with something like: echo "INSERT INTO movies" etc... Let me know if you are interested in the script, the script is included in https://launchpad.net/figuritas/+download the application , but I can post and explain the relevant functions if you want. BareFeetWare-2 wrote: > > Well, yes, but it depends of the definition of "how many directors there > are". The above counts how many directors there are in total, counting the > same person for each movie they direct. > > What you want, I think, however, is how many people there are who are > directors (counting each person just once, even if they direct multiple > movies), given by this: > > select count(*) from "Capacity People Statistics" > where Capacity_ID = (select ID from Capacity where Name = 'director') > ; > Ok then, yes I wanted to count directors counting each person just once, I think that that code will not count much faster. BareFeetWare-2 wrote: > > Hopefully now that I've given you the query you actually needed, it now > makes sense ;-) > Yes it does, but then I don't need the Count column since I want to speed up counting distinct directors but counting each person just once. BareFeetWare-2 wrote: > > No, I don't think you need more tables to achieve the above. > Yes you can do it with the above but I want to make it very fast, with those two additional tables I think it will be much faster. -- View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30935871.html Sent from the SQLite mailing list archive at Nabble.com. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On Feb 13, 2011, at 3:38 PM, BareFeetWare wrote: > On 13/02/2011, at 1:04 AM, Yuzem wrote: > >> I am grabbing the data from the each movie imdb webpage. > > Does IMDB allow use of their data this way? After my brief reading of their > site, I thought they charge a $15k minimum per year for data. They do allow some limited free use of their data, though I doubt they'll be happy with users scrapping their site, I'd imagine they prefer devs download the database files they provide: http://www.imdb.com/interfaces > >>> >>> >> >> >> >>> ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 13/02/2011, at 1:04 AM, Yuzem wrote: > I am grabbing the data from the each movie imdb webpage. Does IMDB allow use of their data this way? After my brief reading of their site, I thought they charge a $15k minimum per year for data. > The script is written in bash and I can give you the code but I think it > would be very complicated to adapt it to your schema. Complicated? Why? In what format is the grabbed data (ie what tables, columns, rows)? It usually just takes an insert select to move data from one table's schema to another. >> You count directors like this: >> >> select sum(Count) from "Capacity People Statistics" >> where Capacity_ID = (select ID from Capacity where Name = 'director') >> ; >> > Are you sure that this count how many directors there are? Well, yes, but it depends of the definition of "how many directors there are". The above counts how many directors there are in total, counting the same person for each movie they direct. What you want, I think, however, is how many people there are who are directors (counting each person just once, even if they direct multiple movies), given by this: select count(*) from "Capacity People Statistics" where Capacity_ID = (select ID from Capacity where Name = 'director') ; > If I understand it correctly (probably I don't) you have for example table > "Movie People": > 1|director|1 > 2|director|1 > 3|director|2 > > There are 2 directors and then in "Capacity People Statistics" you should > have: > director|1|2 > director|2|1 > > If you use the previous code: > select sum(Count) from "Capacity People Statistics" > where Capacity_ID = (select ID from Capacity where Name = 'director') > ; > > I think it will return 3 but there are only 2 directors. Correct. That query will tell you that there are three (3) directors in the database. But two (2) of those directors are the same person. To instead get what you want, the number of people who are directors, do this (repeat of above SQL): select count(*) from "Capacity People Statistics" where Capacity_ID = (select ID from Capacity where Name = 'director') ; which gives your desired answer of 2. > Sorry, but I still don't understands it because I don't understand that you > can count directors that way. Hopefully now that I've given you the query you actually needed, it now makes sense ;-) >> I expect the above to be about the same speed or faster... than separate >> tables, but far more flexible (eg no need to add a table to accommodate a >> new capacity), and better normalized. >> > But using different tables provides an instant result, you can try it with > any table: > SELECT count(*) from table; Let me know if you see any noticeable difference in speed. > In the other hand you are right, it is less flexible. > Oh wait... I think I got it. I need two more tables: No, I don't think you need more tables to achieve the above. Thanks, Tom BareFeetWare -- Comparison of SQLite GUI tools: http://www.barefeetware.com/sqlite/compare/?ml ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
BareFeetWare-2 wrote: > > What is your source for the data? If I have that, I can test my schema > properly, rather than sitting here entering a pile of dummy data. I looked > at IMDB, but they only seem to have a paid license download. > I am grabbing the data from the each movie imdb webpage. The script is written in bash and I can give you the code but I think it would be very complicated to adapt it to your schema. BareFeetWare-2 wrote: > > No. Sorry, I should have included that. You count directors like this: > > select sum(Count) from "Capacity People Statistics" > where Capacity_ID = (select ID from Capacity where Name = 'director') > ; > Are you sure that this count how many directors there are? If I understand it correctly (probably I don't) you have for example table "Movie People": 1|director|1 2|director|1 3|director|2 There are 2 directors and then in "Capacity People Statistics" you should have: director|1|2 director|2|1 If you use the previous code: select sum(Count) from "Capacity People Statistics" where Capacity_ID = (select ID from Capacity where Name = 'director') ; I think it will return 3 but there are only 2 directors. BareFeetWare-2 wrote: > >> Another thing: I don't understand the purpose of the Count column in >> table >> "Capacity People Statistics" > > It hopefully now makes sense with my correction and example selects above. > Sorry, but I still don't understands it because I don't understand that you can count directors that way. BareFeetWare-2 wrote: > > I expect the above to be about the same speed or faster (since part of the > counting is already done) than separate tables, but far more flexible (eg > no need to add a table to accommodate a new capacity), and better > normalized. > But using different tables provides an instant result, you can try it with any table: SELECT count(*) from table; In the other hand you are right, it is less flexible. A more flexible and faster approach would be a table with total directors, writers, etc... but that is very complicated because I would have to count distinct directors, writers, etc... in every insert. Oh wait... I think I got it. I need two more tables: Tables: "Movie People" (movie_id capacity people_id) "Capacity People" (capacity_id people_id) "Capacity Count" (capacity count) On every insert into "Movie People" I trigger an insert into "Capacity People" and on every insert/delete from "Capacity People" I add/substract 1 from the corresponding capacity in "Capacity Count" Then I can count directors with: SELECT count FROM "Capacity Count" WHERE capacity = 'directors'; What do you think? (I didn't test it) BareFeetWare-2 wrote: > >> Another thing: I don't understand the purpose of the Count column in >> table >> "Capacity People Statistics" > > It hopefully now makes sense with my correction and example selects above. > Sorry, but I still don't understands it, of-course, this is because I don't understand that you can count directors that way. -- View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30908962.html Sent from the SQLite mailing list archive at Nabble.com. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 11/02/2011, at 11:51 PM, Yuzem wrote: >> you want the number of each unique (capacity, person). No problem. We'll >> just add a People column to the statistics, and change the uniqueness >> constraint. Hmm, seems I didn't quite change the uniqueness constraint. I meant to allow multiple capacities and multiple people, but keep each combination of (capacity, person) unique. So we need a Capacity_ID column and to use it instead of ID when inserting. The schema should therefore be: begin immediate ; drop table if exists "Capacity People Statistics" ; create table "Capacity People Statistics" ( ID integer primary key not null , Capacity_ID integer not null references "Capacity" (ID) on delete cascade , People_ID integer not null references "People" (ID) on delete cascade , Count integer not null , unique (Capacity_ID, People_ID) ) ; insert into "Capacity People Statistics" (Capacity_ID, People_ID, Count) select Capacity_ID, People_ID, count(*) from "Movie People" group by Capacity_ID, People_ID having Count > 0 ; drop trigger if exists "Movie People insert" ; create trigger "Movie People insert" on "Movie People" after insert begin insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, Count) select new.Capacity_ID , new.People_ID , ( select coalesce(Count, 0) + 1 from "Capacity People Statistics" where Capacity_ID = new.Capacity_ID and People_ID = new.People_ID ) ; end ; drop trigger if exists "Movie People delete" ; create trigger "Movie People delete" on "Movie People" after delete begin insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, Count) select old.Capacity_ID , old.People_ID , ( select coalesce(Count, 0) - 1 from "Capacity People Statistics" where Capacity_ID = old.Capacity_ID and People_ID = old.People_ID ) ; end ; drop trigger if exists "Movie People update" ; create trigger "Movie People update" on "Movie People" after update of Capacity_ID, People_ID begin insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, Count) select old.Capacity_ID , old.People_ID , ( select coalesce(Count, 0) - 1 from "Capacity People Statistics" where Capacity_ID = old.Capacity_ID and People_ID = old.People_ID ) ; insert or replace into "Capacity People Statistics" (Capacity_ID, People_ID, Count) select new.Capacity_ID , new.People_ID , ( select coalesce(Count, 0) + 1 from "Capacity People Statistics" where Capacity_ID = new.Capacity_ID and People_ID = new.People_ID ) ; end ; commit ; What is your source for the data? If I have that, I can test my schema properly, rather than sitting here entering a pile of dummy data. I looked at IMDB, but they only seem to have a paid license download. > That should work, if I understand it correctly now I can count directors with: > SELECT count(*) FROM "Capacity People Statistics" WHERE ID = 'directors'; No. Sorry, I should have included that. You count directors like this: select sum(Count) from "Capacity People Statistics" where Capacity_ID = (select ID from Capacity where Name = 'director') ; or count how many times a particular person is involved in movies: select sum(Count) from "Capacity People Statistics" where People_ID = (select ID from People where Name = 'Clint Eastwood') ; or count the number of times a particular person is a director (notice no sum needed): select Count from "Capacity People Statistics" where Capacity_ID = (select ID from Capacity where Name = 'director') and People_ID = (select ID from People where Name = 'Clint Eastwood') ; > It is similar to using different tables with only the IDs. > Instead of using a table for directors another for writers, etc... this is > specified in a column, it is slower but it uses only one table. I don't think you'll find a noticeable speed difference. It's searching just indexed primary key integer columns, which is very fast. It is optimized (with auto indexes) for searching by Capacity_ID, People_ID in that order, and searching by Capacity_ID alone. If you also do a lot of searches by People_ID alone, then add an index: create index "Capacity People Statistics - People" on "Capacity People Statistics" (People_ID) ; > If I use different tables the result is instant, I don't know if it will be > much faster to count from "Capacity People Statistics" than counting from > "Movie People". I expect the above to be about the same speed or faster (since part of the counting is already done) than separate tables, but far more flexible (eg no need to add a table to accommodate a new capacity), and better normalized. > Another thing: I don't understand the purpose of the Count column in table > "Capacity People Statistics" It hopefully now makes sense with my correction and example se
Re: [sqlite] Speed up count(distinct col)
BareFeetWare-2 wrote: > > Oh, I see. I was counting the total number of each unique capacity > (including director, writers), but you want the number of each unique > (capacity, person). No problem. We'll just add a People column to the > statistics, and change the uniqueness constraint. > > Replace my earlier "Capacity Statistics" and triggers with this: > That should work, if I understand it correctly now I can count directors with: SELECT count(*) FROM "Capacity People Statistics" WHERE ID = 'directors'; Instead of: SELECT count(distinct People_ID) FROM "Movie People" WHERE Capacity_ID = 'directors'; It is similar to using different tables with only the IDs. Instead of using a table for directors another for writers, etc... this is specified in a column, it is slower but it uses only one table. If I use different tables the result is instant, I don't know if it will be much faster to count from "Capacity People Statistics" than counting from "Movie People". Another thing: I don't understand the purpose of the Count column in table "Capacity People Statistics" Simon Slavin-3 wrote: > > So if I deleted one record and created another you wouldn't spot it ? > Yes, any change to the database will result in a complete refresh of the cache. Simon Slavin-3 wrote: > > Your process is trustworthy only when you are certain that the database > file is not currently open. If there's a chance that some application may > be modifying the file when you check these things then the results you get > may not be up-to-date. > Yes but what can I do about it, the same happens if I open a text document that it is already open in another text editor. I could check if there is a journal file but I don't know if it is necessary since I am not keeping any connection open, I try to make the connections as short as possible and if I get some data that it isn't up to date I don't see much problem. Something that I just realized, comparing INTEGER columns doesn't seems faster than comparing TEXT columns, this: SELECT count(distinct ROWID) FROM keywords; Result: 83513 Is slower than this: SELECT count(distinct keywords) FROM keywords; Result: 17321 -- View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30900999.html Sent from the SQLite mailing list archive at Nabble.com. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 11 Feb 2011, at 2:37am, Yuzem wrote: > Simon Slavin-3 wrote: >> > >> By looking at the file on disk ? Are you taking into account the journal >> file ? >> > > Yes, I do all the counts So if I deleted one record and created another you wouldn't spot it ? > and save the data to a file and then if the file is > newer then the database I use the file else I count again. Your process is trustworthy only when you are certain that the database file is not currently open. If there's a chance that some application may be modifying the file when you check these things then the results you get may not be up-to-date. The only way to correctly tell if the data in the database has changed is to use SQLite calls. Also, your operating system does not update the file attributes every time the contents of the file changes. It will often wait for a change of sector or a buffer to be flushed. > No, I am not taking the journal file into account, I don't know what you > mean by that. SQLite uses a journal file to temporarily store change to the database. You will find this journal file appearing in the same directory as the database file if you have the database open and have made changes. These changes will eventually be saved in the database file -- definitely when you close all the SQLite handles to that file -- but until then you have no way of knowing what data SQLite considers is actually in the database apart from using SQLite calls. Read here, especially sections 2.1 to 2.3, about the temporary files SQLite makes: http://www.sqlite.org/tempfiles.html Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 11/02/2011, at 1:37 PM, Yuzem wrote: > For example lets say I have the following data: > 1|director|1 > 2|director|2 > 3|director|1 > > In this example the total count for directors is 2, I have two distinct > directors. > In the table "Capacity Statistics" I will have: > director|2 > > The triggers you made add/subtract 1 from "Capacity Statistics" on > insert/delete on "movies people" > What happens if I add the following to "movies people"? > 4|director|2 > > The trigger should add 1 to "Capacity Statistics": > director|3 > > But there are still 2 directors: > 1|director|1 > 2|director|2 > 3|director|1 > 4|director|2 Oh, I see. I was counting the total number of each unique capacity (including director, writers), but you want the number of each unique (capacity, person). No problem. We'll just add a People column to the statistics, and change the uniqueness constraint. Replace my earlier "Capacity Statistics" and triggers with this: begin immediate ; create table "Capacity People Statistics" ( ID integer primary key references "Capacity" (ID) on delete cascade , People_ID integer not null references "People" (ID) on delete cascade , Count integer not null , unique (ID, People_ID) ) ; insert into "Capacity People Statistics" (ID, People_ID, Count) select Capacity_ID, People_ID, count(*) from "Movie People" group by Capacity_ID, People_ID having Count > 0 ; create trigger "Movie People insert" on "Movie People" after insert begin insert or replace into "Capacity People Statistics" (ID, People_ID, Count) select new.Capacity_ID , new.People_ID , ( select coalesce(Count, 0) + 1 from "Capacity People Statistics" where ID = new.Capacity_ID and People_ID = new.People_ID ) ; end ; create trigger "Movie People delete" on "Movie People" after delete begin insert or replace into "Capacity People Statistics" (ID, People_ID, Count) select old.Capacity_ID , old.People_ID , ( select coalesce(Count, 0) - 1 from "Capacity People Statistics" where ID = old.Capacity_ID and People_ID = old.People_ID ) ; end ; create trigger "Movie People update" on "Movie People" after update of Capacity_ID, People_ID begin insert or replace into "Capacity People Statistics" (ID, People_ID, Count) select old.Capacity_ID , old.People_ID , ( select coalesce(Count, 0) - 1 from "Capacity People Statistics" where ID = old.Capacity_ID and People_ID = old.People_ID ) ; insert or replace into "Capacity People Statistics" (ID, People_ID, Count) select new.Capacity_ID , new.People_ID , ( select coalesce(Count, 0) + 1 from "Capacity People Statistics" where ID = new.Capacity_ID and People_ID = new.People_ID ) ; end ; commit ; Tom BareFeetWare -- Comparison of SQLite GUI tools: http://www.barefeetware.com/sqlite/compare/?ml ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
Simon Slavin-3 wrote: > > By looking at the file on disk ? Are you taking into account the journal > file ? > Yes, I do all the counts and save the data to a file and then if the file is newer then the database I use the file else I count again. No, I am not taking the journal file into account, I don't know what you mean by that. BareFeetWare-2 wrote: > > Does tis answer your needs? If not, please explain further, but it will > probably only require modifying a constraint in the schema I proposed, > rather than denormalizing or partitioning. > I will try to explain it. I have the table "movies people" with columns "movie_ID, capacity_ID, people_ID" I understand that all rows are unique but people_ID isn't unique. For example lets say I have the following data: 1|director|1 2|director|2 3|director|1 In this example the total count for directors is 2, I have two distinct directors. In the table "Capacity Statistics" I will have: director|2 The triggers you made add/subtract 1 from "Capacity Statistics" on insert/delete on "movies people" What happens if I add the following to "movies people"? 4|director|2 The trigger should add 1 to "Capacity Statistics": director|3 But there are still 2 directors: 1|director|1 2|director|2 3|director|1 4|director|2 BareFeetWare-2 wrote: > > Erm ... there are also movies which have more than one person directing. > You need to copy across the rowid from the MoviePeople table, and delete > just based on that particular record. > Yes, of-course. -- View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30898156.html Sent from the SQLite mailing list archive at Nabble.com. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 11/02/2011, at 11:30 AM, Simon Slavin wrote: > Come to think of it, what's really happening here is a FOREIGN KEY situation. Yes, that's why I have the foreign keys (ie "references") in the schema. So, for instance, if you delete a movie, all of the actors, directors etc associated with that movie are automatically deleted, but the same people are kept for other movies. SQL takes care of all the nitty gritty stuff, so we don't have to reinvent the wheel in application code. Tom BareFeetWare -- Comparison of SQLite GUI tools: http://www.barefeetware.com/sqlite/compare/?ml ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 11 Feb 2011, at 12:26am, BareFeetWare wrote: > On 11/02/2011, at 11:11 AM, Simon Slavin wrote: > >> Erm ... there are also movies which have more than one person directing. >> You need to copy across the rowid from the MoviePeople table, and delete >> just based on that particular record. > > Adding to what Simon said: > > The schema I posted allows for multiple directors for each movie, the same > person having multiple capacities in the same movie (eg writer, director and > actor) etc. Any changes to the model and what's allowed (ie constraints) > should be done there and not in your application layer, IMNSHO. > > If you want to delete or insert, based on text (eg person's name), I suggest > NOT getting your application to get the matching rowid, then reinjected it > into a second SQL call. Do it all in one SQL transaction Come to think of it, what's really happening here is a FOREIGN KEY situation. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 11/02/2011, at 11:11 AM, Simon Slavin wrote: > Erm ... there are also movies which have more than one person directing. You > need to copy across the rowid from the MoviePeople table, and delete just > based on that particular record. Adding to what Simon said: The schema I posted allows for multiple directors for each movie, the same person having multiple capacities in the same movie (eg writer, director and actor) etc. Any changes to the model and what's allowed (ie constraints) should be done there and not in your application layer, IMNSHO. If you want to delete or insert, based on text (eg person's name), I suggest NOT getting your application to get the matching rowid, then reinjected it into a second SQL call. Do it all in one SQL transaction, such as: delete from "Movie People" where Movie_ID = (select Movie_ID from Movies where Title = 'Back to the Future') and People_ID = (select ID from People where Name = 'Eric Stoltz') and Capacity_ID = (select ID from Capacity where Name = 'actor') or: update "Movie People" set People_ID = (select ID from People where Name = 'Michael J Fox') where Movie_ID = (select Movie_ID from Movies where Title = 'Back to the Future') and People_ID = (select ID from People where Name = 'Eric Stoltz') and Capacity_ID = (select ID from Capacity where Name = 'actor') By the way, how are you getting the raw data? What URL or RSS feed or whatever? Thanks, Tom BareFeetWare -- Comparison of SQLite GUI tools: http://www.barefeetware.com/sqlite/compare/?ml ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 10 Feb 2011, at 11:55pm, Yuzem wrote: > That's the beauty of using a additional table. I make the column unique in > the the extra table and then on any insert in "movies people" I insert in > for example the directors table and any duplicate will be automatically > rejected. > > The same problem exist on delete, the deleted entry may not be unique. > > For example: > movie1|director|people1 > movie2|director|people1 Erm ... there are also movies which have more than one person directing. You need to copy across the rowid from the MoviePeople table, and delete just based on that particular record. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
> BareFeetWare-2 wrote: >> >> In that case, you should cache the counts in a separate table or two. That >> has a negligible overhead when you add a movie (which is infrequent), and >> basically no overhead when viewing (which is frequent). > I am doing that but in the application level, the down side is that I am > updating the cache on any change, I check the modified time of the database. I strongly suggest doing as much as you can at the SQL level, especially where it concerns the integrity of your data. You should find it faster, more reliable, self contained and logical. > BareFeetWare-2 wrote: >> >> Do that with triggers (which are a good thing in this context), to change >> the relevant count when an insert, delete or update occurs in the "Movie >> People" table. Something like: >> > I think, not sure, that there will be a problem with that code. > I am not an expert so I had to stare that code some time to understand it. > If I am correct you are adding and subtracting 1 on every insert or delete. Yes, and doing a delete/insert combo for an update (ie if a person's capacity in a movie changes). > The problem is that an insert may not be unique so before augmenting the > counter you have to check if it is unique and therefore you would have to > run a count(distinct col) on every insert which would be overkill. If I understand you correctly, the non-unique criteria is already handled by the schema design. The "Movie People" table allows the same person to be listed as the director for multiple movies. The constraints on the "Movie People" only require no nulls and unique combinations of Movie, Capacity, People (ie the same person can't be listed as director for the same movie twice, but can be listed as director for two movies). Because it uses the "unique" constraint for this, SQLite automatically builds an index "Movie People"(Movie_ID, Capacity_ID, People_ID), so that whenever you try to insert a new row, it quickly checks if it already exists. You can throw new non-unique rows at it with "insert or ignore" if you just want it to ignore duplicates, which then won't trigger the count increase because nothing was inserted. Or use plain old "insert" if you want to be alerted to any attempted unique violations. Does tis answer your needs? If not, please explain further, but it will probably only require modifying a constraint in the schema I proposed, rather than denormalizing or partitioning. Thanks, Tom BareFeetWare -- Comparison of SQLite GUI tools: http://www.barefeetware.com/sqlite/compare/?ml ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 10 Feb 2011, at 11:55pm, Yuzem wrote: > I check the modified time of the database. By looking at the file on disk ? Are you taking into account the journal file ? Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
BareFeetWare-2 wrote: > > In that case, you should cache the counts in a separate table or two. That > has a negligible overhead when you add a movie (which is infrequent), and > basically no overhead when viewing (which is frequent). I am doing that but in the application level, the down side is that I am updating the cache on any change, I check the modified time of the database. BareFeetWare-2 wrote: > > Do that with triggers (which are a good thing in this context), to change > the relevant count when an insert, delete or update occurs in the "Movie > People" table. Something like: > I think, not sure, that there will be a problem with that code. I am not an expert so I had to stare that code some time to understand it. If I am correct you are adding and subtracting 1 on every insert or delete. The problem is that an insert may not be unique so before augmenting the counter you have to check if it is unique and therefore you would have to run a count(distinct col) on every insert which would be overkill. That's the beauty of using a additional table. I make the column unique in the the extra table and then on any insert in "movies people" I insert in for example the directors table and any duplicate will be automatically rejected. The same problem exist on delete, the deleted entry may not be unique. For example: movie1|director|people1 movie2|director|people1 -- View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30897526.html Sent from the SQLite mailing list archive at Nabble.com. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
Oops, I should have said old instead of new in a couple of places: > begin immediate > ; > create table "Capacity Statistics" > ( ID integer primary key unique references "Capacity" (ID) on delete > cascade > , Count integer not null > ) > ; > insert into "Capacity Statistics" (ID, Count) select Capacity_ID, count(*) > from "Movie People" group by Capacity_ID having Count > 0 > ; > create trigger "Movie People insert" > on "Movie People" > after insert > begin > insert or replace into "Capacity Statistics" (ID, Count) > select new.Capacity_ID, (select coalesce(Count, 0) + 1 from "Capacity > Statistics" where ID = new.Capacity_ID) > ; > end > ; > create trigger "Movie People delete" > on "Movie People" > after insert > begin > insert or replace into "Capacity Statistics" (ID, Count) > select new.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity > Statistics" where ID = old.Capacity_ID) Should be: select old.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity Statistics" where ID = old.Capacity_ID) > ; > end > ; > create trigger "Movie People update" > on "Movie People" > after update of Capacity_ID > begin > insert or replace into "Capacity Statistics" (ID, Count) > select new.Capacity_ID, (select coalesce(Count, 0) + 1 from "Capacity > Statistics" where ID = new.Capacity_ID) > ; > insert or replace into "Capacity Statistics" (ID, Count) > select new.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity > Statistics" where ID = old.Capacity_ID) Should be: select old.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity Statistics" where ID = old.Capacity_ID) > ; > end > ; > commit > ; Thanks, Tom BareFeetWare -- Comparison of SQLite GUI tools: http://www.barefeetware.com/sqlite/compare/?ml ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 11/02/2011, at 9:40 AM, Yuzem wrote: > Yes, in my application I have in the sidebar all those sections (movies, > years, tags, keywords, actors, directors, writers, etc...) and I count each > one, how many movies, how many years, etc... > It isn't very slow if I update only one item but the problem is when I have > to update the entire list, this uses a lot of cpu and takes some time and I > have to do it every time a movie is added. > http://3.bp.blogspot.com/_EbNPUgfUDXs/TIpif4U1III/A2A/YFSGOAdpOGA/s1600/nested-brosers.jpg > Here there is an old screenshot from the sidebar. In that case, you should cache the counts in a separate table or two. That has a negligible overhead when you add a movie (which is infrequent), and basically no overhead when viewing (which is frequent). Do that with triggers (which are a good thing in this context), to change the relevant count when an insert, delete or update occurs in the "Movie People" table. Something like: begin immediate ; create table "Capacity Statistics" ( ID integer primary key unique references "Capacity" (ID) on delete cascade , Count integer not null ) ; insert into "Capacity Statistics" (ID, Count) select Capacity_ID, count(*) from "Movie People" group by Capacity_ID having Count > 0 ; create trigger "Movie People insert" on "Movie People" after insert begin insert or replace into "Capacity Statistics" (ID, Count) select new.Capacity_ID, (select coalesce(Count, 0) + 1 from "Capacity Statistics" where ID = new.Capacity_ID) ; end ; create trigger "Movie People delete" on "Movie People" after insert begin insert or replace into "Capacity Statistics" (ID, Count) select new.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity Statistics" where ID = old.Capacity_ID) ; end ; create trigger "Movie People update" on "Movie People" after update of Capacity_ID begin insert or replace into "Capacity Statistics" (ID, Count) select new.Capacity_ID, (select coalesce(Count, 0) + 1 from "Capacity Statistics" where ID = new.Capacity_ID) ; insert or replace into "Capacity Statistics" (ID, Count) select new.Capacity_ID, (select coalesce(Count, 0) - 1 from "Capacity Statistics" where ID = old.Capacity_ID) ; end ; commit ; By contrast, I suggest it would be a bad idea to denormalize and partition your data (ie separate directors and writers tables) just for the sake of tracking the count. The solution above maintains a normalized database, gives you the flexibility to add or remove Capacities in data rather than having to change the schema, and better reflects the real data model. Thanks, Tom BareFeetWare -- Comparison of SQLite GUI tools: http://www.barefeetware.com/sqlite/compare/?ml ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
Simon Slavin-3 wrote: > > How much slower. Did you make an index SQLite could use for that query ? > Using distinct isn't slow, it is what I would expect but count(*) is incredibly fast, it is instantaneous no matter how large is the table. Yes, I tried with an index. BareFeetWare-2 wrote: > > Or you can try this: > > select count(*) from People > where ID in > ( select People_ID from "Movie People" where Capacity_ID = > (select ID from Capacity where Name = 'director') > ); > > But I expect you'll get the same performance. > Yes, almost the same performance. BareFeetWare-2 wrote: > > You don't want to denormalize and have separate director and writer tables > etc. That will get ugly. > That's what I thought. Petite Abeille-2 wrote: > > Right... looking at something like the casting information in IMdb (actor, > director, writer, etc), there are about 25M movie + role + person > combinations. Which, while not huge, starts to get taxing when queried on > a lowly laptop. > > For example, out of these 25M rows, about 1.2M represent directors (~217K) > in movies (~1M). > Yes, in my application I have in the sidebar all those sections (movies, years, tags, keywords, actors, directors, writers, etc...) and I count each one, how many movies, how many years, etc... It isn't very slow if I update only one item but the problem is when I have to update the entire list, this uses a lot of cpu and takes some time and I have to do it every time a movie is added. http://3.bp.blogspot.com/_EbNPUgfUDXs/TIpif4U1III/A2A/YFSGOAdpOGA/s1600/nested-brosers.jpg Here there is an old screenshot from the sidebar. Petite Abeille-2 wrote: > > Assuming a movie_cast table [1] and a cast dimension [2], you could record > the distinct count for each cast in the dimension once, and store it. That > way you don't have to recompute it over and over. > Yes, I am caching all counts in the application level but I have to update it every time the database changes so I have to update it at least every time I add a movie. Petite Abeille-2 wrote: > > Alternatively, you could indeed partition that bulky movie_cast table by > its cast type, which will in effect reduce the search space to at most ~9M > (movies by actors, ~1.3M movies, ~1.3M actors). > > That said, even partitioning will not help you much here as you still have > to search through ~9M records to figure out how many distinct actors you > have. > Actually, I already have different tables for directors, writers, etc... Petite Abeille-2 wrote: > > So... perhaps best to cheat :) > > One way to cheat is to precompute the answer by, for example, adding a > is_* flag on your person table: > > update person set is_actor = ( select count( distinct person_id ) from > movie_actor where movie_actor.person_id = person.id ); > > The person table is much smaller (~3M) and easier to query in bulk: > > explain query plan > select count( * ) from person where is_actor = 1; > 0|0|0|SEARCH TABLE person USING COVERING INDEX person_is_actor > (is_actor=?) (~1596808 rows) > > 1310119 > CPU Time: user 0.256286 sys 0.000257 > > There you go :) > That's a good idea, I wonder if it isn't better to use a different table to store only the ids for the directors, etc, as I said before. I would end up with these tables: movies people movies_directors (movieId, peopleId) movies_writers (movieId, peopleId) directors (peopleId) writers (peopleId) Then I can count directors with: SELECT count(*) FROM directors; Petite Abeille-2 wrote: > > This is where I wish SQLite could have bitmap indices: > http://en.wikipedia.org/wiki/Bitmap_index > That seems what I need. -- View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30897078.html Sent from the SQLite mailing list archive at Nabble.com. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On Feb 10, 2011, at 5:00 PM, Yuzem wrote: > The only thing I can think of is to have additional tables for the ids of > all directors, all writers, etc... > Tables "movies", "people", "capacity" and then tables "directors", > "writers", etc.. with only the IDs so I can count using count(*) which is > super fast. Right... looking at something like the casting information in IMdb (actor, director, writer, etc), there are about 25M movie + role + person combinations. Which, while not huge, starts to get taxing when queried on a lowly laptop. For example, out of these 25M rows, about 1.2M represent directors (~217K) in movies (~1M). Assuming a movie_cast table [1] and a cast dimension [2], you could record the distinct count for each cast in the dimension once, and store it. That way you don't have to recompute it over and over. Alternatively, you could indeed partition that bulky movie_cast table by its cast type, which will in effect reduce the search space to at most ~9M (movies by actors, ~1.3M movies, ~1.3M actors). That said, even partitioning will not help you much here as you still have to search through ~9M records to figure out how many distinct actors you have. So... perhaps best to cheat :) One way to cheat is to precompute the answer by, for example, adding a is_* flag on your person table: update person set is_actor = ( select count( distinct person_id ) from movie_actor where movie_actor.person_id = person.id ); The person table is much smaller (~3M) and easier to query in bulk: explain query plan select count( * ) from person where is_actor = 1; 0|0|0|SEARCH TABLE person USING COVERING INDEX person_is_actor (is_actor=?) (~1596808 rows) 1310119 CPU Time: user 0.256286 sys 0.000257 There you go :) This is where I wish SQLite could have bitmap indices: http://en.wikipedia.org/wiki/Bitmap_index Oh, well... [1] http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L161 [2] http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L146 ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
> Thanks you but what I want to do is to count without using count(distinct > col) because it is much slower than count(*). I think you'll find the schema I posted very fast, since it's running everything, including distinct, on primary key columns. Or you can try this: select count(*) from People where ID in ( select People_ID from "Movie People" where Capacity_ID = (select ID from Capacity where Name = 'director') ); But I expect you'll get the same performance. You don't want to denormalize and have separate director and writer tables etc. That will get ugly. Tom BareFeetWare -- Comparison of SQLite GUI tools: http://www.barefeetware.com/sqlite/compare/?ml ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 10 Feb 2011, at 4:00pm, Yuzem wrote: > Thanks you but what I want to do is to count without using count(distinct > col) because it is much slower than count(*). How much slower. Did you make an index SQLite could use for that query ? A good index for that might be People_ID,Capacity_ID but it might be faster with those two the other way around. Try them both out and see which gives you faster results. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
BareFeetWare-2 wrote: > > Then you can count the directors like this: > > select count(distinct People_ID) from "Movie People" join Capacity on > "Movie People".Capacity_ID = Capacity.ID where Capacity.Name = 'director'; > > or: > > select count(distinct People_ID) from "Movie People" where Capacity_ID = > (select ID from Capacity where Name = 'director'); > > or you can create a view [...] > Thanks you but what I want to do is to count without using count(distinct col) because it is much slower than count(*). In the previous examples about tags I can do it, I have "movies" and "tags" related by "movies_tags" and I can do: SELECT count(*) FROM tags; I want to know if there is any standard approach to do that. The only thing I can think of is to have additional tables for the ids of all directors, all writers, etc... Tables "movies", "people", "capacity" and then tables "directors", "writers", etc.. with only the IDs so I can count using count(*) which is super fast. -- View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30893712.html Sent from the SQLite mailing list archive at Nabble.com. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On 08/02/2011, at 10:19 AM, Yuzem wrote: > I have the tables "movies" and "people" > Those two tables are related by tables "directors", "writers", etc... > movies: id, title > people: id, name > > directors: movieId, peopleId > writers: movieId, peopleId > etc... > > How can I normalize that so I can count directors and writers using > count(*)? > Should I create another table for "directors" and another for "writers", > etc... with only the ids? > I will end up having: > movies > movies_directors < (directors people) After implementing the schema in my previous post, add this: begin immediate; create table People ( ID integer primary key not null , Name text collate no case not null unique ); create table Capacity ( ID integer primary key not null , Name text collate no case not null unique ); create table "Movie People" ( ID integer primary key not null , Movie_ID integer not null references Movies (Movie_ID) on delete cascade , Capacity_ID integer not null references Capacity (ID) on delete cascade , People_ID integer not null references People (ID) on delete cascade , unique (Movie_ID, Capacity_ID, People_ID) ); commit; Then you can count the directors like this: select count(distinct People_ID) from "Movie People" join Capacity on "Movie People".Capacity_ID = Capacity.ID where Capacity.Name = 'director'; or: select count(distinct People_ID) from "Movie People" where Capacity_ID = (select ID from Capacity where Name = 'director'); or you can create a view to alphabetically list each director with a CSV list of their movies, like this: create view "Directors" as select People.Name, group_concat (Movies.Title, ', ') from "Movie People" join Movies on "Movie People".Movie_ID = Movies.Movie_ID join Capacity on "Movie People".Capacity_ID = Capacity.ID join People on "Movie People".People_ID = People.ID where Capacity.Name = 'director'; group by People_ID order by People.Name ; Tom BareFeetWare -- Comparison of SQLite GUI tools: http://www.barefeetware.com/sqlite/compare/?ml ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
You have: CREATE TABLE movies(movie_id INTEGER, title TEXT, unique(movie_id)); CREATE TABLE tags(movie_id INTEGER, tag TEXT, unique(movie_id,tag)); You can solve your problem, using pure SQL. No need to resort to the application layer. Just execute the SQL transaction below. It takes care of everything, including normalizing, assigning integer IDs to unique tags, filtering our duplicates and auto indexes. I haven't tested on your tables specifically, but I do similar all the time. Any errors should be simple name errors. begin immediate; create temp table "Movies Old" as select * from Movies; create temp table "Tags Old" from Tags; drop table Movies; drop table Tags; create table Movies ( Movie_ID integer primary key not null , Title text collate no case not null ); create table Tags ( Tag_ID integer primary key not null , Name text not null unique collate nocase ); create table Movie_Tags ( ID integer primary key not null , Movie_ID integer not null references Movies (Movie_ID) on delete cascade , Tag_ID integer not null references Tags (Tag_ID) on delete cascade , unique (Movie_ID, Tag_ID) ); insert into Movies (Movie_ID, Title) select Movie_ID, Title from "Movies Old"; insert or ignore into Tags (Name) select distinct Tag from "Tags Old"; insert into Movie_Tags (Movie_ID, Tag_ID) select distinct Movie_ID, (select Tag_ID from Tags where Name = "Tags Old".Tag) from "Tags Old"; drop table "Movies Old"; drop table "Tags Old"; commit or rollback; Note that the last line "commit or rollback" is not an actual SQLite command. You will need to manually decide at that point whether to issue a commit or rollback. If there are any errors in previous lines, use rollback. Unfortunately, SQLite does not automatically rollback all commands in a transaction if one of the commands fails (specifically create and drop commands). Then you should be able to count tags very fast by: select count(*) from Tags; In order to make foreign keys work, you need to have SQLite version 3.6.19 or later, and use this pragma when you open a connection to SQLite (ie before any commands that require use of foreign keys): pragma foreign_keys = yes; A normalized database using integer keys is fast, small and elegant. Viewing integers by users is ugly. So for viewing the Movie-Tag combinations (and data entry), you might want to create a view to make it user friendly. It depends on your user interface. Reply here if you need that. Just something like this: create view "Movie Tags View" as select Movie_Tags.ID as ID , Movies.Title as "Movie" , Tags.Name as "Tag" from Movie_Tags join Movies on Movie_Tags.Movie_ID = Movies.Movie_ID join Tags on Movie_Tags.Tag_ID = Tags.Tag_ID ; Hope this helps, Tom BareFeetWare -- Comparison of SQLite GUI tools: http://www.barefeetware.com/sqlite/compare/?ml ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On Feb 7, 2011, at 9:32 PM, Yuzem wrote: >> Hey... sounds like IMdb :P >> > Yes, I'm coding a http://yuzem.blogspot.com/p/figuritas-screenshots.html > movie manager that grabs the info from imdb. A fine hobby :) > I thought sqlite didn't handle foreign keys correctly, I thought wrong :) http://www.sqlite.org/foreignkeys.html > I am using triggers > to automatically delete tags when a movie gets deleted. Don't bother. > If I use a foreign key will it automatically remove the tag if the movie > gets deleted? See above. > Anyway, to use integers in the "tags" table is very complicated Nah... it's called normalization... it's a good thing :) http://en.wikipedia.org/wiki/Database_normalization > because I will have to assign the corresponding number to each tag that I > insert Yep. It's a good thing. > and I > have to insert lots of keywords for every movie. IMdb contains ~3.5M movie-keyword pairs (~405K distinct movies, 107K distinct keywords). Not much to write home about. > Does sqlite has any function to convert a text string into an unique number? You do it the other way around: create a unique entry in your tag table, then use the tag id in the movie_tag table that join movies to tags. Think of it as compression: instead of storing, say, the literal "accidental-cannibalism" multiple times in your movie_tag, you store the literal only once in the tag table, than use the literal id in the join table. And they lived happily ever after third normal form :)) ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
I would create a tagList table (integer tagID, string tagName, unique(tagName)) Before performing your batch of inserts, query the tagList table (integer tagID, string tagName), and generate a map (key on tagName, value of tagID). For each tag you want to insert, see if it exists in the map. If it doesn't, insert first into tagList table and get the tagID of that new entry and update your map. Perform your insert with the tagID instead now. If it does exist, use the value of the tagName key in your map. I chose a map (or some similar implementation), because the lookup should be quick, and you won't need to query the DB for each new tag during your batch of inserts. On Mon, Feb 7, 2011 at 2:32 PM, Yuzem wrote: > > > Petite Abeille-2 wrote: > > > > Hey... sounds like IMdb :P > > > Yes, I'm coding a http://yuzem.blogspot.com/p/figuritas-screenshots.html > movie manager that grabs the info from imdb. > > > Petite Abeille-2 wrote: > > > > In any case, as mentioned by Michael Black, you might benefit greatly by > > normalizing your table and indexing the foreign key: > > > > http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L401 > > > > I thought sqlite didn't handle foreign keys correctly, I am using triggers > to automatically delete tags when a movie gets deleted. > If I use a foreign key will it automatically remove the tag if the movie > gets deleted? > Anyway, to use integers in the "tags" table is very complicated because I > will have to assign the corresponding number to each tag that I insert and > I > have to insert lots of keywords for every movie. > Does sqlite has any function to convert a text string into an unique > number? > > -- > View this message in context: > http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30867411.html > Sent from the SQLite mailing list archive at Nabble.com. > > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
Yuzem wrote: > I thought sqlite didn't handle foreign keys correctly, I am using > triggers to automatically delete tags when a movie gets deleted. There's a new thing, 'PRAGMA foreign_keys=1', that causes SQLite to enforce them. Check out the docs on that. > Anyway, to use integers in the "tags" table is very complicated because > I will have to assign the corresponding number to each tag that I insert > and I have to insert lots of keywords for every movie. Well, technically you can have SQLite do that assignment for you. But yes, you'll have to deal with that mapping somewhere. You can abstract that away at a very low level in your infrastructure though. It's not so terrible: again it's something like (in one high level language) set tagId [db one {SELECT tagId FROM tagNames WHERE tagName=$tagName}] db eval {INSERT INTO tags VALUES($movieId, $tagId)} That one SELECT lookup will probably save you tons of string comparisons later on. Depends of course on the nature of your data set and query stats. > Does sqlite has any function to convert a text string into an unique > number? If you run (again correcting for your preferred language) db eval {INSERT INTO tagNames(tagName) VALUES($tagName)} then SQLite will generate the tagId column value for you if the column is declared INTEGER PRIMARY KEY. Again, you can get it back out at some point later using SELECT as above, or if it's needed immediately (which it probably is) you can use last_insert_rowid. So, again in Tcl, your whole insertion path will probably look something like: db eval {INSERT OR IGNORE INTO tagNames(tagName) VALUES($tagName)} if {[db changes]} { set tagId [db last_insert_rowid] } else { set tagId [db one {SELECT tagId FROM tagNames WHERE tagName=$tagName}] } assert {$tagId!=""} db eval {INSERT OR IGNORE INTO tags VALUES($movieId, $tagId)} That could possibly be optimized further. But you could give it a go as a first pass and see how far it takes you. It's not so complicated, and in 7 lines (plus 1 sanity-check line) we've abstracted the ID<->text mapping. Eric -- Eric A. Smith Carperpetuation (kar' pur pet u a shun), n.: The act, when vacuuming, of running over a string at least a dozen times, reaching over and picking it up, examining it, then putting it back down to give the vacuum one more chance. -- Rich Hall, "Sniglets" ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
Petite Abeille-2 wrote: > > Hey... sounds like IMdb :P > Yes, I'm coding a http://yuzem.blogspot.com/p/figuritas-screenshots.html movie manager that grabs the info from imdb. Petite Abeille-2 wrote: > > In any case, as mentioned by Michael Black, you might benefit greatly by > normalizing your table and indexing the foreign key: > > http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L401 > I thought sqlite didn't handle foreign keys correctly, I am using triggers to automatically delete tags when a movie gets deleted. If I use a foreign key will it automatically remove the tag if the movie gets deleted? Anyway, to use integers in the "tags" table is very complicated because I will have to assign the corresponding number to each tag that I insert and I have to insert lots of keywords for every movie. Does sqlite has any function to convert a text string into an unique number? -- View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30867411.html Sent from the SQLite mailing list archive at Nabble.com. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Speed up count(distinct col)
On Feb 7, 2011, at 4:26 PM, Yuzem wrote: > > I have many movies by tag and many tables like "tags" (keywords, countries, > languages, genres, etc..) Hey... sounds like IMdb :P > This solution implies one additional table and two triggers by each table. To paraphrase: Some people, when confronted with a problem, think “I know, I'll use triggers.” Now they have two problems. In any case, as mentioned by Michael Black, you might benefit greatly by normalizing your table and indexing the foreign key: http://dev.alt.textdrive.com/browser/IMDB/IMDB.ddl#L401 Don't bother with the triggers. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Speed up count(distinct col)
Hi. I have this tables: CREATE TABLE movies(movie_id INTEGER, title TEXT, unique(movie_id)); CREATE TABLE tags(movie_id INTEGER, tag TEXT, unique(movie_id,tag)); I have many movies by tag and many tables like "tags" (keywords, countries, languages, genres, etc..) I can count different movies very fast with: SELECT count(*) FROM movies; But the other tables are much slower depending on the size of the table: SELECT count(distinct tag) FROM tags; My solution is to create an additional table "count_tags" and then every time a tag is added to table "tags" a trigger adds the tag to "count_tags", I need also another trigger to remove the tag, then I can do: SELECT count(*) FROM count_tags; This solution implies one additional table and two triggers by each table. I wanted to know if there is a simpler solution, maybe by using indexes. Thanks in advance. -- View this message in context: http://old.nabble.com/Speed-up-count%28distinct-col%29-tp30864622p30864622.html Sent from the SQLite mailing list archive at Nabble.com. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users