Re: [sqlite] sqlite3 performance on select count very slow for 16 GB file
Hi Navin, Excuse me if some of the points below repeat things you already know. 1. Dr. Hipp's advice not to create redundant indexes was *not* intended to give you very quick row counts -- Simon Slavin et al had already given advice to speed up row counts -- and just now Stefen Keller even provided trigger sql statements for you. The trigger solution will result in almost instantaneous row counts (at the expense of slightly slower inserts and deletes.) 2. Alternatively, since you write that you only do deletes every 5 days, then if by chance the time for running VACUUM just after these periodic batch deletes is tolerable, then SELECT max(ROWID) FROM hp_table1; should give you instant satisfaction without using the TRIGGERs. (if you choose this solution you would NOT specify WITHOUT ROWID of course.) 3. We're all assuming that your application truly NEEDS to know the row count (and that you're not just using "select count(*) from hp_table1" as some "arbitrary test sql" for sqlite.) 4. The fact that select count(*) from very_large_table; can be slow is *not* evidence that sqlite can't handle tables as large as yours. Sqlite can update its b-trees till the cows come home. It was a design trade-off. Sqlite could have been designed to always keep up with table sizes, but inserts and deletes would be slightly slower for all tables for all users, and in a great many applications it's not needed. On the other hand, applications that *do* need quick access to the current row count can have it using triggers.Other databases might have been designed differently.(There *are* however, many applications ill-suited to sqlite -- often because of high concurrency needs or the need for fancy features, but row counting is not a veto item. https://www.sqlite.org/whentouse.html) 5. IMPORTANTLY -- I see you tried different page sizes, but did you did you also set a large CACHE size? E.g., "PRAGMA CACHE_SIZE= -100" would set the cache to about 1 GByte. Did I see that your postgres test was using 4GB? You'd want to compare using about the same cache I would think. (A large cache alone does NOT substitute for one of the "fast row count" solutions above, but it might make a big difference in your other operations.) https://www.sqlite.org/pragma.html#pragma_cache_size Regards, Donald Griggs ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] sqlite3 performance on select count very slow for 16 GB file
Hi Navin I've compared with PostgreSQL. It's twice as as "fast" as SQLite with 100 mio. records on my old laptop - but still too slow using count(). So, as Eduardo suggested, you have to solve this problem with a separate table and triggers, like shown below. Yours, S. -- Create test table CREATE TABLE test (id INTEGER PRIMARY KEY, r REAL, t TEXT ); -- Fill test table with test data...! INSERT INTO test (r,t) WITH RECURSIVE cte(r,t) AS ( SELECT random(), hex(random()*1000) UNION ALL SELECT random(), hex(random()*1000) FROM cte LIMIT 1000 -- 10 mio. ) SELECT * FROM cte; -- Create auxiliary table CREATE TABLE count_statistics ( table_name TEXT not NULL, row_count INTEGER ); CREATE TRIGGER count_statistics_delete_test BEFORE DELETE ON 'test' FOR EACH ROW BEGIN UPDATE count_statistics SET row_count=row_count-1 WHERE table_name='test'; END; CREATE TRIGGER count_statistics_insert_test BEFORE INSERT ON 'test' FOR EACH ROW BEGIN UPDATE count_statistics SET row_count=row_count+1 WHERE table_name='test'; END; -- initialize/update/reset count_statistics UPDATE count_statistics SET row_count=( SELECT count(*) FROM test ) WHERE table_name='test'; .timer on SELECT max(id) FROM test; SELECT count(*) FROM test; -- slow! -- This is the count replacement: SELECT row_count FROM count_statistics WHERE table_name='test'; -- fast! -- Done. 2015-01-25 20:05 GMT+01:00 Navin S Parakkal : > > On Saturday 24 January 2015 03:15 PM, Stefan Keller wrote: >> >> I think it's time for a serious simple benchmark with sqlite and say >> PostgreSQL. >> PostgeSQL also had performance problems time ago but this has been >> resolved. >> Can you describe the hp_table1 schema (CREATE TABLE statement...) and >> some data (INSERTs)? >> >> > Yes though a avid fan of sqlite, have to reconsider it for scalability > issues. Everything about the table is present in the archives and thread . > > > > On Sunday 25 January 2015 06:17 AM, Eduardo Morras wrote: >> >> You can use PostgreSQL, using part of Oracle licence cost for better >> hardware and a dinner for the team. >> > Yes that is an option we have to consider seriously. I thought we were doing > something wrong like creating the index or maybe tuning page size. > > > I'm still stuck. > > > > > I maybe completely wrong but i'm guessing your B-tree index is static. > Something in that doesn't support dynamic updation or some bug causing it > have linear update . > > > http://pastebin.com/davqKcF8 > > > > Reposting below if you have missed it. > > > My process.csv is around 27G. I've gzipped it and put > atftp://navinps:sqlit...@h2.usa.hp.com as process.csv.gz > > There is only 1 file there. > md5sum process.csv.gz > e77a322744a26d4c8a1ad4d61a84ee72 process.csv.gz > > [root@centosnavin sqlite-autoconf-3080801]# cat sqlite3commands.txt > CREATE TABLE [hp_table1] ( InstanceId INTEGER, LogTime INTEGER, ArrivalTime > INTEGER, CollectionTime INTEGER, [dml_PROC_TIME] TIME, [dml_PROC_INTERVAL] > INTEGER, [dml_PROC_INTEREST] TEXT, [dml_PROC_TOP_CPU_INDEX] INTEGER, > [dml_PROC_TOP_DISK_INDEX] INTEGER, [dml_PROC_STATE_FLAG] INTEGER, > [dml_PROC_RUN_TIME] REAL, [dml_PROC_STOP_REASON_FLAG] INTEGER, > [dml_PROC_INTERVAL_ALIVE] INTEGER, [dml_PROC_STOP_REASON] TEXT, > [dml_PROC_STATE] TEXT, [dml_PROC_PRI] INTEGER, [dml_PROC_NICE_PRI] INTEGER, > [dml_PROC_CPU_LAST_USED] INTEGER, [dml_PROC_CPU_SWITCHES] INTEGER, > [dml_PROC_IO_BYTE] REAL, [dml_PROC_VOLUNTARY_CSWITCH] INTEGER, > [dml_PROC_FORCED_CSWITCH] INTEGER, [dml_PROC_IO_BYTE_RATE] REAL, > [dml_PROC_CPU_TOTAL_UTIL] REAL, [dml_PROC_CPU_TOTAL_TIME] REAL, > [dml_PROC_CPU_SYS_MODE_UTIL] REAL, [dml_PROC_CPU_SYS_MODE_TIME] REAL, > [dml_PROC_CPU_USER_MODE_UTIL] REAL, [dml_PROC_CPU_USER_MODE_TIME] REAL, > [dml_PROC_THREAD_COUNT] INTEGER, [dml_PROC_CPU_ALIVE_TOTAL_UTIL] REAL , > [dml_PROC_CPU_ALIVE_USER_MODE_UTIL] REAL, > [dml_PROC_CPU_ALIVE_SYS_MODE_UTIL] REAL, [dml_PROC_CPU_SYSCALL_UTIL] REAL, > [dml_PROC_CPU_SYSCALL_TIME] REAL, [dml_PROC_CHILD_CPU_USER_MODE_UTIL] REAL, > [dml_PROC_CHILD_CPU_SYS_MODE_UTIL] REAL, [dml_PROC_CHILD_CPU_TOTAL_UTIL] > REAL, [dml_PROC_DISK_PHYS_READ] INTEGER, [dml_PROC_DISK_PHYS_READ_RATE] > REAL, [dml_PROC_DISK_PHYS_WRITE] INTEGER, [dml_PROC_DISK_PHYS_WRITE_RATE] > REAL, [dml_PROC_DISK_PHYS_IO_RATE] REAL, [dml_PROC_MEM_RES] REAL, > [dml_PROC_MEM_SHARED_RES] REAL, [dml_PROC_MEM_VIRT] REAL, > [dml_PROC_MEM_DATA_VIRT] REAL, [dml_PROC_MEM_STACK_VIRT] REAL, > [dml_PROC_PAGEFAULT] INTEGER, [dml_PROC_PAGEFAULT_RATE] REAL, > [dml_PROC_MINOR_FAULT] INTEGER, [dml_PROC_MAJOR_FAULT] INTEGER, > [dml_PROC_MEM_LOCKED] REAL, [dml_PROC_DISK_SUBSYSTEM_WAIT_PCT] REAL, > [dml_PROC_DISK_SUBSYSTEM_WAIT_TIME] REAL, [dml_PROC_PRI_WAIT_PCT] REAL, > [dml_PROC_PRI_WAIT_TIME] REAL, PRIMARY KEY (InstanceId, CollectionTime)) > WITHOUT ROWID ; > > > Also pasted with details at : > > http://pastebin.com/davqKcF8 > > > Thanks, > Navin > > > > ___ > sqlite
Re: [sqlite] sqlite3 performance on select count very slow for 16 GB file
On Saturday 24 January 2015 03:15 PM, Stefan Keller wrote: I think it's time for a serious simple benchmark with sqlite and say PostgreSQL. PostgeSQL also had performance problems time ago but this has been resolved. Can you describe the hp_table1 schema (CREATE TABLE statement...) and some data (INSERTs)? Yes though a avid fan of sqlite, have to reconsider it for scalability issues. Everything about the table is present in the archives and thread . On Sunday 25 January 2015 06:17 AM, Eduardo Morras wrote: You can use PostgreSQL, using part of Oracle licence cost for better hardware and a dinner for the team. Yes that is an option we have to consider seriously. I thought we were doing something wrong like creating the index or maybe tuning page size. I'm still stuck. I maybe completely wrong but i'm guessing your B-tree index is static. Something in that doesn't support dynamic updation or some bug causing it have linear update . http://pastebin.com/davqKcF8 Reposting below if you have missed it. My process.csv is around 27G. I've gzipped it and put atftp://navinps:sqlit...@h2.usa.hp.com as process.csv.gz There is only 1 file there. md5sum process.csv.gz e77a322744a26d4c8a1ad4d61a84ee72 process.csv.gz [root@centosnavin sqlite-autoconf-3080801]# cat sqlite3commands.txt CREATE TABLE [hp_table1] ( InstanceId INTEGER, LogTime INTEGER, ArrivalTime INTEGER, CollectionTime INTEGER, [dml_PROC_TIME] TIME, [dml_PROC_INTERVAL] INTEGER, [dml_PROC_INTEREST] TEXT, [dml_PROC_TOP_CPU_INDEX] INTEGER, [dml_PROC_TOP_DISK_INDEX] INTEGER, [dml_PROC_STATE_FLAG] INTEGER, [dml_PROC_RUN_TIME] REAL, [dml_PROC_STOP_REASON_FLAG] INTEGER, [dml_PROC_INTERVAL_ALIVE] INTEGER, [dml_PROC_STOP_REASON] TEXT, [dml_PROC_STATE] TEXT, [dml_PROC_PRI] INTEGER, [dml_PROC_NICE_PRI] INTEGER, [dml_PROC_CPU_LAST_USED] INTEGER, [dml_PROC_CPU_SWITCHES] INTEGER, [dml_PROC_IO_BYTE] REAL, [dml_PROC_VOLUNTARY_CSWITCH] INTEGER, [dml_PROC_FORCED_CSWITCH] INTEGER, [dml_PROC_IO_BYTE_RATE] REAL, [dml_PROC_CPU_TOTAL_UTIL] REAL, [dml_PROC_CPU_TOTAL_TIME] REAL, [dml_PROC_CPU_SYS_MODE_UTIL] REAL, [dml_PROC_CPU_SYS_MODE_TIME] REAL, [dml_PROC_CPU_USER_MODE_UTIL] REAL, [dml_PROC_CPU_USER_MODE_TIME] REAL, [dml_PROC_THREAD_COUNT] INTEGER, [dml_PROC_CPU_ALIVE_TOTAL_UTIL] REAL , [dml_PROC_CPU_ALIVE_USER_MODE_UTIL] REAL, [dml_PROC_CPU_ALIVE_SYS_MODE_UTIL] REAL, [dml_PROC_CPU_SYSCALL_UTIL] REAL, [dml_PROC_CPU_SYSCALL_TIME] REAL, [dml_PROC_CHILD_CPU_USER_MODE_UTIL] REAL, [dml_PROC_CHILD_CPU_SYS_MODE_UTIL] REAL, [dml_PROC_CHILD_CPU_TOTAL_UTIL] REAL, [dml_PROC_DISK_PHYS_READ] INTEGER, [dml_PROC_DISK_PHYS_READ_RATE] REAL, [dml_PROC_DISK_PHYS_WRITE] INTEGER, [dml_PROC_DISK_PHYS_WRITE_RATE] REAL, [dml_PROC_DISK_PHYS_IO_RATE] REAL, [dml_PROC_MEM_RES] REAL, [dml_PROC_MEM_SHARED_RES] REAL, [dml_PROC_MEM_VIRT] REAL, [dml_PROC_MEM_DATA_VIRT] REAL, [dml_PROC_MEM_STACK_VIRT] REAL, [dml_PROC_PAGEFAULT] INTEGER, [dml_PROC_PAGEFAULT_RATE] REAL, [dml_PROC_MINOR_FAULT] INTEGER, [dml_PROC_MAJOR_FAULT] INTEGER, [dml_PROC_MEM_LOCKED] REAL, [dml_PROC_DISK_SUBSYSTEM_WAIT_PCT] REAL, [dml_PROC_DISK_SUBSYSTEM_WAIT_TIME] REAL, [dml_PROC_PRI_WAIT_PCT] REAL, [dml_PROC_PRI_WAIT_TIME] REAL, PRIMARY KEY (InstanceId, CollectionTime)) WITHOUT ROWID ; Also pasted with details at : http://pastebin.com/davqKcF8 Thanks, Navin ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] sqlite3 performance on select count very slow for 16 GB file
Hi, On Monday 26 January 2015 12:35 AM, Navin S Parakkal wrote: On Saturday 24 January 2015 03:15 PM, Stefan Keller wrote: I think it's time for a serious simple benchmark with sqlite and say PostgreSQL. PostgeSQL also had performance problems time ago but this has been resolved. Can you describe the hp_table1 schema (CREATE TABLE statement...) and some data (INSERTs)? Yes though a avid fan of sqlite, have to reconsider it for scalability issues. Everything about the table is present in the archives and thread . On Sunday 25 January 2015 06:17 AM, Eduardo Morras wrote: You can use PostgreSQL, using part of Oracle licence cost for better hardware and a dinner for the team. Yes that is an option we have to consider seriously. I thought we were doing something wrong like creating the index or maybe tuning page size. I'm still stuck. I maybe completely wrong but i'm guessing your B-tree index is static. Something in that doesn't support dynamic updation or some bug causing it have linear update . http://pastebin.com/davqKcF8 Reposting below if you have missed it. My process.csv is around 27G. I've gzipped it and put atftp://navinps:sqlit...@h2.usa.hp.com as process.csv.gz There is only 1 file there. md5sum process.csv.gz e77a322744a26d4c8a1ad4d61a84ee72 process.csv.gz [root@centosnavin sqlite-autoconf-3080801]# cat sqlite3commands.txt CREATE TABLE [hp_table1] ( InstanceId INTEGER, LogTime INTEGER, ArrivalTime INTEGER, CollectionTime INTEGER, [dml_PROC_TIME] TIME, [dml_PROC_INTERVAL] INTEGER, [dml_PROC_INTEREST] TEXT, [dml_PROC_TOP_CPU_INDEX] INTEGER, [dml_PROC_TOP_DISK_INDEX] INTEGER, [dml_PROC_STATE_FLAG] INTEGER, [dml_PROC_RUN_TIME] REAL, [dml_PROC_STOP_REASON_FLAG] INTEGER, [dml_PROC_INTERVAL_ALIVE] INTEGER, [dml_PROC_STOP_REASON] TEXT, [dml_PROC_STATE] TEXT, [dml_PROC_PRI] INTEGER, [dml_PROC_NICE_PRI] INTEGER, [dml_PROC_CPU_LAST_USED] INTEGER, [dml_PROC_CPU_SWITCHES] INTEGER, [dml_PROC_IO_BYTE] REAL, [dml_PROC_VOLUNTARY_CSWITCH] INTEGER, [dml_PROC_FORCED_CSWITCH] INTEGER, [dml_PROC_IO_BYTE_RATE] REAL, [dml_PROC_CPU_TOTAL_UTIL] REAL, [dml_PROC_CPU_TOTAL_TIME] REAL, [dml_PROC_CPU_SYS_MODE_UTIL] REAL, [dml_PROC_CPU_SYS_MODE_TIME] REAL, [dml_PROC_CPU_USER_MODE_UTIL] REAL, [dml_PROC_CPU_USER_MODE_TIME] REAL, [dml_PROC_THREAD_COUNT] INTEGER, [dml_PROC_CPU_ALIVE_TOTAL_UTIL] REAL , [dml_PROC_CPU_ALIVE_USER_MODE_UTIL] REAL, [dml_PROC_CPU_ALIVE_SYS_MODE_UTIL] REAL, [dml_PROC_CPU_SYSCALL_UTIL] REAL, [dml_PROC_CPU_SYSCALL_TIME] REAL, [dml_PROC_CHILD_CPU_USER_MODE_UTIL] REAL, [dml_PROC_CHILD_CPU_SYS_MODE_UTIL] REAL, [dml_PROC_CHILD_CPU_TOTAL_UTIL] REAL, [dml_PROC_DISK_PHYS_READ] INTEGER, [dml_PROC_DISK_PHYS_READ_RATE] REAL, [dml_PROC_DISK_PHYS_WRITE] INTEGER, [dml_PROC_DISK_PHYS_WRITE_RATE] REAL, [dml_PROC_DISK_PHYS_IO_RATE] REAL, [dml_PROC_MEM_RES] REAL, [dml_PROC_MEM_SHARED_RES] REAL, [dml_PROC_MEM_VIRT] REAL, [dml_PROC_MEM_DATA_VIRT] REAL, [dml_PROC_MEM_STACK_VIRT] REAL, [dml_PROC_PAGEFAULT] INTEGER, [dml_PROC_PAGEFAULT_RATE] REAL, [dml_PROC_MINOR_FAULT] INTEGER, [dml_PROC_MAJOR_FAULT] INTEGER, [dml_PROC_MEM_LOCKED] REAL, [dml_PROC_DISK_SUBSYSTEM_WAIT_PCT] REAL, [dml_PROC_DISK_SUBSYSTEM_WAIT_TIME] REAL, [dml_PROC_PRI_WAIT_PCT] REAL, [dml_PROC_PRI_WAIT_TIME] REAL, PRIMARY KEY (InstanceId, CollectionTime)) WITHOUT ROWID ; Also pasted with details at : http://pastebin.com/davqKcF8 Thanks, Navin ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] sqlite3 performance on select count very slow for 16 GB file
On Tue, 20 Jan 2015 12:12:00 + "Parakkal, Navin S (Software Engineer)" wrote: > Hello, > >I've few questions about sqlite3 , the database it creates. > Actually I'm finding lot of differences in performance. > > My story: > I have this sqlite3 database called hp.db which is like 100+ > million records for table1. The size of hp.db on Linux x64 (CentOS > 7) is like 16 GB. When I do a select count(*) on hp_table1 it takes > more than 5 mins which is quite a huge time. This file is static for > now ie we copied it from a production server for analysis. Now I > create a index by create index nvnhpindex on hp_table1 > (column1,column2) . The primary key of the table hp_table1 is > (column1,column2). It takes around some time (maybe 20 minutes or > less , I went for lunch and came back , really didn't note the time). > Now I do select count(*) on hp_table1 , it takes around 15 secs. This > is what we want our objective. We want it to be fast. The create > index has increased the size of hp.db to 18 GB. This is OK with us > and the customers. > > The problem is this is not a static database. We keep inserting data > (insert rows every 10 secs or like atleast 1 minute ) and > occassionally delete rows (like once in 5 days). This is a 24x7 > system. > > > So to identify the problem , I created a empty_database similar to > hp.db with no rows. I created a index on column1,column2 on > empty_table1 inside empty_database. > > Now I inserted the rows from csv (this CSV was created by .mode csv, > output myhp.csv, select * from hp_table1). > > The size of database is around 18GB (empty_database) with rows. Now I > do a select count(*) on empty_table1 (actually it contains lots of > rows like 100M+ records ) and it takes more than 5 mins. 5 mins is > too much of a time for us to bear. The customer wants the information > within a minute. > > > Can you please help in resolving this problem ? We are planning to > deploy this across 1+ nodes on Linux x64 on one customer and many > other customers are going in the similar direction. > > How do we go about resolving this ie what should we do to create a > table with sub minute access for 100-500 million . How do we create > the indexes ? Any other performance incentives. Use a trigger on insert and a trigger on delete that modifies a value on another table with current count(*) number. Table can be temporal if you want and stay in memory, but you should do a count(*) on application startup. > Some say we should buy/use Oracle but I just am holding onto sqlite3 > assuming it would help me solve our problem. You can use PostgreSQL, using part of Oracle licence cost for better hardware and a dinner for the team. > Regards, > Navin > > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users --- --- Eduardo Morras ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] sqlite3 performance on select count very slow for 16 GB file
Hi, Relying on sequence will not work (and is a wrong hack) since the use case includes deleting rows explicitly. I think it's time for a serious simple benchmark with sqlite and say PostgreSQL. PostgeSQL also had performance problems time ago but this has been resolved. Can you describe the hp_table1 schema (CREATE TABLE statement...) and some data (INSERTs)? Yours, S. 2015-01-24 10:33 GMT+01:00 Clemens Ladisch : > Jim Wilcoxson wrote: >> If you have a table where rows are inserted but never deleted, and you >> have a rowid column, you can use this: >> >> select seq from sqlite_sequence where name = 'tablename' > > This works only for an AUTOINCREMENT column. > >> This will return instantly, without scanning any rows or indexes, and >> is much faster than max(rowid) for huge tables. > > Max(rowid) has a special optimization and looks only at the last entry > in the index. It is what SQLite uses internally for tables without > AUTOINCREMENT, and is actually faster than looking up the sequence value > in a separate table. > > > Regards, > Clemens > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] sqlite3 performance on select count very slow for 16 GB file
Jim Wilcoxson wrote: > If you have a table where rows are inserted but never deleted, and you > have a rowid column, you can use this: > > select seq from sqlite_sequence where name = 'tablename' This works only for an AUTOINCREMENT column. > This will return instantly, without scanning any rows or indexes, and > is much faster than max(rowid) for huge tables. Max(rowid) has a special optimization and looks only at the last entry in the index. It is what SQLite uses internally for tables without AUTOINCREMENT, and is actually faster than looking up the sequence value in a separate table. Regards, Clemens ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] sqlite3 performance on select count very slow for 16 GB file
If you have a table where rows are inserted but never deleted, and you have a rowid column, you can use this: select seq from sqlite_sequence where name = 'tablename' This will return instantly, without scanning any rows or indexes, and is much faster than max(rowid) for huge tables. If no rows have been inserted, you will get NULL. If rows have been inserted, you will get back the last rowid inserted. Jim Simon wrote: If this is a table for which rows are inserted but never deleted, then you will find that SELECT max(rowid) FROM hp_table1 returns the same value almost immediately. Perhaps value-1, but whatever it is it'll be consistent. -- HashBackup: easy onsite and offsite Unix backup http://www.hashbackup.com ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] sqlite3 performance on select count very slow for 16 GB file
On 20 Jan 2015, at 12:12pm, Parakkal, Navin S (Software Engineer) wrote: >When I do a select count(*) on hp_table1 it takes more than 5 mins which > is quite a huge time. If this is a table for which rows are inserted but never deleted, then you will find that SELECT max(rowid) FROM hp_table1 returns the same value almost immediately. Perhaps value-1, but whatever it is it'll be consistent. > Now I create a index [snip] > Now I do select count(*) on hp_table1 , it takes around 15 secs. > > [...] > > The size of database is around 18GB (empty_database) with rows. Now I do a > select count(*) on empty_table1 (actually it contains lots of rows like 100M+ > records ) and it takes more than 5 mins. 5 mins is too much of a time for us > to bear. The customer wants the information within a minute. What information ? The number of rows in a table ? That's rarely important information for a user. It looks more like the kind of information a database admin would want. The problem is that an unusual aspect of how SQLite works means that the total number of rows for a table is not stored anywhere. To calculate count(*) it has to look at every row in the table. It can do this by looking at every row in a table and counting the entries, which is what it did originally to take 5 minutes. But a full index on the table has the same number of entries but involves handling less data, and SQLite knows that counting the entries in the index you created would be faster. So once you had created the index to find count(*) it went through all the entries in that index instead, which took it just 15 seconds. > How do we go about resolving this ie what should we do to create a table with > sub minute access for 100-500 million . Access is not the problem here. SQLite can access any row in a table that big in a few milliseconds. The problem here is the specific function of counting every row which is something SQLite does not do quickly. Take a look at the 'max(rowid)' trick I showed above. If that's no good for you, you use TRIGGERs which add 1 for each INSERT and delete 1 for each DELETE to keep track of the number of rows. You would create another table to keep the total counts in. This would decrease the time taken to return the counts at the cost of increasing the time taken to insert and delete rows. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] sqlite3 performance on select count very slow for 16 GB file
On 1/20/15, Parakkal, Navin S (Software Engineer) wrote: > Hello, > >I've few questions about sqlite3 , the database it creates. Actually I'm > finding lot of differences in performance. > > My story: > I have this sqlite3 database called hp.db which is like 100+ million > records for table1. The size of hp.db on Linux x64 (CentOS 7) is like 16 > GB. > When I do a select count(*) on hp_table1 it takes more than 5 mins which > is quite a huge time. > This file is static for now ie we copied it from a production server for > analysis. > Now I create a index by create index nvnhpindex on hp_table1 > (column1,column2) . The primary key of the table hp_table1 is > (column1,column2). Are you saying that you have a table like this: CREATE TABLE hp_table1( column1 TEXT, column2 TEXT, other_columns MISC, PRIMARY KEY(column1,column2) ); And then you are doing: CREATE INDEX idx1 ON hp_table1(column1,column2); Don't do that!!! The index is redundant. You already have a primary key on those two columns. The primary key is sufficient. The index just makes your database bigger and slower. For a multi-column primary key, your best performance will (probably) result if you say: CREATE TABLE hp_table1( column1 TEXT, column2 TEXT, other_columns MISC, PRIMARY KEY(column1,column2) ) WITHOUT ROWID; Note the use of WITHOUT ROWID at the end of the table declaration. Note also the absence of any indexes. This will likely make a big size and performance difference for you. I say "probably" because there are cases where it might be better to omit the WITHOUT ROWID - specifically if "other_columns" contain very large strings and/or blobs - larger than about 1/5th of your page size. You can run experiments using and omitting WITHOUT ROWID to see which form works best on your system. > It takes around some time (maybe 20 minutes or less , I went for lunch > and came back , really didn't note the time). > > Now I do select count(*) on hp_table1 , it takes around 15 secs. This is > what we want our objective. We want it to be fast. The create index has > increased the size of hp.db to 18 GB. This is OK with us and the customers. > > The problem is this is not a static database. We keep inserting data (insert > rows every 10 secs or like atleast 1 minute ) and occassionally delete rows > (like once in 5 days). This is a 24x7 system. > > > So to identify the problem , I created a empty_database similar to hp.db > with no rows. > I created a index on column1,column2 on empty_table1 inside empty_database. > > Now I inserted the rows from csv (this CSV was created by .mode csv, output > myhp.csv, select * from hp_table1). > > The size of database is around 18GB (empty_database) with rows. Now I do a > select count(*) on empty_table1 (actually it contains lots of rows like > 100M+ records ) and it takes more than 5 mins. 5 mins is too much of a time > for us to bear. The customer wants the information within a minute. > > > Can you please help in resolving this problem ? We are planning to deploy > this across 1+ nodes on Linux x64 on one customer and many other > customers are going in the similar direction. > > How do we go about resolving this ie what should we do to create a table > with sub minute access for 100-500 million . How do we create the indexes ? > Any other performance incentives. > > Some say we should buy/use Oracle but I just am holding onto sqlite3 > assuming it would help me solve our problem. > > > Regards, > Navin > > -- D. Richard Hipp d...@sqlite.org ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users