Re: [sqlite] Threading makes SQLite 3x slower??
So far I've sped it up by 2x-3x by increasing the cache size by 10x, removing some extraneous SQL, and moving a bit of processing in-app. But using the full dataset with two threads instead of one still results in a 3x-5x time increase with either THREADSAFE=1 or 2. The fastest is single-threaded with THREADSAFE=0. My prepared queries now look like this: SELECT class FROM data_r JOIN data USING (rowNum) WHERE ?1 col0min AND col0max ?2 AND ?3 col1min AND col1max ?4 AND ?5 col2min AND col2max ?6 AND ?7 col3min AND col3max ?8 AND ?9 col4min AND col4max ?10 AND ?11 col5 AND col5 ?12; Am I configuring locking right? I have 17 million queries like the above. Since they are all read-only, I shouldn't be taking such a hit. Every thread has it's own DB connection and prepared statements, so THREADSAFE=2 should work fine (as I understand it). Instead it destroys performance. Thanks, Seth On Aug 4, 2011, at 12:36 PM, Seth Price wrote: Removing COUNT(*) AS count and GROUP BY class and doing it in-program shaved ~10% off of the time. I'll keep it. :) ~Seth On Aug 4, 2011, at 11:30 AM, Eduardo Morras wrote: Oks, another let's try another thing/think. Try the select without the COUNT(*): SELECT class FROM data_r JOIN data USING (rowNum) WHERE 57 col0min AND col0max 61 AND 52 col1min AND col1max 56 AND 66 col2min AND col2max 70 AND 88 col3min AND col3max 92 AND 133 col4min AND col4max 137 AND 57 col0 AND col0 61 AND 52 col1 AND col1 56 AND 66 col2 AND col2 70 AND 88 col3 AND col3 92 AND 133 col4 AND col4 137 AND 81 col5 AND col5 85 GROUP BY class; In some rdbms (don' know in sqlite), count, avg, sum, etc... implies a table scan, making the select very slow. HTH ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Threading makes SQLite 3x slower??
You didn't show your timing results or say what kind of machine you're running on. I'm running on a 2.26 GHz Mac Pro with 8 physical cores and 16 GB of RAM. SQL error (635): near ORDER: syntax error You're probably seeing that error because you need to recompile with -DSQLITE_ENABLE_UPDATE_DELETE_LIMIT. You would need to rebuild the parser, but I don't think it affects the results that we're looking at. So ignore the error. :) What I would do is one query and split the results of that query between threads. As is, the result from each query is ~1-3 rows that don't require much processing. A profiler says I spend about 99% of my time in SQLite. ~Seth On Aug 4, 2011, at 6:33 AM, Black, Michael (IS) wrote: You didn't show your timing results or say what kind of machine you're running on. I'm also seeing Calculating Subset Sample... SQL error (635): near ORDER: syntax error I'm running Linux, sqlite 3.7.5, E5520 2.27Ghz Intel with 16 threads possible. Here's my timings just on the first few rows of your timing output shows 2 threads is a touch better. 1 thread ( 0) Rows Proc: 5000 Avg: 0.00144 Elapsed: 7.20 ( 0) Rows Proc:1 Avg: 0.00104 Elapsed: 10.37 ( 0) Rows Proc:15000 Avg: 0.00098 Elapsed: 14.77 ( 0) Rows Proc:2 Avg: 0.00092 Elapsed: 18.36 ( 0) Rows Proc:25000 Avg: 0.00084 Elapsed: 21.08 2 threads ( 1) Rows Proc: 5000 Avg: 0.00103 Elapsed: 5.20 ( 1) Rows Proc:1 Avg: 0.00104 Elapsed: 10.45 ( 1) Rows Proc:15000 Avg: 0.00103 Elapsed: 15.53 ( 0) Rows Proc: 5000 Avg: 0.00419 Elapsed: 20.96 ( 1) Rows Proc:2 Avg: 0.00106 Elapsed: 21.23 3 threads ( 1) Rows Proc: 5000 Avg: 0.00536 Elapsed: 26.91 ( 2) Rows Proc: 5000 Avg: 0.00598 Elapsed: 30.05 ( 1) Rows Proc:1 Avg: 0.00535 Elapsed: 53.67 ( 2) Rows Proc:1 Avg: 0.00612 Elapsed: 61.36 ( 1) Rows Proc:15000 Avg: 0.00537 Elapsed: 80.75 What I would do is one query and split the results of that query between threads. You can use OpenMP to do that. Michael D. Black Senior Scientist NG Information Systems Advanced Analytics Directorate From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on behalf of Seth Price [s...@pricepages.org] Sent: Wednesday, August 03, 2011 9:07 PM To: sqlite-users@sqlite.org Subject: EXT :[sqlite] Threading makes SQLite 3x slower?? Hey all, I have a small DB bound application that I'm working on. I expect that it will take 10-20 days to run when I'm done with it. So I'm trying to make it multithreaded. But after spending all afternoon getting threading going, it runs on the order of 3x slower per query when I go from one to two threads. Is this expected? The full DB has around 17 million rows in it, and for each row I am trying to count all rows with similar characteristics, divided into different classifications. I was already able to improve speed 10x by using the R*tree extension to narrow my search. My queries look like this: SELECT class, COUNT(*) AS count FROM data_r JOIN data USING (rowNum) WHERE 57 col0min AND col0max 61 AND 52 col1min AND col1max 56 AND 66 col2min AND col2max 70 AND 88 col3min AND col3max 92 AND 133 col4min AND col4max 137 AND 57 col0 AND col0 61 AND 52 col1 AND col1 56 AND 66 col2 AND col2 70 AND 88 col3 AND col3 92 AND 133 col4 AND col4 137 AND 81 col5 AND col5 85 GROUP BY class; They take around 0.04 seconds each (times 17 million). I thought that I should be able to access the DB from each thread without penalty because they are read-only queries. I also tried making copies of the DB file to access a different file with each thread (also slower). Oddly, there are no frequent disk access while I run the program, so there must be caching somewhere. How can I make threading work? Download the example set and code here: http://seth.bluezone.usu.edu/sqlite/ There is info on how I compile and run the program in the header. Use NUM_THREADS to change the number. The stats on which thread is taking how much time is printed under Pruning Conflicting Examples... and the threading code is under pruneEx(ExPtr ex). I'm on a Mac Pro running 10.6.8. Thoughts? Comments? Ideas? Thanks, Seth ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Threading makes SQLite 3x slower??
On Aug 4, 2011, at 1:54 AM, Eduardo Morras wrote: They block each other not trying to get a lock, but trying to get access to disk and cache. Disk access time and cache is shared between all threads and if all threads needs access to different parts of the db they will figth like barbarians, you should convert them to roman cohorts instead. You can make your page cache size bigger using pragmas, check maillist and documentation. Using a 10x larger temp cache and default cache at compile time increases my runtime of that section from 166 to 167. I'm using two threads still. You don't need to update your table, you can use a temp memory table for that with candidates. First select candidates where 57col0min, then you delete candidates from temp table where col0min62 and so on. You don't need to copy the original table schema for temp table, use pk only. The select of the first candidates is important, use the one that minimizes the number of initial data. I was hoping that SQLite was doing that with the JOIN statement. It's narrowing down the results with the query from the R*table (col*min col*max). A large query will have ~14k rows at this point. Then JOINs with the original table (data) and it's narrowing results further with the remaining part of the WHERE (col*). The large query will have ~11k rows. After this it groups by the 'class' column in 'data' and counts how many rows per class. I'm really skeptical that I can chop this up into many queries to make it run faster. I've always assumed that if I can get final data back from a query, then it's exactly the query I'm looking for. I would think that running three queries would use three times the resources. Thanks, Seth ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Threading makes SQLite 3x slower??
With SQLITE_THREADSAFE=2 with SQLITE_THREADSAFE=1 With 2 threads and THREADSAFE=2 I get 167 seconds. With 2 threads and THREADSAFE=1, I get 177 seconds. With 1 thread and THREADSAFE=1, I get 53 seconds. With 1 thread and THREADSAFE=2, I get 52 seconds. One thing I'm worried about is I have two differing definitions of SQLITE_THREADSAFE when I compile via the command line with -DSQLITE_THREADSAFE=2, so I'm worried that one is clobbering the other. Here is a clip from the compile log, you can see *both* THREADSAFE=2 and THREADSAFE=1: gcc -DPACKAGE_NAME=\sqlite\ -DPACKAGE_TARNAME=\sqlite\ -DPACKAGE_VERSION=\3.7.7.1\ -DPACKAGE_STRING=\sqlite 3.7.7.1\ -DPACKAGE_BUGREPORT=\http://www.sqlite.org\; -DPACKAGE_URL=\\ -DPACKAGE=\sqlite\ -DVERSION=\3.7.7.1\ -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DHAVE_FDATASYNC=1 -DHAVE_USLEEP=1 -DHAVE_LOCALTIME_R=1 -DHAVE_GMTIME_R=1 -DHAVE_DECL_STRERROR_R=1 -DHAVE_STRERROR_R=1 -DHAVE_READLINE=1 -I.@am__isrc@ -D_REENTRANT=1 -DSQLITE_THREADSAFE=1 -DSQLITE_ENABLE_FTS3 -DSQLITE_ENABLE_RTREE -O2 -DSQLITE_ENABLE_UPDATE_DELETE_LIMIT -DSQLITE_ENABLE_RTREE -DSQLITE_THREADSAFE=2 -DSQLITE_OPEN_NOMUTEX -DSQLITE_ENABLE_LOCKING_STYLE -DSQLITE_DEFAULT_TEMP_CACHE_SIZE=5000 -DSQLITE_DEFAULT_CACHE_SIZE=2 -MT sqlite3.lo -MD -MP -MF .deps/sqlite3.Tpo -c sqlite3.c -fno-common -DPIC -o .libs/sqlite3.o command-line: warning: SQLITE_THREADSAFE redefined command-line: warning: this is the location of the previous definition The way you're splitting your work is bad. You can see that your row counts for 1 thread are 3 and for 2 threads 45,000 so you're not gaining a lot. 3 threads is 90,000 so you've tripled your workload totally defeating multi-threading. When I run with 1 thread, it queries all 99k examples. When I use two threads, each thread query 45k of the examples. Three threads query 33k of the examples each. The total should always be 99k examples, which is what I'm seeing on my machine. Thanks, Seth On Aug 4, 2011, at 6:36 AM, Black, Michael (IS) wrote: Also.. . Your elapsed time is using clock() which tells you processor time. With SQLITE_THREADSAFE=2 59.855 - 1 thread 49.535 - 2 threads 92.789 - 3 threads with SQLITE_THREADSAFE=1 61.146 - 1 thread 49.568 - 2 threads 64.932 - 3 threads The way you're splitting your work is bad. You can see that your row counts for 1 thread are 3 and for 2 threads 45,000 so you're not gaining a lot. 3 threads is 90,000 so you've tripled your workload totally defeating multi-threading. Michael D. Black Senior Scientist NG Information Systems Advanced Analytics Directorate From: sqlite-users-boun...@sqlite.org [sqlite-users-boun...@sqlite.org] on behalf of Seth Price [s...@pricepages.org] Sent: Wednesday, August 03, 2011 9:07 PM To: sqlite-users@sqlite.org Subject: EXT :[sqlite] Threading makes SQLite 3x slower?? Hey all, I have a small DB bound application that I'm working on. I expect that it will take 10-20 days to run when I'm done with it. So I'm trying to make it multithreaded. But after spending all afternoon getting threading going, it runs on the order of 3x slower per query when I go from one to two threads. Is this expected? The full DB has around 17 million rows in it, and for each row I am trying to count all rows with similar characteristics, divided into different classifications. I was already able to improve speed 10x by using the R*tree extension to narrow my search. My queries look like this: SELECT class, COUNT(*) AS count FROM data_r JOIN data USING (rowNum) WHERE 57 col0min AND col0max 61 AND 52 col1min AND col1max 56 AND 66 col2min AND col2max 70 AND 88 col3min AND col3max 92 AND 133 col4min AND col4max 137 AND 57 col0 AND col0 61 AND 52 col1 AND col1 56 AND 66 col2 AND col2 70 AND 88 col3 AND col3 92 AND 133 col4 AND col4 137 AND 81 col5 AND col5 85 GROUP BY class; They take around 0.04 seconds each (times 17 million). I thought that I should be able to access the DB from each thread without penalty because they are read-only queries. I also tried making copies of the DB file to access a different file with each thread (also slower). Oddly, there are no frequent disk access while I run the program, so there must be caching somewhere. How can I make threading work? Download the example set and code here: http://seth.bluezone.usu.edu/sqlite/ There is info on how I compile and run the program in the header. Use NUM_THREADS to change the number. The stats on which thread is taking how much time is printed under Pruning Conflicting Examples... and the threading code is under pruneEx(ExPtr ex). I'm on a Mac Pro running 10.6.8
Re: [sqlite] Threading makes SQLite 3x slower??
Any threads or processes would still be in contention for the same resource: access to that file on disk. Midway through the run, `iostat` is showing me exactly 0 disk accesses, so I'm assuming it's a locking problem and not a physical I/O bottleneck. I'm hoping that you folks could give me an idea of what I'm doing wrong. (It could also be a problem with the Mac OS locking.) The best indexes to create will depend on whether this is a one-time job or whether you're going to have to do it repeatedly with similar or completely different data. For each row, I'm making a query for similar rows. The full DB is 17 million rows. I've tried adding all the indexes I can think of, but the only one that helps this query is the R*tree index. They were all simple indices on one or more columns, so if you have an idea on a more complex index, I'd apply it and test it out. ~Seth On Aug 4, 2011, at 9:54 AM, Simon Slavin wrote: On 4 Aug 2011, at 4:44pm, Seth Price wrote: I was hoping that SQLite was doing that with the JOIN statement. It's narrowing down the results with the query from the R*table (col*min col*max). A large query will have ~14k rows at this point. Then JOINs with the original table (data) and it's narrowing results further with the remaining part of the WHERE (col*). The large query will have ~11k rows. After this it groups by the 'class' column in 'data' and counts how many rows per class. I'm really skeptical that I can chop this up into many queries to make it run faster. I've always assumed that if I can get final data back from a query, then it's exactly the query I'm looking for. I would think that running three queries would use three times the resources. Threading is a red herring. Any threads or processes would still be in contention for the same resource: access to that file on disk. This is what caused the observation in your 'Subject' header. All the threads are doing is waiting for another thread to get out the way. Speed increases can come from adding one or more indexes. The best indexes to create will depend on whether this is a one-time job or whether you're going to have to do it repeatedly with similar or completely different data. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Threading makes SQLite 3x slower??
I tried one index on all columns originally, but it didn't help nearly as much as the R*tree. ~Seth On Aug 4, 2011, at 10:26 AM, Simon Slavin wrote: On 4 Aug 2011, at 5:13pm, Seth Price wrote: They were all simple indices on one or more columns, so if you have an idea on a more complex index, I'd apply it and test it out. That's not going to be much use, is it ? A single SELECT which tests all those different columns can only use one index, and if it chooses any of those indexes it will help only with one column. Seems like you need one index on all the columns mentioned in your SELECT. But save yourself some time and effort and try one with just five of the columns first. See if that speeds things up. Delete all the indexes on just one column. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Threading makes SQLite 3x slower??
Removing COUNT(*) AS count and GROUP BY class and doing it in-program shaved ~10% off of the time. I'll keep it. :) ~Seth On Aug 4, 2011, at 11:30 AM, Eduardo Morras wrote: Oks, another let's try another thing/think. Try the select without the COUNT(*): SELECT class FROM data_r JOIN data USING (rowNum) WHERE 57 col0min AND col0max 61 AND 52 col1min AND col1max 56 AND 66 col2min AND col2max 70 AND 88 col3min AND col3max 92 AND 133 col4min AND col4max 137 AND 57 col0 AND col0 61 AND 52 col1 AND col1 56 AND 66 col2 AND col2 70 AND 88 col3 AND col3 92 AND 133 col4 AND col4 137 AND 81 col5 AND col5 85 GROUP BY class; In some rdbms (don' know in sqlite), count, avg, sum, etc... implies a table scan, making the select very slow. HTH ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Threading makes SQLite 3x slower??
Those fields span two tables. ~Seth via iPhone On Aug 4, 2011, at 2:45 PM, Simon Slavin slav...@bigfraud.org wrote: On 4 Aug 2011, at 8:40pm, Seth Price wrote: SELECT class FROM data_r JOIN data USING (rowNum) WHERE ?1 col0min AND col0max ?2 AND ?3 col1min AND col1max ?4 AND ?5 col2min AND col2max ?6 AND ?7 col3min AND col3max ?8 AND ?9 col4min AND col4max ?10 AND ?11 col5 AND col5 ?12; Create one index on all these fields: (col0max,col1min,col1max,col2min,col2max,col3min,col3max,col4min,col4max,col5) Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Threading makes SQLite 3x slower??
The table with the first five columns is the R*tree table, and thus already has an index. I just tried adding an index to col5 also, but no change in performance. ~Seth On Aug 4, 2011, at 3:27 PM, Simon Slavin wrote: On 4 Aug 2011, at 9:59pm, Seth Price wrote: Those fields span two tables. Then create one index for each table. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Threading makes SQLite 3x slower??
Hey all, I have a small DB bound application that I'm working on. I expect that it will take 10-20 days to run when I'm done with it. So I'm trying to make it multithreaded. But after spending all afternoon getting threading going, it runs on the order of 3x slower per query when I go from one to two threads. Is this expected? The full DB has around 17 million rows in it, and for each row I am trying to count all rows with similar characteristics, divided into different classifications. I was already able to improve speed 10x by using the R*tree extension to narrow my search. My queries look like this: SELECT class, COUNT(*) AS count FROM data_r JOIN data USING (rowNum) WHERE 57 col0min AND col0max 61 AND 52 col1min AND col1max 56 AND 66 col2min AND col2max 70 AND 88 col3min AND col3max 92 AND 133 col4min AND col4max 137 AND 57 col0 AND col0 61 AND 52 col1 AND col1 56 AND 66 col2 AND col2 70 AND 88 col3 AND col3 92 AND 133 col4 AND col4 137 AND 81 col5 AND col5 85 GROUP BY class; They take around 0.04 seconds each (times 17 million). I thought that I should be able to access the DB from each thread without penalty because they are read-only queries. I also tried making copies of the DB file to access a different file with each thread (also slower). Oddly, there are no frequent disk access while I run the program, so there must be caching somewhere. How can I make threading work? Download the example set and code here: http://seth.bluezone.usu.edu/sqlite/ There is info on how I compile and run the program in the header. Use NUM_THREADS to change the number. The stats on which thread is taking how much time is printed under Pruning Conflicting Examples... and the threading code is under pruneEx(ExPtr ex). I'm on a Mac Pro running 10.6.8. Thoughts? Comments? Ideas? Thanks, Seth ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Threading makes SQLite 3x slower??
I was hoping they wouldn't block each other because it's a read lock. I tried making an index on all the columns, but R*tree table + JOIN that I'm using runs about 10x faster. I might have done something wrong, so I'm open to suggestions on a better index, though. I don't think that UPDATEing 99.99% of 17 million columns is going to run faster than the pure select statement that I have. Each select statement is fairly random, a small selection out of an N-dimensional space. Am I understanding you? Any other ideas? ~Seth On Aug 3, 2011, at 8:43 PM, Simon Slavin wrote: On 4 Aug 2011, at 3:07am, Seth Price wrote: The full DB has around 17 million rows in it, and for each row I am trying to count all rows with similar characteristics, divided into different classifications. I was already able to improve speed 10x by using the R*tree extension to narrow my search. My queries look like this: SELECT class, COUNT(*) AS count FROM data_r JOIN data USING (rowNum) WHERE 57 col0min AND col0max 61 AND 52 col1min AND col1max 56 AND 66 col2min AND col2max 70 AND 88 col3min AND col3max 92 AND 133 col4min AND col4max 137 AND 57 col0 AND col0 61 AND 52 col1 AND col1 56 AND 66 col2 AND col2 70 AND 88 col3 AND col3 92 AND 133 col4 AND col4 137 AND 81 col5 AND col5 85 GROUP BY class; Forget multithreading. You have a bottleneck of accessing the database file and the threads will just block each other. There're two ways to make that SELECT run quickly. One is to create one index on all those columns. The other is to do the SELECT in parts, by disqualifying records you don't want. Create another column called, perhaps, notThisOne. Then do a succession of disqualifying checks: UPDATE data SET notThisOne=1 WHERE col0min=57 OR col0min=61; etc. Then all the rows you didn't disqualify are the ones you want. Obviously, separate indices on each of those columns will speed up the disqualifying check. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users