Re: [sqlite] Improving performance of GROUP BY
Thanks. There are no indexes of any kind because I would need many and they take very long to create. I am experimenting with various GROUP-BY scenarios like the following where all the work is done in TEMP. When TEMP is on disk - it finishes but is slower - when it's in memory - it's faster but I sometime get an "out of memory" error. I didn't completely understand the ANALYZE output. Is there a way to get it to estimate how many MB/GB of TEMP a SQL statement would need to successfully complete? CREATE TABLE X AS SELECT COL1, COL2, COUNT (*) CNT FROM 300K_ROW_TABLE X, 100M_ROW_TABLE Y WHERE Y.MM BETWEEN X.MM_START AND X.MM_END GROUP BY COL1, COL2 HAVING COUNT (*) = 24 / On Tue, May 29, 2012 at 3:04 PM, Nico Williams wrote: > On Fri, May 25, 2012 at 1:38 PM, Udi Karni wrote: > > I am running the following query - > > > > CREATE TABLE XYZ AS > > SELECT ID, MIN (DATE) > > FROM SOURCE-TABLE > > WHERE CRITERIA > > GROUP BY ID ; > > > > SOURCE-TABLE has 600 million rows, 2 million meet the WHERE criteria, and > > get grouped to 100,000 distinct IDs. There are no indexes as there is too > > much variety in the WHERE clauses and index creation on this table takes > > many hours. Temp_store=memory. When executed - the source table is read > > very slowly, and runtime is 1/2 hour. > > > > I split this SQL into 2 steps - the first just extracting - the second > just > > grouping. The extract read the table extremely fast and finished in 3 > > minutes. The GROUP BY ran in a few seconds. So a 10X improvement - which > is > > pretty incredible speed. > > Meaning? Something like this: > > CREATE TABLE FOO AS SELECT ID, DATE FROM SOURCE_TABLE; > CREATE TABLE BAR AS SELECT ID, MIN(DATE) FROM FOO GROUP BY ID; > > ? > > Are you sure that there are no indexes? Maybe there are indexes > implied by UNIQUE constraints and maybe SQLite3 is incorrectly > deciding to scan the index instead of the table? > > My guess is that you have a UNIQUE constraint on ID in SOURCE_TABLE > and SQLite3 is scanning the index and then fetching the DATE from the > table. But this makes no sense because then it makes no sense to > select the min() of DATE as there would be a single row per-ID. My > guess has to be wrong. Maybe there's a UNIQUE constraint on several > columns including ID? That would not be incompatible with the > semantics of your query. > > Can you post EXPLAIN QUERY PLAN output for your statements? And/or your > schema. > > Nico > -- > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Improving performance of GROUP BY
On 29 May 2012, at 11:04pm, Nico Williams wrote: > Can you post EXPLAIN QUERY PLAN output for your statements? And/or your > schema. Also perform the SQL command 'ANALYZE' on that database, then try your timings again. See if the timings improve (or perhaps even get worse). Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Improving performance of GROUP BY
On Fri, May 25, 2012 at 1:38 PM, Udi Karni wrote: > I am running the following query - > > CREATE TABLE XYZ AS > SELECT ID, MIN (DATE) > FROM SOURCE-TABLE > WHERE CRITERIA > GROUP BY ID ; > > SOURCE-TABLE has 600 million rows, 2 million meet the WHERE criteria, and > get grouped to 100,000 distinct IDs. There are no indexes as there is too > much variety in the WHERE clauses and index creation on this table takes > many hours. Temp_store=memory. When executed - the source table is read > very slowly, and runtime is 1/2 hour. > > I split this SQL into 2 steps - the first just extracting - the second just > grouping. The extract read the table extremely fast and finished in 3 > minutes. The GROUP BY ran in a few seconds. So a 10X improvement - which is > pretty incredible speed. Meaning? Something like this: CREATE TABLE FOO AS SELECT ID, DATE FROM SOURCE_TABLE; CREATE TABLE BAR AS SELECT ID, MIN(DATE) FROM FOO GROUP BY ID; ? Are you sure that there are no indexes? Maybe there are indexes implied by UNIQUE constraints and maybe SQLite3 is incorrectly deciding to scan the index instead of the table? My guess is that you have a UNIQUE constraint on ID in SOURCE_TABLE and SQLite3 is scanning the index and then fetching the DATE from the table. But this makes no sense because then it makes no sense to select the min() of DATE as there would be a single row per-ID. My guess has to be wrong. Maybe there's a UNIQUE constraint on several columns including ID? That would not be incompatible with the semantics of your query. Can you post EXPLAIN QUERY PLAN output for your statements? And/or your schema. Nico -- ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Improving performance of GROUP BY
On 26 May 2012, at 6:25pm, Udi Karni wrote: > In any event - breaking the SQL up in 2 is still at least twice as better. I do see huge SQL commands on this list sometimes: great big commands of hundreds of characters which could be done simpler and faster if split into two small SQL commands. > Where SQlite seems to run superfast all the time - is :memory: tables - if > you have the RAM. Maybe that's the real direction with large DBs. A giant > RAMDisk ? Solid State Drives. Those things of Flash memory you can use instead of a rotating hard disk drive. They're getting very cheap. A lot of the time taken for hard disk access is spent waiting for the disk to rotate to the right place so the read/write heads can get at the sector you want. SSDs never have to wait. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Improving performance of GROUP BY
It certainly seems to improve things. I ran multiple tests because over time seems like portions of the data get cached and there is a difference between an initial run and subsequent runs. I almost wish simple table-scans would always go exclusively to disk - I am not sure the caching always helps. In any event - breaking the SQL up in 2 is still at least twice as better. Where SQlite seems to run superfast all the time - is :memory: tables - if you have the RAM. Maybe that's the real direction with large DBs. A giant RAMDisk ? On Fri, May 25, 2012 at 12:10 PM, Simon Slavin wrote: > > On 25 May 2012, at 7:38pm, Udi Karni wrote: > > > I am running the following query - > > > > CREATE TABLE XYZ AS > > SELECT ID, MIN (DATE) > > FROM SOURCE-TABLE > > WHERE CRITERIA > > GROUP BY ID ; > > > > SOURCE-TABLE has 600 million rows, 2 million meet the WHERE criteria, and > > get grouped to 100,000 distinct IDs. There are no indexes as there is too > > much variety in the WHERE clauses and index creation on this table takes > > many hours. Temp_store=memory. When executed - the source table is read > > very slowly, and runtime is 1/2 hour. > > > > I split this SQL into 2 steps - the first just extracting - the second > just > > grouping. The extract read the table extremely fast and finished in 3 > > minutes. The GROUP BY ran in a few seconds. So a 10X improvement - which > is > > pretty incredible speed. > > Just for testing purposes, change MIN(DATE) to DATE and check the timings. > > Simon. > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] Improving performance of GROUP BY
On 25 May 2012, at 7:38pm, Udi Karni wrote: > I am running the following query - > > CREATE TABLE XYZ AS > SELECT ID, MIN (DATE) > FROM SOURCE-TABLE > WHERE CRITERIA > GROUP BY ID ; > > SOURCE-TABLE has 600 million rows, 2 million meet the WHERE criteria, and > get grouped to 100,000 distinct IDs. There are no indexes as there is too > much variety in the WHERE clauses and index creation on this table takes > many hours. Temp_store=memory. When executed - the source table is read > very slowly, and runtime is 1/2 hour. > > I split this SQL into 2 steps - the first just extracting - the second just > grouping. The extract read the table extremely fast and finished in 3 > minutes. The GROUP BY ran in a few seconds. So a 10X improvement - which is > pretty incredible speed. Just for testing purposes, change MIN(DATE) to DATE and check the timings. Simon. ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] Improving performance of GROUP BY
I am running the following query - CREATE TABLE XYZ AS SELECT ID, MIN (DATE) FROM SOURCE-TABLE WHERE CRITERIA GROUP BY ID ; SOURCE-TABLE has 600 million rows, 2 million meet the WHERE criteria, and get grouped to 100,000 distinct IDs. There are no indexes as there is too much variety in the WHERE clauses and index creation on this table takes many hours. Temp_store=memory. When executed - the source table is read very slowly, and runtime is 1/2 hour. I split this SQL into 2 steps - the first just extracting - the second just grouping. The extract read the table extremely fast and finished in 3 minutes. The GROUP BY ran in a few seconds. So a 10X improvement - which is pretty incredible speed. Why such a large disparity? Why wouldn't the single-step SQL read fast - write the accepted values to TEMP - then sort and group - same as the 2-step? Any way to influence the SQlite SQL processor to chose a more optimal single-step approach? Thanks ! ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users