Re: [sqlite] Improving performance of GROUP BY

2012-06-17 Thread Udi Karni
Thanks. There are no indexes of any kind because I would need many and they
take very long to create.

I am experimenting with various GROUP-BY scenarios like the following where
all the work is done in TEMP.

When TEMP is on disk - it finishes but is slower - when it's in memory -
it's faster but I sometime get an "out of memory" error.

I didn't completely understand the ANALYZE output. Is there a way to get it
to estimate how many MB/GB of TEMP a SQL statement would need to
successfully complete?

  CREATE TABLE X
  AS
  SELECT
 COL1,
 COL2,
 COUNT (*) CNT
FROM
 300K_ROW_TABLE X,
 100M_ROW_TABLE Y
   WHERE
 Y.MM BETWEEN X.MM_START  AND  X.MM_END
GROUP BY
 COL1,
 COL2
  HAVING
 COUNT (*) = 24
/


On Tue, May 29, 2012 at 3:04 PM, Nico Williams wrote:

> On Fri, May 25, 2012 at 1:38 PM, Udi Karni  wrote:
> > I am running the following query -
> >
> > CREATE TABLE XYZ AS
> > SELECT ID, MIN (DATE)
> > FROM SOURCE-TABLE
> > WHERE CRITERIA
> > GROUP BY ID ;
> >
> > SOURCE-TABLE has 600 million rows, 2 million meet the WHERE criteria, and
> > get grouped to 100,000 distinct IDs. There are no indexes as there is too
> > much variety in the WHERE clauses and index creation on this table takes
> > many hours. Temp_store=memory. When executed - the source table is read
> > very slowly, and runtime is 1/2 hour.
> >
> > I split this SQL into 2 steps - the first just extracting - the second
> just
> > grouping. The extract read the table extremely fast and finished in 3
> > minutes. The GROUP BY ran in a few seconds. So a 10X improvement - which
> is
> > pretty incredible speed.
>
> Meaning?  Something like this:
>
> CREATE TABLE FOO AS SELECT ID, DATE FROM SOURCE_TABLE;
> CREATE TABLE BAR AS SELECT ID, MIN(DATE) FROM FOO GROUP BY ID;
>
> ?
>
> Are you sure that there are no indexes?  Maybe there are indexes
> implied by UNIQUE constraints and maybe SQLite3 is incorrectly
> deciding to scan the index instead of the table?
>
> My guess is that you have a UNIQUE constraint on ID in SOURCE_TABLE
> and SQLite3 is scanning the index and then fetching the DATE from the
> table.  But this makes no sense because then it makes no sense to
> select the min() of DATE as there would be a single row per-ID.  My
> guess has to be wrong.  Maybe there's a UNIQUE constraint on several
> columns including ID?  That would not be incompatible with the
> semantics of your query.
>
> Can you post EXPLAIN QUERY PLAN output for your statements?  And/or your
> schema.
>
> Nico
> --
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Improving performance of GROUP BY

2012-05-29 Thread Simon Slavin

On 29 May 2012, at 11:04pm, Nico Williams  wrote:

> Can you post EXPLAIN QUERY PLAN output for your statements?  And/or your 
> schema.

Also perform the SQL command 'ANALYZE' on that database, then try your timings 
again.  See if the timings improve (or perhaps even get worse).

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Improving performance of GROUP BY

2012-05-29 Thread Nico Williams
On Fri, May 25, 2012 at 1:38 PM, Udi Karni  wrote:
> I am running the following query -
>
> CREATE TABLE XYZ AS
> SELECT ID, MIN (DATE)
> FROM SOURCE-TABLE
> WHERE CRITERIA
> GROUP BY ID ;
>
> SOURCE-TABLE has 600 million rows, 2 million meet the WHERE criteria, and
> get grouped to 100,000 distinct IDs. There are no indexes as there is too
> much variety in the WHERE clauses and index creation on this table takes
> many hours. Temp_store=memory. When executed - the source table is read
> very slowly, and runtime is 1/2 hour.
>
> I split this SQL into 2 steps - the first just extracting - the second just
> grouping. The extract read the table extremely fast and finished in 3
> minutes. The GROUP BY ran in a few seconds. So a 10X improvement - which is
> pretty incredible speed.

Meaning?  Something like this:

CREATE TABLE FOO AS SELECT ID, DATE FROM SOURCE_TABLE;
CREATE TABLE BAR AS SELECT ID, MIN(DATE) FROM FOO GROUP BY ID;

?

Are you sure that there are no indexes?  Maybe there are indexes
implied by UNIQUE constraints and maybe SQLite3 is incorrectly
deciding to scan the index instead of the table?

My guess is that you have a UNIQUE constraint on ID in SOURCE_TABLE
and SQLite3 is scanning the index and then fetching the DATE from the
table.  But this makes no sense because then it makes no sense to
select the min() of DATE as there would be a single row per-ID.  My
guess has to be wrong.  Maybe there's a UNIQUE constraint on several
columns including ID?  That would not be incompatible with the
semantics of your query.

Can you post EXPLAIN QUERY PLAN output for your statements?  And/or your schema.

Nico
--
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Improving performance of GROUP BY

2012-05-26 Thread Simon Slavin

On 26 May 2012, at 6:25pm, Udi Karni  wrote:

> In any event - breaking the SQL up in 2 is still at least twice as better.

I do see huge SQL commands on this list sometimes: great big commands of 
hundreds of characters which could be done simpler and faster if split into two 
small SQL commands.

> Where SQlite seems to run superfast all the time - is :memory: tables - if
> you have the RAM. Maybe that's the real direction with large DBs. A giant
> RAMDisk ?

Solid State Drives.  Those things of Flash memory you can use instead of a 
rotating hard disk drive.  They're getting very cheap.

A lot of the time taken for hard disk access is spent waiting for the disk to 
rotate to the right place so the read/write heads can get at the sector you 
want.  SSDs never have to wait.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Improving performance of GROUP BY

2012-05-26 Thread Udi Karni
It certainly seems to improve things. I ran multiple tests because over
time seems like portions of the data get cached and there is a difference
between an initial run and subsequent runs. I almost wish simple
table-scans would always go exclusively to disk - I am not sure the caching
always helps.

In any event - breaking the SQL up in 2 is still at least twice as better.

Where SQlite seems to run superfast all the time - is :memory: tables - if
you have the RAM. Maybe that's the real direction with large DBs. A giant
RAMDisk ?
On Fri, May 25, 2012 at 12:10 PM, Simon Slavin  wrote:

>
> On 25 May 2012, at 7:38pm, Udi Karni  wrote:
>
> > I am running the following query -
> >
> > CREATE TABLE XYZ AS
> > SELECT ID, MIN (DATE)
> > FROM SOURCE-TABLE
> > WHERE CRITERIA
> > GROUP BY ID ;
> >
> > SOURCE-TABLE has 600 million rows, 2 million meet the WHERE criteria, and
> > get grouped to 100,000 distinct IDs. There are no indexes as there is too
> > much variety in the WHERE clauses and index creation on this table takes
> > many hours. Temp_store=memory. When executed - the source table is read
> > very slowly, and runtime is 1/2 hour.
> >
> > I split this SQL into 2 steps - the first just extracting - the second
> just
> > grouping. The extract read the table extremely fast and finished in 3
> > minutes. The GROUP BY ran in a few seconds. So a 10X improvement - which
> is
> > pretty incredible speed.
>
> Just for testing purposes, change MIN(DATE) to DATE and check the timings.
>
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Improving performance of GROUP BY

2012-05-25 Thread Simon Slavin

On 25 May 2012, at 7:38pm, Udi Karni  wrote:

> I am running the following query -
> 
> CREATE TABLE XYZ AS
> SELECT ID, MIN (DATE)
> FROM SOURCE-TABLE
> WHERE CRITERIA
> GROUP BY ID ;
> 
> SOURCE-TABLE has 600 million rows, 2 million meet the WHERE criteria, and
> get grouped to 100,000 distinct IDs. There are no indexes as there is too
> much variety in the WHERE clauses and index creation on this table takes
> many hours. Temp_store=memory. When executed - the source table is read
> very slowly, and runtime is 1/2 hour.
> 
> I split this SQL into 2 steps - the first just extracting - the second just
> grouping. The extract read the table extremely fast and finished in 3
> minutes. The GROUP BY ran in a few seconds. So a 10X improvement - which is
> pretty incredible speed.

Just for testing purposes, change MIN(DATE) to DATE and check the timings.

Simon.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Improving performance of GROUP BY

2012-05-25 Thread Udi Karni
I am running the following query -

CREATE TABLE XYZ AS
SELECT ID, MIN (DATE)
FROM SOURCE-TABLE
WHERE CRITERIA
GROUP BY ID ;

SOURCE-TABLE has 600 million rows, 2 million meet the WHERE criteria, and
get grouped to 100,000 distinct IDs. There are no indexes as there is too
much variety in the WHERE clauses and index creation on this table takes
many hours. Temp_store=memory. When executed - the source table is read
very slowly, and runtime is 1/2 hour.

I split this SQL into 2 steps - the first just extracting - the second just
grouping. The extract read the table extremely fast and finished in 3
minutes. The GROUP BY ran in a few seconds. So a 10X improvement - which is
pretty incredible speed.

Why such a large disparity? Why wouldn't the single-step SQL read fast -
write the accepted values to TEMP - then sort and group - same as the
2-step?

Any way to influence the SQlite SQL processor to chose a more optimal
single-step approach?

Thanks !
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users