Re: [sqlite] PATCH: WHERE clause OR to UNION optimization

2007-12-22 Thread drh
Joe Wilson [EMAIL PROTECTED] wrote:
 The attached patch implements the WHERE clause OR to UNION 
 optimization as described in this post:
 

I just went thumbing through the firesafe and I do not think
I have a copyright release on file for you, Joe.  Please go
print out a copy of one of

   http://www.sqlite.org/copyright-release.html
   http://www.sqlite.org/copyright-release.pdf

Sign it, get your employer to sign it, then send it to me
by post to the address given at the bottom of

   http://www.sqlite.org/copyright.html

Thanks.

--
D. Richard Hipp [EMAIL PROTECTED]


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] PATCH: WHERE clause OR to UNION optimization

2007-12-22 Thread Joe Wilson
This new attached patch corrects a slight inefficiency in my 
previous patch. It now uses UNION ALL (TK_ALL) instead of UNION
(TK_UNION) for the subselects, as was my initial intention. 
OR queries on dis-similar columns with large intermediate result
sets are now even faster.

Given the same sample database below and running this query:

 select count(*) from stuff 
 where b=1 or b=5 or c=2 or d=4 or b=9;

will yield these timings:

 sqlite 3.5.4 unpatched: CPU Time: user 3.560223 sys 0.156010
 3.5.4 + 20071221 patch: CPU Time: user 1.252079 sys 0.072004
 3.5.4 + 20071222 patch: CPU Time: user 0.776048 sys 0.116007

I also changed the OR cost estimator to reduce the cost for each 
additional OR expression pair, as it better reflects the actual
cost.

Here's the meta-diff between yesterday's patch and the new one:

-+return (costL+costR);
++/* Reduce the cost for each OR expression pair by a few percent due
++** to the fact that if the transform was performed then there would
++** be fewer expressions to evaluate in each compound subselect's
++** WHERE clause.
++*/
++return (costL+costR)*31/32;
@@ -323 +328 @@
-+  p-op = TK_UNION;
++  p-op = TK_ALL;
@@ -329 +334 @@
-+p-op = TK_UNION;
++p-op = TK_ALL;
@@ -339 +344 @@
-+p-op = TK_UNION;
++p-op = TK_ALL;


--- Joe Wilson [EMAIL PROTECTED] wrote:
 The attached patch implements the WHERE clause OR to UNION 
 optimization as described in this post:
 
  http://www.mail-archive.com/sqlite-users@sqlite.org/msg09004.html
 
 If the computed cost of the rewritten WHERE clause is lower than 
 the original query when indexes are taken into account, then it 
 will perform the optimization. If the cost is estimated to be 
 higher then the query will not be rewritten.
 
 Given the database formed by running these statements:
 
   create table stuff(a,b,c,d);
   insert into stuff values(1,2,3,4);
   create temp view v1 as select random()%100,
 random()%100, random()%1000, random()%1
  from stuff x, stuff y;
   insert into stuff select * from v1;
   insert into stuff select * from v1;
   insert into stuff select * from v1;
   insert into stuff select * from v1;
   insert into stuff select * from v1;
   create index stuff_b on stuff(b);
   create index stuff_c on stuff(c);
   create index stuff_d on stuff(d);
   analyze;
 
 The patched version of sqlite 3.5.4 will run the following query 
 many times faster than an unpatched sqlite 3.5.4:
 
   select b, count(*) from stuff 
   where c=2 or b=23 or c=17 or c=493 or d=7 or c=111 and a=14 
   group by 1 order by 2 DESC, 1 limit 10;
 
 On my machine, the patched version produces these query timings:
 
   CPU Time: user 0.724045 sys 0.092005
 
 with the EXPLAIN QUERY PLAN:
 
   0|0|TABLE stuff USING PRIMARY KEY
   0|0|TABLE stuff WITH INDEX stuff_c
   0|0|TABLE stuff WITH INDEX stuff_d
   0|0|TABLE stuff WITH INDEX stuff_b
   0|0|TABLE stuff WITH INDEX stuff_c
 
 For the same query the unpatched sqlite 3.5.4 produces:
 
   CPU Time: user 20.869304 sys 8.912557
 
   0|0|TABLE stuff WITH INDEX stuff_b ORDER BY
 
 Only single table queries are supported by this OR optimization. 
 For this optimization to be considered, the WHERE clause may only 
 consist of column equality comparisons to constants, ORs and ANDs.
 
 The optimization only looks at the top-level WHERE clause ORs. It 
 will not work with IN expressions. Nor will it will not expand 
 expressions like a=1 AND (b=2 or c=3) into a=1 AND b=2 OR a=1 
 AND c=3 - although if manually expanded, the latter form could 
 potentially be optimized.
 
 It passes make test without regressions, but more testing is needed.




  

Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hsIndex: src/select.c
===
RCS file: /sqlite/sqlite/src/select.c,v
retrieving revision 1.372
diff -u -3 -p -r1.372 select.c
--- src/select.c14 Dec 2007 17:24:40 -  1.372
+++ src/select.c22 Dec 2007 17:37:13 -
@@ -12,7 +12,7 @@
 ** This file contains C code routines that are called by the parser
 ** to handle SELECT statements in SQLite.
 **
-** $Id: select.c,v 1.372 2007/12/14 17:24:40 drh Exp $
+** $Id: select.c,v 1.370 2007/12/13 21:54:11 drh Exp $
 */
 #include sqliteInt.h
 
@@ -2961,6 +2961,445 @@ static void updateAccumulator(Parse *pPa
   pAggInfo-directMode = 0;
 }
 
+#ifndef SQLITE_OMIT_OR_UNION_TRANSFORM
+
+/* The function prefix o2u stands for OR to UNION TRANSFORM */
+static double o2uAndCost(Expr *p, int iTable, Bitmask *bm);
+static double o2uEqCost(Expr *p, int iTable, Bitmask *bm);
+
+/*
+** Count the number of bits in Bitmask. Each bit represents the existance
+** of a column in an expression. The zeroth bit represents the use of the
+** rowid column in the WHERE clause, which is different 

[sqlite] PATCH: WHERE clause OR to UNION optimization

2007-12-21 Thread Joe Wilson
The attached patch implements the WHERE clause OR to UNION 
optimization as described in this post:

 http://www.mail-archive.com/sqlite-users@sqlite.org/msg09004.html

If the computed cost of the rewritten WHERE clause is lower than 
the original query when indexes are taken into account, then it 
will perform the optimization. If the cost is estimated to be 
higher then the query will not be rewritten.

Given the database formed by running these statements:

  create table stuff(a,b,c,d);
  insert into stuff values(1,2,3,4);
  create temp view v1 as select random()%100,
random()%100, random()%1000, random()%1
 from stuff x, stuff y;
  insert into stuff select * from v1;
  insert into stuff select * from v1;
  insert into stuff select * from v1;
  insert into stuff select * from v1;
  insert into stuff select * from v1;
  create index stuff_b on stuff(b);
  create index stuff_c on stuff(c);
  create index stuff_d on stuff(d);
  analyze;

The patched version of sqlite 3.5.4 will run the following query 
many times faster than an unpatched sqlite 3.5.4:

  select b, count(*) from stuff 
  where c=2 or b=23 or c=17 or c=493 or d=7 or c=111 and a=14 
  group by 1 order by 2 DESC, 1 limit 10;

On my machine, the patched version produces these query timings:

  CPU Time: user 0.724045 sys 0.092005

with the EXPLAIN QUERY PLAN:

  0|0|TABLE stuff USING PRIMARY KEY
  0|0|TABLE stuff WITH INDEX stuff_c
  0|0|TABLE stuff WITH INDEX stuff_d
  0|0|TABLE stuff WITH INDEX stuff_b
  0|0|TABLE stuff WITH INDEX stuff_c

For the same query the unpatched sqlite 3.5.4 produces:

  CPU Time: user 20.869304 sys 8.912557

  0|0|TABLE stuff WITH INDEX stuff_b ORDER BY

Only single table queries are supported by this OR optimization. 
For this optimization to be considered, the WHERE clause may only 
consist of column equality comparisons to constants, ORs and ANDs.

The optimization only looks at the top-level WHERE clause ORs. It 
will not work with IN expressions. Nor will it will not expand 
expressions like a=1 AND (b=2 or c=3) into a=1 AND b=2 OR a=1 
AND c=3 - although if manually expanded, the latter form could 
potentially be optimized.

It passes make test without regressions, but more testing is needed.




  

Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hsIndex: src/select.c
===
RCS file: /sqlite/sqlite/src/select.c,v
retrieving revision 1.372
diff -u -3 -p -r1.372 select.c
--- src/select.c14 Dec 2007 17:24:40 -  1.372
+++ src/select.c22 Dec 2007 02:39:00 -
@@ -12,7 +12,7 @@
 ** This file contains C code routines that are called by the parser
 ** to handle SELECT statements in SQLite.
 **
-** $Id: select.c,v 1.372 2007/12/14 17:24:40 drh Exp $
+** $Id: select.c,v 1.370 2007/12/13 21:54:11 drh Exp $
 */
 #include sqliteInt.h
 
@@ -2961,6 +2961,440 @@ static void updateAccumulator(Parse *pPa
   pAggInfo-directMode = 0;
 }
 
+#ifndef SQLITE_OMIT_OR_UNION_TRANSFORM
+
+/* The function prefix o2u stands for OR to UNION TRANSFORM */
+static double o2uAndCost(Expr *p, int iTable, Bitmask *bm);
+static double o2uEqCost(Expr *p, int iTable, Bitmask *bm);
+
+/*
+** Count the number of bits in Bitmask. Each bit represents the existance
+** of a column in an expression. The zeroth bit represents the use of the
+** rowid column in the WHERE clause, which is different from non-o2u
+** uses of Bitmask in the code.
+*/
+static int o2uBitCount(Bitmask x){
+  int n = 0;
+  while( x ){
+x = x-1;
+++n;
+  }
+  return n;
+}
+
+/*
+** Full table scan cost is simply the average number rows for each index
+** for the specified table.
+*/
+static double o2uFullTableScanCost(Table* pTab){
+  if( pTab ){
+double sum = 0;
+int n = 0;
+Index *pIndex;
+for( pIndex = pTab-pIndex; pIndex; pIndex = pIndex-pNext, n++ ){
+  if( pIndex-nColumn0 ){
+sum += pIndex-aiRowEst[0];
+  }
+}
+if( n  sumn ){
+  return sum/n;
+}
+  }
+  return 100;
+}
+
+/*
+** Estimate a WHERE clause column's cost taking indexes into account.
+** Record any columns encountered in Bitmask.
+*/
+static double o2uColumnCost(Expr *p, int iTable, Bitmask *bm){
+  if( p  p-op==TK_COLUMN  p-pSelect==0  p-iTable==iTable  p-pTab ){
+int iColumn = p-iColumn;
+Index* pIndex;
+if( iColumn=-1  iColumn=(int)(sizeof(Bitmask)*8-2) ){
+  *bm |= 1(iColumn+1); /* rowid column -1 is the 0th bit */
+}else{
+  *bm |= 0x3; /* iTable beyond range: disqualify single column OR opt */
+}
+if( iColumn==-1 ){
+  return 10;  /* Match: rowid */
+}
+for( pIndex = p-pTab-pIndex; pIndex; pIndex = pIndex-pNext ){
+  if( pIndex-nColumn0  pIndex-aiColumn[0]==iColumn ){
+return pIndex-aiRowEst[1];   /* Match: first column of index */
+  }
+}
+