Re: [sqlite] PATCH: WHERE clause OR to UNION optimization
Joe Wilson [EMAIL PROTECTED] wrote: The attached patch implements the WHERE clause OR to UNION optimization as described in this post: I just went thumbing through the firesafe and I do not think I have a copyright release on file for you, Joe. Please go print out a copy of one of http://www.sqlite.org/copyright-release.html http://www.sqlite.org/copyright-release.pdf Sign it, get your employer to sign it, then send it to me by post to the address given at the bottom of http://www.sqlite.org/copyright.html Thanks. -- D. Richard Hipp [EMAIL PROTECTED] - To unsubscribe, send email to [EMAIL PROTECTED] -
Re: [sqlite] PATCH: WHERE clause OR to UNION optimization
This new attached patch corrects a slight inefficiency in my previous patch. It now uses UNION ALL (TK_ALL) instead of UNION (TK_UNION) for the subselects, as was my initial intention. OR queries on dis-similar columns with large intermediate result sets are now even faster. Given the same sample database below and running this query: select count(*) from stuff where b=1 or b=5 or c=2 or d=4 or b=9; will yield these timings: sqlite 3.5.4 unpatched: CPU Time: user 3.560223 sys 0.156010 3.5.4 + 20071221 patch: CPU Time: user 1.252079 sys 0.072004 3.5.4 + 20071222 patch: CPU Time: user 0.776048 sys 0.116007 I also changed the OR cost estimator to reduce the cost for each additional OR expression pair, as it better reflects the actual cost. Here's the meta-diff between yesterday's patch and the new one: -+return (costL+costR); ++/* Reduce the cost for each OR expression pair by a few percent due ++** to the fact that if the transform was performed then there would ++** be fewer expressions to evaluate in each compound subselect's ++** WHERE clause. ++*/ ++return (costL+costR)*31/32; @@ -323 +328 @@ -+ p-op = TK_UNION; ++ p-op = TK_ALL; @@ -329 +334 @@ -+p-op = TK_UNION; ++p-op = TK_ALL; @@ -339 +344 @@ -+p-op = TK_UNION; ++p-op = TK_ALL; --- Joe Wilson [EMAIL PROTECTED] wrote: The attached patch implements the WHERE clause OR to UNION optimization as described in this post: http://www.mail-archive.com/sqlite-users@sqlite.org/msg09004.html If the computed cost of the rewritten WHERE clause is lower than the original query when indexes are taken into account, then it will perform the optimization. If the cost is estimated to be higher then the query will not be rewritten. Given the database formed by running these statements: create table stuff(a,b,c,d); insert into stuff values(1,2,3,4); create temp view v1 as select random()%100, random()%100, random()%1000, random()%1 from stuff x, stuff y; insert into stuff select * from v1; insert into stuff select * from v1; insert into stuff select * from v1; insert into stuff select * from v1; insert into stuff select * from v1; create index stuff_b on stuff(b); create index stuff_c on stuff(c); create index stuff_d on stuff(d); analyze; The patched version of sqlite 3.5.4 will run the following query many times faster than an unpatched sqlite 3.5.4: select b, count(*) from stuff where c=2 or b=23 or c=17 or c=493 or d=7 or c=111 and a=14 group by 1 order by 2 DESC, 1 limit 10; On my machine, the patched version produces these query timings: CPU Time: user 0.724045 sys 0.092005 with the EXPLAIN QUERY PLAN: 0|0|TABLE stuff USING PRIMARY KEY 0|0|TABLE stuff WITH INDEX stuff_c 0|0|TABLE stuff WITH INDEX stuff_d 0|0|TABLE stuff WITH INDEX stuff_b 0|0|TABLE stuff WITH INDEX stuff_c For the same query the unpatched sqlite 3.5.4 produces: CPU Time: user 20.869304 sys 8.912557 0|0|TABLE stuff WITH INDEX stuff_b ORDER BY Only single table queries are supported by this OR optimization. For this optimization to be considered, the WHERE clause may only consist of column equality comparisons to constants, ORs and ANDs. The optimization only looks at the top-level WHERE clause ORs. It will not work with IN expressions. Nor will it will not expand expressions like a=1 AND (b=2 or c=3) into a=1 AND b=2 OR a=1 AND c=3 - although if manually expanded, the latter form could potentially be optimized. It passes make test without regressions, but more testing is needed. Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hsIndex: src/select.c === RCS file: /sqlite/sqlite/src/select.c,v retrieving revision 1.372 diff -u -3 -p -r1.372 select.c --- src/select.c14 Dec 2007 17:24:40 - 1.372 +++ src/select.c22 Dec 2007 17:37:13 - @@ -12,7 +12,7 @@ ** This file contains C code routines that are called by the parser ** to handle SELECT statements in SQLite. ** -** $Id: select.c,v 1.372 2007/12/14 17:24:40 drh Exp $ +** $Id: select.c,v 1.370 2007/12/13 21:54:11 drh Exp $ */ #include sqliteInt.h @@ -2961,6 +2961,445 @@ static void updateAccumulator(Parse *pPa pAggInfo-directMode = 0; } +#ifndef SQLITE_OMIT_OR_UNION_TRANSFORM + +/* The function prefix o2u stands for OR to UNION TRANSFORM */ +static double o2uAndCost(Expr *p, int iTable, Bitmask *bm); +static double o2uEqCost(Expr *p, int iTable, Bitmask *bm); + +/* +** Count the number of bits in Bitmask. Each bit represents the existance +** of a column in an expression. The zeroth bit represents the use of the +** rowid column in the WHERE clause, which is different
[sqlite] PATCH: WHERE clause OR to UNION optimization
The attached patch implements the WHERE clause OR to UNION optimization as described in this post: http://www.mail-archive.com/sqlite-users@sqlite.org/msg09004.html If the computed cost of the rewritten WHERE clause is lower than the original query when indexes are taken into account, then it will perform the optimization. If the cost is estimated to be higher then the query will not be rewritten. Given the database formed by running these statements: create table stuff(a,b,c,d); insert into stuff values(1,2,3,4); create temp view v1 as select random()%100, random()%100, random()%1000, random()%1 from stuff x, stuff y; insert into stuff select * from v1; insert into stuff select * from v1; insert into stuff select * from v1; insert into stuff select * from v1; insert into stuff select * from v1; create index stuff_b on stuff(b); create index stuff_c on stuff(c); create index stuff_d on stuff(d); analyze; The patched version of sqlite 3.5.4 will run the following query many times faster than an unpatched sqlite 3.5.4: select b, count(*) from stuff where c=2 or b=23 or c=17 or c=493 or d=7 or c=111 and a=14 group by 1 order by 2 DESC, 1 limit 10; On my machine, the patched version produces these query timings: CPU Time: user 0.724045 sys 0.092005 with the EXPLAIN QUERY PLAN: 0|0|TABLE stuff USING PRIMARY KEY 0|0|TABLE stuff WITH INDEX stuff_c 0|0|TABLE stuff WITH INDEX stuff_d 0|0|TABLE stuff WITH INDEX stuff_b 0|0|TABLE stuff WITH INDEX stuff_c For the same query the unpatched sqlite 3.5.4 produces: CPU Time: user 20.869304 sys 8.912557 0|0|TABLE stuff WITH INDEX stuff_b ORDER BY Only single table queries are supported by this OR optimization. For this optimization to be considered, the WHERE clause may only consist of column equality comparisons to constants, ORs and ANDs. The optimization only looks at the top-level WHERE clause ORs. It will not work with IN expressions. Nor will it will not expand expressions like a=1 AND (b=2 or c=3) into a=1 AND b=2 OR a=1 AND c=3 - although if manually expanded, the latter form could potentially be optimized. It passes make test without regressions, but more testing is needed. Never miss a thing. Make Yahoo your home page. http://www.yahoo.com/r/hsIndex: src/select.c === RCS file: /sqlite/sqlite/src/select.c,v retrieving revision 1.372 diff -u -3 -p -r1.372 select.c --- src/select.c14 Dec 2007 17:24:40 - 1.372 +++ src/select.c22 Dec 2007 02:39:00 - @@ -12,7 +12,7 @@ ** This file contains C code routines that are called by the parser ** to handle SELECT statements in SQLite. ** -** $Id: select.c,v 1.372 2007/12/14 17:24:40 drh Exp $ +** $Id: select.c,v 1.370 2007/12/13 21:54:11 drh Exp $ */ #include sqliteInt.h @@ -2961,6 +2961,440 @@ static void updateAccumulator(Parse *pPa pAggInfo-directMode = 0; } +#ifndef SQLITE_OMIT_OR_UNION_TRANSFORM + +/* The function prefix o2u stands for OR to UNION TRANSFORM */ +static double o2uAndCost(Expr *p, int iTable, Bitmask *bm); +static double o2uEqCost(Expr *p, int iTable, Bitmask *bm); + +/* +** Count the number of bits in Bitmask. Each bit represents the existance +** of a column in an expression. The zeroth bit represents the use of the +** rowid column in the WHERE clause, which is different from non-o2u +** uses of Bitmask in the code. +*/ +static int o2uBitCount(Bitmask x){ + int n = 0; + while( x ){ +x = x-1; +++n; + } + return n; +} + +/* +** Full table scan cost is simply the average number rows for each index +** for the specified table. +*/ +static double o2uFullTableScanCost(Table* pTab){ + if( pTab ){ +double sum = 0; +int n = 0; +Index *pIndex; +for( pIndex = pTab-pIndex; pIndex; pIndex = pIndex-pNext, n++ ){ + if( pIndex-nColumn0 ){ +sum += pIndex-aiRowEst[0]; + } +} +if( n sumn ){ + return sum/n; +} + } + return 100; +} + +/* +** Estimate a WHERE clause column's cost taking indexes into account. +** Record any columns encountered in Bitmask. +*/ +static double o2uColumnCost(Expr *p, int iTable, Bitmask *bm){ + if( p p-op==TK_COLUMN p-pSelect==0 p-iTable==iTable p-pTab ){ +int iColumn = p-iColumn; +Index* pIndex; +if( iColumn=-1 iColumn=(int)(sizeof(Bitmask)*8-2) ){ + *bm |= 1(iColumn+1); /* rowid column -1 is the 0th bit */ +}else{ + *bm |= 0x3; /* iTable beyond range: disqualify single column OR opt */ +} +if( iColumn==-1 ){ + return 10; /* Match: rowid */ +} +for( pIndex = p-pTab-pIndex; pIndex; pIndex = pIndex-pNext ){ + if( pIndex-nColumn0 pIndex-aiColumn[0]==iColumn ){ +return pIndex-aiRowEst[1]; /* Match: first column of index */ + } +} +