Re: [sqlite] select intersecting intervals
>The Minimal-Perfect-Hash-INTERSECTION-OF-VECTORS approach might benefit >queries against tables having several million rows. What I'm wondering >(and >lack the C skills to find out for myself) is whether SQLite's underlying >algorithms for INTERSECT could be optimized with a minimal perfect hash >approach. The algorithms would decide which vector of ids is the better >candidate to be used for the MPH and which is the better candidate to be >iterated, and then iterate over the one vector of ids, testing for the >existence of each id in the MPH using the optimized Exists() function >supplied by the mph library for the particular type of mph being used. > >The question, in scenarios where the vectors contain many items, is >whether >the overhead of creating the MPH and testing for existence is >significantly >less than the overhead of doing whatever INTERSECT is doing now when it >intersects vectors of ids. You have a ready-made acronym to advertise the >speed if it turns out to be faster: MPH. ;-) Ooops, for some reason the end of my post went deleted. I was adding that the simplest naive approch using only one index can be magnitude faster, depending on how selective the conditions are, for a given set content. I believe the key is "know your data" for the same reason you'll craft a regexp differently depending on what you know about the target text. Histogram can help this. Good MPH pun! ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] select intersecting intervals
On Thu, May 13, 2010 at 10:49:48AM -0400, Pavel Ivanov scratched on the wall: > > ?You have three basic conditions, and they're all AND'ed together. > > ?Just build an index that each condition can walk through. > > > > ?Or am I missing something? ?I know there are some odd rules about how > > ?SQLite will use (or won't use) indexes for greater-than/less-than > > ?conditions, but I don't remember the specifics. > > This specifics is the same for any DBMS: if you have greater/less > condition on column that is "in the middle of index" then indexing on > any consecutive columns is useless. Not nearly as useless as it would be at the beginning of the index. > I.e. in this case with index on > (i_name, i_from, i_to) and condition i_from < something condition on > i_to will be checked for each row satisfying conditions on i_name and > i_from. So performance will be almost the same. "Almost" because when > i_to is in the index optimization can be made and row from table not > loaded unless condition on i_to is true. Right... you still have to do an index scan, but you can do that all from the index, rather than the double-lookup required to reference back to the table. Normally GT/LT type constraints are not applied to an index because it is assumed the index won't be targeted enough to justify the overhead. Hence structures like R*Trees that can deal with multiple constraints much more efficiently. But in this case every step is a narrowing step (especially if i_name is reasonably unique) and the whole set of constraints can be processed from a single index. > But I don't know if such optimization exists in SQLite or not. Yeah, I don't know. I also don't know if SQLite *always* ignores the indexes. If you have extended stats enabled, the ANALYZE command will generate a histogram. It seems the only use of the histogram is to make a choice on using (or not using) an index for GT/LT type lookups. -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "Our opponent is an alien starship packed with atomic bombs. We have a protractor." "I'll go home and see if I can scrounge up a ruler and a piece of string." --from Anathem by Neal Stephenson ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] select intersecting intervals
> You have three basic conditions, and they're all AND'ed together. > Just build an index that each condition can walk through. > > Or am I missing something? I know there are some odd rules about how > SQLite will use (or won't use) indexes for greater-than/less-than > conditions, but I don't remember the specifics. This specifics is the same for any DBMS: if you have greater/less condition on column that is "in the middle of index" then indexing on any consecutive columns is useless. I.e. in this case with index on (i_name, i_from, i_to) and condition i_from < something condition on i_to will be checked for each row satisfying conditions on i_name and i_from. So performance will be almost the same. "Almost" because when i_to is in the index optimization can be made and row from table not loaded unless condition on i_to is true. But I don't know if such optimization exists in SQLite or not. Also this optimization can be minimal in SQLite because AFAIK it stores part of the row from table in the index too. Pavel On Thu, May 13, 2010 at 10:16 AM, Jay A. Kreibich wrote: > On Wed, May 12, 2010 at 06:00:29PM +0200, Jan Asselman scratched on the wall: >> Hi, >> >> Given the following table with large row count 'row_count': >> >> CREATE TABLE table >> ( >> i_name TEXT, >> i_from INTEGER, >> i_to INTEGER, >> i_data BLOB >> ) >> >> I am wondering what would be the fastest way to get all rows with a >> given name 'myname' that intersect with a given interval [a, b]? >> >> CREATE INDEX idx_from ON table (i_name, i_from); >> CREATE INDEX idx_to ON table (i_name, i_to); > > The query is only going to be able to use one of these. > >> I know this is exactly what a one dimensional R-tree index is used for, >> but my project requires 64 bit integer minimum- and maximum-value >> pairs... > > True, although R-trees become much more useful when you get past one > dimension. > >> All suggestions or corrections are appreciated. > > Assuming i_name is somewhat unique, why wouldn't you just create an > index over (i_name, i_from, i_to)? If i_name isn't very unique, mix > up the order a bit. > > You have three basic conditions, and they're all AND'ed together. > Just build an index that each condition can walk through. > > Or am I missing something? I know there are some odd rules about how > SQLite will use (or won't use) indexes for greater-than/less-than > conditions, but I don't remember the specifics. > > -j > > -- > Jay A. Kreibich < J A Y @ K R E I B I.C H > > > "Our opponent is an alien starship packed with atomic bombs. We have > a protractor." "I'll go home and see if I can scrounge up a ruler > and a piece of string." --from Anathem by Neal Stephenson > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] select intersecting intervals
On Wed, May 12, 2010 at 06:00:29PM +0200, Jan Asselman scratched on the wall: > Hi, > > Given the following table with large row count 'row_count': > > CREATE TABLE table > ( > i_name TEXT, > i_from INTEGER, > i_toINTEGER, > i_data BLOB > ) > > I am wondering what would be the fastest way to get all rows with a > given name 'myname' that intersect with a given interval [a, b]? > > CREATE INDEX idx_from ON table (i_name, i_from); > CREATE INDEX idx_to ON table (i_name, i_to); The query is only going to be able to use one of these. > I know this is exactly what a one dimensional R-tree index is used for, > but my project requires 64 bit integer minimum- and maximum-value > pairs... True, although R-trees become much more useful when you get past one dimension. > All suggestions or corrections are appreciated. Assuming i_name is somewhat unique, why wouldn't you just create an index over (i_name, i_from, i_to)? If i_name isn't very unique, mix up the order a bit. You have three basic conditions, and they're all AND'ed together. Just build an index that each condition can walk through. Or am I missing something? I know there are some odd rules about how SQLite will use (or won't use) indexes for greater-than/less-than conditions, but I don't remember the specifics. -j -- Jay A. Kreibich < J A Y @ K R E I B I.C H > "Our opponent is an alien starship packed with atomic bombs. We have a protractor." "I'll go home and see if I can scrounge up a ruler and a piece of string." --from Anathem by Neal Stephenson ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] select intersecting intervals
Right, Jean-Christophe, about "moving too much data around". You'd want each of the inner selects to produce a vector of integer ids. "select *" is clearly not going to be very efficient. 200K rows is a very small table for this sort of test, in my view. Were the faster intereections any faster than: select * from T where lo ... {low condition} and hi ..{ hi condition} and name = 'x' where only one index is chosen? The answer would depend on the cardinality. If [name] is highly differentiated then name='x' would return very few rows and the query is fast using standard approach. But if [name] were, say, FirstName, and there were hundreds of thousands of records containing "Joe" in the OP's table, then the name index won't help us very much. That's why I am asking about Minimal Perfect Hash and intersection of vectors. The Minimal-Perfect-Hash-INTERSECTION-OF-VECTORS approach might benefit queries against tables having several million rows. What I'm wondering (and lack the C skills to find out for myself) is whether SQLite's underlying algorithms for INTERSECT could be optimized with a minimal perfect hash approach. The algorithms would decide which vector of ids is the better candidate to be used for the MPH and which is the better candidate to be iterated, and then iterate over the one vector of ids, testing for the existence of each id in the MPH using the optimized Exists() function supplied by the mph library for the particular type of mph being used. The question, in scenarios where the vectors contain many items, is whether the overhead of creating the MPH and testing for existence is significantly less than the overhead of doing whatever INTERSECT is doing now when it intersects vectors of ids. You have a ready-made acronym to advertise the speed if it turns out to be faster: MPH. ;-) Regards Tim Romano Swarthmore PA On Wed, May 12, 2010 at 8:52 PM, Jean-Christophe Deschamps wrote: > > > > >I would first create an INTEGER primary key and then place an index on > >name, > >another on i_from, and another on i_to, and then see if the approach below > >has any benefit. > > > >When I tried this with a geo-queryit was actually slower than the standard > >select, and I'm curious if that's always going to be the case. It will > >come > >down to how efficient the INTERSECT of the vectors of integers is. Each > >vector will have been the result of an index-scan. If INTERSECT were > >optimized (perhaps with a minimal perfect hash function > >http://cmph.sourceforge.net/index.html) this approach might be useful. > > > All three following queries use only simple indexes (PK, name, lo, hi). > > Query#1: > select * from tst where lo < 345678 > intersect > select * from tst where hi > 123456 > intersect > select * from tst where name = 'aaelj'; > > Query#2 > select * from tst > join ( > select rowid from tst where lo < 345678 > intersect > select rowid from tst where hi > 123456 > ) as interval > on tst.rowid = interval.rowid and name = 'aaelj'; > > Query#3 > select * from tst > join ( > select rowid from tst where lo < 345678 > intersect > select rowid from tst where hi > 123456 > intersect > select rowid from tst where name = 'aaelj' > ) as interval > on tst.rowid = interval.rowid; > > On a 200K-row test table with random data, queries #2 and #3 were > essentially identical while #1 was twice slower (moving too much data > around, uselessly). > > > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] select intersecting intervals
> >I would first create an INTEGER primary key and then place an index on >name, >another on i_from, and another on i_to, and then see if the approach below >has any benefit. > >When I tried this with a geo-queryit was actually slower than the standard >select, and I'm curious if that's always going to be the case. It will >come >down to how efficient the INTERSECT of the vectors of integers is. Each >vector will have been the result of an index-scan. If INTERSECT were >optimized (perhaps with a minimal perfect hash function >http://cmph.sourceforge.net/index.html) this approach might be useful. All three following queries use only simple indexes (PK, name, lo, hi). Query#1: select * from tst where lo < 345678 intersect select * from tst where hi > 123456 intersect select * from tst where name = 'aaelj'; Query#2 select * from tst join ( select rowid from tst where lo < 345678 intersect select rowid from tst where hi > 123456 ) as interval on tst.rowid = interval.rowid and name = 'aaelj'; Query#3 select * from tst join ( select rowid from tst where lo < 345678 intersect select rowid from tst where hi > 123456 intersect select rowid from tst where name = 'aaelj' ) as interval on tst.rowid = interval.rowid; On a 200K-row test table with random data, queries #2 and #3 were essentially identical while #1 was twice slower (moving too much data around, uselessly). ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
Re: [sqlite] select intersecting intervals
I would first create an INTEGER primary key and then place an index on name, another on i_from, and another on i_to, and then see if the approach below has any benefit. When I tried this with a geo-queryit was actually slower than the standard select, and I'm curious if that's always going to be the case. It will come down to how efficient the INTERSECT of the vectors of integers is. Each vector will have been the result of an index-scan. If INTERSECT were optimized (perhaps with a minimal perfect hash function http://cmph.sourceforge.net/index.html) this approach might be useful. select * from T JOIN ( select pk_col from T where i_from > ? intersect select pk_col from T where i_to < ? ) as DESIREDINTERVAL ON T.pk_col = DESIREDINTERVAL.pk_col and T.name = ? Regards Tim Romano Swarthmore PA On Wed, May 12, 2010 at 12:00 PM, Jan Asselman wrote: > Hi, > > Given the following table with large row count 'row_count': > > CREATE TABLE table > ( >i_name TEXT, >i_from INTEGER, >i_toINTEGER, >i_data BLOB > ) > > I am wondering what would be the fastest way to get all rows with a > given name 'myname' that intersect with a given interval [a, b]? > > > CREATE INDEX idx_from ON table (i_name, i_from); > CREATE INDEX idx_to ON table (i_name, i_to); > > SELECT data FROM table WHERE name = 'myname' AND i_from < b AND i_to > a > >-> index idx_from will be used >-> in worst case (a is larger than all i_to) all 'myname' rows > will be traversed before concluding result set is empty > > SELECT data FROM table WHERE name = 'myname' AND i_to > a AND i_from < b > >-> index idx_to will be used >-> in worst case (b is smaller than all i_from) all 'myname' > rows > will be traversed before concluding result set is empty > > > > I know this is exactly what a one dimensional R-tree index is used for, > but my project requires 64 bit integer minimum- and maximum-value > pairs... > > All suggestions or corrections are appreciated. > > > Thanks in advance, > > Jan Asselman > ___ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
[sqlite] select intersecting intervals
Hi, Given the following table with large row count 'row_count': CREATE TABLE table ( i_name TEXT, i_from INTEGER, i_toINTEGER, i_data BLOB ) I am wondering what would be the fastest way to get all rows with a given name 'myname' that intersect with a given interval [a, b]? CREATE INDEX idx_from ON table (i_name, i_from); CREATE INDEX idx_to ON table (i_name, i_to); SELECT data FROM table WHERE name = 'myname' AND i_from < b AND i_to > a -> index idx_from will be used -> in worst case (a is larger than all i_to) all 'myname' rows will be traversed before concluding result set is empty SELECT data FROM table WHERE name = 'myname' AND i_to > a AND i_from < b -> index idx_to will be used -> in worst case (b is smaller than all i_from) all 'myname' rows will be traversed before concluding result set is empty I know this is exactly what a one dimensional R-tree index is used for, but my project requires 64 bit integer minimum- and maximum-value pairs... All suggestions or corrections are appreciated. Thanks in advance, Jan Asselman ___ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users