Re: [sqlite] select intersecting intervals

2010-05-13 Thread Jean-Christophe Deschamps

>The Minimal-Perfect-Hash-INTERSECTION-OF-VECTORS approach might benefit
>queries against tables having several million rows. What I'm wondering 
>(and
>lack the C skills to find out for myself) is whether SQLite's underlying
>algorithms for INTERSECT could be optimized with a minimal perfect hash
>approach. The algorithms would decide which vector of ids is the better
>candidate to be used for the MPH and which is the better candidate to be
>iterated, and then iterate over the one vector of ids, testing for the
>existence of each id in the MPH using the optimized Exists() function
>supplied by the mph library for the particular type of mph being used.
>
>The question, in scenarios where the vectors contain many items, is 
>whether
>the overhead of creating the MPH and testing for existence is 
>significantly
>less than the overhead of doing whatever INTERSECT is doing now when it
>intersects vectors of ids.  You have a ready-made acronym to advertise the
>speed if it turns out to be faster: MPH.  ;-)

Ooops, for some reason the end of my post went deleted.  I was adding 
that the simplest naive approch using only one index can be magnitude 
faster, depending on how selective the conditions are, for a given set 
content.  I believe the key is "know your data" for the same reason 
you'll craft a regexp differently depending on what you know about the 
target text.  Histogram can help this.

Good MPH pun!


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] select intersecting intervals

2010-05-13 Thread Jay A. Kreibich
On Thu, May 13, 2010 at 10:49:48AM -0400, Pavel Ivanov scratched on the wall:
> > ?You have three basic conditions, and they're all AND'ed together.
> > ?Just build an index that each condition can walk through.
> >
> > ?Or am I missing something? ?I know there are some odd rules about how
> > ?SQLite will use (or won't use) indexes for greater-than/less-than
> > ?conditions, but I don't remember the specifics.
> 
> This specifics is the same for any DBMS: if you have greater/less
> condition on column that is "in the middle of index" then indexing on
> any consecutive columns is useless.

  Not nearly as useless as it would be at the beginning of the index.

> I.e. in this case with index on
> (i_name, i_from, i_to) and condition i_from < something condition on
> i_to will be checked for each row satisfying conditions on i_name and
> i_from. So performance will be almost the same. "Almost" because when
> i_to is in the index optimization can be made and row from table not
> loaded unless condition on i_to is true.

  Right... you still have to do an index scan, but you can do that all
  from the index, rather than the double-lookup required to reference
  back to the table.

  Normally GT/LT type constraints are not applied to an index because
  it is assumed the index won't be targeted enough to justify the
  overhead.  Hence structures like R*Trees that can deal with
  multiple constraints much more efficiently.
  
  But in this case every step is a narrowing step (especially if
  i_name is reasonably unique) and the whole set of constraints can
  be processed from a single index.

> But I don't know if such optimization exists in SQLite or not. 

  Yeah, I don't know.  I also don't know if SQLite *always* ignores the
  indexes.  If you have extended stats enabled, the ANALYZE command
  will generate a histogram.  It seems the only use of the histogram is
  to make a choice on using (or not using) an index for GT/LT type
  lookups.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Our opponent is an alien starship packed with atomic bombs.  We have
 a protractor."   "I'll go home and see if I can scrounge up a ruler
 and a piece of string."  --from Anathem by Neal Stephenson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] select intersecting intervals

2010-05-13 Thread Pavel Ivanov
>  You have three basic conditions, and they're all AND'ed together.
>  Just build an index that each condition can walk through.
>
>  Or am I missing something?  I know there are some odd rules about how
>  SQLite will use (or won't use) indexes for greater-than/less-than
>  conditions, but I don't remember the specifics.

This specifics is the same for any DBMS: if you have greater/less
condition on column that is "in the middle of index" then indexing on
any consecutive columns is useless. I.e. in this case with index on
(i_name, i_from, i_to) and condition i_from < something condition on
i_to will be checked for each row satisfying conditions on i_name and
i_from. So performance will be almost the same. "Almost" because when
i_to is in the index optimization can be made and row from table not
loaded unless condition on i_to is true. But I don't know if such
optimization exists in SQLite or not. Also this optimization can be
minimal in SQLite because AFAIK it stores part of the row from table
in the index too.


Pavel

On Thu, May 13, 2010 at 10:16 AM, Jay A. Kreibich  wrote:
> On Wed, May 12, 2010 at 06:00:29PM +0200, Jan Asselman scratched on the wall:
>> Hi,
>>
>> Given the following table with large row count 'row_count':
>>
>> CREATE TABLE table
>> (
>>       i_name  TEXT,
>>       i_from  INTEGER,
>>       i_to            INTEGER,
>>       i_data  BLOB
>> )
>>
>> I am wondering what would be the fastest way to get all rows with a
>> given name 'myname' that intersect with a given interval [a, b]?
>>
>> CREATE INDEX idx_from ON table (i_name, i_from);
>> CREATE INDEX idx_to ON table (i_name, i_to);
>
>  The query is only going to be able to use one of these.
>
>> I know this is exactly what a one dimensional R-tree index is used for,
>> but my project requires 64 bit integer minimum- and maximum-value
>> pairs...
>
>  True, although R-trees become much more useful when you get past one
>  dimension.
>
>> All suggestions or corrections are appreciated.
>
>  Assuming i_name is somewhat unique, why wouldn't you just create an
>  index over (i_name, i_from, i_to)?  If i_name isn't very unique, mix
>  up the  order a bit.
>
>  You have three basic conditions, and they're all AND'ed together.
>  Just build an index that each condition can walk through.
>
>  Or am I missing something?  I know there are some odd rules about how
>  SQLite will use (or won't use) indexes for greater-than/less-than
>  conditions, but I don't remember the specifics.
>
>   -j
>
> --
> Jay A. Kreibich < J A Y  @  K R E I B I.C H >
>
> "Our opponent is an alien starship packed with atomic bombs.  We have
>  a protractor."   "I'll go home and see if I can scrounge up a ruler
>  and a piece of string."  --from Anathem by Neal Stephenson
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] select intersecting intervals

2010-05-13 Thread Jay A. Kreibich
On Wed, May 12, 2010 at 06:00:29PM +0200, Jan Asselman scratched on the wall:
> Hi,
> 
> Given the following table with large row count 'row_count':
> 
> CREATE TABLE table
> (
>   i_name  TEXT,   
>   i_from  INTEGER,
>   i_toINTEGER,
>   i_data  BLOB
> )
> 
> I am wondering what would be the fastest way to get all rows with a
> given name 'myname' that intersect with a given interval [a, b]?
> 
> CREATE INDEX idx_from ON table (i_name, i_from);
> CREATE INDEX idx_to ON table (i_name, i_to);

  The query is only going to be able to use one of these.

> I know this is exactly what a one dimensional R-tree index is used for,
> but my project requires 64 bit integer minimum- and maximum-value
> pairs...

  True, although R-trees become much more useful when you get past one
  dimension.

> All suggestions or corrections are appreciated.

  Assuming i_name is somewhat unique, why wouldn't you just create an
  index over (i_name, i_from, i_to)?  If i_name isn't very unique, mix
  up the  order a bit.
  
  You have three basic conditions, and they're all AND'ed together.
  Just build an index that each condition can walk through.

  Or am I missing something?  I know there are some odd rules about how
  SQLite will use (or won't use) indexes for greater-than/less-than
  conditions, but I don't remember the specifics.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Our opponent is an alien starship packed with atomic bombs.  We have
 a protractor."   "I'll go home and see if I can scrounge up a ruler
 and a piece of string."  --from Anathem by Neal Stephenson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] select intersecting intervals

2010-05-13 Thread Tim Romano
Right, Jean-Christophe, about "moving too much data around". You'd want each
of the inner selects to produce a vector of integer ids. "select *" is
clearly not going to be very efficient.  200K rows is a very small table for
this sort of test, in my view.  Were the faster intereections any faster
than:

select * from T
where lo ... {low condition}
and hi ..{ hi condition}
and name = 'x'

where only one index is chosen?  The answer would depend on the cardinality.
  If [name] is highly differentiated then name='x' would return very few
rows and the query is fast using standard approach.  But if [name] were,
say, FirstName, and there were hundreds of thousands of records containing
"Joe" in the OP's table, then the name index won't help us very much.
 That's why I am asking about Minimal Perfect Hash and intersection of
vectors.

The Minimal-Perfect-Hash-INTERSECTION-OF-VECTORS approach might benefit
queries against tables having several million rows. What I'm wondering (and
lack the C skills to find out for myself) is whether SQLite's underlying
algorithms for INTERSECT could be optimized with a minimal perfect hash
approach. The algorithms would decide which vector of ids is the better
candidate to be used for the MPH and which is the better candidate to be
iterated, and then iterate over the one vector of ids, testing for the
existence of each id in the MPH using the optimized Exists() function
supplied by the mph library for the particular type of mph being used.

The question, in scenarios where the vectors contain many items, is whether
the overhead of creating the MPH and testing for existence is significantly
less than the overhead of doing whatever INTERSECT is doing now when it
intersects vectors of ids.  You have a ready-made acronym to advertise the
speed if it turns out to be faster: MPH.  ;-)

Regards
Tim Romano
Swarthmore PA

On Wed, May 12, 2010 at 8:52 PM, Jean-Christophe Deschamps 
wrote:

>
> >
> >I would first create an INTEGER primary key and then place an index on
> >name,
> >another on i_from, and another on i_to, and then see if the approach below
> >has any benefit.
> >
> >When I tried this with a geo-queryit was actually slower than the standard
> >select, and I'm curious if that's always going to be the case. It will
> >come
> >down to how efficient the INTERSECT of the vectors of integers is. Each
> >vector will have been the result of an index-scan.  If INTERSECT were
> >optimized (perhaps with a minimal perfect hash function
> >http://cmph.sourceforge.net/index.html) this approach might be useful.
>
>
> All three following queries use only simple indexes (PK, name, lo, hi).
>
> Query#1:
> select * from tst where lo < 345678
> intersect
> select * from tst where hi > 123456
> intersect
> select * from tst where name = 'aaelj';
>
> Query#2
> select * from tst
> join (
> select rowid from tst where lo < 345678
> intersect
> select rowid from tst where hi > 123456
> ) as interval
> on tst.rowid = interval.rowid and name = 'aaelj';
>
> Query#3
> select * from tst
> join (
> select rowid from tst where lo < 345678
> intersect
> select rowid from tst where hi > 123456
> intersect
> select rowid from tst where name = 'aaelj'
> ) as interval
> on tst.rowid = interval.rowid;
>
> On a 200K-row test table with random data, queries #2 and #3 were
> essentially identical while #1 was twice slower (moving too much data
> around, uselessly).
>
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] select intersecting intervals

2010-05-12 Thread Jean-Christophe Deschamps

>
>I would first create an INTEGER primary key and then place an index on 
>name,
>another on i_from, and another on i_to, and then see if the approach below
>has any benefit.
>
>When I tried this with a geo-queryit was actually slower than the standard
>select, and I'm curious if that's always going to be the case. It will 
>come
>down to how efficient the INTERSECT of the vectors of integers is. Each
>vector will have been the result of an index-scan.  If INTERSECT were
>optimized (perhaps with a minimal perfect hash function
>http://cmph.sourceforge.net/index.html) this approach might be useful.


All three following queries use only simple indexes (PK, name, lo, hi).

Query#1:
select * from tst where lo < 345678
 intersect
select * from tst where hi > 123456
 intersect
select * from tst where name = 'aaelj';

Query#2
select * from tst
 join (
 select rowid from tst where lo < 345678
 intersect
 select rowid from tst where hi > 123456
 ) as interval
 on tst.rowid = interval.rowid and name = 'aaelj';

Query#3
select * from tst
 join (
 select rowid from tst where lo < 345678
 intersect
 select rowid from tst where hi > 123456
 intersect
 select rowid from tst where name = 'aaelj'
 ) as interval
 on tst.rowid = interval.rowid;

On a 200K-row test table with random data, queries #2 and #3 were 
essentially identical while #1 was twice slower (moving too much data 
around, uselessly).


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] select intersecting intervals

2010-05-12 Thread Tim Romano
I would first create an INTEGER primary key and then place an index on name,
another on i_from, and another on i_to, and then see if the approach below
has any benefit.

When I tried this with a geo-queryit was actually slower than the standard
select, and I'm curious if that's always going to be the case. It will come
down to how efficient the INTERSECT of the vectors of integers is. Each
vector will have been the result of an index-scan.  If INTERSECT were
optimized (perhaps with a minimal perfect hash function
http://cmph.sourceforge.net/index.html) this approach might be useful.


select * from T
JOIN
(
select pk_col from T where i_from > ?
intersect
select pk_col from T where i_to < ?
) as DESIREDINTERVAL
ON T.pk_col = DESIREDINTERVAL.pk_col
and T.name = ?


Regards
Tim Romano
Swarthmore PA

On Wed, May 12, 2010 at 12:00 PM, Jan Asselman  wrote:

> Hi,
>
> Given the following table with large row count 'row_count':
>
> CREATE TABLE table
> (
>i_name  TEXT,
>i_from  INTEGER,
>i_toINTEGER,
>i_data  BLOB
> )
>
> I am wondering what would be the fastest way to get all rows with a
> given name 'myname' that intersect with a given interval [a, b]?
>
>
> CREATE INDEX idx_from ON table (i_name, i_from);
> CREATE INDEX idx_to ON table (i_name, i_to);
>
> SELECT data FROM table WHERE name = 'myname' AND i_from < b AND i_to > a
>
>-> index idx_from will be used
>-> in worst case (a is larger than all i_to) all 'myname' rows
>   will be traversed before concluding result set is empty
>
> SELECT data FROM table WHERE name = 'myname' AND i_to > a AND i_from < b
>
>-> index idx_to will be used
>-> in worst case (b is smaller than all i_from) all 'myname'
> rows
>   will be traversed before concluding result set is empty
>
>
>
> I know this is exactly what a one dimensional R-tree index is used for,
> but my project requires 64 bit integer minimum- and maximum-value
> pairs...
>
> All suggestions or corrections are appreciated.
>
>
> Thanks in advance,
>
> Jan Asselman
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] select intersecting intervals

2010-05-12 Thread Jan Asselman
Hi,

Given the following table with large row count 'row_count':

CREATE TABLE table
(
i_name  TEXT,   
i_from  INTEGER,
i_toINTEGER,
i_data  BLOB
)

I am wondering what would be the fastest way to get all rows with a
given name 'myname' that intersect with a given interval [a, b]?


CREATE INDEX idx_from ON table (i_name, i_from);
CREATE INDEX idx_to ON table (i_name, i_to);

SELECT data FROM table WHERE name = 'myname' AND i_from < b AND i_to > a

-> index idx_from will be used
-> in worst case (a is larger than all i_to) all 'myname' rows
   will be traversed before concluding result set is empty

SELECT data FROM table WHERE name = 'myname' AND i_to > a AND i_from < b

-> index idx_to will be used
-> in worst case (b is smaller than all i_from) all 'myname'
rows
   will be traversed before concluding result set is empty



I know this is exactly what a one dimensional R-tree index is used for,
but my project requires 64 bit integer minimum- and maximum-value
pairs...

All suggestions or corrections are appreciated.


Thanks in advance,

Jan Asselman
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users