Re: [HACKERS] Using multidimensional indexes in ordinal queries

2010-06-22 Thread Alexander Korotkov
 On Tue, Jun 22, 2010 at 1:58 AM, Robert Haas robertmh...@gmail.com wrote:

 It doesn't?   I didn't think it was making any assumptions about the
 ordering data type beyond the fact that it had a default btree
 opclass.

Actually, the return type of consistent method was replaced by float8.
Negative values are used for unconsistent state. Non-negative values are
used for consistent and ordering.


Re: [HACKERS] Using multidimensional indexes in ordinal queries

2010-06-21 Thread Thom Brown
On 6 June 2010 21:04, Alexander Korotkov aekorot...@gmail.com wrote:
 Hello hackers,
 I would like to share some my thoughts about usage of multidimensional
 indexes for queries which deal with ordinal unidimensional data types. I
 think that gist indexes (especially with knngist) can produce great benefit
 for complex multi-criterion queries.
 Let's consider come example. I use postgresql-9.0beta1 with knngist patch.
 Also I have created simple patch that allows to use knngist for ordinal
 sorting in cube extension (patch is attached). The * operator was
 introduced in my patch. The first operand is the cube and the second operand
 is number n. If n = 2*k then the ascending ordering by k-dimension occurs.
 If n = 2*k + 1 then descending ordering by k-dimension occurs. Now this
 operator have a limitation and works only with nonnegative coordinate
 values.
 Let's create table with 3 float-point columns and fill it with 10M rows;
 create table test (id serial primary key, v1 double precision, v2 double
 precision, v3 double precision);
 insert into test (v1,v2,v3) (select random()*1000, random()*1000,
 random()*1000 from generate_series(1,1000,1));
 Now, let's create 3 separate btree indexes and one gist cube index.
 create index test_v1_idx on test(v1);
 create index test_v2_idx on test(v2);
 create index test_v3_idx on test(v3);
 create index test_cube_idx on test using gist(cube(ARRAY[v1,v2,v3]));
 Let's consider some complex query with filtering, ordering and limit.
 test=# select * from test where v1 between 480 and 500 and v2 between 480
 and 500 order by v3 limit 10;
     id    |        v1        |        v2        |        v3
 --+--+--+---
  12283631 | 485.982828773558 | 496.795611456037 | 0.213871244341135
   4936086 | 497.239370364696 | 491.878624074161 |  1.26481195911765
   8963067 | 484.963194001466 | 497.094289399683 |  1.30057940259576
  12435440 | 498.670902103186 | 498.667187988758 |  1.33110675960779
  11667415 | 494.398592971265 | 497.440234292299 |  1.44533207640052
   8530558 |  482.85893118009 | 496.267838869244 |  1.48530444130301
   4004942 | 483.679085504264 | 489.547223784029 |  1.57393841072917
  14897796 |  491.37338064611 | 487.47524273 |  1.81775307282805
   4105759 | 489.506138022989 |  486.91446846351 |  1.94038823246956
  12895656 | 499.508572742343 | 487.065799534321 |  2.34963605180383
 (10 rows)
 test=# explain analyze select * from test where v1 between 480 and 500 and
 v2 between 480 and 500 order by v3 limit 10;

  QUERY PLAN

 ---
  Limit  (cost=22786.73..22786.75 rows=10 width=28) (actual
 time=3242.135..3242.162 rows=10 loops=1)
    -  Sort  (cost=22786.73..22797.59 rows=4345 width=28) (actual
 time=3242.131..3242.144 rows=10 loops=1)
          Sort Key: v3
          Sort Method:  top-N heapsort  Memory: 25kB
          -  Bitmap Heap Scan on test  (cost=8755.91..22692.83 rows=4345
 width=28) (actual time=1281.030..3234.934 rows=4027 loops=1)
                Recheck Cond: ((v1 = 480::double precision) AND (v1 =
 500::double precision) AND (v2 = 480::double precision) AND (v2 =
 500::double precision))
                -  BitmapAnd  (cost=8755.91..8755.91 rows=4345 width=0)
 (actual time=1280.783..1280.783 rows=0 loops=1)
                      -  Bitmap Index Scan on test_v1_idx
  (cost=0.00..4243.12 rows=202177 width=0) (actual time=644.702..644.702
 rows=200715 loops=1)
                            Index Cond: ((v1 = 480::double precision) AND
 (v1 = 500::double precision))
                      -  Bitmap Index Scan on test_v2_idx
  (cost=0.00..4510.37 rows=214902 width=0) (actual time=630.085..630.085
 rows=200200 loops=1)
                            Index Cond: ((v2 = 480::double precision) AND
 (v2 = 500::double precision))
  Total runtime: 3242.253 ms
 (12 rows)
 This query can be rewritten in order to let planner use gist cube index.
 test=# select * from test where cube(array[v1,v2,v3]) @
 cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float])
 order by cube(array[v1,v2,v3]) * 4 limit 10;
     id    |        v1        |        v2        |        v3
 --+--+--+---
  12283631 | 485.982828773558 | 496.795611456037 | 0.213871244341135
   4936086 | 497.239370364696 | 491.878624074161 |  1.26481195911765
   8963067 | 484.963194001466 | 497.094289399683 |  1.30057940259576
  12435440 | 498.670902103186 | 498.667187988758 |  1.33110675960779
  11667415 | 494.398592971265 | 497.440234292299 |  1.44533207640052
   8530558 |  482.85893118009 | 496.267838869244 |  1.48530444130301
   4004942 | 483.679085504264 | 489.547223784029 |  1.57393841072917
  14897796 |  491.37338064611 | 487.47524273 |  1.81775307282805
   4105759 | 489.506138022989 |  486.91446846351 |  

Re: [HACKERS] Using multidimensional indexes in ordinal queries

2010-06-21 Thread Robert Haas
On Mon, Jun 21, 2010 at 5:19 AM, Thom Brown thombr...@gmail.com wrote:
 I can't answer this, but is anyone else able to provide Alexander some 
 feedback?

It seems like you can get more or less the same benefit from a
multicolumn btree index.  On my system, with the individual btree
indices, the query ran in 7625 ms; with an additional index on (v1,
v2, v3), it ran in 94 ms.  I didn't get the same plans as Alexander
did, though, so it may not really be apples to apples.  See attached
session trace.

Having said that, I'm very interested in hearing what other ideas
people have for using indices to speed up ORDER BY operations.
Currently, we support only ORDER BY indexed-value.  KNNGIST will
allow ORDER BY indexed-value op constant, but why stop there?
In theory, an index might know how to order the data by any arbitrary
expression the user might happen to enter.  If the user asks for
ORDER BY (indexed-value op1 constan1t) op2 constan2t, who
is to say that we can't use an index scan to get that ordering
quickly?  (As a trivial case, suppose both ops are +, but there could
easily be more interesting ones.)  Or what about ORDER BY
somefunc(indexed-value)?  The trouble is that it's hard to think of
a way of teaching the planner about these cases without hard-coding
lots and lots of special-case kludges into the planner.  Still, if
someone has a clever idea...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company
rhaas=# create table test (id serial primary key, v1 double precision, v2 
double precision, v3 double precision);
NOTICE:  CREATE TABLE will create implicit sequence test_id_seq for serial 
column test.id
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index test_pkey for 
table test
CREATE TABLE
rhaas=# insert into test (v1,v2,v3) (select random()*1000, random()*1000, 
random()*1000 from generate_series(1,1000,1));
INSERT 0 1000
rhaas=# 
rhaas=# select * from test where v1 between 480 and 500 and v2 between 480 and 
500 order by v3 limit 10;
   id|v1|v2| v3 
-+--+--+
 7707160 | 499.468160793185 | 497.105565853417 | 0.0262199901044369
 9836934 | 488.509620074183 | 489.591513760388 | 0.0412175431847572
 8299674 | 488.991918507963 | 494.098918512464 |  0.524408183991909
 4962322 | 484.455766621977 | 496.633755043149 |  0.686612911522388
 5253466 | 493.753412738442 | 481.965465471148 |  0.731946900486946
 3642389 |  496.36858003214 | 483.764411881566 |  0.750890467315912
 3332916 | 486.504513770342 | 492.682197596878 |  0.930964481085539
 3063189 | 483.532963320613 | 481.065005529672 |1.0284804739058
 6368341 | 493.383017368615 | 497.755419462919 |   1.20219821110368
 1587774 |  496.25625833869 | 484.364923555404 |   1.58307375386357
(10 rows)

rhaas=# explain analyze select * from test where v1 between 480 and 500 and v2 
between
 QUERY 
PLAN  
-
 Limit  (cost=273617.93..273617.96 rows=10 width=28) (actual 
time=7625.039..7625.042 rows=10 loops=1)
   -  Sort  (cost=273617.93..273627.92 rows=3995 width=28) (actual 
time=7625.037..7625.039 rows=10 loops=1)
 Sort Key: v3
 Sort Method:  top-N heapsort  Memory: 25kB
 -  Seq Scan on test  (cost=0.00..273531.60 rows=3995 width=28) 
(actual time=5.934..7620.583 rows=3995 loops=1)
   Filter: ((v1 = 480::double precision) AND (v1 = 500::double 
precision) AND (v2 = 480::double precision) AND (v2 = 500::double precision))
 Total runtime: 7625.101 ms
(7 rows)

rhaas=# create index x on test(v1,v2,v3);
CREATE INDEX
rhaas=# explain analyze select * from test where v1 between 480 and 500 and v2 
between 480 and 500 order by v3 limit 10;
  
QUERY PLAN  
 
---
 Limit  (cost=19892.64..19892.66 rows=10 width=28) (actual time=73.735..73.738 
rows=10 loops=1)
   -  Sort  (cost=19892.64..19902.62 rows=3995 width=28) (actual 
time=73.733..73.735 rows=10 loops=1)
 Sort Key: v3
 Sort Method:  top-N heapsort  Memory: 25kB
 -  Bitmap Heap Scan on test  (cost=6850.60..19806.31 rows=3995 
width=28) (actual time=34.682..72.238 rows=3995 loops=1)
   Recheck Cond: ((v1 = 480::double precision) AND (v1 = 
500::double precision) AND (v2 = 480::double precision) AND (v2 = 500::double 
precision))
   -  Bitmap Index Scan on x  

Re: [HACKERS] Using multidimensional indexes in ordinal queries

2010-06-21 Thread Alexander Korotkov
On Mon, Jun 21, 2010 at 5:42 PM, Robert Haas robertmh...@gmail.com wrote:

 It seems like you can get more or less the same benefit from a
 multicolumn btree index.  On my system, with the individual btree
 indices, the query ran in 7625 ms; with an additional index on (v1,
 v2, v3), it ran in 94 ms.  I didn't get the same plans as Alexander
 did, though, so it may not really be apples to apples.  See attached
 session trace.

Benefit of multicolumn btree index was more or less the same than cube
benefit because of very bad picksplit behavior in this case. I attached the
patch which significally improves cube index search performance:

test=# explain (analyze, buffers) select * from test where
cube(ARRAY[v1,v2,v3]) @ cube(ARRAY[480,480,'-inf'::float8],
ARRAY[500,500,'+inf'::float8]) order by cube(ARRAY[v1,v2,v3]) * 4 LIMIT
10;
 QUERY PLAN



 Limit  (cost=0.00..38.07 rows=10 width=28) (actual time=0.495..0.570
rows=10 loops=1)
   Buffers: shared hit=21
   -  Index Scan using test_cube_idx on test  (cost=0.00..38064.52
rows=1 width=28) (actual time=0.489..0.537 rows=10 loops=1)
 Index Cond: (cube(ARRAY[v1, v2, v3]) @ '(480, 480, -inf),(500,
500, inf)'::cube)
 Sort Cond: (cube(ARRAY[v1, v2, v3]) * 4)
 Buffers: shared hit=21
 Total runtime: 0.659 ms
(7 rows)

Now this patch greatly increases tree construction time, but I believe that
picksplit implementation, that is good enough for tree search and tree
construction, can be found.


 The trouble is that it's hard to think of
 a way of teaching the planner about these cases without hard-coding
 lots and lots of special-case kludges into the planner.  Still, if
 someone has a clever idea...


I think that two things can be done to improve the situation:
1) Make knngist deal with negative values. I think this will make easier
using knngist just for sorting, not only k-neighbor searching.
2) Let gist interface methods take care about multicolumn indexes. I think
that if cube index from the example above will be constructed on separate
columns v1, v2, v3 then it would be easier for planner to use cube index for
queries with filters on these columns. I don't know exactly how to do that.


cube.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Using multidimensional indexes in ordinal queries

2010-06-21 Thread Robert Haas
On Mon, Jun 21, 2010 at 5:20 PM, Alexander Korotkov
aekorot...@gmail.com wrote:
 1) Make knngist deal with negative values. I think this will make easier
 using knngist just for sorting, not only k-neighbor searching.

It doesn't?   I didn't think it was making any assumptions about the
ordering data type beyond the fact that it had a default btree
opclass.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers