Re: [HACKERS] Using multidimensional indexes in ordinal queries
On Tue, Jun 22, 2010 at 1:58 AM, Robert Haas robertmh...@gmail.com wrote: It doesn't? I didn't think it was making any assumptions about the ordering data type beyond the fact that it had a default btree opclass. Actually, the return type of consistent method was replaced by float8. Negative values are used for unconsistent state. Non-negative values are used for consistent and ordering.
Re: [HACKERS] Using multidimensional indexes in ordinal queries
On 6 June 2010 21:04, Alexander Korotkov aekorot...@gmail.com wrote: Hello hackers, I would like to share some my thoughts about usage of multidimensional indexes for queries which deal with ordinal unidimensional data types. I think that gist indexes (especially with knngist) can produce great benefit for complex multi-criterion queries. Let's consider come example. I use postgresql-9.0beta1 with knngist patch. Also I have created simple patch that allows to use knngist for ordinal sorting in cube extension (patch is attached). The * operator was introduced in my patch. The first operand is the cube and the second operand is number n. If n = 2*k then the ascending ordering by k-dimension occurs. If n = 2*k + 1 then descending ordering by k-dimension occurs. Now this operator have a limitation and works only with nonnegative coordinate values. Let's create table with 3 float-point columns and fill it with 10M rows; create table test (id serial primary key, v1 double precision, v2 double precision, v3 double precision); insert into test (v1,v2,v3) (select random()*1000, random()*1000, random()*1000 from generate_series(1,1000,1)); Now, let's create 3 separate btree indexes and one gist cube index. create index test_v1_idx on test(v1); create index test_v2_idx on test(v2); create index test_v3_idx on test(v3); create index test_cube_idx on test using gist(cube(ARRAY[v1,v2,v3])); Let's consider some complex query with filtering, ordering and limit. test=# select * from test where v1 between 480 and 500 and v2 between 480 and 500 order by v3 limit 10; id | v1 | v2 | v3 --+--+--+--- 12283631 | 485.982828773558 | 496.795611456037 | 0.213871244341135 4936086 | 497.239370364696 | 491.878624074161 | 1.26481195911765 8963067 | 484.963194001466 | 497.094289399683 | 1.30057940259576 12435440 | 498.670902103186 | 498.667187988758 | 1.33110675960779 11667415 | 494.398592971265 | 497.440234292299 | 1.44533207640052 8530558 | 482.85893118009 | 496.267838869244 | 1.48530444130301 4004942 | 483.679085504264 | 489.547223784029 | 1.57393841072917 14897796 | 491.37338064611 | 487.47524273 | 1.81775307282805 4105759 | 489.506138022989 | 486.91446846351 | 1.94038823246956 12895656 | 499.508572742343 | 487.065799534321 | 2.34963605180383 (10 rows) test=# explain analyze select * from test where v1 between 480 and 500 and v2 between 480 and 500 order by v3 limit 10; QUERY PLAN --- Limit (cost=22786.73..22786.75 rows=10 width=28) (actual time=3242.135..3242.162 rows=10 loops=1) - Sort (cost=22786.73..22797.59 rows=4345 width=28) (actual time=3242.131..3242.144 rows=10 loops=1) Sort Key: v3 Sort Method: top-N heapsort Memory: 25kB - Bitmap Heap Scan on test (cost=8755.91..22692.83 rows=4345 width=28) (actual time=1281.030..3234.934 rows=4027 loops=1) Recheck Cond: ((v1 = 480::double precision) AND (v1 = 500::double precision) AND (v2 = 480::double precision) AND (v2 = 500::double precision)) - BitmapAnd (cost=8755.91..8755.91 rows=4345 width=0) (actual time=1280.783..1280.783 rows=0 loops=1) - Bitmap Index Scan on test_v1_idx (cost=0.00..4243.12 rows=202177 width=0) (actual time=644.702..644.702 rows=200715 loops=1) Index Cond: ((v1 = 480::double precision) AND (v1 = 500::double precision)) - Bitmap Index Scan on test_v2_idx (cost=0.00..4510.37 rows=214902 width=0) (actual time=630.085..630.085 rows=200200 loops=1) Index Cond: ((v2 = 480::double precision) AND (v2 = 500::double precision)) Total runtime: 3242.253 ms (12 rows) This query can be rewritten in order to let planner use gist cube index. test=# select * from test where cube(array[v1,v2,v3]) @ cube(array[480,480,'-Infinity'::float],array[500,500,'+Infinity'::float]) order by cube(array[v1,v2,v3]) * 4 limit 10; id | v1 | v2 | v3 --+--+--+--- 12283631 | 485.982828773558 | 496.795611456037 | 0.213871244341135 4936086 | 497.239370364696 | 491.878624074161 | 1.26481195911765 8963067 | 484.963194001466 | 497.094289399683 | 1.30057940259576 12435440 | 498.670902103186 | 498.667187988758 | 1.33110675960779 11667415 | 494.398592971265 | 497.440234292299 | 1.44533207640052 8530558 | 482.85893118009 | 496.267838869244 | 1.48530444130301 4004942 | 483.679085504264 | 489.547223784029 | 1.57393841072917 14897796 | 491.37338064611 | 487.47524273 | 1.81775307282805 4105759 | 489.506138022989 | 486.91446846351 |
Re: [HACKERS] Using multidimensional indexes in ordinal queries
On Mon, Jun 21, 2010 at 5:19 AM, Thom Brown thombr...@gmail.com wrote: I can't answer this, but is anyone else able to provide Alexander some feedback? It seems like you can get more or less the same benefit from a multicolumn btree index. On my system, with the individual btree indices, the query ran in 7625 ms; with an additional index on (v1, v2, v3), it ran in 94 ms. I didn't get the same plans as Alexander did, though, so it may not really be apples to apples. See attached session trace. Having said that, I'm very interested in hearing what other ideas people have for using indices to speed up ORDER BY operations. Currently, we support only ORDER BY indexed-value. KNNGIST will allow ORDER BY indexed-value op constant, but why stop there? In theory, an index might know how to order the data by any arbitrary expression the user might happen to enter. If the user asks for ORDER BY (indexed-value op1 constan1t) op2 constan2t, who is to say that we can't use an index scan to get that ordering quickly? (As a trivial case, suppose both ops are +, but there could easily be more interesting ones.) Or what about ORDER BY somefunc(indexed-value)? The trouble is that it's hard to think of a way of teaching the planner about these cases without hard-coding lots and lots of special-case kludges into the planner. Still, if someone has a clever idea... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company rhaas=# create table test (id serial primary key, v1 double precision, v2 double precision, v3 double precision); NOTICE: CREATE TABLE will create implicit sequence test_id_seq for serial column test.id NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index test_pkey for table test CREATE TABLE rhaas=# insert into test (v1,v2,v3) (select random()*1000, random()*1000, random()*1000 from generate_series(1,1000,1)); INSERT 0 1000 rhaas=# rhaas=# select * from test where v1 between 480 and 500 and v2 between 480 and 500 order by v3 limit 10; id|v1|v2| v3 -+--+--+ 7707160 | 499.468160793185 | 497.105565853417 | 0.0262199901044369 9836934 | 488.509620074183 | 489.591513760388 | 0.0412175431847572 8299674 | 488.991918507963 | 494.098918512464 | 0.524408183991909 4962322 | 484.455766621977 | 496.633755043149 | 0.686612911522388 5253466 | 493.753412738442 | 481.965465471148 | 0.731946900486946 3642389 | 496.36858003214 | 483.764411881566 | 0.750890467315912 3332916 | 486.504513770342 | 492.682197596878 | 0.930964481085539 3063189 | 483.532963320613 | 481.065005529672 |1.0284804739058 6368341 | 493.383017368615 | 497.755419462919 | 1.20219821110368 1587774 | 496.25625833869 | 484.364923555404 | 1.58307375386357 (10 rows) rhaas=# explain analyze select * from test where v1 between 480 and 500 and v2 between QUERY PLAN - Limit (cost=273617.93..273617.96 rows=10 width=28) (actual time=7625.039..7625.042 rows=10 loops=1) - Sort (cost=273617.93..273627.92 rows=3995 width=28) (actual time=7625.037..7625.039 rows=10 loops=1) Sort Key: v3 Sort Method: top-N heapsort Memory: 25kB - Seq Scan on test (cost=0.00..273531.60 rows=3995 width=28) (actual time=5.934..7620.583 rows=3995 loops=1) Filter: ((v1 = 480::double precision) AND (v1 = 500::double precision) AND (v2 = 480::double precision) AND (v2 = 500::double precision)) Total runtime: 7625.101 ms (7 rows) rhaas=# create index x on test(v1,v2,v3); CREATE INDEX rhaas=# explain analyze select * from test where v1 between 480 and 500 and v2 between 480 and 500 order by v3 limit 10; QUERY PLAN --- Limit (cost=19892.64..19892.66 rows=10 width=28) (actual time=73.735..73.738 rows=10 loops=1) - Sort (cost=19892.64..19902.62 rows=3995 width=28) (actual time=73.733..73.735 rows=10 loops=1) Sort Key: v3 Sort Method: top-N heapsort Memory: 25kB - Bitmap Heap Scan on test (cost=6850.60..19806.31 rows=3995 width=28) (actual time=34.682..72.238 rows=3995 loops=1) Recheck Cond: ((v1 = 480::double precision) AND (v1 = 500::double precision) AND (v2 = 480::double precision) AND (v2 = 500::double precision)) - Bitmap Index Scan on x
Re: [HACKERS] Using multidimensional indexes in ordinal queries
On Mon, Jun 21, 2010 at 5:42 PM, Robert Haas robertmh...@gmail.com wrote: It seems like you can get more or less the same benefit from a multicolumn btree index. On my system, with the individual btree indices, the query ran in 7625 ms; with an additional index on (v1, v2, v3), it ran in 94 ms. I didn't get the same plans as Alexander did, though, so it may not really be apples to apples. See attached session trace. Benefit of multicolumn btree index was more or less the same than cube benefit because of very bad picksplit behavior in this case. I attached the patch which significally improves cube index search performance: test=# explain (analyze, buffers) select * from test where cube(ARRAY[v1,v2,v3]) @ cube(ARRAY[480,480,'-inf'::float8], ARRAY[500,500,'+inf'::float8]) order by cube(ARRAY[v1,v2,v3]) * 4 LIMIT 10; QUERY PLAN Limit (cost=0.00..38.07 rows=10 width=28) (actual time=0.495..0.570 rows=10 loops=1) Buffers: shared hit=21 - Index Scan using test_cube_idx on test (cost=0.00..38064.52 rows=1 width=28) (actual time=0.489..0.537 rows=10 loops=1) Index Cond: (cube(ARRAY[v1, v2, v3]) @ '(480, 480, -inf),(500, 500, inf)'::cube) Sort Cond: (cube(ARRAY[v1, v2, v3]) * 4) Buffers: shared hit=21 Total runtime: 0.659 ms (7 rows) Now this patch greatly increases tree construction time, but I believe that picksplit implementation, that is good enough for tree search and tree construction, can be found. The trouble is that it's hard to think of a way of teaching the planner about these cases without hard-coding lots and lots of special-case kludges into the planner. Still, if someone has a clever idea... I think that two things can be done to improve the situation: 1) Make knngist deal with negative values. I think this will make easier using knngist just for sorting, not only k-neighbor searching. 2) Let gist interface methods take care about multicolumn indexes. I think that if cube index from the example above will be constructed on separate columns v1, v2, v3 then it would be easier for planner to use cube index for queries with filters on these columns. I don't know exactly how to do that. cube.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Using multidimensional indexes in ordinal queries
On Mon, Jun 21, 2010 at 5:20 PM, Alexander Korotkov aekorot...@gmail.com wrote: 1) Make knngist deal with negative values. I think this will make easier using knngist just for sorting, not only k-neighbor searching. It doesn't? I didn't think it was making any assumptions about the ordering data type beyond the fact that it had a default btree opclass. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers