On Tue, Nov 16, 2010 at 3:07 AM, Robert Haas <robertmh...@gmail.com> wrote:
> But on a broader note, I'm not very certain the sorting algorithm is > sensible. For example, suppose you have 10 segments that are exactly > '0' and 20 segments that are exactly '1'. Maybe I'm misunderstanding, > but it seems like this will result in a 15/15 split when we almost > certainly want a 10/20 split. I think there will be problems in more > complex cases as well. The documentation says about the less-than and > greater-than operators that "These operators do not make a lot of > sense for any practical purpose but sorting." In order to illustrate a real problem we should think about gist behavior with great enough amount of data. For example, I tried to extrapolate this case to 100000 of segs where 40% are (0,1) segs and 60% are (1,2) segs. And this case doesn't seem a problem for me. Here are the details. test=# insert into seg_test (select case when random()<0.4 then '0..1'::seg else '1..2'::seg end from generate_series(1,100000)); INSERT 0 100000 Time: 7543,941 ms test=# create index seg_test_idx on seg_test using gist(a); CREATE INDEX Time: 17839,450 ms test=# explain (analyze, buffers) select count(*) from seg_test where a @> '1.5'::seg; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=344.84..344.85 rows=1 width=0) (actual time=186.192..186.193 rows=1 loops=1) Buffers: shared hit=887 -> Bitmap Heap Scan on seg_test (cost=5.05..344.59 rows=100 width=0) (actual time=16.066..102.586 rows=60206 l oops=1) Recheck Cond: (a @> '1.5'::seg) Buffers: shared hit=887 -> Bitmap Index Scan on seg_test_idx (cost=0.00..5.03 rows=100 width=0) (actual time=15.948..15.948 rows =60206 loops=1) Index Cond: (a @> '1.5'::seg) Buffers: shared hit=396 Total runtime: 186.306 ms (9 rows) test=# explain (analyze, buffers) select count(*) from seg_test where a @> '0.5'::seg; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=344.84..344.85 rows=1 width=0) (actual time=144.293..144.295 rows=1 loops=1) Buffers: shared hit=754 -> Bitmap Heap Scan on seg_test (cost=5.05..344.59 rows=100 width=0) (actual time=28.926..87.625 rows=39794 lo ops=1) Recheck Cond: (a @> '0.5'::seg) Buffers: shared hit=754 -> Bitmap Index Scan on seg_test_idx (cost=0.00..5.03 rows=100 width=0) (actual time=26.969..26.969 rows =39794 loops=1) Index Cond: (a @> '0.5'::seg) Buffers: shared hit=263 Total runtime: 144.374 ms (9 rows) test=# select pg_relpages('seg_test_idx'); pg_relpages ------------- 656 (1 row) Total number of pages in index is 656 and number of pages used in scans is 263+396=659. Only 3 pages overhead. We can see the distribution of segs in index using gevel. test=# select a, count(1) from gist_print('seg_test_idx') as t(level int, valid bool, a seg) group by a; a | count --------+------- 0 .. 1 | 40054 0 .. 2 | 2 1 .. 2 | 60599 (3 rows) test=# select level, a, count(1) from gist_print('seg_test_idx') as t(level int, valid bool, a seg) group by level,a order by level, a; level | a | count -------+--------+------- 1 | 0 .. 1 | 1 1 | 0 .. 2 | 1 1 | 1 .. 2 | 1 2 | 0 .. 1 | 259 2 | 0 .. 2 | 1 2 | 1 .. 2 | 392 3 | 0 .. 1 | 39794 3 | 1 .. 2 | 60206 (8 rows) We have only 2 segs '0..2' (one on the first level and one on the second) and all other segs are '0..1' and '1..2'. I think there will be more problems when we'll have many of distinct values where each have count value about half of index page. ---- With best regards, Alexander Korotkov.