Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 One of the reasons our client wants GIN for the integer[] column so bad is 
 because recreating the GiST integer[] index is quite painful. Before I duped 
 the table, I was just dropping and recreating the index on the original 
 table. It was great to create the GIN index; for 400K rows, it took 1300 ms. 
 After my initial round of testing, I dropped it and created the GiST index. 
 That ran for…well, *hours*. Four at least, maybe six or seven (I forgot 
 \timing and was letting it run on screen while I did some iOS hacking). I 
 think something might be really wrong with GiST index creation for integer 
 arrays, because the difference was just appalling. 

I poked into this a bit, although it's really off-topic for what I was
doing in GIN, and I agree that there is something fundamentally rotten
in the gist__int_ops logic.  oprofile shows that the code is spending
all its time in g_int_compress and g_int_union during index build:

samples  %app name symbol name
5274740.2486  _int.so  g_int_compress
2709220.6726  libc-2.12.2.so   _wordcopy_fwd_aligned
1893814.4506  _int.so  compASC
1825613.9302  postgres pg_qsort
5741  4.3807  no-vmlinux   /no-vmlinux
3297  2.5158  postgres swapfunc
864   0.6593  _int.so  g_int_decompress
826   0.6303  _int.so  _int_unique
700   0.5341  _int.so  inner_int_union
617   0.4708  postgres med3
588   0.4487  libc-2.12.2.so   memset
371   0.2831  _int.so  g_int_same
143   0.1091  libc-2.12.2.so   memcpy
123   0.0939  postgres ArrayGetNItems
670.0511  _int.so  inner_int_inter
480.0366  postgres AllocSetAlloc
470.0359  libc-2.12.2.so   memmove
400.0305  postgres MemoryContextAllocZero

The time in g_int_compress is all on this loop:

while (len  MAXNUMRANGE * 2)
{
min = 0x7fff;
for (i = 2; i  len; i += 2)
if (min  (dr[i] - dr[i - 1]))
{
min = (dr[i] - dr[i - 1]);
cand = i;
}
memmove((void *) dr[cand - 1], (void *) dr[cand + 1], (len - cand 
- 1) * sizeof(int32));
len -= 2;
}

which is not too surprising since that's obviously O(N^2).  The other
high-percentage functions are qsort and subsidiaries, and those calls
are coming from g_int_union.  That part could probably be sped up, since
most of the calls are unioning two inputs, which could be done a lot
cheaper than this (the inputs are always presorted no?).  But there is
something really fundamentally wrong in the logic, I think, because
while poking at it with gdb I saw it union-ing fifty-thousand-element
arrays.  Surely it should get lossy before it gets to that.  I'm also
wondering how it tells the difference between lossy and non-lossy keys
--- although it's hard to tell what such spectacularly uncommented code
is supposed to be doing, it looks like the lossy case is supposed to be
pairs of values representing ranges, in which case g_int_compress is
surely doing the Wrong Thing with already-lossy inputs, and union'ing
lossy inputs is an entirely insane operation as well.

At the moment my opinion is that gist__int_ops is too broken to be
usable, and it's also too uncommented to be fixable by anyone other
than the original author.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 These numbers are a bit crazy-making, but the upshot is that Gist is
 slow out of the gate, but with data cached, it's pretty speedy. With
 indexscan and bitmapscan disabled, these queries all took 300-400
 ms. So GIN was never better performing than a table scan.

I could not replicate that here at all --- GIN indexscans were
consistently better than seqscans for me, eg

regression=# set enable_bitmapscan TO 1;
SET
Time: 0.673 ms
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(136879913688001369043)'::query_int
AND deleted_at IS NULL AND status = 1;
  QUERY PLAN
  
--
 Aggregate  (cost=1159.20..1159.21 rows=1 width=0) (actual time=23.964..23.964 
rows=1 loops=1)
   -  Bitmap Heap Scan on listings  (cost=31.15..1158.18 rows=406 width=0) 
(actual time=23.014..23.876 rows=772 loops=1)
 Recheck Cond: ((features @@ '1368799  1368800  1369043'::query_int) 
AND (deleted_at IS NULL) AND (status = 1))
 -  Bitmap Index Scan on idx_gin_features  (cost=0.00..31.05 rows=406 
width=0) (actual time=22.913..22.913 rows=772 loops=1)
   Index Cond: (features @@ '1368799  1368800  
1369043'::query_int)
 Total runtime: 24.040 ms
(6 rows)

Time: 24.968 ms
regression=# set enable_bitmapscan TO 0;
SET
Time: 0.458 ms
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(136879913688001369043)'::query_int
AND deleted_at IS NULL AND status = 1;
 QUERY PLAN 
  

 Aggregate  (cost=9158.24..9158.25 rows=1 width=0) (actual 
time=145.121..145.121 rows=1 loops=1)
   -  Seq Scan on listings  (cost=0.00..9157.22 rows=406 width=0) (actual 
time=0.025..144.982 rows=772 loops=1)
 Filter: ((deleted_at IS NULL) AND (features @@ '1368799  1368800  
1369043'::query_int) AND (status = 1))
 Total runtime: 145.177 ms
(4 rows)

Time: 146.228 ms

I'm noticing also that I get different rowcounts than you do, although
possibly that has something to do with the partial-index conditions,
which I'm not trying to duplicate here (all rows in my table pass those
two tests).

 * Why does it take 3-4x longer to create the GIN than the GiST index
 on tsvector?

Perhaps more maintenance_work_mem would help with that; although the
fine manual says specifically that GIN text search indexes take about
three times longer to build than equivalent GiST indexes, so maybe that
behavior is as designed.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread David E. Wheeler
On Jan 18, 2011, at 1:39 PM, Tom Lane wrote:

 At the moment my opinion is that gist__int_ops is too broken to be
 usable, and it's also too uncommented to be fixable by anyone other
 than the original author.

That seems to jibe with your comments from last year:

  http://archives.postgresql.org/pgsql-performance/2009-03/msg00265.php

Best,

David

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 On Jan 18, 2011, at 1:39 PM, Tom Lane wrote:
 At the moment my opinion is that gist__int_ops is too broken to be
 usable, and it's also too uncommented to be fixable by anyone other
 than the original author.

 That seems to jibe with your comments from last year:
   http://archives.postgresql.org/pgsql-performance/2009-03/msg00265.php

Well, based on what I saw just now, I believe there are one or more
actual bugs in there, not just that it's straining the design capacity
of the opclass.  I think the lossy aspect isn't working at all.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread David E. Wheeler
On Jan 18, 2011, at 1:58 PM, Tom Lane wrote:

 I'm noticing also that I get different rowcounts than you do, although
 possibly that has something to do with the partial-index conditions,
 which I'm not trying to duplicate here (all rows in my table pass those
 two tests).

Shall I send you data with the other two columns?:

 * Why does it take 3-4x longer to create the GIN than the GiST index
 on tsvector?
 
 Perhaps more maintenance_work_mem would help with that; although the
 fine manual says specifically that GIN text search indexes take about
 three times longer to build than equivalent GiST indexes, so maybe that
 behavior is as designed.

Okay then, thanks.

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 On Jan 18, 2011, at 1:58 PM, Tom Lane wrote:
 I'm noticing also that I get different rowcounts than you do, although
 possibly that has something to do with the partial-index conditions,
 which I'm not trying to duplicate here (all rows in my table pass those
 two tests).

 Shall I send you data with the other two columns?:

No, I see no reason to think that has much to do with it.  I'm wondering
if your table is itself a bit bloated ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread David E. Wheeler
On Jan 18, 2011, at 3:08 PM, Tom Lane wrote:

 Shall I send you data with the other two columns?:
 
 No, I see no reason to think that has much to do with it.  I'm wondering
 if your table is itself a bit bloated ...

Nope. Just ran the bloat query from check_postgres.pl. Bloat is 0. Not 
surprising since the table was created by inserting from another table (the 
original with the gist indexes) and has not been updated since.

Any other ideas?

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread Tom Lane
I wrote:
 No, I see no reason to think that has much to do with it.  I'm wondering
 if your table is itself a bit bloated ...

Actually ... I notice you did not show EXPLAIN ANALYZE output for your
tests.  Now I'm wondering whether you tested the right thing at all.
I got burnt that way too.  Observe:

regression=# create index idx_gin_features on listings using gin(features) 
WHERE deleted_at IS NULL AND status = 1;
CREATE INDEX
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(136879913688001369043)'::query_int
AND deleted_at IS NULL AND status = 1;
 QUERY PLAN 


 Aggregate  (cost=9158.24..9158.25 rows=1 width=0) (actual 
time=153.633..153.634 rows=1 loops=1)
   -  Seq Scan on listings  (cost=0.00..9157.22 rows=406 width=0) (actual 
time=0.048..153.493 rows=772 loops=1)
 Filter: ((deleted_at IS NULL) AND (features @@ '1368799  1368800  
1369043'::query_int) AND (status = 1))
 Total runtime: 153.713 ms
(4 rows)

regression=# set enable_seqscan TO 0;
SET
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(136879913688001369043)'::query_int
AND deleted_at IS NULL AND status = 1;
   QUERY PLAN   


 Aggregate  (cost=13253.42..13253.43 rows=1 width=0) (actual 
time=331.990..331.990 rows=1 loops=1)
   -  Bitmap Heap Scan on listings  (cost=4095.18..13252.40 rows=406 width=0) 
(actual time=164.785..331.858 rows=772 loops=1)
 Recheck Cond: ((deleted_at IS NULL) AND (status = 1))
 Filter: (features @@ '1368799  1368800  1369043'::query_int)
 -  Bitmap Index Scan on idx_gin_features  (cost=0.00..4095.07 
rows=406215 width=0) (actual time=164.045..164.045 rows=406215 loops=1)
 Total runtime: 332.169 ms
(6 rows)

The above is using the index, but only as a guide to where the rows
satisfying the partial-index predicate are --- note the lack of any
index condition in the indexscan node.  That's because the query_int
query is not in fact compatible with the core-provided index opclass.
We get much better results using intarray's gin__int_ops opclass:

regression=# drop index idx_gin_features;
DROP INDEX
regression=# create index idx_gin_features on listings using gin(features 
gin__int_ops) WHERE deleted_at IS NULL AND status = 1;
CREATE INDEX
regression=# explain analyze SELECT count(*) FROM listings
  WHERE features @@ '(136879913688001369043)'::query_int
AND deleted_at IS NULL AND status = 1;
  QUERY PLAN
  
--
 Aggregate  (cost=1159.20..1159.21 rows=1 width=0) (actual time=23.896..23.896 
rows=1 loops=1)
   -  Bitmap Heap Scan on listings  (cost=31.15..1158.18 rows=406 width=0) 
(actual time=22.912..23.813 rows=772 loops=1)
 Recheck Cond: ((features @@ '1368799  1368800  1369043'::query_int) 
AND (deleted_at IS NULL) AND (status = 1))
 -  Bitmap Index Scan on idx_gin_features  (cost=0.00..31.05 rows=406 
width=0) (actual time=22.811..22.811 rows=772 loops=1)
   Index Cond: (features @@ '1368799  1368800  
1369043'::query_int)
 Total runtime: 24.036 ms
(6 rows)


regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-18 Thread David E. Wheeler
On Jan 18, 2011, at 3:46 PM, Tom Lane wrote:

 The above is using the index, but only as a guide to where the rows
 satisfying the partial-index predicate are --- note the lack of any
 index condition in the indexscan node.  That's because the query_int
 query is not in fact compatible with the core-provided index opclass.
 We get much better results using intarray's gin__int_ops opclass:

Ah-hah! Confirmed, thank you. I now get 23.2 ms and an index condition. Much, 
much better.

Thank you! And I think we can use the improved GIN indexes for this app, once 
9.1 drops.

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-16 Thread David E. Wheeler
On Jan 14, 2011, at 11:37 PM, David E. Wheeler wrote:

 Hard to comment on any of this without a concrete example (including
 data) to look at.  Given the bugs we've recently found in the picksplit
 algorithms for other contrib modules, I wouldn't be too surprised if the
 sucky GiST performance traced to a similar bug in intarray.  But I'm not
 excited about devising my own test case.

FWIW, it looks like we're able to fix the GiST performance by using 
gist__intbig_ops. Relevant thread:

  http://archives.postgresql.org/pgsql-performance/2009-03/msg00254.php

Perhaps time to revisit using gist__int_ops as the default?

 I could give you access to the box in question if you'd like to poke at it. 
 Send me a public key.
 
 One other point here is that GIN index build time is quite sensitive to
 maintenance_work_mem --- what did you have that set to?
 
 64MB

Best,

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-14 Thread David E . Wheeler
On Jan 8, 2011, at 9:41 PM, Tom Lane wrote:

 David E. Wheeler da...@kineticode.com writes:
 On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
 Well, actually, I just committed it.  If you want to test, feel free.
 Note that right now only the anyarray  @ @ operators are genuinely
 fixed ... I plan to hack on tsearch and contrib pretty soon though.
 
 Hrm, the queries I wrote for this sort of thing use intarray:
WHERE blah @@ '(12|14)'::query_int
 That's not done yet though, right?
 
 intarray is done now, feel free to test ...

Tom,

Well, I regret to say that what I found is…all over the place.

So I have a rather wide table with two GIST indexes, one on an integer[] column 
and one on a tsvector column. I duped the table and replaced those indexes with 
GIN indexes. And the results of my testing don't make much sense.

Well, first the good news: I got no NULL-related errors at all. There are a lot 
of rows with an empty array in the integer[] column. And I got the same results 
for my queries against the table with the GIN indexes as the one with the GiST 
indexes. So all that's good.

One of the reasons our client wants GIN for the integer[] column so bad is 
because recreating the GiST integer[] index is quite painful. Before I duped 
the table, I was just dropping and recreating the index on the original table. 
It was great to create the GIN index; for 400K rows, it took 1300 ms. After my 
initial round of testing, I dropped it and created the GiST index. That ran 
for…well, *hours*. Four at least, maybe six or seven (I forgot \timing and was 
letting it run on screen while I did some iOS hacking). I think something might 
be really wrong with GiST index creation for integer arrays, because the 
difference was just appalling. 

As a sanity check, I did the same thing today with the same table(s) on their 
ts_vector columns. Creating the GiST index took just under 7 seconds; the GIN 
index took 23.4 seconds. On a second attempt, GiST took 16371 ms and GIN 30452. 
I had expected GIN to be faster here, but both are within the realms of the 
acceptable, compared to the time it took to create the GiST index on the 
integer[] column.

As for the queries, here too I was surprised. As I said, the integer[] column 
had a lot of empty arrays. And the indexes look like so:

  idx_gist_features gist (features) WHERE deleted_at IS NULL AND status = 1
  idx_gist_textsearch gist (ts_index_col) WHERE deleted_at IS NULL AND status 
= 1

Just s/gist/gin/ for the gin indexes. For the integer[] column, I ran a bunch 
of queries like so (again, just s/gist/gin/ for the gin versions):

explain analyze SELECT count(*) FROM gist_listings
  WHERE features @@ '(1369043)'::query_int
AND deleted_at IS NULL AND mls_status_id = 1;

This integer had pretty high selectvity, 86 out of 406K rows.
Gist: 117.444 on the first run, around 3.2 ms thereafter
GIN:  Around 325 ms on all runs

explain analyze SELECT count(*) FROM gist_listings
  WHERE features @@ '(1368798|1369043)'::query_int
AND deleted_at IS NULL AND mls_status_id = 1;

Rows selected: 91528.
Gist: 4030.282 ms on the first run, around 210 ms thereafter.
GIN:  4309.259 ms on the first run, around 400 ms thereafter

explain analyze SELECT count(*) FROM gist_listings
  WHERE features @@ '(136879913688001369043)'::query_int
AND deleted_at IS NULL AND mls_status_id = 1;

Rows selected: 91528
Gist: 1738.568 ms on the first run, around  24 ms thereafter.
GIN:  4427.517 ms on the first run, around 340 ms thereafter

These numbers are a bit crazy-making, but the upshot is that Gist is slow out 
of the gate, but with data cached, it's pretty speedy. With indexscan and 
bitmapscan disabled, these queries all took 300-400 ms. So GIN was never better 
performing than a table scan. So while GIN is a big win for re-indexing (this 
database gets its intarray Gist indexes updated quite frequently, as they get 
quited bloated in a hurry), it's not a win at all for querying.

So, thinking that there might be something funky with intarray GIN support, we 
wanted to test performance of GIST vs. GIN on the tsquery column. Here GIN was 
a much bigger win. With a query like this:

SELECT l.* FROM gist_listings
 WHERE ts_index_col @@ to_tsquery(regexp_replace(
 plainto_tsquery('english', '1 Infinite Loop')::text,
 '''(?=[[:space:]]|$)', ''':B', 'g'
 )) and deleted_at IS NULL AND status = 1;

With zero rows returned, GIN consistently executed in 20 ms. Gist took 838.274 
for the first run and 25-30 ms on subsequent runs. So GIN is the clear winner 
here, except for index creation as noted above.

And here's one that selects a single row:

SELECT l.* FROM gist_listings
 WHERE ts_index_col @@ to_tsquery(regexp_replace(
 plainto_tsquery('english', 'volkswagon')::text,
 '''(?=[[:space:]]|$)', ''':B', 'g'
 )) and deleted_at IS NULL AND status = 1;

GiST: 495.867 first run, 380 thereafter
GIN:  83.980 

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-14 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 So some questions:

 * Is something seriously wrong with GiST index creation on integer[] columns?

 * Why does GIN performance appear to be no better than table scans on 
 integer[] columns?

 * Why does it take 3-4x longer to create the GIN than the GiST index on 
 tsvector? I thought that GIN was supposed to be faster to update

Hard to comment on any of this without a concrete example (including
data) to look at.  Given the bugs we've recently found in the picksplit
algorithms for other contrib modules, I wouldn't be too surprised if the
sucky GiST performance traced to a similar bug in intarray.  But I'm not
excited about devising my own test case.

One other point here is that GIN index build time is quite sensitive to
maintenance_work_mem --- what did you have that set to?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-14 Thread David E. Wheeler
On Jan 14, 2011, at 5:54 PM, Tom Lane wrote:

 David E. Wheeler da...@kineticode.com writes:
 So some questions:
 
 * Is something seriously wrong with GiST index creation on integer[] columns?
 
 * Why does GIN performance appear to be no better than table scans on 
 integer[] columns?
 
 * Why does it take 3-4x longer to create the GIN than the GiST index on 
 tsvector? I thought that GIN was supposed to be faster to update
 
 Hard to comment on any of this without a concrete example (including
 data) to look at.  Given the bugs we've recently found in the picksplit
 algorithms for other contrib modules, I wouldn't be too surprised if the
 sucky GiST performance traced to a similar bug in intarray.  But I'm not
 excited about devising my own test case.

I could give you access to the box in question if you'd like to poke at it. 
Send me a public key.

 One other point here is that GIN index build time is quite sensitive to
 maintenance_work_mem --- what did you have that set to?

64MB

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-13 Thread Dimitri Fontaine
Tom Lane t...@sss.pgh.pa.us writes:
 David E. Wheeler da...@kineticode.com writes:
 On Jan 12, 2011, at 4:35 PM, Tom Lane wrote:
 No, what we need is a decent extension package manager ;-)

 Yeah. Maybe you can do that this weekend? Or, I dunno, while you “sleep” 
 tonight?

 Supposedly it's in the queue for the upcoming CF :-)

Hehe, and some provision have been made to support upgrading from 9.0 to
9.1 too:

  http://pgsql.tapoueh.org/extensions/doc/html/extend-extension.html#AEN50748

But that won't solve the dump-from-9.0 and restore-into-9.1 by itself,
the only way for us to solve that problem that I can think of would be
to backpatch a new feature.  Do it the old-way until you upgrade from
9.1 to later might be our answer here.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-13 Thread Bruce Momjian
Robert Haas wrote:
 On Tue, Jan 4, 2011 at 4:49 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  On Tue, Jan 4, 2011 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  * Existing GIN indexes are upwards compatible so far as on-disk storage
  goes, but they will of course be missing entries for empty, null, or
  null-containing items. ?Users who want to do searches that should find
  such items will need to reindex after updating to 9.1.
 
  This is the only part of this proposal that bothers me a little bit.
  It would be nice if the system could determine whether a GIN index is
  upgraded from 9.0 or earlier and thus doesn't contain these entries
  - and avoid trying to use the index for these sorts of queries in
  cases where it might return wrong answers.
 
  I don't think it's really worth the trouble. ?The GIN code has been
  broken for these types of queries since day one, and yet we've had only
  maybe half a dozen complaints about it. ?Moreover there's no practical
  way to avoid trying to use the index, since in many cases the fact
  that a query requires a full-index scan isn't determinable at plan time.
 
  The best we could really do is throw an error at indexscan start, and
  that doesn't seem all that helpful. ?But it probably wouldn't take much
  code either, if you're satisfied with that answer. ?(I'm envisioning
  adding a version ID to the GIN metapage and then checking that before
  proceeding with a full-index scan.)
 
 I'd be satisfied with that answer.  It at least makes it a lot more
 clear when you've got a problem.  If this were a more common scenario,
 I'd probably advocate for a better solution, but the one you propose
 seems adequate given the frequency of the problem as you describe it.

What does pg_upgrade need to do about this for 9.1?  Just tell people
they might get an GIN error someday?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-13 Thread Tom Lane
Bruce Momjian br...@momjian.us writes:
 What does pg_upgrade need to do about this for 9.1?

Nothing.  An existing GIN index can still do all the same queries it
could do before.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-12 Thread David E. Wheeler
Tom,

On Jan 8, 2011, at 9:41 PM, Tom Lane wrote:

 David E. Wheeler da...@kineticode.com writes:
 On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
 Well, actually, I just committed it.  If you want to test, feel free.
 Note that right now only the anyarray  @ @ operators are genuinely
 fixed ... I plan to hack on tsearch and contrib pretty soon though.
 
 Hrm, the queries I wrote for this sort of thing use intarray:
WHERE blah @@ '(12|14)'::query_int
 That's not done yet though, right?
 
 intarray is done now, feel free to test ...

Thanks, working on it now. I'm restoring a dump from 8.4, but got these erors:

pg_restore: [archiver (db)] Error from TOC entry 3227; 2616 46485 OPERATOR 
CLASS gin__int_ops postgres
pg_restore: [archiver (db)] could not execute query: ERROR:  function 
ginarrayextract(anyarray, internal) does not exist
Command was: CREATE OPERATOR CLASS gin__int_ops
FOR TYPE integer[] USING gin AS
STORAGE integer ,
OPERATOR 3 (integer[],int...
pg_restore: [archiver (db)] could not execute query: ERROR:  operator class 
contrib.gin__int_ops does not exist for access method gin
Command was: ALTER OPERATOR CLASS contrib.gin__int_ops USING gin OWNER TO 
postgres;
pg_restore: [archiver (db)] Error from TOC entry 3254; 3600 16245434 TEXT 
SEARCH DICTIONARY en_mls_20101103 postgres

Did a signature change or something? Is there something that needs a 
compatibility interface of some kind?

Thanks,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-12 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 Thanks, working on it now. I'm restoring a dump from 8.4, but got these erors:

 pg_restore: [archiver (db)] Error from TOC entry 3227; 2616 46485 OPERATOR 
 CLASS gin__int_ops postgres
 pg_restore: [archiver (db)] could not execute query: ERROR:  function 
 ginarrayextract(anyarray, internal) does not exist

 Did a signature change or something?

Yeah.  I think if you just load up the current contrib/intarray first,
you'll be fine (ignore all the object-already-exists errors).

 Is there something that needs a compatibility interface of some kind?

No, what we need is a decent extension package manager ;-)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-12 Thread David E. Wheeler
On Jan 12, 2011, at 4:35 PM, Tom Lane wrote:

 Did a signature change or something?
 
 Yeah.  I think if you just load up the current contrib/intarray first,
 you'll be fine (ignore all the object-already-exists errors).

Oh, from 9.1devel? yeah, okay. Will do that tomorrow (finishing the current 
load to make sure that there are no other errors I can't ignore, but won't see 
among all the intarray stuff tomorrow).

 Is there something that needs a compatibility interface of some kind?
 
 No, what we need is a decent extension package manager ;-)

Yeah. Maybe you can do that this weekend? Or, I dunno, while you “sleep” 
tonight?

Best,

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-12 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 On Jan 12, 2011, at 4:35 PM, Tom Lane wrote:
 No, what we need is a decent extension package manager ;-)

 Yeah. Maybe you can do that this weekend? Or, I dunno, while you “sleep” 
 tonight?

Supposedly it's in the queue for the upcoming CF :-)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


contrib/intarray (was Re: [HACKERS] Fixing GIN for empty/null/full-scan cases)

2011-01-08 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
 Well, actually, I just committed it.  If you want to test, feel free.
 Note that right now only the anyarray  @ @ operators are genuinely
 fixed ... I plan to hack on tsearch and contrib pretty soon though.

 Hrm, the queries I wrote for this sort of thing use intarray:

I'm going to work on contrib/intarray first (before tsearch etc)
so that you can do whatever testing you want sooner.

One of the things that first got me annoyed about the whole GIN
situation is that intarray's definitions of the @ and @ operators were
inconsistent with the core operators of the same names.  I believe that
the inconsistency has to go away.  Really the only reason that intarray
has its own versions of these operators at all is that it can be faster
than the generic anyarray versions in core.  There seem to be three ways
in which intarray is simpler/faster than the generic operators:

* restricted to integer arrays
* restricted to 1-D arrays
* doesn't allow nulls in the arrays

The first of these is pretty important from a speed perspective, and
it's also basically free because of the type system: the parser won't
attempt to apply intarray's operators to anything that's not an integer
array.  The second one seems a little more dubious.  AFAICS the code
isn't actually exploiting 1-D-ness anywhere; it always uses
ArrayGetNItems() to compute the array size, for example.  I propose that
we just drop that restriction and let it accept arrays that are
multidimensional, implicitly linearizing the elements in storage order.
(Any created arrays will still be 1-D though.)

The third restriction is a bit harder to decide what to do about.
If we keep it then intarray's @ etc will throw errors in some cases
where core would not have.  However, dealing with nulls will make the
code significantly uglier and probably slower than it is now; and that's
not work that I'm excited about doing right now anyway.  So what I
propose for the moment is that intarray still have that restriction.
Maybe someone else will feel like fixing it later.

I will however fix the issue described here:
http://archives.postgresql.org/pgsql-bugs/2010-12/msg00032.php
that intarray sometimes throws nulls not allowed errors on
arrays that once contained nulls but now don't.  That can be
done with a relatively localized patch --- we just need to look
a little harder when the ARR_HASNULL flag is set.

Comments, objections?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: contrib/intarray (was Re: [HACKERS] Fixing GIN for empty/null/full-scan cases)

2011-01-08 Thread David E. Wheeler
On Jan 8, 2011, at 1:59 PM, Tom Lane wrote:

 Hrm, the queries I wrote for this sort of thing use intarray:
 
 I'm going to work on contrib/intarray first (before tsearch etc)
 so that you can do whatever testing you want sooner.

No, of course not.

 One of the things that first got me annoyed about the whole GIN
 situation is that intarray's definitions of the @ and @ operators were
 inconsistent with the core operators of the same names.  I believe that
 the inconsistency has to go away.  Really the only reason that intarray
 has its own versions of these operators at all is that it can be faster
 than the generic anyarray versions in core.  There seem to be three ways
 in which intarray is simpler/faster than the generic operators:
 
   * restricted to integer arrays
   * restricted to 1-D arrays
   * doesn't allow nulls in the arrays

My understanding is that they also perform much better if the values in an 
integer array are ordered. Does that matter?

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: contrib/intarray (was Re: [HACKERS] Fixing GIN for empty/null/full-scan cases)

2011-01-08 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 On Jan 8, 2011, at 1:59 PM, Tom Lane wrote:
 There seem to be three ways
 in which intarray is simpler/faster than the generic operators:
 
 * restricted to integer arrays
 * restricted to 1-D arrays
 * doesn't allow nulls in the arrays

 My understanding is that they also perform much better if the values in an 
 integer array are ordered. Does that matter?

Some of the operations sort the array contents as an initial step.  I'm
not sure how much faster they'll be if the array is already ordered,
but in any case they don't *require* presorted input.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-08 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
 Well, actually, I just committed it.  If you want to test, feel free.
 Note that right now only the anyarray  @ @ operators are genuinely
 fixed ... I plan to hack on tsearch and contrib pretty soon though.

 Hrm, the queries I wrote for this sort of thing use intarray:
 WHERE blah @@ '(12|14)'::query_int
 That's not done yet though, right?

intarray is done now, feel free to test ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-07 Thread Tom Lane
I wrote:
 2. Add another output bool parameter to extractQuery that it must set
 true (from a default false state) if the query could match with no check
 values set.  This would prompt the GIN code to search for EMPTY_ITEM
 placeholders, but they'd not be part of the check[] array.

On further reflection: if we're going to go this route, we really ought
to take one more step and allow the opclass to demand a full-index scan.
The reason for this is cases like tsvector's NOT operator:

SELECT ... WHERE tsvectorcol @@ '! unwanted'::tsquery

Right now, this will do what it says on the tin if implemented as a
seqscan.  It will fail (silently, I think) if implemented as a GIN index
search.  We didn't use to have any way of making it behave sanely as
an indexsearch, but the mechanisms I'm building now would support doing
this right.

So, instead of just a bool, I'm now proposing adding an int return
argument specified like this:

searchMode is an output argument that allows extractQuery to specify
details about how the search will be done. If *searchMode is set to
GIN_SEARCH_MODE_DEFAULT (which is the value it is initialized to
before call), only items that match at least one of the returned
keys are considered candidate matches. If *searchMode is set to
GIN_SEARCH_MODE_INCLUDE_EMPTY, then in addition to items containing
at least one matching key, items that contain no keys at all are
considered candidate matches. (This mode is useful for implementing
is-subset-of operators, for example.) If *searchMode is set to
GIN_SEARCH_MODE_ALL, then all non-null items in the index are
considered candidate matches, whether they match any of the returned
keys or not. (This mode is much slower than the other two choices,
since it requires scanning essentially the entire index, but it may
be necessary to implement corner cases correctly. An operator that
needs this mode in most cases is probably not a good candidate for a
GIN operator class.) The symbols to use for setting this mode are
defined in access/gin.h.

The default mode is equivalent to what used to happen implicitly, so
this is still backwards-compatible with existing opclasses.

Don't have code to back up this spec yet, but I believe I see how to do
it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-07 Thread David E . Wheeler
On Jan 4, 2011, at 3:18 PM, Josh Berkus wrote:

 Actually, there's been a *lot* of complaining about the GIN issues.
 It's just that most of that complaining doesn't reach -hackers.
 
 The common pattern I've seen in our practice and on IRC is:
 
 1) user has GiST indexes
 2) user tries converting them to GIN
 3) user gets full index scan errors
 4) user switches back and gives up

We (PGX) actually have a client who could use this. Tom, if you have patches as 
you work on this (or, better, a branch in a Git repo), I could do some testing 
on your client's code with it. It would involve converting from a GiST to a GIN 
index and then seeing how well the queries fare. Would that be helpful to you?

Best,

David


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-07 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 We (PGX) actually have a client who could use this. Tom, if you have patches 
 as you work on this (or, better, a branch in a Git repo), I could do some 
 testing on your client's code with it. It would involve converting from a 
 GiST to a GIN index and then seeing how well the queries fare. Would that be 
 helpful to you?

Well, actually, I just committed it.  If you want to test, feel free.
Note that right now only the anyarray  @ @ operators are genuinely
fixed ... I plan to hack on tsearch and contrib pretty soon though.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-07 Thread David E. Wheeler
On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:

 Well, actually, I just committed it.  If you want to test, feel free.
 Note that right now only the anyarray  @ @ operators are genuinely
 fixed ... I plan to hack on tsearch and contrib pretty soon though.

Hrm, the queries I wrote for this sort of thing use intarray:

WHERE blah @@ '(12|14)'::query_int

That's not done yet though, right?

Best,

David



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-07 Thread Tom Lane
David E. Wheeler da...@kineticode.com writes:
 On Jan 7, 2011, at 4:19 PM, Tom Lane wrote:
 Well, actually, I just committed it.  If you want to test, feel free.
 Note that right now only the anyarray  @ @ operators are genuinely
 fixed ... I plan to hack on tsearch and contrib pretty soon though.

 Hrm, the queries I wrote for this sort of thing use intarray:

 WHERE blah @@ '(12|14)'::query_int

 That's not done yet though, right?

Right.  Maybe sometime this weekend.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-06 Thread Euler Taveira de Oliveira

Em 06-01-2011 21:31, Tom Lane escreveu:

I think I like option #2 better.  Comments?


+1.


--
  Euler Taveira de Oliveira
  http://www.timbira.com/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-04 Thread Tom Lane
I've been thinking about how to fix GIN's assorted corner-case problems,
as has been discussed several times, most recently here:
http://archives.postgresql.org/pgsql-hackers/2010-10/msg00521.php
See also
http://wiki.postgresql.org/wiki/Todo#GIN

There are basically three related issues:

1. GIN doesn't store anything in the index for a NULL item.
2. GIN doesn't store anything in the index for an empty (zero-key) item.
3. GIN can't deal with NULL key values.

(An item is a composite value to be indexed, such as a tsvector or
array.  A key is an individual indexable value, such as a lexeme or
array element.)

Because of #1 and #2, GIN can't handle full-index scans.  This is not
because the code can't scan all of the index, but because doing so
wouldn't necessarily return a TID for every existing heap row.

We have to fix #1 and #2, and then get rid of the prohibition on
full-index scans, in order to deal with the complaints that appear in our
TODO list.  The problem with NULL key values is somewhat less pressing,
because most GIN-indexable operators are strict enough to not care about
null keys, so we could perhaps just ignore nulls.  But I'm inclined to
think that it'd be best to fix that now while we're whacking the code and
opclass APIs around, rather than probably having to go back for yet
another round later.  A concrete example of a hit we'll take if we don't
index nulls is that array-is-contained-in would have to be treated as a
lossy rather than lossless index search, since there'd be no way to tell
from the index whether the array item contains any nulls (rendering it not
contained in anything).


As far as storage in the index goes, NULL key values seem perfectly simple
to deal with: just allow a null to get stored for the key value of a GIN
index entry.  What we discussed doing to fix #1 and #2 was to store a
dummy index entry, rather than no entries at all.  The least complicated
way to do that would be to store a NULL key.  That would mean that,
for example with integer arrays, all three of these item values would have
identical index entries:
NULL::int[]
'{}'::int[]
'{NULL}'::int[]
So any searches that need to yield different answers for these cases
(ie find some but not all of them) would have to be treated as lossy.
That's not necessarily a show-stopper, but I'm inclined to think that
it is worth working a bit harder so that we can distinguish them.

It's already true that GIN requires special-case code to construct its
index entries (look at GinFormTuple).  What I'm thinking we could do
without too much additional ugliness is store either the key value (for
the normal, non-null key case) or an int16 representing a category:
1 = null key value
2 = placeholder for zero-key item
3 = placeholder for null item
The index entry's IndexTupleHasNulls flag would be sufficient to
distinguish whether a key or a category flag is present, since there
are no other potentially-null values in a GIN index entry.  There is
room in this scheme for more categories if we ever need any, though
I can't think of what they'd be.


The other sticky problem is how to extend the GIN opclass support function
API definitions for all this.  I propose the following:

compare(): doesn't really need any changes.  Just as in btree, we only
need to call the key comparison function for non-null keys.  The various
categories of nulls can have hard-wired comparison behavior.

extractValue(): needs an extension so it can return null key values.
I propose adding a third argument:
Datum *extractValue(Datum inputValue, int32 *nkeys,
bool **nullFlags)
If the function wants to return any nulls, it has to palloc an array
of nkeys bools and store a pointer to it at *nullFlags.  We can initialize
*nullFlags to NULL (implying no null keys) for backwards compatibility
with existing functions that aren't aware of the third argument.  In the
case of a null item value, we needn't call the function at all, we can
just generate the dummy entry directly.  Zero-key items can be handled
compatibly with the current behavior: if nkeys is returned as zero,
we'll generate a dummy entry instead of generating nothing at all.

extractQuery(): likewise, needs to be able to return null query element
values.  I propose adding a sixth argument:
Datum *extractQuery(Datum query, int32 *nkeys, StrategyNumber n,
bool **pmatch, Pointer **extra_data,
bool **nullFlags)
As above, we can initialize *nullFlags to NULL for backwards
compatibility.  We will continue to assume that a null query value means
an unsatisfiable query, so we don't need to be able to call extractQuery
with a null input query.  We'll keep the current convention that returning
nkeys = -1 means an unsatisfiable query while returning zero requests a
full index scan.

consistent(): needs to be able to deal with possible nulls in the
extracted query 

Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-04 Thread Robert Haas
On Tue, Jan 4, 2011 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 * Existing GIN indexes are upwards compatible so far as on-disk storage
 goes, but they will of course be missing entries for empty, null, or
 null-containing items.  Users who want to do searches that should find
 such items will need to reindex after updating to 9.1.

This is the only part of this proposal that bothers me a little bit.
It would be nice if the system could determine whether a GIN index is
upgraded from 9.0 or earlier and thus doesn't contain these entries
- and avoid trying to use the index for these sorts of queries in
cases where it might return wrong answers.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-04 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 4, 2011 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 * Existing GIN indexes are upwards compatible so far as on-disk storage
 goes, but they will of course be missing entries for empty, null, or
 null-containing items.  Users who want to do searches that should find
 such items will need to reindex after updating to 9.1.

 This is the only part of this proposal that bothers me a little bit.
 It would be nice if the system could determine whether a GIN index is
 upgraded from 9.0 or earlier and thus doesn't contain these entries
 - and avoid trying to use the index for these sorts of queries in
 cases where it might return wrong answers.

I don't think it's really worth the trouble.  The GIN code has been
broken for these types of queries since day one, and yet we've had only
maybe half a dozen complaints about it.  Moreover there's no practical
way to avoid trying to use the index, since in many cases the fact
that a query requires a full-index scan isn't determinable at plan time.

The best we could really do is throw an error at indexscan start, and
that doesn't seem all that helpful.  But it probably wouldn't take much
code either, if you're satisfied with that answer.  (I'm envisioning
adding a version ID to the GIN metapage and then checking that before
proceeding with a full-index scan.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-04 Thread Robert Haas
On Tue, Jan 4, 2011 at 4:49 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Jan 4, 2011 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 * Existing GIN indexes are upwards compatible so far as on-disk storage
 goes, but they will of course be missing entries for empty, null, or
 null-containing items.  Users who want to do searches that should find
 such items will need to reindex after updating to 9.1.

 This is the only part of this proposal that bothers me a little bit.
 It would be nice if the system could determine whether a GIN index is
 upgraded from 9.0 or earlier and thus doesn't contain these entries
 - and avoid trying to use the index for these sorts of queries in
 cases where it might return wrong answers.

 I don't think it's really worth the trouble.  The GIN code has been
 broken for these types of queries since day one, and yet we've had only
 maybe half a dozen complaints about it.  Moreover there's no practical
 way to avoid trying to use the index, since in many cases the fact
 that a query requires a full-index scan isn't determinable at plan time.

 The best we could really do is throw an error at indexscan start, and
 that doesn't seem all that helpful.  But it probably wouldn't take much
 code either, if you're satisfied with that answer.  (I'm envisioning
 adding a version ID to the GIN metapage and then checking that before
 proceeding with a full-index scan.)

I'd be satisfied with that answer.  It at least makes it a lot more
clear when you've got a problem.  If this were a more common scenario,
I'd probably advocate for a better solution, but the one you propose
seems adequate given the frequency of the problem as you describe it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fixing GIN for empty/null/full-scan cases

2011-01-04 Thread Josh Berkus
On 1/4/11 1:49 PM, Tom Lane wrote:
 I don't think it's really worth the trouble.  The GIN code has been
 broken for these types of queries since day one, and yet we've had only
 maybe half a dozen complaints about it.  Moreover there's no practical
 way to avoid trying to use the index, since in many cases the fact
 that a query requires a full-index scan isn't determinable at plan time.

Actually, there's been a *lot* of complaining about the GIN issues.
It's just that most of that complaining doesn't reach -hackers.

The common pattern I've seen in our practice and on IRC is:

1) user has GiST indexes
2) user tries converting them to GIN
3) user gets full index scan errors
4) user switches back and gives up

I agree that backwards compatibility should not be a priority; it is
sufficient to tell users to reindex.  For one thing, anyone who *is*
using GIN presently will have written their application code to avoid
full index scans.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers