On Mon, Jan 6, 2014 at 12:35 PM, Amit Langote <amitlangot...@gmail.com>wrote:
> On Sat, Dec 21, 2013 at 4:36 AM, Heikki Linnakangas > <hlinnakan...@vmware.com> wrote: > > > > Yet another version. The encoding/decoding code is now quite isolated in > > ginpostinglist.c, so it's easy to experiment with different encodings. > This > > patch uses varbyte encoding again. > > > > I got a bit carried away, experimented with a bunch of different > encodings. > > I tried rice encoding, rice encoding with block and offset number delta > > stored separately, the simple9 variant, and varbyte encoding. > > > > The compressed size obviously depends a lot on the distribution of the > > items, but in the test set I used, the differences between different > > encodings were quite small. > > > > One fatal problem with many encodings is VACUUM. If a page is completely > > full and you remove one item, the result must still fit. In other words, > > removing an item must never enlarge the space needed. Otherwise we have > to > > be able to split on vacuum, which adds a lot of code, and also makes it > > possible for VACUUM to fail if there is no disk space left. That's > > unpleasant if you're trying to run VACUUM to release disk space. (gin > fast > > updates already has that problem BTW, but let's not make it worse) > > > > I believe that eliminates all encodings in the Simple family, as well as > > PForDelta, and surprisingly also Rice encoding. For example, if you have > > three items in consecutive offsets, the differences between them are > encoded > > as 11 in rice encoding. If you remove the middle item, the encoding for > the > > next item becomes 010, which takes more space than the original. > > > > AFAICS varbyte encoding is safe from that. (a formal proof would be nice > > though). > > > > So, I'm happy to go with varbyte encoding now, indeed I don't think we > have > > much choice unless someone can come up with an alternative that's > > VACUUM-safe. I have to put this patch aside for a while now, I spent a > lot > > more time on these encoding experiments than I intended. If you could > take a > > look at this latest version, spend some time reviewing it and cleaning up > > any obsolete comments, and re-run the performance tests you did earlier, > > that would be great. One thing I'm slightly worried about is the > overhead of > > merging the compressed and uncompressed posting lists in a scan. This > patch > > will be in good shape for the final commitfest, or even before that. > > > > > I just tried out the patch "gin-packed-postinglists-varbyte2.patch" > (which looks like the latest one in this thread) as follows: > > 1) Applied patch to the HEAD (on commit > 94b899b829657332bda856ac3f06153d09077bd1) > 2) Created a test table and index > > create table test (a text); > copy test from '/usr/share/dict/words'; > create index test_trgm_idx on test using gin (a gin_trgm_ops); > > 3) Got the following error on a wildcard query: > > postgres=# explain (buffers, analyze) select count(*) from test where > a like '%tio%'; > ERROR: lock 9447 is not held > STATEMENT: explain (buffers, analyze) select count(*) from test where > a like '%tio%'; > ERROR: lock 9447 is not held > Thanks for reporting. Fixed version is attached. ------ With best regards, Alexander Korotkov.
gin-packed-postinglists-varbyte3.patch.gz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers