Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Thom Brown Thu, 28 Jan 2016 09:04:26 -0800

On 28 January 2016 at 16:12, Anastasia Lubennikova <
a.lubennik...@postgrespro.ru> wrote:


>
> 28.01.2016 18:12, Thom Brown:
>
> On 28 January 2016 at 14:06, Anastasia Lubennikova <
> a.lubennik...@postgrespro.ru> wrote:
>
>>
>> 31.08.2015 10:41, Anastasia Lubennikova:
>>
>> Hi, hackers!
>> I'm going to begin work on effective storage of duplicate keys in B-tree
>> index.
>> The main idea is to implement posting lists and posting trees for B-tree
>> index pages as it's already done for GIN.
>>
>> In a nutshell, effective storing of duplicates in GIN is organised as
>> follows.
>> Index stores single index tuple for each unique key. That index tuple
>> points to posting list which contains pointers to heap tuples (TIDs). If
>> too many rows having the same key, multiple pages are allocated for the
>> TIDs and these constitute so called posting tree.
>> You can find wonderful detailed descriptions in gin readme
>> <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
>> and articles <http://www.cybertec.at/gin-just-an-index-type/>.
>> It also makes possible to apply compression algorithm to posting
>> list/tree and significantly decrease index size. Read more in presentation
>> (part 1)
>> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.
>>
>> Now new B-tree index tuple must be inserted for each table row that we
>> index.
>> It can possibly cause page split. Because of MVCC even unique index could
>> contain duplicates.
>> Storing duplicates in posting list/tree helps to avoid superfluous splits.
>>
>>
>> I'd like to share the progress of my work. So here is a WIP patch.
>> It provides effective duplicate handling using posting lists the same way
>> as GIN does it.
>>
>> Layout of the tuples on the page is changed in the following way:
>> before:
>> TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID
>> (ip_blkid, ip_posid) + key
>> with patch:
>> TID (N item pointers, posting list offset) + key, TID (ip_blkid,
>> ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)
>>
>> It seems that backward compatibility works well without any changes. But
>> I haven't tested it properly yet.
>>
>> Here are some test results. They are obtained by test functions
>> test_btbuild and test_ginbuild, which you can find in attached sql file.
>> i - number of distinct values in the index. So i=1 means that all rows
>> have the same key, and i=10000000 means that all keys are different.
>> The other columns contain the index size (MB).
>>
>> i B-tree Old B-tree New GIN
>> 1 214,234375 87,7109375 10,2109375
>> 10 214,234375 87,7109375 10,71875
>> 100 214,234375 87,4375 15,640625
>> 1000 214,234375 86,2578125 31,296875
>> 10000 214,234375 78,421875 104,3046875
>> 100000 214,234375 65,359375 49,078125
>> 1000000 214,234375 90,140625 106,8203125
>> 10000000 214,234375 214,234375 534,0625
>> You can note that the last row contains the same index sizes for B-tree,
>> which is quite logical - there is no compression if all the keys are
>> distinct.
>> Other cases looks really nice to me.
>> Next thing to say is that I haven't implemented posting list compression
>> yet. So there is still potential to decrease size of compressed btree.
>>
>> I'm almost sure, there are still some tiny bugs and missed functions, but
>> on the whole, the patch is ready for testing.
>> I'd like to get a feedback about the patch testing on some real datasets.
>> Any bug reports and suggestions are welcome.
>>
>> Here is a couple of useful queries to inspect the data inside the index
>> pages:
>> create extension pageinspect;
>> select * from bt_metap('idx');
>> select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx',
>> n) as bt;
>> select n, bt.* from generate_series(1,1) as n, lateral
>> bt_page_items('idx', n) as bt;
>>
>> And at last, the list of items I'm going to complete in the near future:
>> 1. Add storage_parameter 'enable_compression' for btree access method
>> which specifies whether the index handles duplicates. default is 'off'
>> 2. Bring back microvacuum functionality for compressed indexes.
>> 3. Improve insertion speed. Insertions became significantly slower with
>> compressed btree, which is obviously not what we do want.
>> 4. Clean the code and comments, add related documentation.
>>
>
> This doesn't apply cleanly against current git head.  Have you caught up
> past commit 65c5fcd35?
>
>
> Thank you for the notice. New patch is attached.
>

Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms

Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point.  Your
results show a sweet spot at around 1000 / 10000000, with it getting
slightly worse beyond that.  I kind of expected a lot of efficiency where
all the values are the same, but perhaps that's due to my lack of
understanding regarding the way they're being stored.

Thom

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Reply via email to