28.01.2016 20:03, Thom Brown:
On 28 January 2016 at 16:12, Anastasia Lubennikova <a.lubennik...@postgrespro.ru <mailto:a.lubennik...@postgrespro.ru>> wrote:


    28.01.2016 18:12, Thom Brown:

    On 28 January 2016 at 14:06, Anastasia Lubennikova
    <a.lubennik...@postgrespro.ru
    <mailto:a.lubennik...@postgrespro.ru>> wrote:


        31.08.2015 10:41, Anastasia Lubennikova:
        Hi, hackers!
        I'm going to begin work on effective storage of duplicate
        keys in B-tree index.
        The main idea is to implement posting lists and posting
        trees for B-tree index pages as it's already done for GIN.

        In a nutshell, effective storing of duplicates in GIN is
        organised as follows.
        Index stores single index tuple for each unique key. That
        index tuple points to posting list which contains pointers
        to heap tuples (TIDs). If too many rows having the same key,
        multiple pages are allocated for the TIDs and these
        constitute so called posting tree.
        You can find wonderful detailed descriptions in gin readme
        
<https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README>
        and articles <http://www.cybertec.at/gin-just-an-index-type/>.
        It also makes possible to apply compression algorithm to
        posting list/tree and significantly decrease index size.
        Read more in presentation (part 1)
        <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>.

        Now new B-tree index tuple must be inserted for each table
        row that we index.
        It can possibly cause page split. Because of MVCC even
        unique index could contain duplicates.
        Storing duplicates in posting list/tree helps to avoid
        superfluous splits.

        I'd like to share the progress of my work. So here is a WIP
        patch.
        It provides effective duplicate handling using posting lists
        the same way as GIN does it.

        Layout of the tuples on the page is changed in the following way:
        before:
        TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) +
        key, TID (ip_blkid, ip_posid) + key
        with patch:
        TID (N item pointers, posting list offset) + key, TID
        (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID
        (ip_blkid, ip_posid)

        It seems that backward compatibility works well without any
        changes. But I haven't tested it properly yet.

        Here are some test results. They are obtained by test
        functions test_btbuild and test_ginbuild, which you can find
        in attached sql file.
        i - number of distinct values in the index. So i=1 means that
        all rows have the same key, and i=10000000 means that all
        keys are different.
        The other columns contain the index size (MB).

        i       B-tree Old      B-tree New      GIN
        1       214,234375      87,7109375      10,2109375
        10      214,234375      87,7109375      10,71875
        100     214,234375      87,4375         15,640625
        1000    214,234375      86,2578125      31,296875
        10000   214,234375      78,421875       104,3046875
        100000  214,234375      65,359375       49,078125
        1000000         214,234375      90,140625       106,8203125
        10000000        214,234375      214,234375      534,0625


        You can note that the last row contains the same index sizes
        for B-tree, which is quite logical - there is no compression
        if all the keys are distinct.
        Other cases looks really nice to me.
        Next thing to say is that I haven't implemented posting list
        compression yet. So there is still potential to decrease size
        of compressed btree.

        I'm almost sure, there are still some tiny bugs and missed
        functions, but on the whole, the patch is ready for testing.
        I'd like to get a feedback about the patch testing on some
        real datasets. Any bug reports and suggestions are welcome.

        Here is a couple of useful queries to inspect the data inside
        the index pages:
        create extension pageinspect;
        select * from bt_metap('idx');
        select bt.* from generate_series(1,1) as n, lateral
        bt_page_stats('idx', n) as bt;
        select n, bt.* from generate_series(1,1) as n, lateral
        bt_page_items('idx', n) as bt;

        And at last, the list of items I'm going to complete in the
        near future:
        1. Add storage_parameter 'enable_compression' for btree
        access method which specifies whether the index handles
        duplicates. default is 'off'
        2. Bring back microvacuum functionality for compressed indexes.
        3. Improve insertion speed. Insertions became significantly
        slower with compressed btree, which is obviously not what we
        do want.
        4. Clean the code and comments, add related documentation.


    This doesn't apply cleanly against current git head.  Have you
    caught up past commit 65c5fcd35?

    Thank you for the notice. New patch is attached.


Thanks for the quick rebase.

Okay, a quick check with pgbench:

CREATE INDEX ON pgbench_accounts(bid);

Timing
Scale: master / patch
100: 10657ms / 13555ms (rechecked and got 9745ms)
500: 56909ms / 56985ms

Size
Scale: master / patch
100: 214MB / 87MB (40.7%)
500: 1071MB / 437MB (40.8%)

No performance issues from what I can tell.

I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored.

Thank you for the prompt reply. I see what you're confused about. I'll try to clarify it.

First of all, what is implemented in the patch is not actually compression. It's more about index page layout changes to compact ItemPointers (TIDs). Instead of TID+key, TID+key, we store now META+key+List_of_TIDs (also known as Posting list).

before:
TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key
with patch:
TID (N item pointers, posting list offset) + key, TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid)

TID (N item pointers, posting list offset) - this is the meta information. So, we have to store this meta information in addition to useful data.

Next point is the requirement of having minimum three tuples in a page. We need at least two tuples to point the children and the highkey as well.
This requirement leads to the limitation of the max index tuple size.

/*
 * Maximum size of a btree index entry, including its tuple header.
 *
 * We actually need to be able to fit three items on every page,
 * so restrict any one item to 1/3 the per-page available space.
 */
#define BTMaxItemSize(page) \
        MAXALIGN_DOWN((PageGetPageSize(page) - \
                                   MAXALIGN(SizeOfPageHeaderData + 
3*sizeof(ItemIdData)) - \
                                   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)

Although, I thought just now that this size could be increased for compressed tuples, at least for leaf pages.

That's the reason, why we have to store more meta information than meets the eye.

For example, we have 100000 of duplicates with the same key. It seems that compression should be really significant. Something like 1 Meta + 1 key instead of 100000 keys --> 6 bytes (size of meta TID) + keysize instead of 600000. But, we have to split one huge posting list into the smallest ones to fit it into the index page.

It depends on the key size, of course. As I can see form pageisnpect the index on the single integer key have to split the tuples into the pieces with the size 2704 containing 447 TIDs in one posting list. So we have 1 Meta + 1 key instead of 447 keys. As you can see, that is really less impressive than expected.

There is an idea of posting trees in GIN. Key is stored just once, and posting list which doesn't fit into the page becomes a tree. You can find incredible article about it here http://www.cybertec.at/2013/03/gin-just-an-index-type/ But I think, that it's not the best way for the btree am, because it’s not supposed to handle concurrent insertions.

As I mentioned before I'm going to implement prefix compression of posting list, which must be efficient and quite simple, since it's already implemented in GIN. You can find the presentation about it here https://www.pgcon.org/2014/schedule/events/698.en.html

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Reply via email to