On 28 January 2016 at 16:12, Anastasia Lubennikova < a.lubennik...@postgrespro.ru> wrote:
> > 28.01.2016 18:12, Thom Brown: > > On 28 January 2016 at 14:06, Anastasia Lubennikova < > a.lubennik...@postgrespro.ru> wrote: > >> >> 31.08.2015 10:41, Anastasia Lubennikova: >> >> Hi, hackers! >> I'm going to begin work on effective storage of duplicate keys in B-tree >> index. >> The main idea is to implement posting lists and posting trees for B-tree >> index pages as it's already done for GIN. >> >> In a nutshell, effective storing of duplicates in GIN is organised as >> follows. >> Index stores single index tuple for each unique key. That index tuple >> points to posting list which contains pointers to heap tuples (TIDs). If >> too many rows having the same key, multiple pages are allocated for the >> TIDs and these constitute so called posting tree. >> You can find wonderful detailed descriptions in gin readme >> <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README> >> and articles <http://www.cybertec.at/gin-just-an-index-type/>. >> It also makes possible to apply compression algorithm to posting >> list/tree and significantly decrease index size. Read more in presentation >> (part 1) >> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>. >> >> Now new B-tree index tuple must be inserted for each table row that we >> index. >> It can possibly cause page split. Because of MVCC even unique index could >> contain duplicates. >> Storing duplicates in posting list/tree helps to avoid superfluous splits. >> >> >> I'd like to share the progress of my work. So here is a WIP patch. >> It provides effective duplicate handling using posting lists the same way >> as GIN does it. >> >> Layout of the tuples on the page is changed in the following way: >> before: >> TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID >> (ip_blkid, ip_posid) + key >> with patch: >> TID (N item pointers, posting list offset) + key, TID (ip_blkid, >> ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid) >> >> It seems that backward compatibility works well without any changes. But >> I haven't tested it properly yet. >> >> Here are some test results. They are obtained by test functions >> test_btbuild and test_ginbuild, which you can find in attached sql file. >> i - number of distinct values in the index. So i=1 means that all rows >> have the same key, and i=10000000 means that all keys are different. >> The other columns contain the index size (MB). >> >> i B-tree Old B-tree New GIN >> 1 214,234375 87,7109375 10,2109375 >> 10 214,234375 87,7109375 10,71875 >> 100 214,234375 87,4375 15,640625 >> 1000 214,234375 86,2578125 31,296875 >> 10000 214,234375 78,421875 104,3046875 >> 100000 214,234375 65,359375 49,078125 >> 1000000 214,234375 90,140625 106,8203125 >> 10000000 214,234375 214,234375 534,0625 >> You can note that the last row contains the same index sizes for B-tree, >> which is quite logical - there is no compression if all the keys are >> distinct. >> Other cases looks really nice to me. >> Next thing to say is that I haven't implemented posting list compression >> yet. So there is still potential to decrease size of compressed btree. >> >> I'm almost sure, there are still some tiny bugs and missed functions, but >> on the whole, the patch is ready for testing. >> I'd like to get a feedback about the patch testing on some real datasets. >> Any bug reports and suggestions are welcome. >> >> Here is a couple of useful queries to inspect the data inside the index >> pages: >> create extension pageinspect; >> select * from bt_metap('idx'); >> select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', >> n) as bt; >> select n, bt.* from generate_series(1,1) as n, lateral >> bt_page_items('idx', n) as bt; >> >> And at last, the list of items I'm going to complete in the near future: >> 1. Add storage_parameter 'enable_compression' for btree access method >> which specifies whether the index handles duplicates. default is 'off' >> 2. Bring back microvacuum functionality for compressed indexes. >> 3. Improve insertion speed. Insertions became significantly slower with >> compressed btree, which is obviously not what we do want. >> 4. Clean the code and comments, add related documentation. >> > > This doesn't apply cleanly against current git head. Have you caught up > past commit 65c5fcd35? > > > Thank you for the notice. New patch is attached. > Thanks for the quick rebase. Okay, a quick check with pgbench: CREATE INDEX ON pgbench_accounts(bid); Timing Scale: master / patch 100: 10657ms / 13555ms (rechecked and got 9745ms) 500: 56909ms / 56985ms Size Scale: master / patch 100: 214MB / 87MB (40.7%) 500: 1071MB / 437MB (40.8%) No performance issues from what I can tell. I'm surprised that efficiencies can't be realised beyond this point. Your results show a sweet spot at around 1000 / 10000000, with it getting slightly worse beyond that. I kind of expected a lot of efficiency where all the values are the same, but perhaps that's due to my lack of understanding regarding the way they're being stored. Thom