On 28 January 2016 at 17:03, Thom Brown <t...@linux.com> wrote: > > On 28 January 2016 at 16:12, Anastasia Lubennikova < > a.lubennik...@postgrespro.ru> wrote: > >> >> 28.01.2016 18:12, Thom Brown: >> >> On 28 January 2016 at 14:06, Anastasia Lubennikova < >> a.lubennik...@postgrespro.ru> wrote: >> >>> >>> 31.08.2015 10:41, Anastasia Lubennikova: >>> >>> Hi, hackers! >>> I'm going to begin work on effective storage of duplicate keys in B-tree >>> index. >>> The main idea is to implement posting lists and posting trees for B-tree >>> index pages as it's already done for GIN. >>> >>> In a nutshell, effective storing of duplicates in GIN is organised as >>> follows. >>> Index stores single index tuple for each unique key. That index tuple >>> points to posting list which contains pointers to heap tuples (TIDs). If >>> too many rows having the same key, multiple pages are allocated for the >>> TIDs and these constitute so called posting tree. >>> You can find wonderful detailed descriptions in gin readme >>> <https://github.com/postgres/postgres/blob/master/src/backend/access/gin/README> >>> and articles <http://www.cybertec.at/gin-just-an-index-type/>. >>> It also makes possible to apply compression algorithm to posting >>> list/tree and significantly decrease index size. Read more in presentation >>> (part 1) >>> <http://www.pgcon.org/2014/schedule/attachments/329_PGCon2014-GIN.pdf>. >>> >>> Now new B-tree index tuple must be inserted for each table row that we >>> index. >>> It can possibly cause page split. Because of MVCC even unique index >>> could contain duplicates. >>> Storing duplicates in posting list/tree helps to avoid superfluous >>> splits. >>> >>> >>> I'd like to share the progress of my work. So here is a WIP patch. >>> It provides effective duplicate handling using posting lists the same >>> way as GIN does it. >>> >>> Layout of the tuples on the page is changed in the following way: >>> before: >>> TID (ip_blkid, ip_posid) + key, TID (ip_blkid, ip_posid) + key, TID >>> (ip_blkid, ip_posid) + key >>> with patch: >>> TID (N item pointers, posting list offset) + key, TID (ip_blkid, >>> ip_posid), TID (ip_blkid, ip_posid), TID (ip_blkid, ip_posid) >>> >>> It seems that backward compatibility works well without any changes. But >>> I haven't tested it properly yet. >>> >>> Here are some test results. They are obtained by test functions >>> test_btbuild and test_ginbuild, which you can find in attached sql file. >>> i - number of distinct values in the index. So i=1 means that all rows >>> have the same key, and i=10000000 means that all keys are different. >>> The other columns contain the index size (MB). >>> >>> i B-tree Old B-tree New GIN >>> 1 214,234375 87,7109375 10,2109375 >>> 10 214,234375 87,7109375 10,71875 >>> 100 214,234375 87,4375 15,640625 >>> 1000 214,234375 86,2578125 31,296875 >>> 10000 214,234375 78,421875 104,3046875 >>> 100000 214,234375 65,359375 49,078125 >>> 1000000 214,234375 90,140625 106,8203125 >>> 10000000 214,234375 214,234375 534,0625 >>> You can note that the last row contains the same index sizes for B-tree, >>> which is quite logical - there is no compression if all the keys are >>> distinct. >>> Other cases looks really nice to me. >>> Next thing to say is that I haven't implemented posting list compression >>> yet. So there is still potential to decrease size of compressed btree. >>> >>> I'm almost sure, there are still some tiny bugs and missed functions, >>> but on the whole, the patch is ready for testing. >>> I'd like to get a feedback about the patch testing on some real >>> datasets. Any bug reports and suggestions are welcome. >>> >>> Here is a couple of useful queries to inspect the data inside the index >>> pages: >>> create extension pageinspect; >>> select * from bt_metap('idx'); >>> select bt.* from generate_series(1,1) as n, lateral bt_page_stats('idx', >>> n) as bt; >>> select n, bt.* from generate_series(1,1) as n, lateral >>> bt_page_items('idx', n) as bt; >>> >>> And at last, the list of items I'm going to complete in the near future: >>> 1. Add storage_parameter 'enable_compression' for btree access method >>> which specifies whether the index handles duplicates. default is 'off' >>> 2. Bring back microvacuum functionality for compressed indexes. >>> 3. Improve insertion speed. Insertions became significantly slower with >>> compressed btree, which is obviously not what we do want. >>> 4. Clean the code and comments, add related documentation. >>> >> >> This doesn't apply cleanly against current git head. Have you caught up >> past commit 65c5fcd35? >> >> >> Thank you for the notice. New patch is attached. >> > > Thanks for the quick rebase. > > Okay, a quick check with pgbench: > > CREATE INDEX ON pgbench_accounts(bid); > > Timing > Scale: master / patch > 100: 10657ms / 13555ms (rechecked and got 9745ms) > 500: 56909ms / 56985ms > > Size > Scale: master / patch > 100: 214MB / 87MB (40.7%) > 500: 1071MB / 437MB (40.8%) > > No performance issues from what I can tell. > > I'm surprised that efficiencies can't be realised beyond this point. Your > results show a sweet spot at around 1000 / 10000000, with it getting > slightly worse beyond that. I kind of expected a lot of efficiency where > all the values are the same, but perhaps that's due to my lack of > understanding regarding the way they're being stored. >
Okay, now for some badness. I've restored a database containing 2 tables, one 318MB, another 24kB. The 318MB table contains 5 million rows with a sequential id column. I get a problem if I try to delete many rows from it: # delete from contacts where id % 3 != 0 ; WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory WARNING: out of shared memory The query completes, but I get this message a lot before it does. This happens even if I drop the primary key and foreign key constraints, so somehow the memory usage has massively increased with this patch. Thom