01.09.2015 21:23, Peter Geoghegan:
On Mon, Aug 31, 2015 at 12:41 AM, Anastasia Lubennikova
<a.lubennik...@postgrespro.ru> wrote:
Now new B-tree index tuple must be inserted for each table row that we
index.
It can possibly cause page split. Because of MVCC even unique index could
contain duplicates.
Storing duplicates in posting list/tree helps to avoid superfluous splits.
I'm glad someone is thinking about this, because it is certainly
needed. I thought about working on it myself, but there is always
something else to do. I should be able to assist with review, though.
Thank you)
So it seems to be very useful improvement. Of course it requires a lot of
changes in B-tree implementation, so I need approval from community.

1. Compatibility.
It's important to save compatibility with older index versions.
I'm going to change BTREE_VERSION to 3.
And use new (posting) features for v3, saving old implementation for v2.
Any objections?
It might be better to just have a flag bit for pages that are
compressed -- there are IIRC 8 free bits in the B-Tree page special
area flags variable. But no real opinion on this from me, yet. You
have plenty of bitspace to work with to mark B-Tree pages, in any
case.

Hmm.. If we are talking about storing duplicates in posting lists (and trees) as in GIN, I don't see a way how to apply it to separate pages, while not applying to others. Look some notes below .

2. There are several tricks to handle non-unique keys in B-tree.
More info in btree readme (chapter - Differences to the Lehman & Yao
algorithm).
In the new version they'll become useless. Am I right?
I think that the L&Y algorithm makes assumptions for the sake of
simplicity, rather than because they really believed that there were
real problems. For example, they say that deletion can occur offline
or something along those lines, even though that's clearly
impractical. They say that because they didn't want to write a paper
about deletion within B-Trees, I suppose.

See also, my opinion of how they claim to not need read locks [1].
Also, note that despite the fact that the GIN README mentions "Lehman
& Yao style right links", it doesn't actually do the L&Y trick of
avoiding lock coupling -- the whole point of L&Y -- so that remark is
misleading. This must be why B-Tree has much better concurrency than
GIN in practice.

Yes, thanks for extensive explanation.
I mean such tricks as moving right in _bt_findinsertloc(), for example.

/*----------
     * If we will need to split the page to put the item on this page,
     * check whether we can put the tuple somewhere to the right,
     * instead.  Keep scanning right until we
     *        (a) find a page with enough free space,
     *        (b) reach the last page where the tuple can legally go, or
     *        (c) get tired of searching.
     * (c) is not flippant; it is important because if there are many
     * pages' worth of equal keys, it's better to split one of the early
     * pages than to scan all the way to the end of the run of equal keys
     * on every insert.  We implement "get tired" as a random choice,
     * since stopping after scanning a fixed number of pages wouldn't work
     * well (we'd never reach the right-hand side of previously split
     * pages).  Currently the probability of moving right is set at 0.99,
     * which may seem too high to change the behavior much, but it does an
     * excellent job of preventing O(N^2) behavior with many equal keys.
     *----------
     */

If there is no multiple tuples with the same key, we shouldn't care about it at all. It would be possible to skip these steps in "effective B-tree implementation". That's why I want to change btree_version.

  So I'm really talking about a slightly
different thing -- prefix compression, rather than handling
duplicates. Whether or not you should do prefix compression instead of
deduplication is certainly not clear to me, but it should be
considered. Also, I always imagined that prefix compression would use
the highkey as the thing that is offset for each "real" IndexTuple,
because it's there anyway, and that's simple. However, I suppose that
that means that duplicate handling can't really work in a way that
makes duplicates have a fixed cost, which may be a particularly
important property to you.

You're right, that is two different techniques.
1. Effective storing of duplicates, which I propose, works with equal keys. And allow us to delete repeats.
Index tuples are stored like this:

IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key) | IndexTupleData + Attrs (key)

If all Attrs are equal, it seems reasonable not to repeat them. So we can store it in following structure:

MetaData + Attrs (key) | IndexTupleData | IndexTupleData | IndexTupleData

It is a posting list. It doesn't require significant changes in index page layout, because we can use ordinary IndexTupleData for meta information. Each IndexTupleData has fixed size, so it's easy to handle posting list as an array.

2. Prefix compression handles different keys and somehow compresses them.
I think that it will require non-trivial changes in btree index tuples representation. Furthermore, any compression leads to extra computations. Now, I don't have clear idea about how to implement this technique.

* Currently, B-Tree must be able to store at least 3 items on each
page, for the benefit of the L&Y algorithm. You need room for 1
"highkey", plus 2 downlink IndexTuples. Obviously an internal B-Tree
page is redundant if you cannot get to any child page based on the
scanKey value differing one way or the other (so 2 downlinks are a
sensible minimum), plus a highkey is usually needed (just not on the
rightmost page). As you probably know, we enforce this by making sure
every IndexTuple is no more than 1/3 of the size that will fit.
That is the point where too big posting list transforms to a posting tree. But I think, that in the first patch, I'll do it another way. Just by splitting long posting list into 2 lists of appropriate length.

* Since everything is aligned within B-Tree, it's probably worth
considering the alignment boundaries when doing prefix compression, if
you want to go that way. We can probably imagine a world where
alignment is not required for B-Tree, which would work on x86
machines, but I can't see it happening soon. It isn't worth
compressing unless it compresses enough to cross an "alignment
boundary", where we're not actually obliged to store as much data on
disk. This point may be obvious, not sure.

That is another reason, why I doubt prefix compression, whereas effective duplicate storage hasn't this problem.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to