On 11/1/21 21:06, Robert Haas wrote:
On Tue, Oct 26, 2021 at 11:11 AM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:
If I get what you propose, you want to have a "top" tree for (job, nlp,
year), which "splits" the data set into subsets of ~5000-7000 rows. And
then for each subset you want a separate "small" trees on each of the
other columns, so in this case three trees.

Well, the problem with this is pretty obvious - each of the small trees
requires separate copies of the leaf pages. And remember, in a btree the
internal pages are usually less than 1% of the index, so this pretty
much triples the size of the index. And if you insert a row into the
index, it has to insert the item pointer into each of the small trees,
likely requiring a separate I/O for each.

So I'd bet this is not any different from just having three separate
indexes - it doesn't save space, doesn't save I/O, nothing.

I agree. In a lot of cases, it's actually useful to define the index
on fewer columns, like (job, nlp, year) or even just (job, nlp) or
even just (job) because it makes the index so much smaller and that's
pretty important. If you have enough duplicate entries in a (job, nlp,
year) index to justify create a (job, nlp, year, sequence) index, the
number of distinct (job, nlp, year) tuples has to be small compared to
the number of (job, nlp, year, sequence) tuples - and that means that
you wouldn't actually save much by combining your (job, nlp, year,
sequence) index with a (job, nlp, year, other-stuff) index. As you
say, the internal pages aren't the problem.

I don't intend to say that there's no possible use for this kind of
technology. Peter G. says that people are writing research papers
about that and they probably wouldn't be doing that unless they'd
found some case where it's a big win. But this example seems extremely
unconvincing.


I actually looked at the use case mentioned by Peter G, i.e. merged indexes with master-detail clustering (see e.g. [1]), but that seems like a rather different thing. The master-detail refers to storing rows from multiple tables, interleaved in a way that allows faster joins. So it's essentially a denormalization tool.

Perhaps there's something we could learn about efficient storage of the small trees, but I haven't found any papers describing that (I haven't spent much time on the search, though).

[1] Algorithms for merged indexes, Goetz Graefe
    https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.7709


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to