Obviously, database benchmarking is basically a silly idea, because every 
workload, every hardware configuration, every schema are different, with very 
different consequences.

Still, I was left with wondering when one should choose a BTree vs GIST or GIN 
(I didn’t even try to look at SP-GIST: any thoughts?).

The documentation on heuristics for when to choose which doesn’t say exactly 
but suggests that one should just choose a BTree.

But what made me curious about the alternatives is this: you have to use a 
compound BTree index left-to-right (you can use the first column, or the first 
and second, or first, second and third…), whereas GIST and GIN both support 
using arbitrary fields. That right there is mighty compelling for any kind of 
ad-hoc query scenario. So I thought I’d do some testing about insert and query 
speed.

I made a table with six integer columns but no indexes, and made copies with a 
regular BT primary key, and then an index on all the columns, in separate 
tables using either a BTree, a GIST or a GIN index. The original table I 
generated 10 million rows of random values in the range 1..1 million.

This was on 9.4RC1 on a recent-model iMac. I used an external 2.5” hard drive 
so the SSD wasn’t mucking things up. I restarted Postgres (but not the 
computer) between tests. I tested a different range of values, of the same 
size, when comparing searches, in order to mostly avoid having the MacOS disk 
cache helping things out. I also tried running multiple versions of the search 
tests on different ranges in different orders, just to see if the comparison of 
search times was roughly consistent, which it was.

The results were… surprising, and don’t seem to accord with what the 
documentation says.

I was interested in insert speed. The docs suggest that GIN would be 
appallingly slow, but don’t really talk about GIST vs BTree. When I configured 
reasonable caches, working mem and such, I got the following times for insert… 
select * (values in ms; columns are BTree-GIST-GIN):

1,233,570
2,323,639
1,700,752

GIST was slower than GIN (which flat contradicts the documentation), but both 
are in shouting distance of BTree. When I tuned down the caches, to simulate a 
machine under load, I got:

20,252,299
3,583,346
10,495,424

So wow, BTree is much slower. This is basically a default setup, with tiny 
cache and working memory and whatnot.

Searching on the first column is just as interesting. With decent caches, find 
100 values (so about a thousand rows):

936
508
2,215

So this result is interesting. GIN is *slowest*, and in particular is slower 
than GIST by a good margin, whereas the docs suggest GIN should always be 
faster. GIST is significantly faster than BTree.

Now finding 1,000 values (so 10,000 rows, give or take):

4,148
1,302
3,078

GIST is fastest again, but now GIN is faster than BTree.

Again, but 10,000 values/100,000 rows:

34,767
8,104
8,995

So in a memory-rich environment, GIST appears to be the clear winner. When I 
turn down the memory values, though, I get a very different result.

100 values/1000 rows:

7,931
7,209
1,034

1000 values/10,000 rows:

81,675
88,191
1,598

I tried that one several times, because I found it hard to believe. The results 
were consistent. With 10,000 values/100,000 rows, the results are even starker.

Index sizes in pages:

258,578
309,636
581,795

GIN is certainly not the “three times” size suggested in the docs, but perhaps 
that just hasn’t been updated for the 9.4 improvements. Certainly, there isn’t 
sufficient difference here to make the BTree advantage compelling in most 
applications.

Given the futility of database benchmarking in general, I didn’t want to go any 
further with this. What I was interested in was whether it might be worth 
switching from BTree to GIST/GIN indexes with regular sorts of data. It appears 
to be the case that GIST and GIN are often better than BTree in general, and 
given their much greater flexibility in satisfying queries on different 
columns, it might even be the case that one should recommend a single GIST or 
GIN index on the frequently-searched columns of a table in most cases?

I would absolutely *love* to hear what this community has to say about this 
question: should one consider GIST or GIN indexes on regular old numeric/text 
columns? When, theoretically? When, in practice (ie does anyone have comparable 
benchmarks on 9.4?). Other thoughts?

Reply via email to