Re: BRIN index worse than sequential scan for large search set

Mickael van der Beek Fri, 24 Feb 2023 09:51:32 -0800

Hello Justin,

Thanks for the quick response!


The table may be dense, but the tuples aren't.  You're asking to return
> 1/1000th of the tuples, across the entire table.  Suppose there are ~100
> tuples per page, and you need to read about every 10th page.  It makes
> sense that it's slow to read a large amount of data nonsequentially.


Ah, of course, you're right!
I forgot that the BRIN indexes store ranges that are not fully covered by
the row values and that PostgreSQL has to double-check (bitmap heap scan)
...
Would you thus advise to only use BRIN indexes for columns who's values are
(1) monotonically increasing but also (2) close to each other?

I guess that in my use case, something like a roaring bitmap would have
been perfect but I do not believe that those exist in PostgreSQL by default.
Btrees work well performance wise but are simply too large (> 400MB for 20M
rows) even when the index fill factor is 100% (+/- 380 MB) for my use case
as I need to index around 6B rows partitioned in roughly 3K buckets which
would result in ~120GB of  Btree indexes which seems a bit much for simple
existence checks.

Because it's necessary to check if the tuple is visible to the current
> transaction.  It might be from an uncommited/aborted transaction.


Interesting, that's something I did not consider indeed.
I'm guessing that this is a cost brought by MVCC that you can't get around
no matter the isolation level?

Mickael


On Fri, Feb 24, 2023 at 6:19 PM Justin Pryzby <pry...@telsasoft.com> wrote:

> On Fri, Feb 24, 2023 at 05:40:55PM +0100, Mickael van der Beek wrote:
> > Hello everyone,
> >
> > I'm playing around with BRIN indexes so as to get a feel for the feature.
> > During my tests, I was unable to make BRIN indexes perform better than a
> > sequential scan for queries searching for large value sets (20K values in
> > the example down below).
>
> > And now let's query 20K random rows from our 20M total rows:
>
> I didn't try your test, but I think *random* is the problem/explanation.
>
> > By default, this query will not use the BRIN index and simply run a 1.5s
> > long sequential scan (hitting 700 MB) and a 2.47s hash join for a total
> > 8.7s query time:
> > https://explain.dalibo.com/plan/46c3191g8a6c1bc7
>
> > If we force the use of the BRIN index using (`SET LOCAL enable_seqscan =
> > OFF;`) the same query will now take 50s with 2.5s spent on the bitmap
> index
> > scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan
> > (hitting 20 GB of data!):
> > https://explain.dalibo.com/plan/7f73bg9172a8b226
>
> That means the planner's cost model correctly preferred a seq scan.
>
> > So I had the following two questions:
> >
> >    1. Why is the BRIN index systematically worse than a sequential scan,
> >    even when the table is x1000 larger than the search set, physically
> >    pre-sorted, dense (fillfactor at 100%) and the search rows are
> themselves
> >    sorted?
>
> The table may be dense, but the tuples aren't.  You're asking to return
> 1/1000th of the tuples, across the entire table.  Suppose there are ~100
> tuples per page, and you need to read about every 10th page.  It makes
> sense that it's slow to read a large amount of data nonsequentially.
> That's why random_page_cost is several times higher than seq_page_cost.
>
> I would expect brin to win if the pages to be accessed were dense rather
> than distributed across the whole table.
>
> >    2. Since we only select the "idx" column, why does the BRIN index not
> >    simply return the searched value if included in one of it's ranges?
> >    Hitting the actual row data stored in the table seems to be unnessary
> no?
>
> Because it's necessary to check if the tuple is visible to the current
> transaction.  It might be from an uncommited/aborted transaction.
>
> --
> Justin
>


-- 
Mickael van der BeekWeb developer & Security analyst

mickael.van.der.b...@gmail.com

Re: BRIN index worse than sequential scan for large search set

Reply via email to