Hello Justin, Thanks for the quick response!
The table may be dense, but the tuples aren't. You're asking to return > 1/1000th of the tuples, across the entire table. Suppose there are ~100 > tuples per page, and you need to read about every 10th page. It makes > sense that it's slow to read a large amount of data nonsequentially. Ah, of course, you're right! I forgot that the BRIN indexes store ranges that are not fully covered by the row values and that PostgreSQL has to double-check (bitmap heap scan) ... Would you thus advise to only use BRIN indexes for columns who's values are (1) monotonically increasing but also (2) close to each other? I guess that in my use case, something like a roaring bitmap would have been perfect but I do not believe that those exist in PostgreSQL by default. Btrees work well performance wise but are simply too large (> 400MB for 20M rows) even when the index fill factor is 100% (+/- 380 MB) for my use case as I need to index around 6B rows partitioned in roughly 3K buckets which would result in ~120GB of Btree indexes which seems a bit much for simple existence checks. Because it's necessary to check if the tuple is visible to the current > transaction. It might be from an uncommited/aborted transaction. Interesting, that's something I did not consider indeed. I'm guessing that this is a cost brought by MVCC that you can't get around no matter the isolation level? Mickael On Fri, Feb 24, 2023 at 6:19 PM Justin Pryzby <pry...@telsasoft.com> wrote: > On Fri, Feb 24, 2023 at 05:40:55PM +0100, Mickael van der Beek wrote: > > Hello everyone, > > > > I'm playing around with BRIN indexes so as to get a feel for the feature. > > During my tests, I was unable to make BRIN indexes perform better than a > > sequential scan for queries searching for large value sets (20K values in > > the example down below). > > > And now let's query 20K random rows from our 20M total rows: > > I didn't try your test, but I think *random* is the problem/explanation. > > > By default, this query will not use the BRIN index and simply run a 1.5s > > long sequential scan (hitting 700 MB) and a 2.47s hash join for a total > > 8.7s query time: > > https://explain.dalibo.com/plan/46c3191g8a6c1bc7 > > > If we force the use of the BRIN index using (`SET LOCAL enable_seqscan = > > OFF;`) the same query will now take 50s with 2.5s spent on the bitmap > index > > scan (hitting 470 MB of data) and a whopping 42s on the bitmap heap scan > > (hitting 20 GB of data!): > > https://explain.dalibo.com/plan/7f73bg9172a8b226 > > That means the planner's cost model correctly preferred a seq scan. > > > So I had the following two questions: > > > > 1. Why is the BRIN index systematically worse than a sequential scan, > > even when the table is x1000 larger than the search set, physically > > pre-sorted, dense (fillfactor at 100%) and the search rows are > themselves > > sorted? > > The table may be dense, but the tuples aren't. You're asking to return > 1/1000th of the tuples, across the entire table. Suppose there are ~100 > tuples per page, and you need to read about every 10th page. It makes > sense that it's slow to read a large amount of data nonsequentially. > That's why random_page_cost is several times higher than seq_page_cost. > > I would expect brin to win if the pages to be accessed were dense rather > than distributed across the whole table. > > > 2. Since we only select the "idx" column, why does the BRIN index not > > simply return the searched value if included in one of it's ranges? > > Hitting the actual row data stored in the table seems to be unnessary > no? > > Because it's necessary to check if the tuple is visible to the current > transaction. It might be from an uncommited/aborted transaction. > > -- > Justin > -- Mickael van der BeekWeb developer & Security analyst mickael.van.der.b...@gmail.com