On Sat, Mar 2, 2024 at 5:51 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > On 3/2/24 23:11, Melanie Plageman wrote: > > On Fri, Mar 1, 2024 at 2:31 PM Melanie Plageman > > <melanieplage...@gmail.com> wrote: > >> > >> ... > >> > >> Hold the phone on this one. I realized why I moved > >> BitmapAdjustPrefetchIterator after table_scan_bitmap_next_block() in > >> the first place -- master calls BitmapAdjustPrefetchIterator after the > >> tbm_iterate() for the current block -- otherwise with eic = 1, it > >> considers the prefetch iterator behind the current block iterator. I'm > >> going to go through and figure out what order this must be done in and > >> fix it. > > > > So, I investigated this further, and, as far as I can tell, for > > parallel bitmapheapscan the timing around when workers decrement > > prefetch_pages causes the performance differences with patch 0010 > > applied. It makes very little sense to me, but some of the queries I > > borrowed from your regression examples are up to 30% slower when this > > code from BitmapAdjustPrefetchIterator() is after > > table_scan_bitmap_next_block() instead of before it. > > > > SpinLockAcquire(&pstate->mutex); > > if (pstate->prefetch_pages > 0) > > pstate->prefetch_pages--; > > SpinLockRelease(&pstate->mutex); > > > > I did some stracing and did see much more time spent in futex/wait > > with this code after the call to table_scan_bitmap_next_block() vs > > before it. (table_scan_bitmap_next_block()) calls ReadBuffer()). > > > > In my branch, I've now moved only the parallel prefetch_pages-- code > > to before table_scan_bitmap_next_block(). > > https://github.com/melanieplageman/postgres/tree/bhs_pgsr > > I'd be interested to know if you see the regressions go away with 0010 > > applied (commit message "Make table_scan_bitmap_next_block() async > > friendly" and sha bfdcbfee7be8e2c461). > > > > I'll give this a try once the runs with MAX_BUFFERS_PER_TRANSFER=1 > complete. But it seems really bizarre that simply moving this code a > little bit would cause such a regression ...
Yes, it is bizarre. It also might not be a reproducible performance difference on the cases besides the one I was testing (cyclic dataset, uncached, eic=8, matches 16+, distinct=100, rows=100000000, 4 parallel workers). But even if it only affects that one case, it still had a major, reproducible performance impact to move those 5 lines before and after table_scan_bitmap_next_block(). The same number of reads and fadvises are being issued overall. However, I did notice that the pread calls are skewed when the those lines of code are after table_scan_bitmap_next_block() -- fewer of the workers are doing more of the reads. Perhaps this explains what is taking longer. Why those workers would end up doing more of the reads, I don't quite know. - Melanie