Re: BitmapHeapScan streaming read user and prelim refactoring

Melanie Plageman Sat, 02 Mar 2024 15:40:02 -0800

On Sat, Mar 2, 2024 at 5:51 PM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:
>
> On 3/2/24 23:11, Melanie Plageman wrote:
> > On Fri, Mar 1, 2024 at 2:31 PM Melanie Plageman
> > <melanieplage...@gmail.com> wrote:
> >>
> >> ...
> >>
> >> Hold the phone on this one. I realized why I moved
> >> BitmapAdjustPrefetchIterator after table_scan_bitmap_next_block() in
> >> the first place -- master calls BitmapAdjustPrefetchIterator after the
> >> tbm_iterate() for the current block -- otherwise with eic = 1, it
> >> considers the prefetch iterator behind the current block iterator. I'm
> >> going to go through and figure out what order this must be done in and
> >> fix it.
> >
> > So, I investigated this further, and, as far as I can tell, for
> > parallel bitmapheapscan the timing around when workers decrement
> > prefetch_pages causes the performance differences with patch 0010
> > applied. It makes very little sense to me, but some of the queries I
> > borrowed from your regression examples are up to 30% slower when this
> > code from BitmapAdjustPrefetchIterator() is after
> > table_scan_bitmap_next_block() instead of before it.
> >
> >         SpinLockAcquire(&pstate->mutex);
> >         if (pstate->prefetch_pages > 0)
> >             pstate->prefetch_pages--;
> >         SpinLockRelease(&pstate->mutex);
> >
> > I did some stracing and did see much more time spent in futex/wait
> > with this code after the call to table_scan_bitmap_next_block() vs
> > before it. (table_scan_bitmap_next_block()) calls ReadBuffer()).
> >
> > In my branch, I've now moved only the parallel prefetch_pages-- code
> > to before table_scan_bitmap_next_block().
> > https://github.com/melanieplageman/postgres/tree/bhs_pgsr
> > I'd be interested to know if you see the regressions go away with 0010
> > applied (commit message "Make table_scan_bitmap_next_block() async
> > friendly" and sha bfdcbfee7be8e2c461).
> >
>
> I'll give this a try once the runs with MAX_BUFFERS_PER_TRANSFER=1
> complete. But it seems really bizarre that simply moving this code a
> little bit would cause such a regression ...


Yes, it is bizarre. It also might not be a reproducible performance
difference on the cases besides the one I was testing (cyclic dataset,
uncached, eic=8, matches 16+, distinct=100, rows=100000000, 4 parallel
workers). But even if it only affects that one case, it still had a
major, reproducible performance impact to move those 5 lines before
and after table_scan_bitmap_next_block().

The same number of reads and fadvises are being issued overall.
However, I did notice that the pread calls are skewed when the those
lines of code are after table_scan_bitmap_next_block()  -- fewer of
the workers are doing more of the reads. Perhaps this explains what is
taking longer. Why those workers would end up doing more of the reads,
I don't quite know.

- Melanie

Re: BitmapHeapScan streaming read user and prelim refactoring

Reply via email to