On 3/17/24 17:38, Andres Freund wrote:
> Hi,
> 
> On 2024-03-16 21:25:18 +0100, Tomas Vondra wrote:
>> On 3/16/24 20:12, Andres Freund wrote:
>>> That would address some of the worst behaviour, but it doesn't really seem 
>>> to
>>> address the underlying problem of the two iterators being modified
>>> independently. ISTM the proper fix would be to protect the state of the
>>> iterators with a single lock, rather than pushing down the locking into the
>>> bitmap code. OTOH, we'll only need one lock going forward, so being economic
>>> in the effort of fixing this is also important.
>>>
>>
>> Can you share some details about how you identified the problem, counted
>> the prefetches that happen too late, etc? I'd like to try to reproduce
>> this to understand the issue better.
> 
> There's two aspects. Originally I couldn't reliably reproduce the regression
> with Melanie's repro on my laptop. I finally was able to do so after I
> a) changed the block device's read_ahead_kb to 0
> b) used effective_io_concurrency=1
> 
> That made the difference between the BitmapAdjustPrefetchIterator() locations
> very significant, something like 2.3s vs 12s.
> 

Interesting. I haven't thought about read_ahead_kb, but in hindsight it
makes sense it affects these cases. OTOH I did not set it to 0 on either
machine (the 6xSATA RAID0 has it at 12288, for example) and yet that's
how we found the regressions.

For eic it makes perfect sense that setting it to 1 is particularly
vulnerable to this issue - it only takes a small "desynchronization" of
the two iterators for the prefetch to "fall behind" and frequently
prefetch blocks we already read.

> Besides a lot of other things, I finally added debugging fprintfs printing the
> pid, (prefetch, read), block number. Even looking at tiny excerpts of the
> large amount of output that generates shows that two iterators were out of
> sync.
> 

Thanks. I did experiment with fprintf, but it's quite cumbersome, so I
was hoping you came up with some smart way to trace this king of stuff.
For example I was wondering if ebpf would be a more convenient way.

> 
>> If I understand correctly, what may happen is that a worker reads blocks
>> from the "prefetch" iterator, but before it manages to issue the
>> posix_fadvise, some other worker already did pread. Or can the iterators
>> get "out of sync" in a more fundamental way?
> 
> I agree that the current scheme of two shared iterators being used has some
> fairly fundamental raciness. But I suspect there's more than that going on
> right now.
> 
> Moving BitmapAdjustPrefetchIterator() to later drastically increases the
> raciness because it means table_scan_bitmap_next_block() happens between
> increasing the "real" and the "prefetch" iterators.
> 
> An example scenario that, I think, leads to the iterators being out of sync,
> without there being races between iterator advancement and completing
> prefetching:
> 
> start:
>   real -> block 0
>   prefetch -> block 0
>   prefetch_pages = 0
>   prefetch_target = 1
> 
> W1: tbm_shared_iterate(real) -> block 0
> W2: tbm_shared_iterate(real) -> block 1
> W1: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 0
> W2: BitmapAdjustPrefetchIterator() -> tbm_shared_iterate(prefetch) -> 1
> W1: read block 0
> W2: read block 1
> W1: BitmapPrefetch() -> prefetch_pages++ -> 1, tbm_shared_iterate(prefetch) 
> -> 2, prefetch block 2
> W2: BitmapPrefetch() -> nothing, as prefetch_pages == prefetch_target
> 
> W1: tbm_shared_iterate(real) -> block 2
> W2: tbm_shared_iterate(real) -> block 3
> 
> W2: BitmapAdjustPrefetchIterator() -> prefetch_pages--
> W2: read block 3
> W2: BitmapPrefetch() -> prefetch_pages++, tbm_shared_iterate(prefetch) -> 3, 
> prefetch block 3
> 
> So afaict here we end up prefetching a block that the *same process* just had
> read.
> 

Uh, that's very weird. I'd understood if there's some cross-process
issue, but if this happens in a single process ... strange.

> ISTM that the idea of somehow "catching up" in BitmapAdjustPrefetchIterator(),
> separately from advancing the "real" iterator, is pretty ugly for non-parallel
> BHS and just straight up broken in the parallel case.
> 

Yeah, I agree with the feeling it's an ugly fix. Definitely seems more
like fixing symptoms than the actual problem.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to