On 1/15/24 21:42, Konstantin Knizhnik wrote:
> 
> On 15/01/2024 5:08 pm, Tomas Vondra wrote:
>>
>> My patch does not care about prefetching internal index pages. Yes, it's
>> a limitation, but my assumption is the internal pages are maybe 0.1% of
>> the index, and typically very hot / cached. Yes, if the index is not
>> used very often, this may be untrue. But I consider it a possible future
>> improvement, for some other patch. FWIW there's a prefetching patch for
>> inserts into indexes (which only prefetches just the index leaf pages).
> 
> We have to prefetch pages at height-1 level (parents of leave pages) for
> IOS because otherwise prefetch pipeline is broken at each transition to
> next leave page.
> When we start with new leave patch we have to fill prefetch ring from
> the scratch which certainly has negative impact on performance.
> 

By "broken" you mean that you prefetch items only from a single leaf
page, so immediately after reading the next one nothing is prefetched.
Correct? Yeah, I had this problem initially too, when I did the
prefetching in the index AM code. One of the reasons why it got moved to
the executor.

> 
>> Not sure I understand what this is about. The patch simply calls the
>> index AM function index_getnext_tid() enough times to fill the prefetch
>> queue. It does not prefetch the next index leaf page, it however does
>> prefetch the heap pages. It does not "stall" at the boundary of the
>> index leaf page, or something.
> 
> Ok, now I fully understand your approach. Looks really elegant and works
> for all indexes.
> There is still issue with IOS and seqscan.
> 

Not sure. For seqscan, I think this has nothing to do with it. Postgres
relies on read-ahad to do the work - of course, if that doesn't work
(e.g. for async/direct I/O that'd be the case), an improvement will be
needed. But it's unrelated to this patch, and I'm certainly not saying
this patch does that. I think Thomas/Andres did some work on that.

For IOS, I think the limitation that this does not prefetch any index
pages (especially the leafs) is there, and it'd be nice to do something
about it. But I see it as a separate thing, which I think does need to
happen in the index AM layer (not in the executor).

> 
> 
>>
>>> Another challenge - is how far we should prefetch (as far as I
>>> understand both your and our approach using dynamically extended
>>> prefetch window)
>>>
>> By dynamic extension of prefetch window you mean the incremental growth
>> of the prefetch distance from 0 to effective_io_concurrency?
> 
> Yes
> 
>> I don't
>> think there's a better solution.
> 
> I tried one more solution: propagate information about expected number
> of fetched rows to AM. Based on this information it is possible to
> choose proper prefetch distance.
> Certainly it is not quote precise: we can scan large number rows but
> filter only few of them. This is why this approach was not committed in
> Neon.
> But I still think that using statistics for determining prefetch window
> is not so bad idea. May be it needs better thinking.
> 

I don't think we should rely on this information too much. It's far too
unreliable - especially the planner estimates. The run-time data may be
more accurate, but I'm worried it may be quite variable (e.g. for
different runs of the scan).

My position is to keep this as simple as possible, and prefer to be more
conservative when possible - that is, shorter prefetch distances. In my
experience the benefit of prefetching is subject to diminishing returns,
i.e. going from 0 => 16 is way bigger difference than 16 => 32. So
better to stick with lower value instead of wasting resources.

> 
>>
>> There might be additional information that we could consider (e.g.
>> expected number of rows for the plan, earlier executions of the scan,
>> ...) but each of these has a failure more.
> 
> I wrote reply above before reading next fragment:)
> So I have already tried it.
> 
>> I haven't tried with pgvector, but I don't see why my patch would not
>> work for all index AMs that cna return TID.
> 
> 
> Yes, I agree. But it will be efficient only if getting next TIS is
> cheapĀ  - it is located on the same leaf page.
> 

Maybe. I haven't tried/thought about it, but yes - if it requires doing
a lot of work in between the prefetches, the benefits of prefetching
will diminish naturally. Might be worth doing some experiments.

> 
>>
>>> I have also tried to implement alternative approach for prefetch based
>>> on access statistic.
>>> It comes from use case of seqscan of table with larger toasted records.
>>> So for each record we have to extract its TOAST data.
>>> It is done using standard index scan, but unfortunately index prefetch
>>> doesn't help much here: there is usually just one TOAST segment and so
>>> prefetch just have no chance to do something useful. But as far as heap
>>> records are accessed sequentially, there is good chance that toast table
>>> will also be accessed mostly sequentially. So we just can count number
>>> of sequential requests to each relation and if ratio or seq/rand
>>> accesses is above some threshold we can prefetch next pages of this
>>> relation. This is really universal approach but ... working mostly for
>>> TOAST table.
>>>
>> Are you're talking about what works / doesn't work in neon, or about
>> postgres in general?
>>
>> I'm not sure what you mean by "one TOAST segment" and I'd also guess
>> that if both tables are accessed mostly sequentially, the read-ahead
>> will do most of the work (in postgres).
> 
> Yes, I agree: in case of vanilla Postgres OS will do read-ahead. But not
> in Neon.
> By one TOAST segment I mean "one TOAST record - 2kb.
> 

Ah, you mean "TOAST chunk". Yes, if a record fits into a single TOAST
chunk, my prefetch won't work. Not sure what to do for neon ...

> 
>> It's probably true that as we do a separate index scan for each TOAST-ed
>> value, that can't really ramp-up the prefetch distance fast enough.
>> Maybe we could have a mode where we start with the full distance?
> 
> Sorry, I do not understand. Especially in this case large prefetch
> window is undesired.
> Most of records fits in 2kb, so we need to fetch onely one head (TOAST)
> record per TOAST index search.
> 

Yeah, I was confused what you mean by "segment". My point was that if a
value is TOAST-ed into multiple chunks, maybe we should allow more
aggressive prefetching instead of the slow ramp-up ...

But yeah, if there's just one TOAST chunk, that does not help.

> 
>>> This is exactly the difference. In Neon such approach doesn't work.
>>> Each backend maintains it's own prefetch ring. And if prefetched page
>>> was not actually received, then the whole pipe is lost.
>>> I.e. backend prefetched pages 1,5,10. Then it need to read page 2. So it
>>> has to consume responses for 1,5,10 and issue another request for
>>> page 2.
>>> Instead of improving speed we are just doing extra job.
>>> So each backend should prefetch only those pages which it is actually
>>> going to read.
>>> This is why prefetch approach used in Postgres for example for parallel
>>> bitmap heap scan doesn't work for Neon.
>>> If you do `posic_fadvise` then prefetched page is placed in OS cache and
>>> can be used by any parallel worker.
>>> But in Neon each parallel worker should be given its own range of pages
>>> to scan and prefetch only this pages.
>>>
>> I still don't quite see/understand the difference. I mean, even in
>> postgres each backend does it's own prefetches, using it's own prefetch
>> ring. But I'm not entirely sure about the neon architecture differences
>>
> I am not speaking about your approach. It will work with Neon as well.
> I am describing why implementation of prefetch for heap bitmap scan
> doesn't work for Neon:
> it issues prefetch requests for pages which never accessed by this
> parallel worker.
> 
>> Does this mean neon can do prefetching from the executor in principle?
>>
>> Could you perhaps describe a situation where the bitmap can prefetching
>> (as implemented in Postgres) does not work for neon?
>>
> 
> I am speaking about prefetch implementation in nodeBitmpapHeapScan.
> Prefetch iterator is not synced with normal iterator, i.e. they can
> return different pages.
> 

Ah, now I think I understand. The workers don't share memory, so the
pages prefetched by one worker are wasted if some other worker ends up
processing them.

>>
>> Well, maybe you could try doing rewriting it now, so that you can give
>> some feedback to the patch. I'd appreciate that.
> 
> I will try.
> 

Thanks!

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to