On Thu, Apr 4, 2024 at 8:02 PM David Rowley <dgrowle...@gmail.com> wrote: > 3a4a3537a > latency average = 34.497 ms > latency average = 34.538 ms > > 3a4a3537a + read_stream_for_seqscans.patch > latency average = 40.923 ms > latency average = 41.415 ms > > i.e. no meaningful change from the refactor, but a regression from a > cached workload that changes the page often without doing much work in > between with the read stread patch.
I ran Heikki's test except I ran the "insert" 4 times to get a table of 4376MB according to \d+. On my random cloud ARM server (SB=8GB, huge pages, parallelism disabled), I see a speedup 1290ms -> 1046ms when the data is in LInux cache and PG is not prewarmed, roughly as he reported. Good. If I pg_prewarm first, I see that slowdown 260ms -> 294ms. Trying things out to see what works, I got that down to 243ms (ie beat master) by inserting a memory prefetch: --- a/src/backend/storage/aio/read_stream.c +++ b/src/backend/storage/aio/read_stream.c @@ -757,6 +757,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data) /* Prepare for the next call. */ read_stream_look_ahead(stream, false); + __builtin_prefetch(BufferGetPage(stream->buffers[stream->oldest_buffer_index])); Maybe that's a solution to a different problem that just happens to more than make up the difference in this case, and it may be questionable whether that cache line will survive long enough to help you, but this one-tuple-per-page test likes it... Hmm, to get a more realistic table than the one-tuple-per-page on, I tried doubling a tenk1 table until it reached 2759MB and then I got a post-prewarm regression 702ms -> 721ms, and again I can beat master by memory prefetching: 689ms. Annoyingly, with the same table I see no difference between the actual pg_prewarm('giga') time: around 155ms for both. pg_prewarm is able to use the 'fast path' I made pretty much just to be able to minimise regression in that (probably stupid) all-cached tes that doesn't even look at the page contents. Unfortunately seq scan can't use it because it has per-buffer data, which is one of the things it can't do (because of a space management issue). Maybe I should try to find a way to fix that. > I'm happy to run further benchmarks, but for the remainder of the > committer responsibility for the next patches, I'm going to leave that > to Thomas. Thanks!