On Thu, Aug 27, 2020 at 04:28:54PM -0400, Stephen Frost wrote:
Greetings,

* Robert Haas (robertmh...@gmail.com) wrote:
On Thu, Aug 27, 2020 at 2:51 PM Stephen Frost <sfr...@snowman.net> wrote:
> > Hm? At least earlier versions didn't do prefetching for records with an 
fpw, and only for subsequent records affecting the same or if not in s_b anymore.
>
> We don't actually read the page when we're replaying an FPW though..?
> If we don't read it, and we entirely write the page from the FPW, how is
> pre-fetching helping..?

Suppose there is a checkpoint. Then we replay a record with an FPW,
pre-fetching nothing. Then the buffer gets evicted from
shared_buffers, and maybe the OS cache too. Then, before the next
checkpoint, we again replay a record for the same page. At this point,
pre-fetching should be helpful.

Sure- but if we're talking about 25GB of WAL, on a server that's got
32GB, then why would those pages end up getting evicted from memory
entirely?  Particularly, enough of them to end up with such a huge
difference in replay time..

I do agree that if we've got more outstanding WAL between checkpoints
than the system's got memory then that certainly changes things, but
that wasn't what I understood the case to be here.


I don't think it's very clear how much WAL there actually was in each
case - the message only said there was more than 25GB, but who knows how
many checkpoints that covers? In the cases with FPW=on this may easily
be much less than one checkpoint (because with scale 45GB an update to
every page will log 45GB of full-page images). It'd be interesting to
see some stats from pg_waldump etc.

Admittedly, I don't quite understand whether that is what is happening
in this test case, or why SDD vs. HDD should make any difference. But
there doesn't seem to be any reason why it doesn't make sense in
theory.

I agree that this could be a reason, but it doesn't seem to quite fit in
this particular case given the amount of memory and WAL.  I'm suspecting
that it's something else and I'd very much like to know if it's a
general "this applies to all (most?  a lot of?) SSDs because the
hardware has a larger than 8KB page size and therefore the kernel has to
read it", or if it's something odd about this particular system and
doesn't apply generally.


Not sure. I doubt it has anything to do with the hardware page size,
that's mostly transparent to the kernel anyway. But it might be that the
prefetching on a particular SSD has more overhead than what it saves.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Reply via email to