Hi Andres > It's implied, but to make it more explicit: One big efficiency advantage of > writes by checkpointer is that they are sorted and can often be combined into > larger writes. That's often a lot more efficient: For network attached storage > it saves you iops, for local SSDs it's much friendlier to wear leveling.
thank you for explanation, I think bgwrite also can merge io ,It writes asynchronously to the file system cache, scheduling by os, . > Another aspect is that checkpointer's writes are much easier to pace over time > than e.g. bgwriters, because bgwriter is triggered by a fairly short term > signal. Eventually we'll want to combine writes by bgwriter too, but that's > always going to be more expensive than doing it in a large batched fashion > like checkpointer does. > I think we could improve checkpointer's pacing further, fwiw, by taking into > account that the WAL volume at the start of a spread-out checkpoint typically > is bigger than at the end. I'm also very keen to improve checkpoints , Whenever I do stress test, bgwrite does not write dirty pages when the data set is smaller than shard_buffer size,Before the checkpoint, the pressure measurement tps was stable and the highest during the entire pressure measurement phase,Other databases refresh dirty pages at a certain frequency, at intervals, and at dirty page water levels,They have a much smaller impact on performance when checkpoints occur Thanks Andres Freund <and...@anarazel.de> 于2024年10月4日周五 03:40写道: > Hi, > > On 2024-10-02 18:36:44 +0200, Tomas Vondra wrote: > > On 10/2/24 17:02, Tony Wayne wrote: > > > > > > > > > On Wed, Oct 2, 2024 at 8:14 PM Laurenz Albe <laurenz.a...@cybertec.at > > > <mailto:laurenz.a...@cybertec.at>> wrote: > > > > > > On Wed, 2024-10-02 at 16:48 +0800, wenhui qiu wrote: > > > > Whenever I check the checkpoint information in a log, most dirty > > > pages are written by the checkpoint process > > > > > > That's exactly how it should be! > > > > > > is it because if bgwriter frequently flushes, the disk io will be > more?🤔 > > > > Yes, pretty much. But it's also about where the writes happen. > > > > Checkpoint flushes dirty buffers only once per checkpoint interval, > > which is the lowest amount of write I/O that needs to happen. > > > > Every other way of flushing buffers is less efficient, and is mostly a > > sign of memory pressure (shared buffers not large enough for active part > > of the data). > > It's implied, but to make it more explicit: One big efficiency advantage of > writes by checkpointer is that they are sorted and can often be combined > into > larger writes. That's often a lot more efficient: For network attached > storage > it saves you iops, for local SSDs it's much friendlier to wear leveling. > > > > But it's also happens about where the writes happen. Checkpoint does > > that in the background, not as part of regular query execution. What we > > don't want is for the user backends to flush buffers, because it's > > expensive and can cause result in much higher latency. > > > > The bgwriter is somewhere in between - it's happens in the background, > > but may not be as efficient as doing it in the checkpointer. Still much > > better than having to do this in regular backends. > > Another aspect is that checkpointer's writes are much easier to pace over > time > than e.g. bgwriters, because bgwriter is triggered by a fairly short term > signal. Eventually we'll want to combine writes by bgwriter too, but > that's > always going to be more expensive than doing it in a large batched fashion > like checkpointer does. > > I think we could improve checkpointer's pacing further, fwiw, by taking > into > account that the WAL volume at the start of a spread-out checkpoint > typically > is bigger than at the end. > > Greetings, > > Andres Freund > > >