On 7/26/23 21:53, Matthias van de Meent wrote: > On Wed, 26 Jul 2023 at 20:58, Tomas Vondra > <tomas.von...@enterprisedb.com> wrote: >> >> >> >> On 7/26/23 15:16, Matthias van de Meent wrote: >>> On Wed, 26 Jul 2023 at 14:41, Alvaro Herrera <alvhe...@alvh.no-ip.org> >>> wrote: >>>> >>>> Hello >>>> >>>> On 2023-Jul-26, Thomas wen wrote: >>>> >>>>> Hi Hackes: I found this page : >>>>> https://pgsql-hackers.postgresql.narkive.com/cMxBwq65/incremental-checkopints,PostgreSQL >>>>> no incremental checkpoints have been implemented so far. When a >>>>> checkpoint is triggered, the performance jitter of PostgreSQL is very >>>>> noticeable. I think incremental checkpoints should be implemented as >>>>> soon as possible >>>> >>>> I think my first question is why do you think that is necessary; there >>>> are probably other tools to achieve better performance. For example, >>>> you may want to try making checkpoint_completion_target closer to 1, and >>>> the checkpoint interval longer (both checkpoint_timeout and >>>> max_wal_size). Also, changing shared_buffers may improve things. You >>>> can try adding more RAM to the machine. >>> >>> Even with all those tuning options, a significant portion of a >>> checkpoint's IO (up to 50%) originates from FPIs in the WAL, which (in >>> general) will most often appear at the start of each checkpoint due to >>> each first update to a page after a checkpoint needing an FPI. >> >> Yeah, FPIs are certainly expensive and can represent huge part of the >> WAL produced. But how would incremental checkpoints make that step >> unnecessary? >> >>> If instead we WAL-logged only the pages we are about to write to disk >>> (like MySQL's double-write buffer, but in WAL instead of a separate >>> cyclical buffer file), then a checkpoint_completion_target close to 1 >>> would probably solve the issue, but with "WAL-logged torn page >>> protection at first update after checkpoint" we'll probably always >>> have higher-than-average FPI load just after a new checkpoint. >>> >> >> So essentially instead of WAL-logging the FPI on the first change, we'd >> only do that later when actually writing-out the page (either during a >> checkpoint or because of memory pressure)? How would you make sure >> there's enough WAL space until the next checkpoint? I mean, FPIs are a >> huge write amplification source ... > > You don't make sure that there's enough space for the modifications, > but does it matter from a durability point of view? As long as the > page isn't written to disk before the FPI, we can replay non-FPI (but > fsynced) WAL on top of the old version of the page that you read from > disk, instead of only trusting FPIs from WAL. >
It does not matter from durability point of view, I think. But I was thinking more about how this affects scheduling of checkpoints - how would you know when the next checkpoint is likely to happen, when you don't know how many FPIs you're going to write? >> Imagine the system has max_wal_size set to 1GB, and does 1M updates >> before writing 512MB of WAL and thus triggering a checkpoint. Now it >> needs to write FPIs for 1M updates - easily 8GB of WAL, maybe more with >> indexes. What then? > > Then you ignore the max_wal_size GUC as PostgreSQL so often already > does. At least, it doesn't do what I expect it to do at face value - > limit the size of the WAL directory to the given size. > I agree the soft-limit nature of max_wal_size (i.e. best effort, not a strict limit) is not great. But just ignoring the limit altogether seems like a step in the wrong direction - we should try not to exceed it. I wonder if we'd actually need / want to write the FPIs into WAL. AFAICS we only need the FPI until the page is written and flushed - since that moment it shouldn't be possible to tear the page. So a small cyclic buffer separate from WAL would be better ... > But more reasonably, you'd keep track of the count of modified pages > that are yet to be fully WAL-logged, and keep that into account as a > debt that you have to the current WAL insert pointer when considering > checkpoint distances and max_wal_size. > Yeah, that might work. It'd likely be just estimates, but probably good enough for pacing the writes. > --- > > The main issue that I see with "WAL-logging the FPI only when you > write the dirty page to disk" is that dirty page flushing also happens > with buffer eviction in ReadBuffer(). This change in behaviour would > add a WAL insertion penalty to this write, and make it a very common > occurrance that we'd have to write WAL + fsync the WAL when we have to > write the dirty page. It would thus add significant latency to the > dirty write mechanism, which is probably a unpopular change. Yeah, it certainly move the latencies from one place to another. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company