On Tue, Jun 2, 2015 at 6:45 PM, Fabien COELHO <coe...@cri.ensmp.fr> wrote: > > > Hello Amit, > >> [...] >>> >>> The objective is to help avoid PG stalling when fsyncing on checkpoints, >>> and in general to get better latency-bound performance. >> >> >> Won't this lead to more-unsorted writes (random I/O) as the >> FlushBuffer requests (by checkpointer or bgwriter) are not sorted as >> per files or order of blocks on disk? > > > Yep, probably. Under "moderate load" this is not an issue. The io-scheduler and other hd firmware will probably reorder writes anyway. Also, if several data are updated together, probably they are likely to be already neighbours in memory as well as on disk. > >> I remember sometime back there was some discusion regarding >> sorting writes during checkpoint, one idea could be try to >> check this idea along with that patch. I just saw that Andres has >> also given same suggestion which indicates that it is important >> to see both the things together. > > > I would rather separate them, unless this is a blocker. This version seems already quite effective and very light. ISTM that adding a sort phase would mean reworking significantly how the checkpointer processes pages. >
I agree with you that if we have to add a sort phase, there is additional work and that work could be significant depending on the design we choose, however without that, this patch can have impact on many kind of workloads, even in your mail in one of the tests ("pgbench -M prepared -N -T 100 -j 2 -c 4 -P 1" over 32 runs (4 clients)) it has shown 20% degradation which is quite significant and test also seems to be representative of the workload which many users in real-world will use. Now one can say that for such workloads turn the new knob to off, but in reality it could be difficult to predict if the load is always moderate. I think users might be able to predict that at table level, but inspite of that I don't think having any such knob can give us ticket to flush the buffers in random order. >> Also here another related point is that I think currently even fsync >> requests are not in order of the files as they are stored on disk so >> that also might cause random I/O? > > > I think that currently the fsync is on the file handler, so what happens depends on how fsync is implemented by the system. > That can also lead to random I/O if the fsync for different files is not in order as they are actually stored on disk. >> Yet another idea could be to allow BGWriter to also fsync the dirty >> buffers, > > > ISTM That it is done with this patch with "bgwriter_flush_to_disk=on". > I think patch just issues an async operation not the actual flush. Why I have suggested so is that in your tests when the checkpoint_timeout is small it seems there is a good gain in performance that means if keep on flushing dirty buffers at regular intervals, the system's performance is good and BGWriter is the process where that can be done conveniently apart from checkpoint, one might think that if same can be achieved by using shorter checkpoint_timeout interval, then why to do this incremental flushes by bgwriter, but in reality I think checkpoint is responsible for other things as well other than dirty buffers, so we can't leave everything till checkpoint happens. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com