Re: [HACKERS] Design proposal: fsync absorb linear slider

Greg Smith Tue, 27 Aug 2013 03:27:50 -0700

On 7/29/13 2:04 AM, KONDO Mitsumasa wrote:

I think that it is almost same as small dirty_background_ratio or
dirty_background_bytes.

The main difference here is that all writes pushed out this way will beto a single 1GB relation chunk. The odds are better that multiplewrites will combine, and that the I/O will involve a lower than averageamount of random seeking. Whereas shrinking the size of the write cachealways results in more random seeking.

The essential improvement is not dirty page size in fsync() but
scheduling of fsync phase.
I can't understand why postgres does not consider scheduling of fsync
phase.

Because it cannot get the sort of latency improvements I think peoplewant. I proved to myself it's impossible during the last 9.2 CF when Isubmitted several fsync scheduling change submissions.

By the time you get to the fsync sync phase, on a system that's alwayswriting heavily there is way too much backlog to possibly cope with bythen. There just isn't enough time left before the checkpoint shouldend to write everything out. You have to force writes to actual disk tostart happening earlier to keep a predictable schedule. Basically, thelonger you go without issuing a fsync, the more uncertainty there isaround how long it might take to fire. My proposal lets someone keepall I/O from ever reaching the point where the uncertainty is that high.

In the simplest to explain case, imagine that a checkpoint includes a1GB relation segment that is completely dirty in shared_buffers. When acheckpoint hits this, it will have 1GB of I/O to push out.

If you have waited this long to fsync the segment, the problem is nowtoo big to fix by checkpoint time. Even if the 1GB of writes arethemselves nicely ordered and grouped on disk, the concurrent backgroundability is going to chop the combination up into more random I/O thanthe ideal.

Regular consumer disks have a worst case random I/O throughput of lessthan 2MB/s. My observed progress rates for such systems show you'relucky to get 10MB/s of writes out of them. So how long will the dirty1GB in the segment take to write? 1GB @ 10MB/s = 102.4 *seconds*. Andthat's exactly what I saw whenever I tried to play with checkpoint syncscheduling. No matter what you do there, periodically you'll hit asegment that has over a minute of dirty data accumulated, and >60 secondlatency pauses result. By the point you've reached checkpoint, you'redead when you call fsync on that relation. You *must* hit that segmentwith fsync more often than once per checkpoint to achieve reasonablelatency.

With this "linear slider" idea, I might tune such that no segment willever get more than 256MB of writes before hitting a fsync instead. Ican't guarantee that will work usefully, but the shape of the idea seemsto match the problem.

Taken together my checkpoint proposal method,
* write phase
   - Almost same, but considering fsync phase schedule.
   - Considering case of background-write in OS, sort buffer before
starting checkpoint write.

This cannot work for the reasons I've outlined here. I guarantee you Iwill easily find a test workload where it performs worse than what'shappening right now. If you want to play with this to learn more aboutthe trade-offs involved, that's fine, but expect me to vote againstaccepting any change of this form. I would prefer you to not submitthem because it will waste a large amount of reviewer time to reach thatconclusion yet again. And I'm not going to be that reviewer.

* fsync phase
   - Considering checkpoint schedule and write-phase schedule
   - Executing separated sync_file_range() and sleep, in final fsync().

If you can figure out how to use sync_file_range() to fine tune how muchfsync is happening at any time, that would be useful on all theplatforms that support it. I haven't tried it just because that lookedto me like a large job refactoring the entire fsync absorb mechanism,and I've never had enough funding to take it on. That approach has alot of good properties, if it could be made to work without a lot ofcode changes.


--
Greg Smith   2ndQuadrant US    [email protected]   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Design proposal: fsync absorb linear slider

Reply via email to