Greg Stark wrote:
Using sync_file_range you can specify the set of blocks to sync and
then block on them only after some time has passed. But there's no
documentation on how this relates to the I/O scheduler so it's not
clear it would have any effect on the problem.

I believe this is the exact spot we're stalled at in regards to getting this improved on the Linux side, as I understand it at least. *The* answer for this class of problem on Linux is to use sync_file_range, and I don't think we'll ever get any sympathy from those kernel developers until we do. But that's a Linux specific call, so doing that is going to add a write path fork with platform-specific code into the database. If I thought sync_file_range was a silver bullet guaranteed to make this better, maybe I'd go for that. I think there's some relatively low-hanging fruit on the database side that would do better before going to that extreme though, thus the patch.

We might still have to delay the begining of the sync to allow the dirty blocks 
to be synced
naturally and then when we issue it still end up catching a lot of
other i/o as well.

Whether it's "lots" or not is really workload dependent. I work from the assumption that the blocks being written out by the checkpoint are the most popular ones in the database, the ones that accumulate a high usage count and stay there. If that's true, my guess is that the writes being done while the checkpoint is executing are a bit less likely to be touching the same files. You raise a valid concern, I just haven't seen that actually happen in practice yet.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to