Greg Stark wrote:
Using sync_file_range you can specify the set of blocks to sync and
then block on them only after some time has passed. But there's no
documentation on how this relates to the I/O scheduler so it's not
clear it would have any effect on the problem.
I believe this is the exact spot we're stalled at in regards to getting
this improved on the Linux side, as I understand it at least. *The*
answer for this class of problem on Linux is to use sync_file_range, and
I don't think we'll ever get any sympathy from those kernel developers
until we do. But that's a Linux specific call, so doing that is going
to add a write path fork with platform-specific code into the database.
If I thought sync_file_range was a silver bullet guaranteed to make this
better, maybe I'd go for that. I think there's some relatively
low-hanging fruit on the database side that would do better before going
to that extreme though, thus the patch.
We might still have to delay the begining of the sync to allow the dirty blocks
to be synced
naturally and then when we issue it still end up catching a lot of
other i/o as well.
Whether it's "lots" or not is really workload dependent. I work from
the assumption that the blocks being written out by the checkpoint are
the most popular ones in the database, the ones that accumulate a high
usage count and stay there. If that's true, my guess is that the writes
being done while the checkpoint is executing are a bit less likely to be
touching the same files. You raise a valid concern, I just haven't seen
that actually happen in practice yet.
--
Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers