On Wed, Nov 11, 2015 at 1:08 PM, Andres Freund <and...@anarazel.de> wrote: > > On 2015-09-10 17:15:26 +0200, Fabien COELHO wrote: > > Here is a v13, which is just a rebase after 1aba62ec. > > > 3) I found that latency wasn't improved much for workloads that are > significantly bigger than shared buffers. The problem here is that > neither bgwriter nor the backends have, so far, done > sync_file_range() calls. That meant that the old problem of having > gigabytes of dirty data that periodically get flushed out, still > exists. Having these do flushes mostly attacks that problem. > > > Benchmarking revealed that for workloads where the hot data set mostly > fits into shared buffers flushing and sorting is anywhere from a small > to a massive improvement, both in throughput and latency. Even without > the patch from 2), although fixing that improves things furhter. > > > > What I did not expect, and what confounded me for a long while, is that > for workloads where the hot data set does *NOT* fit into shared buffers, > sorting often led to be a noticeable reduction in throughput. Up to > 30%. The performance was still much more regular than before, i.e. no > more multi-second periods without any transactions happening. > > By now I think I know what's going on: Before the sorting portion of the > patch the write-loop in BufferSync() starts at the current clock hand, > by using StrategySyncStart(). But after the sorting that obviously > doesn't happen anymore - buffers are accessed in their sort order. By > starting at the current clock hand and moving on from there the > checkpointer basically makes it more less likely that victim buffers > need to be written either by the backends themselves or by > bgwriter. That means that the sorted checkpoint writes can, indirectly, > increase the number of unsorted writes by other processes :( >
That sounds to be a tricky problem. I think the way to improve the current situation is to change buffer allocation algorithm such that instead of backend issuing the write for dirty buffer, it will just continue to find next free buffer when it finds that selected buffer is dirty and if it could not find the non-dirty buffer for certain number of attempts, it will signal bgwriter to write-out some buffers. Now the writing algorithm of bgwriter has to be such that it picks the buffers in chunks from checkpoint-list, sort them and then write them. Checkpoint also uses the same checkpoint-list to flush the dirty buffers. This will ensure that the writes will always be sorted-writes irrespective of which process does the writes. There could be multiple ways to form this checkpoint-list and one of the way could be MarkBufferDirty() adds it to such a list. I think following such a mechanism could solve the problem of unsorted writes in the system, but it arises a question, what kind of latency such a mechanism could introduce for a backend which signals bgwriter after not finding a non-dirty buffer for certain number of attempts, I think if we sense this could be a problematic case, then we can make both bgwriter and checkpoint to always start from next victim buffer and then traverse the checkpoint-list. > > My benchmarking suggest that that effect is the larger, the shorter the > checkpoint timeout is. That seems to intuitively make sense, give the > above explanation attempt. If the checkpoint takes longer the clock hand > will almost certainly soon overtake checkpoints 'implicit' hand. > > I'm not sure if we can really do anything about this problem. While I'm > pretty jet lagged, I still spent a fair amount of time thinking about > it. Seems to suggest that we need to bring back the setting to > enable/disable sorting :( > > > What I think needs to happen next with the patch is: > 1) Hoist up the FileFlushContext stuff into the smgr layer. Carefully > handling the issue of smgr invalidations. > 2) Replace the boolean checkpoint_flush_to_disk GUC with a list guc that > later can contain multiple elements like checkpoint, bgwriter, > backends, ddl, bulk-writes. That seems better than adding GUCs for > these separately. Then make the flush locations in the patch > configurable using that. > 3) I think we should remove the sort timing from the checkpoint logging > before commit. It'll always be pretty short. > It seems for now you have left out the windows specific implementation in pg_flush_data(). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com