It's been known for years that commit_delay isn't very good at giving us group commit behavior. I did some experiments with this simple test case: "BEGIN; INSERT INTO test VALUES (1); COMMIT;", with different numbers of concurrent clients and with and without commit_delay.

Summary for the impatient:
1. Current behavior sucks.
2. commit_delay doesn't help with # of clients < ~10. It does help with higher numbers, but it still sucks.
3. I'm working on a patch.


I added logging to show how many commit records are flushed on each fsync. The output with otherwise unpatched PG head looks like this, with 5 clients:

LOG:  Flushed 4 out of 5 commits
LOG:  Flushed 1 out of 5 commits
LOG:  Flushed 4 out of 5 commits
LOG:  Flushed 1 out of 5 commits
LOG:  Flushed 4 out of 5 commits
LOG:  Flushed 1 out of 5 commits
LOG:  Flushed 4 out of 5 commits
LOG:  Flushed 1 out of 5 commits
LOG:  Flushed 3 out of 5 commits
LOG:  Flushed 2 out of 5 commits
LOG:  Flushed 3 out of 5 commits
LOG:  Flushed 2 out of 5 commits
LOG:  Flushed 3 out of 5 commits
LOG:  Flushed 2 out of 5 commits
LOG:  Flushed 3 out of 5 commits
...

Here's what's happening:

1. Client 1 issues fsync (A)
2. Clients 2-5 write their commit record, and try to fsync, but they have to wait for fsync (A) to finish.
3. fsync (A) finishes, freeing client 1.
4. One of clients 2-5 starts the next fsync (B), which will flush commits of clients 2-5 to disk 5. Client 1 begins new transaction, inserts commit record and tries to fsync. Needs to wait for previous fsync (B) to finish
6. fsync B finishes, freeing clients 2-5
7. Client 1 issues fsync (C)
8. ...

The 2-3-2-3 pattern can be explained with similar unfortunate "resonance", but with two clients instead of client 1 in the above possibly running in separate cores (test was run on a dual-core laptop).

I also draw a diagram illustrating the above, attached.

I wrote a quick & dirty patch for this that I'm going to refine further, but wanted to get the results out for others to look at first. I'm not posting the patch yet, but it basically adds some synchronization to the WAL flushes. It introduces a counter of inserted but not yet flushed commit records. Instead of the commit_delay, the counter is checked. If it's smaller than NBackends, the process waits until count reaches NBackends, or a timeout expires. There's two significant differences to commit_delay here: 1. Instead of waiting for commit_delay to expire, processes are woken and fsync is started immediately when we know there's no more commit records coming that we should wait for. Even though commit_delay is given in microseconds, the real granularity of the wait can be as high as 10 ms, which is in the same ball park as the fsync itself. 2. commit_delay is not used when there's less than commit_siblings non-idle backends in the system. With very short transactions, it's worthwhile to wait even if that's the case, because a client can begin and finish a transaction in much shorter time than it takes to fsync. This is what makes the commit_delay to not work at all in my test case with 2 clients.

Here's a spreadsheet with the results of the tests I ran:
http://community.enterprisedb.com/groupcommit-comparison.ods

It contains a graph that shows that the patch works very well for this test case. It's not very good for real life as it is, though. An obvious flaw is that if you have a longer-running transaction, effect 1. goes away. Instead of waiting for NBackends commit records, we should try to guess the number of transactions that are likely to finish in a reasonably short time. I'm thinking of keeping a running average of commits per second, or # of transactions that finish while an fsync is taking place.

Any thoughts?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

GIF image

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Reply via email to