On Thu, Apr 4, 2013 at 3:41 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > I think the original vision of the clock sweep algorithm included the > idea that different backends could be running the sweep over different > parts of the buffer ring concurrently. If we could get rid of the need > to serialize that activity, it'd pretty much eliminate the bottleneck > I should think. The problem is how to manage it to ensure that (a) > backends aren't actually contending on the same buffers as they do this, > and (b) there's a reasonably fair rate of usage_count decrements for > each buffer, rather than possibly everybody ganging up on the same area > sometimes. Thoughts?
Fairness and avoiding each others implies that some coordination is required. One wild idea I had is to partition buffers to N partitions and have one clock sweep each, protected by a lwlock. To reduce contention, the clocksweep runners use something like sched_getcpu() to determine which partition to use to find their buffer. Using the CPU to determine the partition makes it necessary for the process to be scheduled out in the critical section for some other backend to contend on the lock. And if some backend does contend on it, it is likely to reside on the same CPU and by sleeping will make room for the lockholder to run. To ensure fairness for buffers, every time one of the clocksweeps wraps around a global offset counter is incremented. This re-assigns all cpus/backends to the next partition, sort of like the mad hatters tea party. The scenario that I'm most worried about here is what happens when a process holding the clocksweep lock is migrated to another CPU and then scheduled out. The processes on the original CPU will sooner or later block behind the lock while the processes on the CPU where the lock holder now waits keep hogging the CPU. This will create an imbalance that the scheduler might try to correct, possibly creating a nasty feedback loop. It could be that in practice the scenario works out to be too far fetched to matter, but who knows. I don't have a slightest idea yet how the background writer would function in this environment. But if redesigning the bgwriter mechanism was on the table I thought I would throw the idea out here. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers