In December, Metin (a coworker of mine) discussed an inability to scale a 
simple task (parallel scans of many independent tables) to many cores (it’s 
here). As a ramp-up task at Citus I was tasked to figure out what the heck was 
going on here.

I have a pretty extensive writeup here (whose length is more a result of my 
inexperience with the workings of PostgreSQL than anything else) and was 
looking for some feedback.

In short, my conclusion is that a working set larger than memory results in 
backends piling up on BufFreelistLock. As much as possible I removed anything 
that could be blamed for this:

Hyper-Threading is disabled
zone reclaim mode is disabled
numactl was used to ensure interleaved allocation
kernel.sched_migration_cost was set to highly disable migration
kernel.sched_autogroup_enabled was disabled
transparent hugepage support was disabled

For a way forward, I was thinking the buffer allocation sections could use some 
of the atomics Andres added here. Rather than workers grabbing BufFreelistLock 
to iterate the clock hand until they find a victim, the algorithm could be 
rewritten in a lock-free style, allowing workers to move the clock hand in 
tandem.

Alternatively, the clock iteration could be moved off to a background process, 
similar to what Amit Kapila proposed here.

Is this assessment accurate? I know 9.4 changes a lot about lock organization, 
but last I looked I didn’t see anything that could alleviate this contention: 
are there any plans to address this?

—Jason

Reply via email to