On Sun, Jun 8, 2014 at 5:45 PM, Kevin Grittner <kgri...@ymail.com> wrote: > I ran into a situation where a machine with 4 NUMA memory nodes and > 40 cores had performance problems due to NUMA. The problems were > worst right after they rebooted the OS and warmed the cache by > running a script of queries to read all tables. These were all run > on a single connection. As it turned out, the size of the database > was just over one-quarter of the size of RAM, and with default NUMA > policies both the OS cache for the database and the PostgreSQL > shared memory allocation were placed on a single NUMA segment, so > access to the CPU package managing that segment became a > bottleneck. On top of that, processes which happened to run on the > CPU package which had all the cached data had to allocate memory > for local use on more distant memory because there was none left in > the more local memory. > > Through normal operations, things eventually tended to shift around > and get better (after several hours of heavy use with substandard > performance). I ran some benchmarks and found that even in > long-running tests, spreading these allocations among the memory > segments showed about a 2% benefit in a read-only load. The > biggest difference I saw in a long-running read-write load was > about a 20% hit for unbalanced allocations, but I only saw that > once. I talked to someone at PGCon who managed to engineer much > worse performance hits for an unbalanced load, although the > circumstances were fairly artificial. Still, fixing this seems > like something worth doing if further benchmarks confirm benefits > at this level. > > By default, the OS cache and buffers are allocated in the memory > node with the shortest "distance" from the CPU a process is running > on. This is determined by a the "cpuset" associated with the > process which reads or writes the disk page. Typically a NUMA > machine starts with a single cpuset with a policy specifying this > behavior. Fixing this aspect of things seems like an issue for > packagers, although we should probably document it for those > running from their own source builds. > > To set an alternate policy for PostgreSQL, you first need to find > or create the location for cpuset specification, which uses a > filesystem in a way similar to the /proc directory. On a machine > with more than one memory node, the appropriate filesystem is > probably already mounted, although different distributions use > different filesystem names and mount locations. I will illustrate > the process on my Ubuntu machine. Even though it has only one > memory node (and so, this makes no difference), I have it handy at > the moment to confirm the commands as I put them into the email. > > # Sysadmin must create the root cpuset if not already done. (On a > # system with NUMA memory, this will probably already be mounted.) > # Location and options can vary by distro. > > sudo sudo mkdir /dev/cpuset > sudo mount -t cpuset none /dev/cpuset > > # Sysadmin must create a cpuset for postgres and configure > # resources. This will normally be all cores and all RAM. This is > # where we specify that this cpuset will spread pages among its > # memory nodes. > > sudo mkdir /dev/cpuset/postgres > sudo /bin/bash -c "echo 0-3 >/dev/cpuset/postgres/cpus" > sudo /bin/bash -c "echo 0 >/dev/cpuset/postgres/mems" > sudo /bin/bash -c "echo 1 >/dev/cpuset/postgres/memory_spread_page" > > # Sysadmin must grant permissions to the desired setting(s). > # This could be by user or group. > > sudo chown postgres /dev/cpuset/postgres/tasks > > # The pid of postmaster or an ancestor process must be written to > # the tasks "file" of the cpuset. This can be a shell from which > # pg_ctl is run, at least for bash shells. It could also be > # written by the postmaster itself, essentially as an extra pid > # file. Possible snippet from a service script: > > echo $$ >/dev/cpuset/postgres/tasks > pg_ctl start ... > > Where the OS cache is larger than shared_buffers, the above is > probably more important than the attached patch, which causes the > main shared memory segment to be spread among all available memory > nodes. This patch only compiles in the relevant code if configure > is run using the --with-libnuma option, in which case a dependency > on the numa library is created. It is v3 to avoid confusion with > earlier versions I have shared with a few people off-list. (The > only difference from v2 is fixing bitrot.) > > I'll add it to the next CF.
Hm, your patch seems to boil down to interleave_memory(start, size, numa_all_nodes_ptr) inside PGSharedMemoryCreate(). I've read your email a couple of times and am a little hazy around a couple of points, in particular: "the above is probably more important than the attached patch". So I have a couple of questions: *) There is a lot of advice floating around (for example here: http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html) to instruct operators to disable zone_reclaim. Will your changes invalidate any of that advice? *) is there any downside to enabling --with-libnuma if you have support? Do you expect packagers will enable it generally? Why not just always build it in (if configure allows it) and rely on a GUC if there is some kind of tradeoff (and if there is one, what kinds of things are you looking for to manage it)? *) The bash script above, what problem does the 'alternate policy' solve? *) What kinds of improvements (even if in very general terms) will we see from better numa management? Are there further optimizations possible? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers