On Sun, Jun 8, 2014 at 5:45 PM, Kevin Grittner <kgri...@ymail.com> wrote:
> I ran into a situation where a machine with 4 NUMA memory nodes and
> 40 cores had performance problems due to NUMA.  The problems were
> worst right after they rebooted the OS and warmed the cache by
> running a script of queries to read all tables.  These were all run
> on a single connection.  As it turned out, the size of the database
> was just over one-quarter of the size of RAM, and with default NUMA
> policies both the OS cache for the database and the PostgreSQL
> shared memory allocation were placed on a single NUMA segment, so
> access to the CPU package managing that segment became a
> bottleneck.  On top of that, processes which happened to run on the
> CPU package which had all the cached data had to allocate memory
> for local use on more distant memory because there was none left in
> the more local memory.
>
> Through normal operations, things eventually tended to shift around
> and get better (after several hours of heavy use with substandard
> performance).  I ran some benchmarks and found that even in
> long-running tests, spreading these allocations among the memory
> segments showed about a 2% benefit in a read-only load.  The
> biggest difference I saw in a long-running read-write load was
> about a 20% hit for unbalanced allocations, but I only saw that
> once.  I talked to someone at PGCon who managed to engineer much
> worse performance hits for an unbalanced load, although the
> circumstances were fairly artificial.  Still, fixing this seems
> like something worth doing if further benchmarks confirm benefits
> at this level.
>
> By default, the OS cache and buffers are allocated in the memory
> node with the shortest "distance" from the CPU a process is running
> on.  This is determined by a the "cpuset" associated with the
> process which reads or writes the disk page.  Typically a NUMA
> machine starts with a single cpuset with a policy specifying this
> behavior.  Fixing this aspect of things seems like an issue for
> packagers, although we should probably document it for those
> running from their own source builds.
>
> To set an alternate policy for PostgreSQL, you first need to find
> or create the location for cpuset specification, which uses a
> filesystem in a way similar to the /proc directory.  On a machine
> with more than one memory node, the appropriate filesystem is
> probably already mounted, although different distributions use
> different filesystem names and mount locations.  I will illustrate
> the process on my Ubuntu machine.  Even though it has only one
> memory node (and so, this makes no difference), I have it handy at
> the moment to confirm the commands as I put them into the email.
>
> # Sysadmin must create the root cpuset if not already done.  (On a
> # system with NUMA memory, this will probably already be mounted.)
> # Location and options can vary by distro.
>
> sudo sudo mkdir /dev/cpuset
> sudo mount -t cpuset none /dev/cpuset
>
> # Sysadmin must create a cpuset for postgres and configure
> # resources.  This will normally be all cores and all RAM.  This is
> # where we specify that this cpuset will spread pages among its
> # memory nodes.
>
> sudo mkdir /dev/cpuset/postgres
> sudo /bin/bash -c "echo 0-3 >/dev/cpuset/postgres/cpus"
> sudo /bin/bash -c "echo 0 >/dev/cpuset/postgres/mems"
> sudo /bin/bash -c "echo 1 >/dev/cpuset/postgres/memory_spread_page"
>
> # Sysadmin must grant permissions to the desired setting(s).
> # This could be by user or group.
>
> sudo chown postgres /dev/cpuset/postgres/tasks
>
> # The pid of postmaster or an ancestor process must be written to
> # the tasks "file" of the cpuset.  This can be a shell from which
> # pg_ctl is run, at least for bash shells.  It could also be
> # written by the postmaster itself, essentially as an extra pid
> # file.  Possible snippet from a service script:
>
> echo $$ >/dev/cpuset/postgres/tasks
> pg_ctl start ...
>
> Where the OS cache is larger than shared_buffers, the above is
> probably more important than the attached patch, which causes the
> main shared memory segment to be spread among all available memory
> nodes.  This patch only compiles in the relevant code if configure
> is run using the --with-libnuma option, in which case a dependency
> on the numa library is created.  It is v3 to avoid confusion with
> earlier versions I have shared with a few people off-list.  (The
> only difference from v2 is fixing bitrot.)
>
> I'll add it to the next CF.

Hm, your patch seems to boil down to interleave_memory(start, size,
numa_all_nodes_ptr) inside PGSharedMemoryCreate().  I've read your
email a couple of times and am a little hazy around a couple of
points, in particular: "the above is probably more important than the
attached patch".  So I have a couple of questions:

*) There is a lot of advice floating around (for example here:
http://frosty-postgres.blogspot.com/2012/08/postgresql-numa-and-zone-reclaim-mode.html)
to instruct operators to disable zone_reclaim.  Will your changes
invalidate any of that advice?

*) is there any downside to enabling --with-libnuma if you have
support?  Do you expect packagers will enable it generally?  Why not
just always build it in (if configure allows it) and rely on a GUC if
there is some kind of tradeoff (and if there is one, what kinds of
things are you looking for to manage it)?

*) The bash script above, what problem does the 'alternate policy' solve?

*) What kinds of improvements (even if in very general terms) will we
see from better numa management?  Are there further optimizations
possible?

merlin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to