I've recently been taking another look at the test case shown here: http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php This is the infamous "context swap storm" problem that we've hacking at for so long. After the 8.1 buffer manager redesign, the problem of contention for the BufMgrLock is pretty much gone, and so the test case no longer exhibits a lot of semop()s. However ... performance of CVS tip on a multiprocessor still pretty much sucks :-(
I started looking at this because of a discussion with a Red Hat customer, who was saying that the "futex spinlock" patch shown here http://archives.postgresql.org/pgsql-performance/2004-10/msg00185.php had helped him materially with a context-swap-storm problem on 8.0.3. That didn't make a lot of sense to me, because we had determined that the futex patch didn't help much for this problem (which is one reason why it never got applied). So I started digging, and I've found a few interesting things that we still need to fix. First off, the RH customer had found that the futex patch improved his query performance even though the context swap rate actually went *up* materially. So I realized that we shouldn't simply be looking at the CS rate and saying "ugh, we have to get that down". It's important to find out how much work is really getting done, too. Accordingly, I generated a modified version of the test case with a shorter query that is actually meant to complete in a reasonable amount of time, and I looked at how long it took to complete multiple instances of this query running in parallel. Ideally, if you have N cpus you should be able to run up to N copies of the query in the same amount of time it takes for one copy, and then, say, 2*N copies ought to take about twice as long as N copies. We do actually seem to scale pretty linearly beyond N ... but below that, not. The files for the updated test case are attached if anyone else wants to try it. They are: test_setup.sql Run this to create the test tables test_run.sql A version of the test query that will run pretty nearly forever (until you kill it) test_run_small.sql A one-minute-or-so test query startn.sh A shell script to run N copies of test_run_small.sql in parallel. You need this one-time setup: createdb test psql -f test_setup.sql test To perform testing, do "time ./startn.sh N" for N=1,2,... up to however many CPUs you have, and maybe a run with more queries than that to make sure performance doesn't drop off above that. Note the total time to complete the script for each N. You can also watch the CPU utilization and context swap rates using "vmstat 1" or equivalent, and/or strace some of the backends to see what they're doing. I did my testing on a 4-way Opteron box at Red Hat, using kernel 2.6.12-1.1398_FC4smp. 8.0.3 completely sucks in this test: N, runtime: 1 45s 2 119s 4 507s (didn't try 8) With 4 processes running, it shows about 40K context swaps/sec and barely 15% CPU utilization! strace shows plenty of semop()s as expected, but also a fair number of delaying select() calls from s_lock(). I saw about one select for every hundred semops, but that was still enough to have 30 or so selects per second (per process). Since each select sleeps 10msec, that's about a third of the potential CPU usage being discarded in spinlock delays. (Even more interesting, this test case behaves much worse on Opteron than on other SMP boxes. I had previously done similar tests on a 4-way Xeon with similar-vintage kernel, and saw context swap rates over 140K/sec with about 70% CPU load. I don't have test_run_small results from that session, but clearly the spinlock code is behaving worse on the Opterons.) CVS tip eliminates essentially all the semops, as we hoped, but scalability is still not great: N, runtime: 1 36s 2 61s 4 105s 8 198s At 4 processes, CPU use is only about 30%; the rate of select() delays has gone up to 50/sec or more, so even though we fixed the bufmgr itself, we are still losing a lot of potential CPU cycles to spinlock waits. (Note: I later realized that the CVS-tip code was accidentally built with -O1, not -O2 as the 8.0.3 code had been, so these two sets of numbers are not completely comparable: with -O2 the single-query time drops to 31s in CVS tip. But the main point is all the time that's disappearing into spinlock delays.) I have some proposals for fixing this stuff, which I'll put into a separate message since this one is pretty long already. What I wanted to accomplish here is to document the test case and the problem, and that's done. Anyone have SMP boxes of other types that they can try this on? regards, tom lane
drop table test_data; create table test_data(f1 int); insert into test_data values (random() * 100); insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; insert into test_data select random() * 100 from test_data; create index test_index on test_data(f1); vacuum verbose analyze test_data; checkpoint;
-- force nestloop indexscan plan set enable_seqscan to 0; set enable_bitmapscan to 0; set enable_mergejoin to 0; set enable_hashjoin to 0; explain select count(*) from test_data a, test_data b, test_data c where a.f1 = b.f1 and b.f1 = c.f1; select count(*) from test_data a, test_data b, test_data c where a.f1 = b.f1 and b.f1 = c.f1;
-- force nestloop indexscan plan set enable_seqscan to 0; set enable_bitmapscan to 0; set enable_mergejoin to 0; set enable_hashjoin to 0; select count(*) from test_data a, test_data b where a.f1 = b.f1;
#!/bin/sh N=0 while [ $N -lt $1 ] do psql -f test_run_small.sql test >/dev/null & N=`expr $N + 1` done wait
---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq