Jim Meyering wrote: > Jim Meyering wrote: >> Jim Meyering wrote: >>> Running "make -j25 check" on a nominal-12-core F14 system would >>> cause serious difficulty leading to an OOM kill -- and this is brand new. >>> It worked fine yesterday. I tracked it down to all of the make processes >>> working on the "built_programs.list" (in src/Makefile.am) rule >>> >>> built_programs.list: >>> @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \ >>> | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' ' >>> >>> Which made me realize we were running that submake over 400 times, >>> once per test scripts (including skipped ones). That's well worth >>> avoiding, even if it means a new temporary file. >>> >>> I don't know the root cause of the OOM-kill (preceded by interminable >>> minutes of a seemingly hung and barely responsive system) or why it started >>> happening today (afaics, none of the programs involved was updated), >>> but this does fix it... >> >> FYI, >> I've tracked this down a little further. >> The horrid performance (hung system and eventual OOM-kill) >> are related to the use of sort above. This is the definition: >> >> ASSORT = LC_ALL=C sort >> >> If I revert my earlier patch and instead simply >> insist that sort not do anything in parallel, >> >> ASSORT = LC_ALL=C sort --parallel=1 >> >> then there is no hang, and things finish in relatively good time. >> >> I don't have a good stand-alone reproducer yet >> and am out of time for today.
After updating to a new F14 kernel, I never managed to reproduce that, but maybe that's just a coincidence... > Well, right after writing that, I realized the key to the misbehavior: > sort was reading from a *pipe*: > > # This finishes right away, reading from input file "k": > seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & > done > > # When reading from a pipe, it's a very different story: > # Without the "timeout 7" prefix, the following would render an N-core > # system (5<N) unusable for many minutes. As it is, be prepared: > # my system goes unresponsive after 1 second, and doesn't return until > timeout. > for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & > done > > Occasionally, the above jobs all complete quickly. > > My first question was why were *any* processes being spawned to handle > such a small input file. The first answer is in the first hunk: > > diff --git a/src/sort.c b/src/sort.c > index 13954cb..b9316e7 100644 > --- a/src/sort.c > +++ b/src/sort.c > @@ -112,9 +112,8 @@ struct rlimit { size_t rlim_cur; }; > /* Heuristic value for the number of lines for which it is worth > creating a subthread, during an internal merge sort, on a machine > that has processors galore. Currently this number is just a guess. > - This value must be at least 4. We don't know of any machine where > - this number has any practical effect. */ > -enum { SUBTHREAD_LINES_HEURISTIC = 4 }; > + This value must be at least 4. */ > +enum { SUBTHREAD_LINES_HEURISTIC = 32 * 1024 }; > > /* The number of threads after which there are > diminishing performance gains. */ > > The old definition of SUBTHREAD_LINES_HEURISTIC meant that any group > of 5 or more lines would be split in two and sorted via two (or more) > separate threads. Thus, with just 40 lines, you could get the maximum > of 8 threads working. That is obviously not efficient, unless lines are > so pathologically long that the cost of comparing two of them approaches > the cost of creating a new process. > > With the above, sort would use a more reasonable number. > Tests on high-end hardware and using very short lines > suggest that a value like 200,000 would still be conservative. Here's a complete patch for that: >From b2db5675bfeb3fe7e87bcc12934f34057ee26704 Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyer...@redhat.com> Date: Thu, 10 Feb 2011 08:48:27 +0100 Subject: [PATCH] sort: spawn fewer threads for small inputs * src/sort.c (SUBTHREAD_LINES_HEURISTIC): Do not spawn a new thread for every 4 lines. Increase this from 4 to 128K. 128K lines seems appropriate for a 5-year-old dual-core laptop, but it is too low for some common combinations of short lines and/or newer systems. * NEWS (Bug fixes): Mention it. --- NEWS | 9 ++++++--- src/sort.c | 16 ++++++++++------ 2 files changed, 16 insertions(+), 9 deletions(-) diff --git a/NEWS b/NEWS index 3157ef2..5770410 100644 --- a/NEWS +++ b/NEWS @@ -4,13 +4,16 @@ GNU coreutils NEWS -*- outline -*- ** Bug fixes - du would infloop when given --files0-from=DIR - [bug introduced in coreutils-7.1] - cut could segfault when invoked with a user-specified output delimiter and an unbounded range like "-f1234567890-". [bug introduced in coreutils-5.3.0] + du would infloop when given --files0-from=DIR + [bug introduced in coreutils-7.1] + + sort no longer spawns 7 worker threads to sort 16 lines + [bug introduced in coreutils-8.6] + wc would dereference a NULL pointer upon an early out-of-memory error [bug introduced in coreutils-7.1] diff --git a/src/sort.c b/src/sort.c index 13954cb..9b8666a 100644 --- a/src/sort.c +++ b/src/sort.c @@ -109,12 +109,16 @@ struct rlimit { size_t rlim_cur; }; and is responsible for merging TOTAL lines. */ #define MAX_MERGE(total, level) (((total) >> (2 * ((level) + 1))) + 1) -/* Heuristic value for the number of lines for which it is worth - creating a subthread, during an internal merge sort, on a machine - that has processors galore. Currently this number is just a guess. - This value must be at least 4. We don't know of any machine where - this number has any practical effect. */ -enum { SUBTHREAD_LINES_HEURISTIC = 4 }; +/* Heuristic value for the number of lines for which it is worth creating + a subthread, during an internal merge sort. I.e., it is a small number + of "average" lines for which sorting via two threads is faster than + sorting via one on an "average" system. On an dual-core 2.0 GHz i686 + system with 3GB of RAM and 2MB of L2 cache, a file containing 128K + lines of gensort -a output is sorted slightly faster with --parallel=2 + than with --parallel=1. By contrast, using --parallel=1 is about 10% + faster than using --parallel=2 with a 64K-line input. */ +enum { SUBTHREAD_LINES_HEURISTIC = 128 * 1024 }; +verify (4 <= SUBTHREAD_LINES_HEURISTIC); /* The number of threads after which there are diminishing performance gains. */ -- 1.7.4.1.299.ga459d