Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

Jim Meyering Sat, 12 Mar 2011 07:34:54 -0800

Jim Meyering wrote:
> Jim Meyering wrote:
>> Jim Meyering wrote:
>>> Running "make -j25 check" on a nominal-12-core F14 system would
>>> cause serious difficulty leading to an OOM kill -- and this is brand new.
>>> It worked fine yesterday.  I tracked it down to all of the make processes
>>> working on the "built_programs.list" (in src/Makefile.am) rule
>>>
>>> built_programs.list:
>>>     @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \
>>>       | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' '
>>>
>>> Which made me realize we were running that submake over 400 times,
>>> once per test scripts (including skipped ones).  That's well worth
>>> avoiding, even if it means a new temporary file.
>>>
>>> I don't know the root cause of the OOM-kill (preceded by interminable
>>> minutes of a seemingly hung and barely responsive system) or why it started
>>> happening today (afaics, none of the programs involved was updated),
>>> but this does fix it...
>>
>> FYI,
>> I've tracked this down a little further.
>> The horrid performance (hung system and eventual OOM-kill)
>> are related to the use of sort above.  This is the definition:
>>
>>     ASSORT = LC_ALL=C sort
>>
>> If I revert my earlier patch and instead simply
>> insist that sort not do anything in parallel,
>>
>>     ASSORT = LC_ALL=C sort --parallel=1
>>
>> then there is no hang, and things finish in relatively good time.
>>
>> I don't have a good stand-alone reproducer yet
>> and am out of time for today.


After updating to a new F14 kernel,
I never managed to reproduce that, but maybe
that's just a coincidence...

> Well, right after writing that, I realized the key to the misbehavior:
> sort was reading from a *pipe*:
>
> # This finishes right away, reading from input file "k":
> seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & 
> done
>
> # When reading from a pipe, it's a very different story:
> # Without the "timeout 7" prefix, the following would render an N-core
> # system (5<N) unusable for many minutes.  As it is, be prepared:
> # my system goes unresponsive after 1 second, and doesn't return until 
> timeout.
> for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & 
> done
>
> Occasionally, the above jobs all complete quickly.
>
> My first question was why were *any* processes being spawned to handle
> such a small input file.  The first answer is in the first hunk:
>
> diff --git a/src/sort.c b/src/sort.c
> index 13954cb..b9316e7 100644
> --- a/src/sort.c
> +++ b/src/sort.c
> @@ -112,9 +112,8 @@ struct rlimit { size_t rlim_cur; };
>  /* Heuristic value for the number of lines for which it is worth
>     creating a subthread, during an internal merge sort, on a machine
>     that has processors galore.  Currently this number is just a guess.
> -   This value must be at least 4.  We don't know of any machine where
> -   this number has any practical effect.  */
> -enum { SUBTHREAD_LINES_HEURISTIC = 4 };
> +   This value must be at least 4.  */
> +enum { SUBTHREAD_LINES_HEURISTIC = 32 * 1024 };
>
>  /* The number of threads after which there are
>     diminishing performance gains.  */
>
> The old definition of SUBTHREAD_LINES_HEURISTIC meant that any group
> of 5 or more lines would be split in two and sorted via two (or more)
> separate threads.  Thus, with just 40 lines, you could get the maximum
> of 8 threads working.  That is obviously not efficient, unless lines are
> so pathologically long that the cost of comparing two of them approaches
> the cost of creating a new process.
>
> With the above, sort would use a more reasonable number.
> Tests on high-end hardware and using very short lines
> suggest that a value like 200,000 would still be conservative.

Here's a complete patch for that:

>From b2db5675bfeb3fe7e87bcc12934f34057ee26704 Mon Sep 17 00:00:00 2001
From: Jim Meyering <[email protected]>
Date: Thu, 10 Feb 2011 08:48:27 +0100
Subject: [PATCH] sort: spawn fewer threads for small inputs

* src/sort.c (SUBTHREAD_LINES_HEURISTIC): Do not spawn a new thread
for every 4 lines.  Increase this from 4 to 128K.  128K lines seems
appropriate for a 5-year-old dual-core laptop, but it is too low for
some common combinations of short lines and/or newer systems.
* NEWS (Bug fixes): Mention it.
---
 NEWS       |    9 ++++++---
 src/sort.c |   16 ++++++++++------
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/NEWS b/NEWS
index 3157ef2..5770410 100644
--- a/NEWS
+++ b/NEWS
@@ -4,13 +4,16 @@ GNU coreutils NEWS                                    -*- 
outline -*-

 ** Bug fixes

-  du would infloop when given --files0-from=DIR
-  [bug introduced in coreutils-7.1]
-
   cut could segfault when invoked with a user-specified output
   delimiter and an unbounded range like "-f1234567890-".
   [bug introduced in coreutils-5.3.0]

+  du would infloop when given --files0-from=DIR
+  [bug introduced in coreutils-7.1]
+
+  sort no longer spawns 7 worker threads to sort 16 lines
+  [bug introduced in coreutils-8.6]
+
   wc would dereference a NULL pointer upon an early out-of-memory error
   [bug introduced in coreutils-7.1]

diff --git a/src/sort.c b/src/sort.c
index 13954cb..9b8666a 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -109,12 +109,16 @@ struct rlimit { size_t rlim_cur; };
    and is responsible for merging TOTAL lines. */
 #define MAX_MERGE(total, level) (((total) >> (2 * ((level) + 1))) + 1)

-/* Heuristic value for the number of lines for which it is worth
-   creating a subthread, during an internal merge sort, on a machine
-   that has processors galore.  Currently this number is just a guess.
-   This value must be at least 4.  We don't know of any machine where
-   this number has any practical effect.  */
-enum { SUBTHREAD_LINES_HEURISTIC = 4 };
+/* Heuristic value for the number of lines for which it is worth creating
+   a subthread, during an internal merge sort.  I.e., it is a small number
+   of "average" lines for which sorting via two threads is faster than
+   sorting via one on an "average" system.  On an dual-core 2.0 GHz i686
+   system with 3GB of RAM and 2MB of L2 cache, a file containing 128K
+   lines of gensort -a output is sorted slightly faster with --parallel=2
+   than with --parallel=1.  By contrast, using --parallel=1 is about 10%
+   faster than using --parallel=2 with a 64K-line input.  */
+enum { SUBTHREAD_LINES_HEURISTIC = 128 * 1024 };
+verify (4 <= SUBTHREAD_LINES_HEURISTIC);

 /* The number of threads after which there are
    diminishing performance gains.  */
--
1.7.4.1.299.ga459d

Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

Reply via email to