Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
Pádraig Brady wrote: > On 16/03/11 15:32, Jim Meyering wrote: >> Pádraig Brady wrote: >>> >>> With SUBTHREAD_LINES_HEURISTIC=128k and -S1M option to sort we get >>> no threads as >>> nlines never gets above 12787 (there looks to be around 80 bytes >>> overhead per line?). >>> Only when -S >= 12M do we get nlines high enough to create threads. >> >> Thanks for pursuing this. >> Here's a proposed patch to address the other problem. >> It doesn't have much of an effect (any?) on your >> issue when using very little memory, but when a sort user >> specifies -S1M, I think they probably want to avoid the >> expense (memory) of going multi-threaded. >> >> What do you think? >> >> -#define INPUT_FILE_SIZE_GUESS (1024 * 1024) >> +#define INPUT_FILE_SIZE_GUESS (128 * 1024) > > This does seem a bit like whack-a-mole > but at least we're lining them up better. > > The above gives reasonable threading by default, > while reducing the large upfront malloc. > > $ for len in 1 79; do > for i in $(seq 22); do > lines=$((2<<$i)) > yes "$(printf %${len}s)"| head -n$lines > t.sort > strace -f -c -e clone ./sort --parallel=16 t.sort -o /dev/null 2>&1 | > join --nocheck-order -a1 -o1.4,1.5 - /dev/null | > sed -n "s/\([0-9]*\) clone/$lines\t\1/p" > done > done > > #lines threads (2 byte lines) > -- > 131072 1 > 262144 3 > 524288 7 > 1048576 15 > 2097152 15 > 4194304 15 > 8388608 15 > > #lines threads (80 byte lines) > -- > 131072 1 > 262144 3 > 524288 7 > 1048576 15 > 2097152 15 > 4194304 22 > 8388608 60 Thanks for testing and review. Pushed.
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
On 16/03/11 15:32, Jim Meyering wrote: > Pádraig Brady wrote: >> >> With SUBTHREAD_LINES_HEURISTIC=128k and -S1M option to sort we get no >> threads as >> nlines never gets above 12787 (there looks to be around 80 bytes overhead >> per line?). >> Only when -S >= 12M do we get nlines high enough to create threads. > > Thanks for pursuing this. > Here's a proposed patch to address the other problem. > It doesn't have much of an effect (any?) on your > issue when using very little memory, but when a sort user > specifies -S1M, I think they probably want to avoid the > expense (memory) of going multi-threaded. > > What do you think? > > -#define INPUT_FILE_SIZE_GUESS (1024 * 1024) > +#define INPUT_FILE_SIZE_GUESS (128 * 1024) This does seem a bit like whack-a-mole but at least we're lining them up better. The above gives reasonable threading by default, while reducing the large upfront malloc. $ for len in 1 79; do for i in $(seq 22); do lines=$((2<<$i)) yes "$(printf %${len}s)"| head -n$lines > t.sort strace -f -c -e clone ./sort --parallel=16 t.sort -o /dev/null 2>&1 | join --nocheck-order -a1 -o1.4,1.5 - /dev/null | sed -n "s/\([0-9]*\) clone/$lines\t\1/p" done done #lines threads (2 byte lines) -- 131072 1 262144 3 524288 7 1048576 15 2097152 15 4194304 15 8388608 15 #lines threads (80 byte lines) -- 131072 1 262144 3 524288 7 1048576 15 2097152 15 4194304 22 8388608 60 cheers, Pádraig.
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
Pádraig Brady wrote: > # SUBTHREAD_LINES_HEURISTIC = 4 > $ for i in $(seq 22); do > j=$((2<<$i)) > yes | head -n$j > t.sort > strace -f -c -e clone ./sort --parallel=16 t.sort -o /dev/null 2>&1 | > join --nocheck-order -a1 -o1.4,1.5 - /dev/null | > sed -n "s/\([0-9]*\) clone/$j\t\1/p" > done > 4 1 > 8 3 > 16 7 > 32 15 > 64 15 > 128 15 > 256 15 > 512 15 > 102415 > 204815 > 409615 > 819215 > 16384 15 > 32768 15 > 65536 15 > 131072 15 > 262144 15 > 524288 15 > 1048576 15 > 2097152 15 > 4194304 30 > 8388608 45 > > # As above, but add -S1M option to sort > > 4 1 > 8 3 > 16 7 > 32 15 > 64 15 > 128 15 > 256 15 > 512 15 > 102415 > 204815 > 409615 > 819215 > 16384 30 > 32768 45 > 65536 90 > 131072 165 > 262144 315 > 524288 622 > 1048576 1245 > 2097152 2475 > 4194304 4935 > 8388608 9855 > > With SUBTHREAD_LINES_HEURISTIC=128k and -S1M option to sort we get no threads > as > nlines never gets above 12787 (there looks to be around 80 bytes overhead per > line?). > Only when -S >= 12M do we get nlines high enough to create threads. Thanks for pursuing this. Here's a proposed patch to address the other problem. It doesn't have much of an effect (any?) on your issue when using very little memory, but when a sort user specifies -S1M, I think they probably want to avoid the expense (memory) of going multi-threaded. What do you think? >From 4f591fdd0bb78f621d2b72021de883fc4df1e179 Mon Sep 17 00:00:00 2001 From: Jim Meyering Date: Wed, 16 Mar 2011 16:09:31 +0100 Subject: [PATCH] sort: avoid memory pressure of 130MB/thread when reading from pipe * src/sort.c (INPUT_FILE_SIZE_GUESS): Decrease initial allocation factor used to size buffer used when reading a non-regular file. For motivation, see discussion here: http://thread.gmane.org/gmane.comp.gnu.coreutils.general/878/focus=887 --- src/sort.c |8 ++-- 1 files changed, 6 insertions(+), 2 deletions(-) diff --git a/src/sort.c b/src/sort.c index 9b8666a..07d6765 100644 --- a/src/sort.c +++ b/src/sort.c @@ -319,8 +319,12 @@ static size_t merge_buffer_size = MAX (MIN_MERGE_BUFFER_SIZE, 256 * 1024); specified by the user. Zero if the user has not specified a size. */ static size_t sort_size; -/* The guessed size for non-regular files. */ -#define INPUT_FILE_SIZE_GUESS (1024 * 1024) +/* The initial allocation factor for non-regular files. + This is used, e.g., when reading from a pipe. + Don't make it too big, since it is multiplied by ~130 to + obtain the size of the actual buffer sort will allocate. + Also, there may be 8 threads all doing this at the same time. */ +#define INPUT_FILE_SIZE_GUESS (128 * 1024) /* Array of directory names in which any temporary files are to be created. */ static char const **temp_dirs; -- 1.7.4.1.430.g5aa4d
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
On 16/03/11 12:07, Jim Meyering wrote: > Pádraig Brady wrote: >> I've not fully analyzed this yet, and I'm not saying it's wrong, >> but the above change seems to have a large effect on thread >> creation when smaller buffers are used (you hinted previously >> that being less aggressive with the amount of mem used by default >> might be appropriate, and I agree). >> >> Anyway with the above I seem to need a buffer size more >> than 10M to have any threads created at all. >> >> Testing the original 4 lines heuristic with the following, shows: >> (note I only get > 4 threads after 4M of input, not 7 for 16 lines >> as indicated in NEWS). >> >> $ for i in $(seq 30); do >>> j=$((2<<$i)) >>> yes | head -n$j > t.sort >>> strace -c -e clone sort --parallel=16 t.sort -o /dev/null 2>&1 | >>>join --nocheck-order -a1 -o1.4,1.5 - /dev/null | >>>sed -n "s/\([0-9]*\) clone/$j\t\1/p" >>> done >> 4 1 >> 8 2 >> 16 3 >> 32 4 >> 64 4 >> 128 4 > ... >> 1048576 4 >> 2097152 4 >> 4194304 8 >> 8388608 16 >> >> When I restrict the buffer size with '-S 1M', many more threads >> are created (a max of 16 in parallel with the above command) >> 4 1 >> 8 2 >> 16 3 >> 32 4 >> 64 4 >> 128 4 >> 256 4 >> 512 4 >> 10244 >> 20484 >> 40964 >> 81924 >> 16384 8 >> 32768 12 >> 65536 24 >> 131072 44 >> 262144 84 >> 524288 167 >> 1048576 332 >> 2097152 660 >> 4194304 1316 >> 8388608 2628 >> >> After increasing the heuristic to 128K, I get _no_ threads until -S > 10M >> and this seems to be independent of line length. > > Thanks for investigating that. > Could strace -c -e clone be doing something unexpected? > When I run this (without my patch), it would use 8 threads: > > seq 16 > in; strace -ff -o k ./sort --parallel=16 in -o /dev/null > > since it created eight k.PID files: > > $ ls -1 k.*|wc -l > 8 > > Now, for such a small file, it does not call clone at all. > Oops, yep I forget to add -f to strace. So NEWS is correct. # SUBTHREAD_LINES_HEURISTIC = 4 $ for i in $(seq 22); do j=$((2<<$i)) yes | head -n$j > t.sort strace -f -c -e clone ./sort --parallel=16 t.sort -o /dev/null 2>&1 | join --nocheck-order -a1 -o1.4,1.5 - /dev/null | sed -n "s/\([0-9]*\) clone/$j\t\1/p" done 4 1 8 3 16 7 32 15 64 15 128 15 256 15 512 15 102415 204815 409615 819215 16384 15 32768 15 65536 15 131072 15 262144 15 524288 15 1048576 15 2097152 15 4194304 30 8388608 45 # As above, but add -S1M option to sort 4 1 8 3 16 7 32 15 64 15 128 15 256 15 512 15 102415 204815 409615 819215 16384 30 32768 45 65536 90 131072 165 262144 315 524288 622 1048576 1245 2097152 2475 4194304 4935 8388608 9855 With SUBTHREAD_LINES_HEURISTIC=128k and -S1M option to sort we get no threads as nlines never gets above 12787 (there looks to be around 80 bytes overhead per line?). Only when -S >= 12M do we get nlines high enough to create threads. cheers, Pádraig.
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
Pádraig Brady wrote: > I've not fully analyzed this yet, and I'm not saying it's wrong, > but the above change seems to have a large effect on thread > creation when smaller buffers are used (you hinted previously > that being less aggressive with the amount of mem used by default > might be appropriate, and I agree). > > Anyway with the above I seem to need a buffer size more > than 10M to have any threads created at all. > > Testing the original 4 lines heuristic with the following, shows: > (note I only get > 4 threads after 4M of input, not 7 for 16 lines > as indicated in NEWS). > > $ for i in $(seq 30); do >> j=$((2<<$i)) >> yes | head -n$j > t.sort >> strace -c -e clone sort --parallel=16 t.sort -o /dev/null 2>&1 | >>join --nocheck-order -a1 -o1.4,1.5 - /dev/null | >>sed -n "s/\([0-9]*\) clone/$j\t\1/p" >> done > 4 1 > 8 2 > 16 3 > 32 4 > 64 4 > 128 4 ... > 1048576 4 > 2097152 4 > 4194304 8 > 8388608 16 > > When I restrict the buffer size with '-S 1M', many more threads > are created (a max of 16 in parallel with the above command) > 4 1 > 8 2 > 16 3 > 32 4 > 64 4 > 128 4 > 256 4 > 512 4 > 10244 > 20484 > 40964 > 81924 > 16384 8 > 32768 12 > 65536 24 > 131072 44 > 262144 84 > 524288 167 > 1048576 332 > 2097152 660 > 4194304 1316 > 8388608 2628 > > After increasing the heuristic to 128K, I get _no_ threads until -S > 10M > and this seems to be independent of line length. Thanks for investigating that. Could strace -c -e clone be doing something unexpected? When I run this (without my patch), it would use 8 threads: seq 16 > in; strace -ff -o k ./sort --parallel=16 in -o /dev/null since it created eight k.PID files: $ ls -1 k.*|wc -l 8 Now, for such a small file, it does not call clone at all.
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
Pádraig Brady wrote: ... > I still get bad performance for the above with SUBTHREAD_LINES_HEURISTIC=128K Sorry I haven't had time for this today. I'll investigate tomorrow. > So as you suggested, the large mem allocation when reading from a pipe > is a problem, > and in fact seems to be the main problem. Now given the memory isn't > actually used > it shouldn't be a such an issue, but if one has MALLOC_PERTURB_ set, > then it is used, > and it has a huge impact. Compare: > > $ for i in $(seq 33); do seq 88| MALLOC_PERTURB_= timeout 2 sort > --para=1 >/dev/null & done > $ for i in $(seq 33); do seq 88| MALLOC_PERTURB_=1 timeout 2 sort > --para=1 >/dev/null & done Good point! > So we should be more conservative in memory allocation in sort, > and be more aligned with CPU cache sizes than RAM sizes I suspect. > This will be an increasing problem as we tend to run more in ||. > It would be interesting I think to sort first by L1 cache size, > then by L2, etc, but as a first pass, a more sensible default > of 8MB or so seems appropriate. > > As a general note, MALLOC_PERTURB_ should be unset when benchmarking > anything to do with `sort`
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
On 14/03/11 15:31, Pádraig Brady wrote: > On 12/03/11 15:34, Jim Meyering wrote: >> Jim Meyering wrote: >>> Jim Meyering wrote: Jim Meyering wrote: > Running "make -j25 check" on a nominal-12-core F14 system would > cause serious difficulty leading to an OOM kill -- and this is brand new. > It worked fine yesterday. I tracked it down to all of the make processes > working on the "built_programs.list" (in src/Makefile.am) rule > > built_programs.list: > @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \ > | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' ' > > Which made me realize we were running that submake over 400 times, > once per test scripts (including skipped ones). That's well worth > avoiding, even if it means a new temporary file. > > I don't know the root cause of the OOM-kill (preceded by interminable > minutes of a seemingly hung and barely responsive system) or why it > started > happening today (afaics, none of the programs involved was updated), > but this does fix it... FYI, I've tracked this down a little further. The horrid performance (hung system and eventual OOM-kill) are related to the use of sort above. This is the definition: ASSORT = LC_ALL=C sort If I revert my earlier patch and instead simply insist that sort not do anything in parallel, ASSORT = LC_ALL=C sort --parallel=1 then there is no hang, and things finish in relatively good time. I don't have a good stand-alone reproducer yet and am out of time for today. >> >> After updating to a new F14 kernel, >> I never managed to reproduce that, but maybe >> that's just a coincidence... >> >>> Well, right after writing that, I realized the key to the misbehavior: >>> sort was reading from a *pipe*: >>> >>> # This finishes right away, reading from input file "k": >>> seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & >>> done >>> >>> # When reading from a pipe, it's a very different story: >>> # Without the "timeout 7" prefix, the following would render an N-core >>> # system (5>> # my system goes unresponsive after 1 second, and doesn't return until >>> timeout. >>> for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & >>> done I still get bad performance for the above with SUBTHREAD_LINES_HEURISTIC=128K So as you suggested, the large mem allocation when reading from a pipe is a problem, and in fact seems to be the main problem. Now given the memory isn't actually used it shouldn't be a such an issue, but if one has MALLOC_PERTURB_ set, then it is used, and it has a huge impact. Compare: $ for i in $(seq 33); do seq 88| MALLOC_PERTURB_= timeout 2 sort --para=1 >/dev/null & done $ for i in $(seq 33); do seq 88| MALLOC_PERTURB_=1 timeout 2 sort --para=1 >/dev/null & done So we should be more conservative in memory allocation in sort, and be more aligned with CPU cache sizes than RAM sizes I suspect. This will be an increasing problem as we tend to run more in ||. It would be interesting I think to sort first by L1 cache size, then by L2, etc, but as a first pass, a more sensible default of 8MB or so seems appropriate. As a general note, MALLOC_PERTURB_ should be unset when benchmarking anything to do with `sort` cheers, Pádraig.
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
On 12/03/11 15:34, Jim Meyering wrote: > Jim Meyering wrote: >> Jim Meyering wrote: >>> Jim Meyering wrote: Running "make -j25 check" on a nominal-12-core F14 system would cause serious difficulty leading to an OOM kill -- and this is brand new. It worked fine yesterday. I tracked it down to all of the make processes working on the "built_programs.list" (in src/Makefile.am) rule built_programs.list: @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \ | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' ' Which made me realize we were running that submake over 400 times, once per test scripts (including skipped ones). That's well worth avoiding, even if it means a new temporary file. I don't know the root cause of the OOM-kill (preceded by interminable minutes of a seemingly hung and barely responsive system) or why it started happening today (afaics, none of the programs involved was updated), but this does fix it... >>> >>> FYI, >>> I've tracked this down a little further. >>> The horrid performance (hung system and eventual OOM-kill) >>> are related to the use of sort above. This is the definition: >>> >>> ASSORT = LC_ALL=C sort >>> >>> If I revert my earlier patch and instead simply >>> insist that sort not do anything in parallel, >>> >>> ASSORT = LC_ALL=C sort --parallel=1 >>> >>> then there is no hang, and things finish in relatively good time. >>> >>> I don't have a good stand-alone reproducer yet >>> and am out of time for today. > > After updating to a new F14 kernel, > I never managed to reproduce that, but maybe > that's just a coincidence... > >> Well, right after writing that, I realized the key to the misbehavior: >> sort was reading from a *pipe*: >> >> # This finishes right away, reading from input file "k": >> seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & >> done >> >> # When reading from a pipe, it's a very different story: >> # Without the "timeout 7" prefix, the following would render an N-core >> # system (5> # my system goes unresponsive after 1 second, and doesn't return until >> timeout. >> for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & >> done >> >> Occasionally, the above jobs all complete quickly. >> >> My first question was why were *any* processes being spawned to handle >> such a small input file. The first answer is in the first hunk: >> >> diff --git a/src/sort.c b/src/sort.c >> index 13954cb..b9316e7 100644 >> --- a/src/sort.c >> +++ b/src/sort.c >> @@ -112,9 +112,8 @@ struct rlimit { size_t rlim_cur; }; >> /* Heuristic value for the number of lines for which it is worth >> creating a subthread, during an internal merge sort, on a machine >> that has processors galore. Currently this number is just a guess. >> - This value must be at least 4. We don't know of any machine where >> - this number has any practical effect. */ >> -enum { SUBTHREAD_LINES_HEURISTIC = 4 }; >> + This value must be at least 4. */ >> +enum { SUBTHREAD_LINES_HEURISTIC = 32 * 1024 }; >> >> /* The number of threads after which there are >> diminishing performance gains. */ >> >> The old definition of SUBTHREAD_LINES_HEURISTIC meant that any group >> of 5 or more lines would be split in two and sorted via two (or more) >> separate threads. Thus, with just 40 lines, you could get the maximum >> of 8 threads working. That is obviously not efficient, unless lines are >> so pathologically long that the cost of comparing two of them approaches >> the cost of creating a new process. >> >> With the above, sort would use a more reasonable number. >> Tests on high-end hardware and using very short lines >> suggest that a value like 200,000 would still be conservative. > > Here's a complete patch for that: > >>From b2db5675bfeb3fe7e87bcc12934f34057ee26704 Mon Sep 17 00:00:00 2001 > From: Jim Meyering > Date: Thu, 10 Feb 2011 08:48:27 +0100 > Subject: [PATCH] sort: spawn fewer threads for small inputs > > * src/sort.c (SUBTHREAD_LINES_HEURISTIC): Do not spawn a new thread > for every 4 lines. Increase this from 4 to 128K. 128K lines seems > appropriate for a 5-year-old dual-core laptop, but it is too low for > some common combinations of short lines and/or newer systems. > * NEWS (Bug fixes): Mention it. > --- > NEWS |9 ++--- > src/sort.c | 16 ++-- > 2 files changed, 16 insertions(+), 9 deletions(-) > > diff --git a/NEWS b/NEWS > index 3157ef2..5770410 100644 > --- a/NEWS > +++ b/NEWS > @@ -4,13 +4,16 @@ GNU coreutils NEWS-*- > outline -*- > > ** Bug fixes > > - du would infloop when given --files0-from=DIR > - [bug introduced in coreutils-7.1] > - >cut could segfault when invoked with a user-specified output >delimiter and an unbounded range like "-f1234567890-". >[bug introduced in coreutils-5.3.0] > > + du would infloop when g
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
Jim Meyering wrote: > Jim Meyering wrote: >> Jim Meyering wrote: >>> Running "make -j25 check" on a nominal-12-core F14 system would >>> cause serious difficulty leading to an OOM kill -- and this is brand new. >>> It worked fine yesterday. I tracked it down to all of the make processes >>> working on the "built_programs.list" (in src/Makefile.am) rule >>> >>> built_programs.list: >>> @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \ >>> | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' ' >>> >>> Which made me realize we were running that submake over 400 times, >>> once per test scripts (including skipped ones). That's well worth >>> avoiding, even if it means a new temporary file. >>> >>> I don't know the root cause of the OOM-kill (preceded by interminable >>> minutes of a seemingly hung and barely responsive system) or why it started >>> happening today (afaics, none of the programs involved was updated), >>> but this does fix it... >> >> FYI, >> I've tracked this down a little further. >> The horrid performance (hung system and eventual OOM-kill) >> are related to the use of sort above. This is the definition: >> >> ASSORT = LC_ALL=C sort >> >> If I revert my earlier patch and instead simply >> insist that sort not do anything in parallel, >> >> ASSORT = LC_ALL=C sort --parallel=1 >> >> then there is no hang, and things finish in relatively good time. >> >> I don't have a good stand-alone reproducer yet >> and am out of time for today. After updating to a new F14 kernel, I never managed to reproduce that, but maybe that's just a coincidence... > Well, right after writing that, I realized the key to the misbehavior: > sort was reading from a *pipe*: > > # This finishes right away, reading from input file "k": > seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & > done > > # When reading from a pipe, it's a very different story: > # Without the "timeout 7" prefix, the following would render an N-core > # system (5 # my system goes unresponsive after 1 second, and doesn't return until > timeout. > for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & > done > > Occasionally, the above jobs all complete quickly. > > My first question was why were *any* processes being spawned to handle > such a small input file. The first answer is in the first hunk: > > diff --git a/src/sort.c b/src/sort.c > index 13954cb..b9316e7 100644 > --- a/src/sort.c > +++ b/src/sort.c > @@ -112,9 +112,8 @@ struct rlimit { size_t rlim_cur; }; > /* Heuristic value for the number of lines for which it is worth > creating a subthread, during an internal merge sort, on a machine > that has processors galore. Currently this number is just a guess. > - This value must be at least 4. We don't know of any machine where > - this number has any practical effect. */ > -enum { SUBTHREAD_LINES_HEURISTIC = 4 }; > + This value must be at least 4. */ > +enum { SUBTHREAD_LINES_HEURISTIC = 32 * 1024 }; > > /* The number of threads after which there are > diminishing performance gains. */ > > The old definition of SUBTHREAD_LINES_HEURISTIC meant that any group > of 5 or more lines would be split in two and sorted via two (or more) > separate threads. Thus, with just 40 lines, you could get the maximum > of 8 threads working. That is obviously not efficient, unless lines are > so pathologically long that the cost of comparing two of them approaches > the cost of creating a new process. > > With the above, sort would use a more reasonable number. > Tests on high-end hardware and using very short lines > suggest that a value like 200,000 would still be conservative. Here's a complete patch for that: >From b2db5675bfeb3fe7e87bcc12934f34057ee26704 Mon Sep 17 00:00:00 2001 From: Jim Meyering Date: Thu, 10 Feb 2011 08:48:27 +0100 Subject: [PATCH] sort: spawn fewer threads for small inputs * src/sort.c (SUBTHREAD_LINES_HEURISTIC): Do not spawn a new thread for every 4 lines. Increase this from 4 to 128K. 128K lines seems appropriate for a 5-year-old dual-core laptop, but it is too low for some common combinations of short lines and/or newer systems. * NEWS (Bug fixes): Mention it. --- NEWS |9 ++--- src/sort.c | 16 ++-- 2 files changed, 16 insertions(+), 9 deletions(-) diff --git a/NEWS b/NEWS index 3157ef2..5770410 100644 --- a/NEWS +++ b/NEWS @@ -4,13 +4,16 @@ GNU coreutils NEWS-*- outline -*- ** Bug fixes - du would infloop when given --files0-from=DIR - [bug introduced in coreutils-7.1] - cut could segfault when invoked with a user-specified output delimiter and an unbounded range like "-f1234567890-". [bug introduced in coreutils-5.3.0] + du would infloop when given --files0-from=DIR + [bug introduced in coreutils-7.1] + + sort no longer spawns 7 worker threads to sort 16 lines + [bug introduced in coreutils-8.6] + wc would dereference a NULL pointer up
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
Jim Meyering wrote: > Jim Meyering wrote: >> Running "make -j25 check" on a nominal-12-core F14 system would >> cause serious difficulty leading to an OOM kill -- and this is brand new. >> It worked fine yesterday. I tracked it down to all of the make processes >> working on the "built_programs.list" (in src/Makefile.am) rule >> >> built_programs.list: >> @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \ >>| sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' ' >> >> Which made me realize we were running that submake over 400 times, >> once per test scripts (including skipped ones). That's well worth >> avoiding, even if it means a new temporary file. >> >> I don't know the root cause of the OOM-kill (preceded by interminable >> minutes of a seemingly hung and barely responsive system) or why it started >> happening today (afaics, none of the programs involved was updated), >> but this does fix it... > > FYI, > I've tracked this down a little further. > The horrid performance (hung system and eventual OOM-kill) > are related to the use of sort above. This is the definition: > > ASSORT = LC_ALL=C sort > > If I revert my earlier patch and instead simply > insist that sort not do anything in parallel, > > ASSORT = LC_ALL=C sort --parallel=1 > > then there is no hang, and things finish in relatively good time. > > I don't have a good stand-alone reproducer yet > and am out of time for today. Well, right after writing that, I realized the key to the misbehavior: sort was reading from a *pipe*: # This finishes right away, reading from input file "k": seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & done # When reading from a pipe, it's a very different story: # Without the "timeout 7" prefix, the following would render an N-core # system (5/dev/null & done Occasionally, the above jobs all complete quickly. My first question was why were *any* processes being spawned to handle such a small input file. The first answer is in the first hunk: diff --git a/src/sort.c b/src/sort.c index 13954cb..b9316e7 100644 --- a/src/sort.c +++ b/src/sort.c @@ -112,9 +112,8 @@ struct rlimit { size_t rlim_cur; }; /* Heuristic value for the number of lines for which it is worth creating a subthread, during an internal merge sort, on a machine that has processors galore. Currently this number is just a guess. - This value must be at least 4. We don't know of any machine where - this number has any practical effect. */ -enum { SUBTHREAD_LINES_HEURISTIC = 4 }; + This value must be at least 4. */ +enum { SUBTHREAD_LINES_HEURISTIC = 32 * 1024 }; /* The number of threads after which there are diminishing performance gains. */ The old definition of SUBTHREAD_LINES_HEURISTIC meant that any group of 5 or more lines would be split in two and sorted via two (or more) separate threads. Thus, with just 40 lines, you could get the maximum of 8 threads working. That is obviously not efficient, unless lines are so pathologically long that the cost of comparing two of them approaches the cost of creating a new process. With the above, sort would use a more reasonable number. Tests on high-end hardware and using very short lines suggest that a value like 200,000 would still be conservative. But even with that change, the above for-loop would sometimes hang. Pursuing it further, I discovered that sort was allocating 130MB (INPUT_FILE_SIZE_GUESS * 129, with the former being 1MiB) of space for its input buffer, simply because it was reading from a pipe. diff --git a/src/sort.c b/src/sort.c index 13954cb..92c8d4e 100644 --- a/src/sort.c +++ b/src/sort.c @@ -316,7 +315,7 @@ static size_t merge_buffer_size = MAX (MIN_MERGE_BUFFER_SIZE, 256 * 1024); static size_t sort_size; /* The guessed size for non-regular files. */ -#define INPUT_FILE_SIZE_GUESS (1024 * 1024) +#define INPUT_FILE_SIZE_GUESS (64 * 1024) /* Array of directory names in which any temporary files are to be created. */ static char const **temp_dirs; With that, the while-loop does not hang any more. Note again that this is on a multi-core system. I'll investigate more tomorrow.
Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
Jim Meyering wrote, On 02/09/2011 12:29 PM: > Jim Meyering wrote: >> Running "make -j25 check" on a nominal-12-core F14 system would >> cause serious difficulty leading to an OOM kill -- and this is brand new. > > If I revert my earlier patch and instead simply > insist that sort not do anything in parallel, > > ASSORT = LC_ALL=C sort --parallel=1 > > then there is no hang, and things finish in relatively good time. I previously noticed the memory issue as a surprising (for me, at least) side-effect of parallel sort (http://lists.gnu.org/archive/html/coreutils/2010-12/msg00079.html and http://lists.gnu.org/archive/html/coreutils/2010-12/msg00084.html) But I'm noticing another side-effect of the new default behavior: On a shared linux system (both on a server with many cores, and on a SGE cluster), were users are using 'sort' as part of their scripts, memory usage and avg.load sometimes peaks beyond what is available, because the each sort process now uses up to 8 cores by default. So while implicitly users assume they use 1 core (at least with SGE, where you can specify how many threads your job will require), in practice they use many more. Globally setting OMP_NUM_THREADS=1 restores the old behavior, and only explicitly specifying "--parallel=X" lets sort use more than one thread, but I'm wondering if it's not better to default back to 1 core, and require explicit "--parallel" to do multi-threaded sort. -gordon
parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...
Jim Meyering wrote: > Running "make -j25 check" on a nominal-12-core F14 system would > cause serious difficulty leading to an OOM kill -- and this is brand new. > It worked fine yesterday. I tracked it down to all of the make processes > working on the "built_programs.list" (in src/Makefile.am) rule > > built_programs.list: > @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \ > | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' ' > > Which made me realize we were running that submake over 400 times, > once per test scripts (including skipped ones). That's well worth > avoiding, even if it means a new temporary file. > > I don't know the root cause of the OOM-kill (preceded by interminable > minutes of a seemingly hung and barely responsive system) or why it started > happening today (afaics, none of the programs involved was updated), > but this does fix it... FYI, I've tracked this down a little further. The horrid performance (hung system and eventual OOM-kill) are related to the use of sort above. This is the definition: ASSORT = LC_ALL=C sort If I revert my earlier patch and instead simply insist that sort not do anything in parallel, ASSORT = LC_ALL=C sort --parallel=1 then there is no hang, and things finish in relatively good time. I don't have a good stand-alone reproducer yet and am out of time for today.