Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-03-17 Thread Jim Meyering
Pádraig Brady wrote:
> On 16/03/11 15:32, Jim Meyering wrote:
>> Pádraig Brady wrote:
>>>
>>> With SUBTHREAD_LINES_HEURISTIC=128k and -S1M option to sort we get
>>> no threads as
>>> nlines never gets above 12787 (there looks to be around 80 bytes
>>> overhead per line?).
>>> Only when -S >= 12M do we get nlines high enough to create threads.
>>
>> Thanks for pursuing this.
>> Here's a proposed patch to address the other problem.
>> It doesn't have much of an effect (any?) on your
>> issue when using very little memory, but when a sort user
>> specifies -S1M, I think they probably want to avoid the
>> expense (memory) of going multi-threaded.
>>
>> What do you think?
>>
>> -#define INPUT_FILE_SIZE_GUESS (1024 * 1024)
>> +#define INPUT_FILE_SIZE_GUESS (128 * 1024)
>
> This does seem a bit like whack-a-mole
> but at least we're lining them up better.
>
> The above gives reasonable threading by default,
> while reducing the large upfront malloc.
>
> $ for len in 1 79; do
> for i in $(seq 22); do
>   lines=$((2<<$i))
>   yes "$(printf %${len}s)"| head -n$lines > t.sort
>   strace -f -c -e clone ./sort --parallel=16 t.sort -o /dev/null 2>&1 |
>   join --nocheck-order -a1 -o1.4,1.5 - /dev/null |
>   sed -n "s/\([0-9]*\) clone/$lines\t\1/p"
> done
>   done
>
> #lines  threads (2 byte lines)
> --
> 131072  1
> 262144  3
> 524288  7
> 1048576 15
> 2097152 15
> 4194304 15
> 8388608 15
>
> #lines  threads (80 byte lines)
> --
> 131072  1
> 262144  3
> 524288  7
> 1048576 15
> 2097152 15
> 4194304 22
> 8388608 60

Thanks for testing and review.
Pushed.



Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-03-16 Thread Pádraig Brady
On 16/03/11 15:32, Jim Meyering wrote:
> Pádraig Brady wrote:
>>
>> With SUBTHREAD_LINES_HEURISTIC=128k and -S1M option to sort we get no 
>> threads as
>> nlines never gets above 12787 (there looks to be around 80 bytes overhead 
>> per line?).
>> Only when -S >= 12M do we get nlines high enough to create threads.
> 
> Thanks for pursuing this.
> Here's a proposed patch to address the other problem.
> It doesn't have much of an effect (any?) on your
> issue when using very little memory, but when a sort user
> specifies -S1M, I think they probably want to avoid the
> expense (memory) of going multi-threaded.
> 
> What do you think?
> 
> -#define INPUT_FILE_SIZE_GUESS (1024 * 1024)
> +#define INPUT_FILE_SIZE_GUESS (128 * 1024)

This does seem a bit like whack-a-mole
but at least we're lining them up better.

The above gives reasonable threading by default,
while reducing the large upfront malloc.

$ for len in 1 79; do
for i in $(seq 22); do
  lines=$((2<<$i))
  yes "$(printf %${len}s)"| head -n$lines > t.sort
  strace -f -c -e clone ./sort --parallel=16 t.sort -o /dev/null 2>&1 |
  join --nocheck-order -a1 -o1.4,1.5 - /dev/null |
  sed -n "s/\([0-9]*\) clone/$lines\t\1/p"
done
  done

#lines  threads (2 byte lines)
--
131072  1
262144  3
524288  7
1048576 15
2097152 15
4194304 15
8388608 15

#lines  threads (80 byte lines)
--
131072  1
262144  3
524288  7
1048576 15
2097152 15
4194304 22
8388608 60

cheers,
Pádraig.



Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-03-16 Thread Jim Meyering
Pádraig Brady wrote:
> # SUBTHREAD_LINES_HEURISTIC = 4
> $ for i in $(seq 22); do
> j=$((2<<$i))
> yes | head -n$j > t.sort
> strace -f -c -e clone ./sort --parallel=16 t.sort -o /dev/null 2>&1 |
> join --nocheck-order -a1 -o1.4,1.5 - /dev/null |
> sed -n "s/\([0-9]*\) clone/$j\t\1/p"
>   done
> 4   1
> 8   3
> 16  7
> 32  15
> 64  15
> 128 15
> 256 15
> 512 15
> 102415
> 204815
> 409615
> 819215
> 16384   15
> 32768   15
> 65536   15
> 131072  15
> 262144  15
> 524288  15
> 1048576 15
> 2097152 15
> 4194304 30
> 8388608 45
>
> # As above, but add -S1M option to sort
>
> 4   1
> 8   3
> 16  7
> 32  15
> 64  15
> 128 15
> 256 15
> 512 15
> 102415
> 204815
> 409615
> 819215
> 16384   30
> 32768   45
> 65536   90
> 131072  165
> 262144  315
> 524288  622
> 1048576 1245
> 2097152 2475
> 4194304 4935
> 8388608 9855
>
> With SUBTHREAD_LINES_HEURISTIC=128k and -S1M option to sort we get no threads 
> as
> nlines never gets above 12787 (there looks to be around 80 bytes overhead per 
> line?).
> Only when -S >= 12M do we get nlines high enough to create threads.

Thanks for pursuing this.
Here's a proposed patch to address the other problem.
It doesn't have much of an effect (any?) on your
issue when using very little memory, but when a sort user
specifies -S1M, I think they probably want to avoid the
expense (memory) of going multi-threaded.

What do you think?

>From 4f591fdd0bb78f621d2b72021de883fc4df1e179 Mon Sep 17 00:00:00 2001
From: Jim Meyering 
Date: Wed, 16 Mar 2011 16:09:31 +0100
Subject: [PATCH] sort: avoid memory pressure of 130MB/thread when reading
 from pipe

* src/sort.c (INPUT_FILE_SIZE_GUESS): Decrease initial allocation
factor used to size buffer used when reading a non-regular file.
For motivation, see discussion here:
http://thread.gmane.org/gmane.comp.gnu.coreutils.general/878/focus=887
---
 src/sort.c |8 ++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 9b8666a..07d6765 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -319,8 +319,12 @@ static size_t merge_buffer_size = MAX 
(MIN_MERGE_BUFFER_SIZE, 256 * 1024);
specified by the user.  Zero if the user has not specified a size.  */
 static size_t sort_size;

-/* The guessed size for non-regular files.  */
-#define INPUT_FILE_SIZE_GUESS (1024 * 1024)
+/* The initial allocation factor for non-regular files.
+   This is used, e.g., when reading from a pipe.
+   Don't make it too big, since it is multiplied by ~130 to
+   obtain the size of the actual buffer sort will allocate.
+   Also, there may be 8 threads all doing this at the same time.  */
+#define INPUT_FILE_SIZE_GUESS (128 * 1024)

 /* Array of directory names in which any temporary files are to be created. */
 static char const **temp_dirs;
--
1.7.4.1.430.g5aa4d



Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-03-16 Thread Pádraig Brady
On 16/03/11 12:07, Jim Meyering wrote:
> Pádraig Brady wrote:
>> I've not fully analyzed this yet, and I'm not saying it's wrong,
>> but the above change seems to have a large effect on thread
>> creation when smaller buffers are used (you hinted previously
>> that being less aggressive with the amount of mem used by default
>> might be appropriate, and I agree).
>>
>> Anyway with the above I seem to need a buffer size more
>> than 10M to have any threads created at all.
>>
>> Testing the original 4 lines heuristic with the following, shows:
>> (note I only get > 4 threads after 4M of input, not 7 for 16 lines
>> as indicated in NEWS).
>>
>> $ for i in $(seq 30); do
>>>   j=$((2<<$i))
>>>   yes | head -n$j > t.sort
>>>   strace -c -e clone sort --parallel=16 t.sort -o /dev/null 2>&1 |
>>>join --nocheck-order -a1 -o1.4,1.5 - /dev/null |
>>>sed -n "s/\([0-9]*\) clone/$j\t\1/p"
>>> done
>> 4   1
>> 8   2
>> 16  3
>> 32  4
>> 64  4
>> 128 4
> ...
>> 1048576 4
>> 2097152 4
>> 4194304 8
>> 8388608 16
>>
>> When I restrict the buffer size with '-S 1M', many more threads
>> are created (a max of 16 in parallel with the above command)
>> 4   1
>> 8   2
>> 16  3
>> 32  4
>> 64  4
>> 128 4
>> 256 4
>> 512 4
>> 10244
>> 20484
>> 40964
>> 81924
>> 16384   8
>> 32768   12
>> 65536   24
>> 131072  44
>> 262144  84
>> 524288  167
>> 1048576 332
>> 2097152 660
>> 4194304 1316
>> 8388608 2628
>>
>> After increasing the heuristic to 128K, I get _no_ threads until -S > 10M
>> and this seems to be independent of line length.
> 
> Thanks for investigating that.
> Could strace -c -e clone be doing something unexpected?
> When I run this (without my patch), it would use 8 threads:
> 
> seq 16 > in; strace -ff -o k ./sort --parallel=16 in -o /dev/null
> 
> since it created eight k.PID files:
> 
> $ ls -1 k.*|wc -l
> 8
> 
> Now, for such a small file, it does not call clone at all.
> 

Oops, yep I forget to add -f to strace.
So NEWS is correct.

# SUBTHREAD_LINES_HEURISTIC = 4
$ for i in $(seq 22); do
j=$((2<<$i))
yes | head -n$j > t.sort
strace -f -c -e clone ./sort --parallel=16 t.sort -o /dev/null 2>&1 |
join --nocheck-order -a1 -o1.4,1.5 - /dev/null |
sed -n "s/\([0-9]*\) clone/$j\t\1/p"
  done
4   1
8   3
16  7
32  15
64  15
128 15
256 15
512 15
102415
204815
409615
819215
16384   15
32768   15
65536   15
131072  15
262144  15
524288  15
1048576 15
2097152 15
4194304 30
8388608 45

# As above, but add -S1M option to sort

4   1
8   3
16  7
32  15
64  15
128 15
256 15
512 15
102415
204815
409615
819215
16384   30
32768   45
65536   90
131072  165
262144  315
524288  622
1048576 1245
2097152 2475
4194304 4935
8388608 9855

With SUBTHREAD_LINES_HEURISTIC=128k and -S1M option to sort we get no threads as
nlines never gets above 12787 (there looks to be around 80 bytes overhead per 
line?).
Only when -S >= 12M do we get nlines high enough to create threads.

cheers,
Pádraig.



Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-03-16 Thread Jim Meyering
Pádraig Brady wrote:
> I've not fully analyzed this yet, and I'm not saying it's wrong,
> but the above change seems to have a large effect on thread
> creation when smaller buffers are used (you hinted previously
> that being less aggressive with the amount of mem used by default
> might be appropriate, and I agree).
>
> Anyway with the above I seem to need a buffer size more
> than 10M to have any threads created at all.
>
> Testing the original 4 lines heuristic with the following, shows:
> (note I only get > 4 threads after 4M of input, not 7 for 16 lines
> as indicated in NEWS).
>
> $ for i in $(seq 30); do
>>   j=$((2<<$i))
>>   yes | head -n$j > t.sort
>>   strace -c -e clone sort --parallel=16 t.sort -o /dev/null 2>&1 |
>>join --nocheck-order -a1 -o1.4,1.5 - /dev/null |
>>sed -n "s/\([0-9]*\) clone/$j\t\1/p"
>> done
> 4   1
> 8   2
> 16  3
> 32  4
> 64  4
> 128 4
...
> 1048576 4
> 2097152 4
> 4194304 8
> 8388608 16
>
> When I restrict the buffer size with '-S 1M', many more threads
> are created (a max of 16 in parallel with the above command)
> 4   1
> 8   2
> 16  3
> 32  4
> 64  4
> 128 4
> 256 4
> 512 4
> 10244
> 20484
> 40964
> 81924
> 16384   8
> 32768   12
> 65536   24
> 131072  44
> 262144  84
> 524288  167
> 1048576 332
> 2097152 660
> 4194304 1316
> 8388608 2628
>
> After increasing the heuristic to 128K, I get _no_ threads until -S > 10M
> and this seems to be independent of line length.

Thanks for investigating that.
Could strace -c -e clone be doing something unexpected?
When I run this (without my patch), it would use 8 threads:

seq 16 > in; strace -ff -o k ./sort --parallel=16 in -o /dev/null

since it created eight k.PID files:

$ ls -1 k.*|wc -l
8

Now, for such a small file, it does not call clone at all.



Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-03-14 Thread Jim Meyering
Pádraig Brady wrote:
...
> I still get bad performance for the above with SUBTHREAD_LINES_HEURISTIC=128K

Sorry I haven't had time for this today.
I'll investigate tomorrow.

> So as you suggested, the large mem allocation when reading from a pipe
> is a problem,
> and in fact seems to be the main problem.  Now given the memory isn't
> actually used
> it shouldn't be a such an issue, but if one has MALLOC_PERTURB_ set,
> then it is used,
> and it has a huge impact. Compare:
>
> $ for i in $(seq 33); do seq 88| MALLOC_PERTURB_= timeout 2 sort
> --para=1 >/dev/null & done
> $ for i in $(seq 33); do seq 88| MALLOC_PERTURB_=1 timeout 2 sort
> --para=1 >/dev/null & done

Good point!

> So we should be more conservative in memory allocation in sort,
> and be more aligned with CPU cache sizes than RAM sizes I suspect.
> This will be an increasing problem as we tend to run more in ||.
> It would be interesting I think to sort first by L1 cache size,
> then by L2, etc, but as a first pass, a more sensible default
> of 8MB or so seems appropriate.
>
> As a general note, MALLOC_PERTURB_ should be unset when benchmarking
> anything to do with `sort`



Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-03-14 Thread Pádraig Brady
On 14/03/11 15:31, Pádraig Brady wrote:
> On 12/03/11 15:34, Jim Meyering wrote:
>> Jim Meyering wrote:
>>> Jim Meyering wrote:
 Jim Meyering wrote:
> Running "make -j25 check" on a nominal-12-core F14 system would
> cause serious difficulty leading to an OOM kill -- and this is brand new.
> It worked fine yesterday.  I tracked it down to all of the make processes
> working on the "built_programs.list" (in src/Makefile.am) rule
>
> built_programs.list:
>   @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \
> | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' '
>
> Which made me realize we were running that submake over 400 times,
> once per test scripts (including skipped ones).  That's well worth
> avoiding, even if it means a new temporary file.
>
> I don't know the root cause of the OOM-kill (preceded by interminable
> minutes of a seemingly hung and barely responsive system) or why it 
> started
> happening today (afaics, none of the programs involved was updated),
> but this does fix it...

 FYI,
 I've tracked this down a little further.
 The horrid performance (hung system and eventual OOM-kill)
 are related to the use of sort above.  This is the definition:

 ASSORT = LC_ALL=C sort

 If I revert my earlier patch and instead simply
 insist that sort not do anything in parallel,

 ASSORT = LC_ALL=C sort --parallel=1

 then there is no hang, and things finish in relatively good time.

 I don't have a good stand-alone reproducer yet
 and am out of time for today.
>>
>> After updating to a new F14 kernel,
>> I never managed to reproduce that, but maybe
>> that's just a coincidence...
>>
>>> Well, right after writing that, I realized the key to the misbehavior:
>>> sort was reading from a *pipe*:
>>>
>>> # This finishes right away, reading from input file "k":
>>> seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & 
>>> done
>>>
>>> # When reading from a pipe, it's a very different story:
>>> # Without the "timeout 7" prefix, the following would render an N-core
>>> # system (5>> # my system goes unresponsive after 1 second, and doesn't return until 
>>> timeout.
>>> for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & 
>>> done

I still get bad performance for the above with SUBTHREAD_LINES_HEURISTIC=128K

So as you suggested, the large mem allocation when reading from a pipe is a 
problem,
and in fact seems to be the main problem.  Now given the memory isn't actually 
used
it shouldn't be a such an issue, but if one has MALLOC_PERTURB_ set, then it is 
used,
and it has a huge impact. Compare:

$ for i in $(seq 33); do seq 88| MALLOC_PERTURB_=  timeout 2 sort --para=1 
>/dev/null & done
$ for i in $(seq 33); do seq 88| MALLOC_PERTURB_=1 timeout 2 sort --para=1 
>/dev/null & done

So we should be more conservative in memory allocation in sort,
and be more aligned with CPU cache sizes than RAM sizes I suspect.
This will be an increasing problem as we tend to run more in ||.
It would be interesting I think to sort first by L1 cache size,
then by L2, etc, but as a first pass, a more sensible default
of 8MB or so seems appropriate.

As a general note, MALLOC_PERTURB_ should be unset when benchmarking anything 
to do with `sort`

cheers,
Pádraig.



Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-03-14 Thread Pádraig Brady
On 12/03/11 15:34, Jim Meyering wrote:
> Jim Meyering wrote:
>> Jim Meyering wrote:
>>> Jim Meyering wrote:
 Running "make -j25 check" on a nominal-12-core F14 system would
 cause serious difficulty leading to an OOM kill -- and this is brand new.
 It worked fine yesterday.  I tracked it down to all of the make processes
 working on the "built_programs.list" (in src/Makefile.am) rule

 built_programs.list:
@echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \
  | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' '

 Which made me realize we were running that submake over 400 times,
 once per test scripts (including skipped ones).  That's well worth
 avoiding, even if it means a new temporary file.

 I don't know the root cause of the OOM-kill (preceded by interminable
 minutes of a seemingly hung and barely responsive system) or why it started
 happening today (afaics, none of the programs involved was updated),
 but this does fix it...
>>>
>>> FYI,
>>> I've tracked this down a little further.
>>> The horrid performance (hung system and eventual OOM-kill)
>>> are related to the use of sort above.  This is the definition:
>>>
>>> ASSORT = LC_ALL=C sort
>>>
>>> If I revert my earlier patch and instead simply
>>> insist that sort not do anything in parallel,
>>>
>>> ASSORT = LC_ALL=C sort --parallel=1
>>>
>>> then there is no hang, and things finish in relatively good time.
>>>
>>> I don't have a good stand-alone reproducer yet
>>> and am out of time for today.
> 
> After updating to a new F14 kernel,
> I never managed to reproduce that, but maybe
> that's just a coincidence...
> 
>> Well, right after writing that, I realized the key to the misbehavior:
>> sort was reading from a *pipe*:
>>
>> # This finishes right away, reading from input file "k":
>> seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & 
>> done
>>
>> # When reading from a pipe, it's a very different story:
>> # Without the "timeout 7" prefix, the following would render an N-core
>> # system (5> # my system goes unresponsive after 1 second, and doesn't return until 
>> timeout.
>> for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & 
>> done
>>
>> Occasionally, the above jobs all complete quickly.
>>
>> My first question was why were *any* processes being spawned to handle
>> such a small input file.  The first answer is in the first hunk:
>>
>> diff --git a/src/sort.c b/src/sort.c
>> index 13954cb..b9316e7 100644
>> --- a/src/sort.c
>> +++ b/src/sort.c
>> @@ -112,9 +112,8 @@ struct rlimit { size_t rlim_cur; };
>>  /* Heuristic value for the number of lines for which it is worth
>> creating a subthread, during an internal merge sort, on a machine
>> that has processors galore.  Currently this number is just a guess.
>> -   This value must be at least 4.  We don't know of any machine where
>> -   this number has any practical effect.  */
>> -enum { SUBTHREAD_LINES_HEURISTIC = 4 };
>> +   This value must be at least 4.  */
>> +enum { SUBTHREAD_LINES_HEURISTIC = 32 * 1024 };
>>
>>  /* The number of threads after which there are
>> diminishing performance gains.  */
>>
>> The old definition of SUBTHREAD_LINES_HEURISTIC meant that any group
>> of 5 or more lines would be split in two and sorted via two (or more)
>> separate threads.  Thus, with just 40 lines, you could get the maximum
>> of 8 threads working.  That is obviously not efficient, unless lines are
>> so pathologically long that the cost of comparing two of them approaches
>> the cost of creating a new process.
>>
>> With the above, sort would use a more reasonable number.
>> Tests on high-end hardware and using very short lines
>> suggest that a value like 200,000 would still be conservative.
> 
> Here's a complete patch for that:
> 
>>From b2db5675bfeb3fe7e87bcc12934f34057ee26704 Mon Sep 17 00:00:00 2001
> From: Jim Meyering 
> Date: Thu, 10 Feb 2011 08:48:27 +0100
> Subject: [PATCH] sort: spawn fewer threads for small inputs
> 
> * src/sort.c (SUBTHREAD_LINES_HEURISTIC): Do not spawn a new thread
> for every 4 lines.  Increase this from 4 to 128K.  128K lines seems
> appropriate for a 5-year-old dual-core laptop, but it is too low for
> some common combinations of short lines and/or newer systems.
> * NEWS (Bug fixes): Mention it.
> ---
>  NEWS   |9 ++---
>  src/sort.c |   16 ++--
>  2 files changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/NEWS b/NEWS
> index 3157ef2..5770410 100644
> --- a/NEWS
> +++ b/NEWS
> @@ -4,13 +4,16 @@ GNU coreutils NEWS-*- 
> outline -*-
> 
>  ** Bug fixes
> 
> -  du would infloop when given --files0-from=DIR
> -  [bug introduced in coreutils-7.1]
> -
>cut could segfault when invoked with a user-specified output
>delimiter and an unbounded range like "-f1234567890-".
>[bug introduced in coreutils-5.3.0]
> 
> +  du would infloop when g

Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-03-12 Thread Jim Meyering
Jim Meyering wrote:
> Jim Meyering wrote:
>> Jim Meyering wrote:
>>> Running "make -j25 check" on a nominal-12-core F14 system would
>>> cause serious difficulty leading to an OOM kill -- and this is brand new.
>>> It worked fine yesterday.  I tracked it down to all of the make processes
>>> working on the "built_programs.list" (in src/Makefile.am) rule
>>>
>>> built_programs.list:
>>> @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \
>>>   | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' '
>>>
>>> Which made me realize we were running that submake over 400 times,
>>> once per test scripts (including skipped ones).  That's well worth
>>> avoiding, even if it means a new temporary file.
>>>
>>> I don't know the root cause of the OOM-kill (preceded by interminable
>>> minutes of a seemingly hung and barely responsive system) or why it started
>>> happening today (afaics, none of the programs involved was updated),
>>> but this does fix it...
>>
>> FYI,
>> I've tracked this down a little further.
>> The horrid performance (hung system and eventual OOM-kill)
>> are related to the use of sort above.  This is the definition:
>>
>> ASSORT = LC_ALL=C sort
>>
>> If I revert my earlier patch and instead simply
>> insist that sort not do anything in parallel,
>>
>> ASSORT = LC_ALL=C sort --parallel=1
>>
>> then there is no hang, and things finish in relatively good time.
>>
>> I don't have a good stand-alone reproducer yet
>> and am out of time for today.

After updating to a new F14 kernel,
I never managed to reproduce that, but maybe
that's just a coincidence...

> Well, right after writing that, I realized the key to the misbehavior:
> sort was reading from a *pipe*:
>
> # This finishes right away, reading from input file "k":
> seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & 
> done
>
> # When reading from a pipe, it's a very different story:
> # Without the "timeout 7" prefix, the following would render an N-core
> # system (5 # my system goes unresponsive after 1 second, and doesn't return until 
> timeout.
> for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & 
> done
>
> Occasionally, the above jobs all complete quickly.
>
> My first question was why were *any* processes being spawned to handle
> such a small input file.  The first answer is in the first hunk:
>
> diff --git a/src/sort.c b/src/sort.c
> index 13954cb..b9316e7 100644
> --- a/src/sort.c
> +++ b/src/sort.c
> @@ -112,9 +112,8 @@ struct rlimit { size_t rlim_cur; };
>  /* Heuristic value for the number of lines for which it is worth
> creating a subthread, during an internal merge sort, on a machine
> that has processors galore.  Currently this number is just a guess.
> -   This value must be at least 4.  We don't know of any machine where
> -   this number has any practical effect.  */
> -enum { SUBTHREAD_LINES_HEURISTIC = 4 };
> +   This value must be at least 4.  */
> +enum { SUBTHREAD_LINES_HEURISTIC = 32 * 1024 };
>
>  /* The number of threads after which there are
> diminishing performance gains.  */
>
> The old definition of SUBTHREAD_LINES_HEURISTIC meant that any group
> of 5 or more lines would be split in two and sorted via two (or more)
> separate threads.  Thus, with just 40 lines, you could get the maximum
> of 8 threads working.  That is obviously not efficient, unless lines are
> so pathologically long that the cost of comparing two of them approaches
> the cost of creating a new process.
>
> With the above, sort would use a more reasonable number.
> Tests on high-end hardware and using very short lines
> suggest that a value like 200,000 would still be conservative.

Here's a complete patch for that:

>From b2db5675bfeb3fe7e87bcc12934f34057ee26704 Mon Sep 17 00:00:00 2001
From: Jim Meyering 
Date: Thu, 10 Feb 2011 08:48:27 +0100
Subject: [PATCH] sort: spawn fewer threads for small inputs

* src/sort.c (SUBTHREAD_LINES_HEURISTIC): Do not spawn a new thread
for every 4 lines.  Increase this from 4 to 128K.  128K lines seems
appropriate for a 5-year-old dual-core laptop, but it is too low for
some common combinations of short lines and/or newer systems.
* NEWS (Bug fixes): Mention it.
---
 NEWS   |9 ++---
 src/sort.c |   16 ++--
 2 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/NEWS b/NEWS
index 3157ef2..5770410 100644
--- a/NEWS
+++ b/NEWS
@@ -4,13 +4,16 @@ GNU coreutils NEWS-*- 
outline -*-

 ** Bug fixes

-  du would infloop when given --files0-from=DIR
-  [bug introduced in coreutils-7.1]
-
   cut could segfault when invoked with a user-specified output
   delimiter and an unbounded range like "-f1234567890-".
   [bug introduced in coreutils-5.3.0]

+  du would infloop when given --files0-from=DIR
+  [bug introduced in coreutils-7.1]
+
+  sort no longer spawns 7 worker threads to sort 16 lines
+  [bug introduced in coreutils-8.6]
+
   wc would dereference a NULL pointer up

Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-02-10 Thread Jim Meyering
Jim Meyering wrote:
> Jim Meyering wrote:
>> Running "make -j25 check" on a nominal-12-core F14 system would
>> cause serious difficulty leading to an OOM kill -- and this is brand new.
>> It worked fine yesterday.  I tracked it down to all of the make processes
>> working on the "built_programs.list" (in src/Makefile.am) rule
>>
>> built_programs.list:
>>  @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \
>>| sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' '
>>
>> Which made me realize we were running that submake over 400 times,
>> once per test scripts (including skipped ones).  That's well worth
>> avoiding, even if it means a new temporary file.
>>
>> I don't know the root cause of the OOM-kill (preceded by interminable
>> minutes of a seemingly hung and barely responsive system) or why it started
>> happening today (afaics, none of the programs involved was updated),
>> but this does fix it...
>
> FYI,
> I've tracked this down a little further.
> The horrid performance (hung system and eventual OOM-kill)
> are related to the use of sort above.  This is the definition:
>
> ASSORT = LC_ALL=C sort
>
> If I revert my earlier patch and instead simply
> insist that sort not do anything in parallel,
>
> ASSORT = LC_ALL=C sort --parallel=1
>
> then there is no hang, and things finish in relatively good time.
>
> I don't have a good stand-alone reproducer yet
> and am out of time for today.

Well, right after writing that, I realized the key to the misbehavior:
sort was reading from a *pipe*:

# This finishes right away, reading from input file "k":
seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & done

# When reading from a pipe, it's a very different story:
# Without the "timeout 7" prefix, the following would render an N-core
# system (5/dev/null & done

Occasionally, the above jobs all complete quickly.

My first question was why were *any* processes being spawned to handle
such a small input file.  The first answer is in the first hunk:

diff --git a/src/sort.c b/src/sort.c
index 13954cb..b9316e7 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -112,9 +112,8 @@ struct rlimit { size_t rlim_cur; };
 /* Heuristic value for the number of lines for which it is worth
creating a subthread, during an internal merge sort, on a machine
that has processors galore.  Currently this number is just a guess.
-   This value must be at least 4.  We don't know of any machine where
-   this number has any practical effect.  */
-enum { SUBTHREAD_LINES_HEURISTIC = 4 };
+   This value must be at least 4.  */
+enum { SUBTHREAD_LINES_HEURISTIC = 32 * 1024 };

 /* The number of threads after which there are
diminishing performance gains.  */

The old definition of SUBTHREAD_LINES_HEURISTIC meant that any group
of 5 or more lines would be split in two and sorted via two (or more)
separate threads.  Thus, with just 40 lines, you could get the maximum
of 8 threads working.  That is obviously not efficient, unless lines are
so pathologically long that the cost of comparing two of them approaches
the cost of creating a new process.

With the above, sort would use a more reasonable number.
Tests on high-end hardware and using very short lines
suggest that a value like 200,000 would still be conservative.

But even with that change, the above for-loop would sometimes hang.
Pursuing it further, I discovered that sort was allocating 130MB
(INPUT_FILE_SIZE_GUESS * 129, with the former being 1MiB) of space
for its input buffer, simply because it was reading from a pipe.

diff --git a/src/sort.c b/src/sort.c
index 13954cb..92c8d4e 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -316,7 +315,7 @@ static size_t merge_buffer_size = MAX 
(MIN_MERGE_BUFFER_SIZE, 256 * 1024);
 static size_t sort_size;

 /* The guessed size for non-regular files.  */
-#define INPUT_FILE_SIZE_GUESS (1024 * 1024)
+#define INPUT_FILE_SIZE_GUESS (64 * 1024)

 /* Array of directory names in which any temporary files are to be created. */
 static char const **temp_dirs;

With that, the while-loop does not hang any more.
Note again that this is on a multi-core system.

I'll investigate more tomorrow.



Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-02-09 Thread Assaf Gordon
Jim Meyering wrote, On 02/09/2011 12:29 PM:
> Jim Meyering wrote:
>> Running "make -j25 check" on a nominal-12-core F14 system would
>> cause serious difficulty leading to an OOM kill -- and this is brand new.
> 
> If I revert my earlier patch and instead simply
> insist that sort not do anything in parallel,
> 
> ASSORT = LC_ALL=C sort --parallel=1
> 
> then there is no hang, and things finish in relatively good time.

I previously noticed the memory issue as a surprising (for me, at least) 
side-effect of parallel sort
(http://lists.gnu.org/archive/html/coreutils/2010-12/msg00079.html and 
http://lists.gnu.org/archive/html/coreutils/2010-12/msg00084.html)

But I'm noticing another side-effect of the new default behavior:
On a shared linux system (both on a server with many cores, and on a SGE 
cluster),
were users are using 'sort' as part of their scripts, memory usage and avg.load 
sometimes peaks beyond what is available, because the each sort process now 
uses up to 8 cores by default.

So while implicitly users assume they use 1 core (at least with SGE, where you 
can specify how many threads your job will require), in practice they use many 
more.

Globally setting OMP_NUM_THREADS=1 restores the old behavior, and only 
explicitly specifying "--parallel=X" lets sort use more than one thread, 
but I'm wondering if it's not better to default back to 1 core, and require 
explicit "--parallel" to do multi-threaded sort.

-gordon



parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

2011-02-09 Thread Jim Meyering
Jim Meyering wrote:
> Running "make -j25 check" on a nominal-12-core F14 system would
> cause serious difficulty leading to an OOM kill -- and this is brand new.
> It worked fine yesterday.  I tracked it down to all of the make processes
> working on the "built_programs.list" (in src/Makefile.am) rule
>
> built_programs.list:
>   @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \
> | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' '
>
> Which made me realize we were running that submake over 400 times,
> once per test scripts (including skipped ones).  That's well worth
> avoiding, even if it means a new temporary file.
>
> I don't know the root cause of the OOM-kill (preceded by interminable
> minutes of a seemingly hung and barely responsive system) or why it started
> happening today (afaics, none of the programs involved was updated),
> but this does fix it...

FYI,
I've tracked this down a little further.
The horrid performance (hung system and eventual OOM-kill)
are related to the use of sort above.  This is the definition:

ASSORT = LC_ALL=C sort

If I revert my earlier patch and instead simply
insist that sort not do anything in parallel,

ASSORT = LC_ALL=C sort --parallel=1

then there is no hang, and things finish in relatively good time.

I don't have a good stand-alone reproducer yet
and am out of time for today.