Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

Pádraig Brady Mon, 14 Mar 2011 15:06:50 -0700

On 14/03/11 15:31, Pádraig Brady wrote:
> On 12/03/11 15:34, Jim Meyering wrote:
>> Jim Meyering wrote:
>>> Jim Meyering wrote:
>>>> Jim Meyering wrote:
>>>>> Running "make -j25 check" on a nominal-12-core F14 system would
>>>>> cause serious difficulty leading to an OOM kill -- and this is brand new.
>>>>> It worked fine yesterday.  I tracked it down to all of the make processes
>>>>> working on the "built_programs.list" (in src/Makefile.am) rule
>>>>>
>>>>> built_programs.list:
>>>>>   @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \
>>>>>     | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' '
>>>>>
>>>>> Which made me realize we were running that submake over 400 times,
>>>>> once per test scripts (including skipped ones).  That's well worth
>>>>> avoiding, even if it means a new temporary file.
>>>>>
>>>>> I don't know the root cause of the OOM-kill (preceded by interminable
>>>>> minutes of a seemingly hung and barely responsive system) or why it 
>>>>> started
>>>>> happening today (afaics, none of the programs involved was updated),
>>>>> but this does fix it...
>>>>
>>>> FYI,
>>>> I've tracked this down a little further.
>>>> The horrid performance (hung system and eventual OOM-kill)
>>>> are related to the use of sort above.  This is the definition:
>>>>
>>>>     ASSORT = LC_ALL=C sort
>>>>
>>>> If I revert my earlier patch and instead simply
>>>> insist that sort not do anything in parallel,
>>>>
>>>>     ASSORT = LC_ALL=C sort --parallel=1
>>>>
>>>> then there is no hang, and things finish in relatively good time.
>>>>
>>>> I don't have a good stand-alone reproducer yet
>>>> and am out of time for today.
>>
>> After updating to a new F14 kernel,
>> I never managed to reproduce that, but maybe
>> that's just a coincidence...
>>
>>> Well, right after writing that, I realized the key to the misbehavior:
>>> sort was reading from a *pipe*:
>>>
>>> # This finishes right away, reading from input file "k":
>>> seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & 
>>> done
>>>
>>> # When reading from a pipe, it's a very different story:
>>> # Without the "timeout 7" prefix, the following would render an N-core
>>> # system (5<N) unusable for many minutes.  As it is, be prepared:
>>> # my system goes unresponsive after 1 second, and doesn't return until 
>>> timeout.
>>> for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & 
>>> done


I still get bad performance for the above with SUBTHREAD_LINES_HEURISTIC=128K

So as you suggested, the large mem allocation when reading from a pipe is a 
problem,
and in fact seems to be the main problem.  Now given the memory isn't actually 
used
it shouldn't be a such an issue, but if one has MALLOC_PERTURB_ set, then it is 
used,
and it has a huge impact. Compare:

$ for i in $(seq 33); do seq 88| MALLOC_PERTURB_=  timeout 2 sort --para=1 
>/dev/null & done
$ for i in $(seq 33); do seq 88| MALLOC_PERTURB_=1 timeout 2 sort --para=1 
>/dev/null & done

So we should be more conservative in memory allocation in sort,
and be more aligned with CPU cache sizes than RAM sizes I suspect.
This will be an increasing problem as we tend to run more in ||.
It would be interesting I think to sort first by L1 cache size,
then by L2, etc, but as a first pass, a more sensible default
of 8MB or so seems appropriate.

As a general note, MALLOC_PERTURB_ should be unset when benchmarking anything 
to do with `sort`

cheers,
Pádraig.

Re: parallel sort at fault? [Re: [PATCH] tests: avoid gross inefficiency...

Reply via email to