On 14/03/11 15:31, Pádraig Brady wrote: > On 12/03/11 15:34, Jim Meyering wrote: >> Jim Meyering wrote: >>> Jim Meyering wrote: >>>> Jim Meyering wrote: >>>>> Running "make -j25 check" on a nominal-12-core F14 system would >>>>> cause serious difficulty leading to an OOM kill -- and this is brand new. >>>>> It worked fine yesterday. I tracked it down to all of the make processes >>>>> working on the "built_programs.list" (in src/Makefile.am) rule >>>>> >>>>> built_programs.list: >>>>> @echo $(bin_PROGRAMS) $(bin_SCRIPTS) | tr ' ' '\n' \ >>>>> | sed -e 's,$(EXEEXT)$$,,' | $(ASSORT) -u | tr '\n' ' ' >>>>> >>>>> Which made me realize we were running that submake over 400 times, >>>>> once per test scripts (including skipped ones). That's well worth >>>>> avoiding, even if it means a new temporary file. >>>>> >>>>> I don't know the root cause of the OOM-kill (preceded by interminable >>>>> minutes of a seemingly hung and barely responsive system) or why it >>>>> started >>>>> happening today (afaics, none of the programs involved was updated), >>>>> but this does fix it... >>>> >>>> FYI, >>>> I've tracked this down a little further. >>>> The horrid performance (hung system and eventual OOM-kill) >>>> are related to the use of sort above. This is the definition: >>>> >>>> ASSORT = LC_ALL=C sort >>>> >>>> If I revert my earlier patch and instead simply >>>> insist that sort not do anything in parallel, >>>> >>>> ASSORT = LC_ALL=C sort --parallel=1 >>>> >>>> then there is no hang, and things finish in relatively good time. >>>> >>>> I don't have a good stand-alone reproducer yet >>>> and am out of time for today. >> >> After updating to a new F14 kernel, >> I never managed to reproduce that, but maybe >> that's just a coincidence... >> >>> Well, right after writing that, I realized the key to the misbehavior: >>> sort was reading from a *pipe*: >>> >>> # This finishes right away, reading from input file "k": >>> seq 99 >k && for i in $(seq 33); do LC_ALL=C timeout 1 sort k > /dev/null & >>> done >>> >>> # When reading from a pipe, it's a very different story: >>> # Without the "timeout 7" prefix, the following would render an N-core >>> # system (5<N) unusable for many minutes. As it is, be prepared: >>> # my system goes unresponsive after 1 second, and doesn't return until >>> timeout. >>> for i in $(seq 33); do seq 88|LC_ALL=C timeout 7 sort --para=5 >/dev/null & >>> done
I still get bad performance for the above with SUBTHREAD_LINES_HEURISTIC=128K So as you suggested, the large mem allocation when reading from a pipe is a problem, and in fact seems to be the main problem. Now given the memory isn't actually used it shouldn't be a such an issue, but if one has MALLOC_PERTURB_ set, then it is used, and it has a huge impact. Compare: $ for i in $(seq 33); do seq 88| MALLOC_PERTURB_= timeout 2 sort --para=1 >/dev/null & done $ for i in $(seq 33); do seq 88| MALLOC_PERTURB_=1 timeout 2 sort --para=1 >/dev/null & done So we should be more conservative in memory allocation in sort, and be more aligned with CPU cache sizes than RAM sizes I suspect. This will be an increasing problem as we tend to run more in ||. It would be interesting I think to sort first by L1 cache size, then by L2, etc, but as a first pass, a more sensible default of 8MB or so seems appropriate. As a general note, MALLOC_PERTURB_ should be unset when benchmarking anything to do with `sort` cheers, Pádraig.
