Pádraig Brady wrote: > On 07/28/2011 01:16 PM, Jim Meyering wrote: >> For some reason, sort is now very slow here (Fedora 15, x86_64, with >> little free RAM), even for tiny inputs, but only some of the time, >> and not when I set OMP_NUM_THREADS=1. >> (I noticed because ./bootstrap was taking *forever*, traced it, >> which showed gnulib-tool taking all the time, then traced gnulib-tool, >> which showed slow sort -- it runs thousands of small ones.) >> >> time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" >> sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 3.18s >> system 98% cpu 3.218 total >> $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" >> sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.01s user 39.90s >> system 99% cpu 40.077 total >> $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" >> sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 5.65s >> system 99% cpu 5.679 total >> $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" >> sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 34.16s >> system 13% cpu 4:12.40 total >> >> Then I can run the very same command 40 times in a row, with it >> terminating in less than .3s total. >> >> Then once more and it takes 5 seconds. > > Testing here (2 core, 4 threads) does show a large jump in time, > but only the first time I specified a large number of threads. > > $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" > > real 0m0.031s > user 0m0.015s > sys 0m0.012s > > $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort --parallel=100 | uniq -d" > > real 0m2.095s > user 0m0.042s > sys 0m2.042s > > $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort --parallel=100 | uniq -d" > > real 0m0.207s > user 0m0.050s > sys 0m0.157s > > My first thought was memory usage which seems more likely as > more memory is used for the multithreaded case. Also when > doing any perf testing with sort I now disable MALLOC_PERTURB_ > considering what large buffers it allocates, otherwise that > would disable any lazy allocation that could be done. > I was half considering unsetting MALLOC_PERTURB_ for the > large buffer allocation internally in sort. > > A wild guess on your 5s delay is, maybe that run caused > something to be swaped out. Maybe your swap is on a > mechanical disk which needed to be spun up?
The disk I was using was an SSD (including swap). Most delays were consistently smaller than those listed above. More like .5 to 2 seconds per run. > It's unlikely related, but I've had particular issues with > kswapd0 spinning on F15, and I'm running a patched kernel > to fix the issue. > > Could you repeat tests with a small -S or try the bootstrap > with MALLOC_PERTURB_ unset? Good call. In a few small-input tests, I see that makes maxresident the same regardless of whether I use --parallel=1 or 12. That makes sense, too. Without MALLOC_PERTURB_, an over-allocation like what we see there causes virtually no penalty, but *with* MALLOC_PERTURB_=N, for 0 < N < 256, that buffer is both written (to initialize) and read (to check at free-time), so it has an impact on the virtual memory footprint, and hence potentially on swapping. Now I think this was at least partly self-inflicted, due to my using MALLOC_PERTURB_=, so I expect to discard the proposed patch. And buy more RAM ;-) Thanks.