For some reason, sort is now very slow here (Fedora 15, x86_64, with little free RAM), even for tiny inputs, but only some of the time, and not when I set OMP_NUM_THREADS=1. (I noticed because ./bootstrap was taking *forever*, traced it, which showed gnulib-tool taking all the time, then traced gnulib-tool, which showed slow sort -- it runs thousands of small ones.)
time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 3.18s system 98% cpu 3.218 total $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.01s user 39.90s system 99% cpu 40.077 total $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 5.65s system 99% cpu 5.679 total $ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d" 0.00s user 34.16s system 13% cpu 4:12.40 total Then I can run the very same command 40 times in a row, with it terminating in less than .3s total. Then once more and it takes 5 seconds. I managed to get some info out of one via strace -p PID: $ strace -p 21925 Process 21925 attached - interrupt to quit fstat(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8fded12000 read(0, "a\nb\n", 131072) = 4 read(0, "", 126976) = 0 fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8fded11000 write(1, "a\nb\n", 4) = 4 munmap(0x7f8fddcd0000, 16912384) = 0 lseek(0, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) close(0) = 0 munmap(0x7f8fded12000, 4096) = 0 close(1) = 0 munmap(0x7f8fded11000, 4096) = 0 close(2) = 0 exit_group(0) = ? Process 21925 detached That fstat (STDIN_FILENO, ... call comes from sort->merge's call to avoid_trashing_input. Top reported this: top - 23:18:49 up 25 days, 15:21, 38 users, load average: 0.11, 0.30, 0.45 Tasks: 285 total, 1 running, 278 sleeping, 5 stopped, 1 zombie Cpu(s): 0.1%us, 4.4%sy, 0.1%ni, 95.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 12323704k total, 11880864k used, 442840k free, 84120k buffers Swap: 4194300k total, 4194252k used, 48k free, 958840k cached Unfortunately, the above was last night, and this morning I've been unable to reproduce it. This strace -c diff shows how the default with 6/12 cores (equiv. to OMP_NUM_THREADS=12) each of sort's munmap call takes 50% longer than when I set OMP_NUM_THREADS=1: --- /t/sort-omp1.log 2011-07-28 10:48:39.284061665 +0200 +++ /t/sort.log 2011-07-28 10:48:27.088061664 +0200 @@ -1,6 +1,6 @@ % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- -100.00 0.000761 127 6 munmap +100.00 0.001151 192 6 munmap 0.00 0.000000 0 6 read 0.00 0.000000 0 1 write 0.00 0.000000 0 5 open @@ -16,8 +16,9 @@ 0.00 0.000000 0 1 execve 0.00 0.000000 0 4 getrlimit 0.00 0.000000 0 1 arch_prctl + 0.00 0.000000 0 1 sched_getaffinity 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 1 1 fadvise64 0.00 0.000000 0 1 set_robust_list ------ ----------- ----------- --------- --------- ---------------- -100.00 0.000761 95 3 total +100.00 0.001151 96 3 total This shows how with OMP_NUM_THREADS=12 the max-resident RAM is nearly triple what you'll see when using OMP_NUM_THREADS=1: $ :|env time env LC_ALL=C OMP_NUM_THREADS=12 sort 0.00user 0.01system 0:00.01elapsed 88%CPU (0avgtext+0avgdata 17076maxresident)k 0inputs+0outputs (0major+4496minor)pagefaults 0swaps $ :|env time env LC_ALL=C OMP_NUM_THREADS=1 sort 0.00user 0.00system 0:00.00elapsed 83%CPU (0avgtext+0avgdata 6780maxresident)k 0inputs+0outputs (0major+1937minor)pagefaults 0swaps Comparing the two, I see where the extra memory is being allocated, and will probably remove this pre-optimization to fix it: [I'll adjust the log, but I doubt I'll be able to add a test] >From fe20dc12a0bacb81e0474c12dbb29a1afa8b28d1 Mon Sep 17 00:00:00 2001 From: Jim Meyering <meyer...@redhat.com> Date: Thu, 28 Jul 2011 14:03:09 +0200 Subject: [PATCH] sort: FIXME: do not allocate more space initially for multi-thread --- src/sort.c | 17 +---------------- 1 files changed, 1 insertions(+), 16 deletions(-) diff --git a/src/sort.c b/src/sort.c index 03f63fb..2aa5da2 100644 --- a/src/sort.c +++ b/src/sort.c @@ -3793,22 +3793,7 @@ sort (char *const *files, size_t nfiles, char const *output_file, char const *file = *files; FILE *fp = xfopen (file, "r"); FILE *tfp; - - size_t bytes_per_line; - if (nthreads > 1) - { - /* Get log P. */ - size_t tmp = 1; - size_t mult = 1; - while (tmp < nthreads) - { - tmp *= 2; - mult++; - } - bytes_per_line = (mult * sizeof (struct line)); - } - else - bytes_per_line = sizeof (struct line) * 3 / 2; + size_t bytes_per_line = sizeof (struct line) * 3 / 2; if (! buf.alloc) initbuf (&buf, bytes_per_line, -- 1.7.6.639.g50fb9