parallel sort strangeness

Jim Meyering Thu, 28 Jul 2011 05:16:18 -0700

For some reason, sort is now very slow here (Fedora 15, x86_64, with
little free RAM), even for tiny inputs, but only some of the time,
and not when I set OMP_NUM_THREADS=1.
(I noticed because ./bootstrap was taking *forever*, traced it,
which showed gnulib-tool taking all the time, then traced gnulib-tool,
which showed slow sort -- it runs thousands of small ones.)


time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"  0.00s user 3.18s system 98% 
cpu 3.218 total
$ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"  0.01s user 39.90s system 99% 
cpu 40.077 total
$ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"  0.00s user 5.65s system 99% 
cpu 5.679 total
$ time sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"
sh -c "printf 'a\nb\n'|LC_ALL=C sort | uniq -d"  0.00s user 34.16s system 13% 
cpu 4:12.40 total

Then I can run the very same command 40 times in a row, with it
terminating in less than .3s total.

Then once more and it takes 5 seconds.

I managed to get some info out of one via strace -p PID:

  $ strace -p 21925
  Process 21925 attached - interrupt to quit
  fstat(0, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
  mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f8fded12000
  read(0, "a\nb\n", 131072)               = 4
  read(0, "", 126976)                     = 0
  fstat(1, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
  mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x7f8fded11000
  write(1, "a\nb\n", 4)                   = 4
  munmap(0x7f8fddcd0000, 16912384)        = 0
  lseek(0, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
  close(0)                                = 0
  munmap(0x7f8fded12000, 4096)            = 0
  close(1)                                = 0
  munmap(0x7f8fded11000, 4096)            = 0
  close(2)                                = 0
  exit_group(0)                           = ?
  Process 21925 detached

That fstat (STDIN_FILENO, ... call
comes from sort->merge's call to avoid_trashing_input.
Top reported this:

  top - 23:18:49 up 25 days, 15:21, 38 users,  load average: 0.11, 0.30, 0.45
  Tasks: 285 total,   1 running, 278 sleeping,   5 stopped,   1 zombie
  Cpu(s):  0.1%us,  4.4%sy,  0.1%ni, 95.4%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
  Mem:  12323704k total, 11880864k used,   442840k free,    84120k buffers
  Swap:  4194300k total,  4194252k used,       48k free,   958840k cached

Unfortunately, the above was last night,
and this morning I've been unable to reproduce it.

This strace -c diff shows how the default with 6/12 cores
(equiv. to OMP_NUM_THREADS=12) each of sort's munmap call takes 50% longer
than when I set OMP_NUM_THREADS=1:

--- /t/sort-omp1.log    2011-07-28 10:48:39.284061665 +0200
+++ /t/sort.log 2011-07-28 10:48:27.088061664 +0200
@@ -1,6 +1,6 @@
 % time     seconds  usecs/call     calls    errors syscall
 ------ ----------- ----------- --------- --------- ----------------
-100.00    0.000761         127         6           munmap
+100.00    0.001151         192         6           munmap
   0.00    0.000000           0         6           read
   0.00    0.000000           0         1           write
   0.00    0.000000           0         5           open
@@ -16,8 +16,9 @@
   0.00    0.000000           0         1           execve
   0.00    0.000000           0         4           getrlimit
   0.00    0.000000           0         1           arch_prctl
+  0.00    0.000000           0         1           sched_getaffinity
   0.00    0.000000           0         1           set_tid_address
   0.00    0.000000           0         1         1 fadvise64
   0.00    0.000000           0         1           set_robust_list
 ------ ----------- ----------- --------- --------- ----------------
-100.00    0.000761                    95         3 total
+100.00    0.001151                    96         3 total

This shows how with OMP_NUM_THREADS=12 the max-resident RAM
is nearly triple what you'll see when using OMP_NUM_THREADS=1:

  $ :|env time env LC_ALL=C OMP_NUM_THREADS=12 sort
  0.00user 0.01system 0:00.01elapsed 88%CPU (0avgtext+0avgdata 
17076maxresident)k
  0inputs+0outputs (0major+4496minor)pagefaults 0swaps
  $ :|env time env LC_ALL=C OMP_NUM_THREADS=1 sort
  0.00user 0.00system 0:00.00elapsed 83%CPU (0avgtext+0avgdata 6780maxresident)k
  0inputs+0outputs (0major+1937minor)pagefaults 0swaps

Comparing the two, I see where the extra memory is being allocated,
and will probably remove this pre-optimization to fix it:
[I'll adjust the log, but I doubt I'll be able to add a test]

>From fe20dc12a0bacb81e0474c12dbb29a1afa8b28d1 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyer...@redhat.com>
Date: Thu, 28 Jul 2011 14:03:09 +0200
Subject: [PATCH] sort: FIXME: do not allocate more space initially for
 multi-thread

---
 src/sort.c |   17 +----------------
 1 files changed, 1 insertions(+), 16 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 03f63fb..2aa5da2 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -3793,22 +3793,7 @@ sort (char *const *files, size_t nfiles, char const 
*output_file,
       char const *file = *files;
       FILE *fp = xfopen (file, "r");
       FILE *tfp;
-
-      size_t bytes_per_line;
-      if (nthreads > 1)
-        {
-          /* Get log P. */
-          size_t tmp = 1;
-          size_t mult = 1;
-          while (tmp < nthreads)
-            {
-              tmp *= 2;
-              mult++;
-            }
-          bytes_per_line = (mult * sizeof (struct line));
-        }
-      else
-        bytes_per_line = sizeof (struct line) * 3 / 2;
+      size_t bytes_per_line = sizeof (struct line) * 3 / 2;

       if (! buf.alloc)
         initbuf (&buf, bytes_per_line,
--
1.7.6.639.g50fb9

parallel sort strangeness

Reply via email to