On 08-Jul-2015 15:43, Jonathan Barber wrote:
I think you're process is being moved between NUMA nodes and you're losing locality to the data. Try confining the process and data to the same NUMA
node with the numactl command.

That's part of it.  I ran a bunch of commands like this:

 taskset -c 20 dd if=KTEMP1 of=KTEMP0 bs=120000 count=34000
 taskset -c 20 testprogram -in KTEMP0

with these results:

count   size Gb  time(s)  size bytes
 34000    ~4       ~3       4080000000
 68000    ~8       ~7       8160000000
 70000    ~8       ~3       8400000000
100000   ~12       ~3      12000000000
120000   ~14       ~7      14400000000
130000   ~16       ~9      15600000000
140000   ~17     >120      16800000000  (2^34 is 17179869184)

(I didn't wait for the 140000 to complete, it could have gone on for another 5 minutes.) The variation between ~3s and ~9s isn't significant or repeatable, I think it represents the flush process getting in the way of the second command.

If the test was changed so that "-c 1" was used for the first command
and "-c 20" for the second, then the 130000 record case took
23s. So there is definitely an advantage in having the file cache pages somehow associated with the CPU where they will be needed next.

Now the mystery is what the problem is for an fread() into a buffer of close to, but just below, 2^34.

Here is ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 4134441
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Nothing there that screams 2^34 to me. Perhaps something crucial for performance needs to be locked into memory and grows beyond 64kb at that buffer size, and that indirectly leads to the performance problem.

As an aside, when the test program is locked to a CPU and a file which is "too big" is read there is no migration/20 process using CPU time. Instead, there is an events/20 that starts using up a significant amount of CPU time (varying wildly around 30%). ksoftirqd/20 also comes and goes, so that could also be a factor.



Assuming your machine is NUMA (hwloc-ls will show you this) in my
experience some of the E5's have really poor performance for inter-NUMA
communication.

I don't have anything called hwloc-ls on this system. What package provides it? This is a Centos system.

Thanks,

David Mathog
[email protected]
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to