Fergus Henderson wrote:
What's possibly more interesting is the performance results. I'm
compiling about 1300 source files on nodes that have 16 cpu's each.
Single node (plain GNU make, no distcc):
-j2 8m 19s
-j4 5m 46s
-j5 6m 39s
-j8 10m 35s
I don't understand this, but it is repeatable. Any ideas on that one?
It looks like your machine probably has 4 CPUs, with each job using
nearly 100% CPU.
My node actually has four quad-core processors.
make -j4 cpu utilization, according to top:
Cpu0 : 0.7%us, 4.0%sy, 0.0%ni, 95.3%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu1 : 13.6%us, 29.7%sy, 0.0%ni, 56.8%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu2 : 0.7%us, 1.8%sy, 0.0%ni, 97.4%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu3 : 0.7%us, 0.7%sy, 0.0%ni, 98.5%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu4 : 26.8%us, 22.4%sy, 0.0%ni, 49.6%id, 0.0%wa, 0.0%hi, 1.1%si
Cpu5 : 7.0%us, 15.8%sy, 0.0%ni, 77.3%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu6 : 2.9%us, 2.9%sy, 0.0%ni, 94.2%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu7 : 2.6%us, 0.7%sy, 0.0%ni, 96.7%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu8 : 0.7%us, 3.7%sy, 0.0%ni, 95.6%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu9 : 9.5%us, 15.8%sy, 0.0%ni, 74.7%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu10 : 0.7%us, 2.9%sy, 0.0%ni, 96.4%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu11 : 0.4%us, 0.4%sy, 0.0%ni, 99.3%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu12 : 1.5%us, 5.8%sy, 0.0%ni, 92.7%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu13 : 5.1%us, 5.1%sy, 0.0%ni, 89.8%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu14 : 6.6%us, 5.1%sy, 0.0%ni, 88.3%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu15 : 27.5%us, 30.0%sy, 0.0%ni, 41.8%id, 0.0%wa, 0.0%hi, 0.7%si
It bounces all over the place but this is not an atypical snapshot. The
machine is mostly idle, according to the fourth column. Is this somehow
a major resource contention issue? Disk access, maybe? Note that I
have tried local disk access on a /tmp partition, and while overall
performance improves a bit, the scaling with increasing -j does not.
-j5 is still slower than -j4.
A make -j8 run barely eats any more cpu. This is with fully local disk,
too:
Cpu0 : 20.0%us, 38.2%sy, 0.0%ni, 41.8%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu1 : 5.4%us, 30.4%sy, 0.0%ni, 64.3%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu2 : 14.5%us, 21.8%sy, 0.0%ni, 63.6%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu3 : 12.7%us, 41.8%sy, 0.0%ni, 45.5%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu4 : 20.0%us, 27.3%sy, 0.0%ni, 52.7%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu5 : 3.7%us, 29.6%sy, 0.0%ni, 66.7%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu6 : 1.8%us, 32.7%sy, 0.0%ni, 65.5%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu7 : 1.8%us, 3.5%sy, 0.0%ni, 94.7%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu8 : 1.8%us, 30.9%sy, 0.0%ni, 67.3%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu9 : 5.4%us, 23.2%sy, 0.0%ni, 71.4%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu10 : 3.6%us, 12.7%sy, 0.0%ni, 83.6%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu11 : 0.0%us, 7.3%sy, 0.0%ni, 92.7%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu12 : 3.6%us, 32.7%sy, 0.0%ni, 63.6%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu13 : 5.4%us, 19.6%sy, 0.0%ni, 75.0%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu14 : 0.0%us, 5.5%sy, 0.0%ni, 94.5%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu15 : 3.5%us, 28.1%sy, 0.0%ni, 68.4%id, 0.0%wa, 0.0%hi, 0.0%si
Recall that -j8 results in a 10m build whereas -j4 results in a 5m build.
I'm not sure how to effectively profile this. All the sources are on
NFS.
So I then went multi-node w/ 4 jobs per node. Using localhost as a
server only seems to slow things down, incidentally.
1 node, -j4: 5m 28s (using distcc and 1 remote node)
2 nodes, -j8: 2m 57s
3 nodes, -j12: 2m 16s
4 nodes, -j16: 1m 58s
5 nodes, -j20: 2m 7s
Scaling seems to break down around the 4 node mark. Our link step
is only 5-6 seconds, so we are not getting bound by that. Messing
with -j further doesn't seem to help. Any ideas for profiling this
to find any final bottlenecks?
First, try running "top" during the build to determine the CPU usage on
your local host. If it stays near 100%, then the bottleneck is local
jobs such as linking and/or include scanning, and top will show you
which jobs are using the CPU most. That's quite likely to be the
limiting factor if you have a large number of nodes.
Not surprisingly (now), the localhost CPU is mostly idle as well during
a multi-node build. A snapshot:
Cpu0 : 0.0%us, 11.5%sy, 0.0%ni, 88.5%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu1 : 1.9%us, 26.9%sy, 0.0%ni, 71.2%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu2 : 1.9%us, 29.6%sy, 0.0%ni, 68.5%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu3 : 1.9%us, 17.0%sy, 0.0%ni, 81.1%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu4 : 9.4%us, 43.4%sy, 0.0%ni, 43.4%id, 0.0%wa, 0.0%hi, 3.8%si
Cpu5 : 3.8%us, 28.3%sy, 0.0%ni, 67.9%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu6 : 1.9%us, 18.9%sy, 0.0%ni, 79.2%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu7 : 1.9%us, 28.8%sy, 0.0%ni, 69.2%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu8 : 1.9%us, 11.3%sy, 0.0%ni, 86.8%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu9 : 3.7%us, 37.0%sy, 0.0%ni, 59.3%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu10 : 1.9%us, 26.4%sy, 0.0%ni, 71.7%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu11 : 0.0%us, 11.3%sy, 0.0%ni, 88.7%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu12 : 1.9%us, 15.4%sy, 0.0%ni, 82.7%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu13 : 1.9%us, 30.2%sy, 0.0%ni, 67.9%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu14 : 3.8%us, 22.6%sy, 0.0%ni, 73.6%id, 0.0%wa, 0.0%hi, 0.0%si
Cpu15 : 1.9%us, 24.5%sy, 0.0%ni, 73.6%id, 0.0%wa, 0.0%hi, 0.0%si
Another possibility is lack of parallelism in your Makefile; you may
have 1300 source files, but the dependencies in your Makefile probably
mean that you can't actually run 1300 compiles in parallel. Maybe your
Makefile only allows about 16 compiles to run in parallel on average.
I believe I fixed my makefiles to be, after a couple of short initial
serial steps, fully parallel in compiling the source, both per directory
and per source file. I do see the directories being interleaved in my
output, and also big bursts of files from the same directory being launched.
--
Robert W. Anderson
Center for Applied Scientific Computing
Email: anderson...@llnl.gov
Tel: 925-424-2858 Fax: 925-423-8704
__
distcc mailing list http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc