[OMPI devel] Problem running openmpi on nodes connected via eth

Andrej Prsa Tue, 20 Oct 2015 22:09:06 -0400 (EDT)

Hi everyone,

We have a small cluster of 6 identical 48-core nodes for astrophysical
research. We are struggling on getting openmpi to run efficiently on
the nodes. The head node is running ubuntu and openmpi-1.6.5 on a local
disk. All worker nodes are booting from NFS exported root that resides
on a NAS, also with ubuntu and openmpi 1.6.5. All nodes have Gbit
ethernets and the NAS is connected to the switch with 4 NICs. The
motherboard is Supermicro H8QG6, processors are 2.6GHz AMD Opterons
6344.


When we run openmpi on the head node, everything works as expected. But
when we run in on any of the worker nodes, the execution is ~20+ times
longer, and htop shows that all processes spend the vast majority of
their time on kernel cycles (red symbols).

I have been trying to learn about the profilers and MCA optimization
and such, but it seems to me that a 20-fold hit in performance
indicates a much more serious problem. For example, it might have to do
with a buggy BIOS that doesn't report L3 cache correctly, and that
throws hwloc warnings that I reported in the past. I flashed the BIOS
to the latest version, we are running the latest kernel, and I tried
newer, manually compiled hwloc/openmpi to no avail. I am at my wits'
end on what to try next, and I would thoroughly appreciate any help and
guidance. Our cluster is idling till I resolve this, and quite a few
people are tapping on my shoulder impatiently. And yes, I'm an
astronomer, not a sys admin, so please excuse my ignorance.

Thanks a bunch,
Andrej

[OMPI devel] Problem running openmpi on nodes connected via eth

Reply via email to