We are doing a test build of a new cluster. We are re-using our Myrinet 10G gear from a previous cluster.

I have built OpenMPI 1.4.2 with PGI 10.4. We use this regularly on our Infiniband based cluster and all the install elements were readily available.

With a few go-arounds with the Myrinet MX stack, we are now running MX -1.2.12 with allowances for more than the max of 16 endpoints. Each node has 24 cores.

The cluster is running rocks 5.3.

As part of the initial build, I installed the Myrinet_MX Rocks Roll from Myricom. With the default limitation of 16 endpoints, we could not run on all nodes. As mentioned above, the MX stack was replaced.

Myrinet provided a build of OpenMPI 1.4.1. That build works. It is only compiled with gcc and gfortran and we wanted it built with the compilers we normally use, e.g. PGI, Pathscale and Intel.

We can compile with the OpenMPI 1.4.2 / PGI 10.4 build. However, we cannot launch jobs with mpirun. It seg faults.

--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
[enet1-head2-eth1:29532] *** Process received signal ***
[enet1-head2-eth1:29532] Signal: Segmentation fault (11)
[enet1-head2-eth1:29532] Signal code: Address not mapped (1)
[enet1-head2-eth1:29532] Failing at address: 0x6c
[enet1-head2-eth1:29532] *** End of error message ***
Segmentation fault

However, if we launch the job with the Myricom supplied mpirun in the OpenMPI tree, the job runs successfully. This works even with a test program compiled with the OpenMPI 1.4.2 with PGI 10.4 build.


Reply via email to