Dear all, I am a little bit confused about a problem I have encountered a few times now.
I have 3 clusters which have InfiniBand network. One of the older clusters has Mellanox MT23108 cards and a Voltaire sLB-24 switch, the newer cluster has Mellanox MT26428 with a QLogic 12300 switch. All clusters are running Debian Squeeze, all of them are 64 bit machines and all of them have the the required packages for the IB network installed. I have tested the IB network and that is up and running without any problems, as far as I can tell. Most of the programs I am using are running well with the IB network, however, I got two which are behaving a bit odd. I will give only one example. I do not expect a solution for the specific problem here but I would like to understand what is going on here. If I am compiling the latest version of GAMESS-US with the MPI network, it is running fine if I am starting it like that in the rungms wrapper script: /opt/openmpi/gfortran/1.4.3/bin/mpirun -np 4 --hostfile /home/sassy/gamess/mpi/host /home/sassy/build/gamess/gamess.01.x However, as OpenMPI is 'clever', I wanted to make sure that I am really using the IB network and not the gigabit network here. Thus, I added the flag to ignore the TCP network and started the program like that: /opt/openmpi/gfortran/1.4.3/bin/mpirun -np 4 --hostfile /home/sassy/gamess/mpi/host --mca btl ^tcp /home/sassy/build/gamess/gamess.01.x That crashes immediately, and I have included the verbose output of that in the attached file. So far, so good. However, if I am not using the cluster with the Voltair switch (described above) but the one with the more recent Qlogic switch and _copy_ the binary just over, it is working. There is no crash when I am using the IB network and the program is running. My question is: why? I have thought, that MPI is an interface and anything which has to do with the node-to-node communication is handled by MPI, so the program GAMESS-US is just making its calls to MPI and OpenMPI is then handling the communication, regardless of the network. So if there is a TCP network around, OpenMPI is using that and if there is a IB network around, OpenMPI is using that. However, from the above observation (and I got a very similar case wit NWchem) it appears to me that the program GAMESS-US has problems with the Voltair network but no problems with the Qlogic network . That is something I find a bit puzzling. As I said, I am not after a specific solution for that particular problem here I really would like to understand why one IB network is working and the _same_ binary on a different network is failing. Recompiling GAMESS-US on the failing network does not help here, I get the same problems. All the best from a foggy London Jörg -- ************************************************************* Jörg Saßmannshausen University College London Department of Chemistry Gordon Street London WC1H 0AJ email: [email protected] web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html
/opt/openmpi/gfortran/1.4.3/bin/mpirun -np 4 --hostfile /home/sassy/gamess/mpi/host --mca btl ^tcp --mca btl_openib_verbose 100 --mca orte_base_help_aggregate 0 --mca btl_base_verbose 30 /home/sassy/build/gamess/gamess.01.x [node24:05105] mca: base: components_open: Looking for btl components [node24:05106] mca: base: components_open: Looking for btl components [node24:05106] mca: base: components_open: opening btl components [node24:05106] mca: base: components_open: found loaded component self [node24:05106] mca: base: components_open: component self has no register function [node24:05105] mca: base: components_open: opening btl components [node24:05106] mca: base: components_open: component self open function successful [node24:05106] mca: base: components_open: found loaded component sm [node24:05106] mca: base: components_open: component sm has no register function [node24:05105] mca: base: components_open: found loaded component self [node24:05105] mca: base: components_open: component self has no register function [node24:05106] mca: base: components_open: component sm open function successful [node24:05105] mca: base: components_open: component self open function successful [node24:05105] mca: base: components_open: found loaded component sm [node24:05105] mca: base: components_open: component sm has no register function [node24:05105] mca: base: components_open: component sm open function successful [node32:32503] mca: base: components_open: Looking for btl components [node32:32504] mca: base: components_open: Looking for btl components [node32:32503] mca: base: components_open: opening btl components [node32:32503] mca: base: components_open: found loaded component self [node32:32503] mca: base: components_open: component self has no register function [node32:32503] mca: base: components_open: component self open function successful [node32:32504] mca: base: components_open: opening btl components [node32:32504] mca: base: components_open: found loaded component self [node32:32504] mca: base: components_open: component self has no register function [node32:32503] mca: base: components_open: found loaded component sm [node32:32503] mca: base: components_open: component sm has no register function [node32:32504] mca: base: components_open: component self open function successful [node32:32504] mca: base: components_open: found loaded component sm [node32:32504] mca: base: components_open: component sm has no register function [node32:32503] mca: base: components_open: component sm open function successful [node32:32504] mca: base: components_open: component sm open function successful [node24:05106] select: initializing btl component self [node24:05106] select: init of component self returned success [node24:05105] select: initializing btl component self [node24:05105] select: init of component self returned success [node24:05105] select: initializing btl component sm [node24:05105] select: init of component sm returned success [node24:05106] select: initializing btl component sm [node24:05106] select: init of component sm returned success [node32:32504] select: initializing btl component self [node32:32503] select: initializing btl component self [node32:32503] select: init of component self returned success [node32:32503] select: initializing btl component sm [node32:32503] select: init of component sm returned success [node32:32504] select: init of component self returned success [node32:32504] select: initializing btl component sm [node32:32504] select: init of component sm returned success -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[5187,1],1]) is on host: node24 Process 2 ([[5187,1],0]) is on host: node32 BTLs attempted: self sm Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[5187,1],2]) is on host: node32 Process 2 ([[5187,1],1]) is on host: node24 BTLs attempted: self sm Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[5187,1],3]) is on host: node24 Process 2 ([[5187,1],0]) is on host: node32 BTLs attempted: self sm Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[5187,1],0]) is on host: node32 Process 2 ([[5187,1],1]) is on host: node24 BTLs attempted: self sm Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -------------------------------------------------------------------------- *** The MPI_Init_thread() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -------------------------------------------------------------------------- *** Your MPI job will now abort. *** The MPI_Init_thread() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [node24:5105] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! [node32:32504] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -------------------------------------------------------------------------- *** The MPI_Init_thread() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [node24:5106] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -------------------------------------------------------------------------- *** The MPI_Init_thread() function was called before MPI_INIT was invoked. *** This is disallowed by the MPI standard. *** Your MPI job will now abort. [node32:32503] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- mpirun has exited due to process rank 2 with PID 32504 on node node32 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- unset echo
_______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
