I'm not terribly well-versed in debugging mpich-gm problems, so I'm
posting this to the oscar-users list to see if anyone has any
input/insight.

Jason
--- Begin Message ---
Hi,

We are trying to test our rh9+oscar3.0 cluster with myrinet.

When running an mpich job (which works fine on other fast or giga
ethernet clusters). We have the following output :

---------------------------------------------------------------------
[EMAIL PROTECTED] root]# mpirun -np 4 -v /home/oscartst/a.out
running /home/oscartst/a.out on 4 LINUX ch_gm processors
Program binary is: /home/oscartst/a.out
Machines file is /opt/mpich-1.2.5.9-ch_gm-gcc/share/machines.ch_gm.LINUX
Shared memory for intra-nodes coms is enabled.
GM receive mode used: polling.
4 processes will be spawned:
        Process 0 (/home/oscartst/a.out ) on oscarnode20.CELAR
        Process 1 (/home/oscartst/a.out ) on oscarnode20.CELAR
        Process 2 (/home/oscartst/a.out ) on oscarnode21.CELAR
        Process 3 (/home/oscartst/a.out ) on oscarnode21.CELAR
Open a socket on serveur.CELAR...
Got a first socket opened on port 8000.
Got a second socket opened on port 8001.
Shared memory file: /tmp/gmpi_shmem-4010002.tmp

ssh oscarnode20.CELAR cd /root ; env  GMPI_MASTER=serveur.CELAR
GMPI_PORT1=8000 GMPI_PORT2=8001 GMPI_SHMEM=1 GMPI_MAGIC=4010002
GMPI_ID=0 GMPI_NP=4 GMPI_BOARD=-1 /home/oscartst/a.out
ssh oscarnode20.CELAR -n cd /root ; env  GMPI_MASTER=serveur.CELAR
GMPI_PORT1=8000 GMPI_PORT2=8001 GMPI_SHMEM=1 GMPI_MAGIC=4010002
GMPI_ID=1 GMPI_NP=4 GMPI_BOARD=-1 /home/oscartst/a.out
ssh oscarnode21.CELAR -n cd /root ; env  GMPI_MASTER=serveur.CELAR
GMPI_PORT1=8000 GMPI_PORT2=8001 GMPI_SHMEM=1 GMPI_MAGIC=4010002
GMPI_ID=2 GMPI_NP=4 GMPI_BOARD=-1 /home/oscartst/a.out
ssh oscarnode21.CELAR -n cd /root ; env  GMPI_MASTER=serveur.CELAR
GMPI_PORT1=8000 GMPI_PORT2=8001 GMPI_SHMEM=1 GMPI_MAGIC=4010002
GMPI_ID=3 GMPI_NP=4 GMPI_BOARD=-1 /home/oscartst/a.out
Nombre de processus = 1 Mon rang = 0

Total : 0.000000
Nombre de processus = 1 Mon rang = 0

Total : 0.000000
Nombre de processus = 1 Mon rang = 0

Total : 0.000000
Nombre de processus = 1 Mon rang = 0

Total : 0.000000
All remote MPI processes have exited.
Cleaning up all remaining processes.
Reap remote processes:
-----------------------------------------------------------------
with :
        Nombre de processus = MPI_COMM_RANK
        Mon rang = MPI_COMM_SIZE

It looks like each process is launched as a new separate standalone job.
The 2 hosts are added in mpich_gm/share/machine.ch_gm.LINUX and in
/etc/hosts

mpich is lauching 2 process per machine (Dual Xeon each) without
throwing any error.

There is the GMPI_BOARD=-1 in the debug output which seems a bit
strange.

Do you have any idea of the problem ?

Best regards,
Constantin CHARISSIS
DATASWIFT

--- End Message ---

Reply via email to