I'm not terribly well-versed in debugging mpich-gm problems, so I'm posting this to the oscar-users list to see if anyone has any input/insight.
Jason
--- Begin Message ---Hi, We are trying to test our rh9+oscar3.0 cluster with myrinet. When running an mpich job (which works fine on other fast or giga ethernet clusters). We have the following output : --------------------------------------------------------------------- [EMAIL PROTECTED] root]# mpirun -np 4 -v /home/oscartst/a.out running /home/oscartst/a.out on 4 LINUX ch_gm processors Program binary is: /home/oscartst/a.out Machines file is /opt/mpich-1.2.5.9-ch_gm-gcc/share/machines.ch_gm.LINUX Shared memory for intra-nodes coms is enabled. GM receive mode used: polling. 4 processes will be spawned: Process 0 (/home/oscartst/a.out ) on oscarnode20.CELAR Process 1 (/home/oscartst/a.out ) on oscarnode20.CELAR Process 2 (/home/oscartst/a.out ) on oscarnode21.CELAR Process 3 (/home/oscartst/a.out ) on oscarnode21.CELAR Open a socket on serveur.CELAR... Got a first socket opened on port 8000. Got a second socket opened on port 8001. Shared memory file: /tmp/gmpi_shmem-4010002.tmp ssh oscarnode20.CELAR cd /root ; env GMPI_MASTER=serveur.CELAR GMPI_PORT1=8000 GMPI_PORT2=8001 GMPI_SHMEM=1 GMPI_MAGIC=4010002 GMPI_ID=0 GMPI_NP=4 GMPI_BOARD=-1 /home/oscartst/a.out ssh oscarnode20.CELAR -n cd /root ; env GMPI_MASTER=serveur.CELAR GMPI_PORT1=8000 GMPI_PORT2=8001 GMPI_SHMEM=1 GMPI_MAGIC=4010002 GMPI_ID=1 GMPI_NP=4 GMPI_BOARD=-1 /home/oscartst/a.out ssh oscarnode21.CELAR -n cd /root ; env GMPI_MASTER=serveur.CELAR GMPI_PORT1=8000 GMPI_PORT2=8001 GMPI_SHMEM=1 GMPI_MAGIC=4010002 GMPI_ID=2 GMPI_NP=4 GMPI_BOARD=-1 /home/oscartst/a.out ssh oscarnode21.CELAR -n cd /root ; env GMPI_MASTER=serveur.CELAR GMPI_PORT1=8000 GMPI_PORT2=8001 GMPI_SHMEM=1 GMPI_MAGIC=4010002 GMPI_ID=3 GMPI_NP=4 GMPI_BOARD=-1 /home/oscartst/a.out Nombre de processus = 1 Mon rang = 0 Total : 0.000000 Nombre de processus = 1 Mon rang = 0 Total : 0.000000 Nombre de processus = 1 Mon rang = 0 Total : 0.000000 Nombre de processus = 1 Mon rang = 0 Total : 0.000000 All remote MPI processes have exited. Cleaning up all remaining processes. Reap remote processes: ----------------------------------------------------------------- with : Nombre de processus = MPI_COMM_RANK Mon rang = MPI_COMM_SIZE It looks like each process is launched as a new separate standalone job. The 2 hosts are added in mpich_gm/share/machine.ch_gm.LINUX and in /etc/hosts mpich is lauching 2 process per machine (Dual Xeon each) without throwing any error. There is the GMPI_BOARD=-1 in the debug output which seems a bit strange. Do you have any idea of the problem ? Best regards, Constantin CHARISSIS DATASWIFT
--- End Message ---
