Re: [OMPI users] problems with hostfile when doing MPMD

Ralph Castain Sun, 13 Apr 2008 19:15:25 -0400

I believe this -should- work, but can't verify it myself. The most important
thing is to be sure you built with --enable-heterogeneous or else it will
definitely fail.


Ralph



On 4/10/08 7:17 AM, "Rolf Vandevaart" <rolf.vandeva...@sun.com> wrote:

> 
> On a CentOS Linux box, I see the following:
> 
>> grep 113 /usr/include/asm-i386/errno.h
> #define EHOSTUNREACH 113 /* No route to host */
> 
> I have also seen folks do this to figure out the errno.
> 
>> perl -e 'die$!=113'
> No route to host at -e line 1.
> 
> I am not sure why this is happening, but you could also check the Open
> MPI User's Mailing List Archives where there are other examples of
> people running into this error.  A search of "113" had a few hits.
> 
> http://www.open-mpi.org/community/lists/users
> 
> Also, I assume you would see this problem with or without the
> MPI_Barrier if you add this parameter to your mpirun line:
> 
>      --mca mpi_preconnect_all 1
> 
> The MPI_Barrier is causing the bad behavior because by default
> connections are setup up lazily. Therefore only when the MPI_Barrier
> call is made and we start communicating and establishing connections do
> we start seeing the communication problems.
> 
> Rolf
> 
> jody wrote:
>> Rolf,
>> I was able to run hostname on the two noes that way,
>> and also a simplified version of my testprogram (without a barrier)
>> works. Only MPI_Barrier shows bad behaviour.
>> 
>> Do you know what this message means?
>> [aim-plankton][0,1,2][btl_tcp_endpoint.c:
>> 572:mca_btl_tcp_endpoint_complete_connect]
>> connect() failed with errno=113
>> Does it give an idea what could be the problem?
>> 
>> Jody
>> 
>> On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart
>> <rolf.vandeva...@sun.com> wrote:
>>> This worked for me although I am not sure how extensive our 32/64
>>> interoperability support is.  I tested on Solaris using the TCP
>>> interconnect and a 1.2.5 version of Open MPI.  Also, we configure
>>> with
>>> the --enable-heterogeneous flag which may make a difference here.
>>> Also
>>> this did not work for me over the sm btl.
>>> 
>>> By the way, can you run a simple /bin/hostname across the two nodes?
>>> 
>>> 
>>>  burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o
>>> simple.32
>>>  burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o
>>> simple.64
>>>  burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca
>>> btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 -
>>> np 3
>>> simple.32 : -host burl-ct-v20z-5 -np 3 simple.64
>>> [burl-ct-v20z-4]I am #0/6 before the barrier
>>> [burl-ct-v20z-5]I am #3/6 before the barrier
>>> [burl-ct-v20z-5]I am #4/6 before the barrier
>>> [burl-ct-v20z-4]I am #1/6 before the barrier
>>> [burl-ct-v20z-4]I am #2/6 before the barrier
>>> [burl-ct-v20z-5]I am #5/6 before the barrier
>>> [burl-ct-v20z-5]I am #3/6 after the barrier
>>> [burl-ct-v20z-4]I am #1/6 after the barrier
>>> [burl-ct-v20z-5]I am #5/6 after the barrier
>>> [burl-ct-v20z-5]I am #4/6 after the barrier
>>> [burl-ct-v20z-4]I am #2/6 after the barrier
>>> [burl-ct-v20z-4]I am #0/6 after the barrier
>>>  burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open
>>> MPI) 1.2.5r16572
>>> 
>>> Report bugs to http://www.open-mpi.org/community/help/
>>>  burl-ct-v20z-4 65 =>
>>> 
>>> 
>>> 
>>> 
>>> jody wrote:
>>>> i narrowed it down:
>>>> The majority of processes get stuck in MPI_Barrier.
>>>> My Test application looks like this:
>>>> 
>>>> #include <stdio.h>
>>>> #include <unistd.h>
>>>> #include "mpi.h"
>>>> 
>>>> int main(int iArgC, char *apArgV[]) {
>>>>    int iResult = 0;
>>>>    int iRank1;
>>>>    int iNum1;
>>>> 
>>>>    char sName[256];
>>>>    gethostname(sName, 255);
>>>> 
>>>>    MPI_Init(&iArgC, &apArgV);
>>>> 
>>>>    MPI_Comm_rank(MPI_COMM_WORLD, &iRank1);
>>>>    MPI_Comm_size(MPI_COMM_WORLD, &iNum1);
>>>> 
>>>>    printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1,
>>>> iNum1);
>>>>    MPI_Barrier(MPI_COMM_WORLD);
>>>>    printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1,
>>>> iNum1);
>>>> 
>>>>    MPI_Finalize();
>>>> 
>>>>    return iResult;
>>>> }
>>>> 
>>>> 
>>>> If i make this call:
>>>> mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY
>>>> ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY
>>>> ./run_gdb.sh ./MPITest64
>>>> 
>>>> (run_gdb.sh is a script which starts gdb in a xterm for each
>>>> process)
>>>> Process 0 (on aim-plankton) passes the barrier and gets stuck in
>>>> PMPI_Finalize,
>>>> all other processes get stuck in PMPI_Barrier,
>>>> Process 1 (on aim-plankton) displays the message
>>>>   [aim-plankton][0,1,1][btl_tcp_endpoint.c:
>>>> 572:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() failed with errno=113
>>>> Process 2 on (aim-plankton) displays the same message twice.
>>>> 
>>>> Any ideas?
>>>> 
>>>>  Thanks Jody
>>>> 
>>>> On Thu, Apr 10, 2008 at 1:05 PM, jody <jody....@gmail.com> wrote:
>>>>> Hi
>>>>> Using a more realistic application than a simple "Hello, world"
>>>>> even the --host version doesn't work correctly
>>>>> Called this way
>>>>> 
>>>>> mpirun -np 3 --host aim-plankton ./QHGLauncher
>>>>> --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-
>>>>> fanta4
>>>>> ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt
>>>>> 
>>>>> the application starts but seems to hang after a while.
>>>>> 
>>>>> Running the application in gdb:
>>>>> 
>>>>> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./
>>>>> QHGLauncher
>>>>> --read-config=pureveg_new.cfg -o output.txt  : -np 3 --host aim-
>>>>> fanta4
>>>>> -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read-
>>>>> config=pureveg_new.cfg
>>>>> -o bruzlopf -n 12
>>>>> --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim
>>>>> 
>>>>> i can see that the processes on aim-fanta4 have indeed gotten stuck
>>>>> after a few initial outputs,
>>>>> and the processes on aim-plankton all have a messsage:
>>>>> 
>>>>> [aim-plankton][0,1,1][btl_tcp_endpoint.c:
>>>>> 572:mca_btl_tcp_endpoint_complete_connect]
>>>>> connect() failed with errno=113
>>>>> 
>>>>> If i opnly use aim-plankton alone or aim-fanta4 alone everythiung
>>>>> runs
>>>>> as expected.
>>>>> 
>>>>> BTW: i'm, using open MPI 1.2.2
>>>>> 
>>>>> Thanks
>>>>>  Jody
>>>>> 
>>>>> 
>>>>> On Thu, Apr 10, 2008 at 12:40 PM, jody <jody....@gmail.com> wrote:
>>>>>> HI
>>>>>> In my network i have some 32 bit machines and some 64 bit
>>>>>> machines.
>>>>>> With --host i successfully call my application:
>>>>>>  mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./
>>>>>> MPITest :
>>>>>> -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64
>>>>>> (MPITest64 has the same code as MPITest, but was compiled on the
>>>>>> 64 bit machine)
>>>>>> 
>>>>>> But when i use hostfiles:
>>>>>>  mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./
>>>>>> MPITest :
>>>>>> -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64
>>>>>> all 6 processes are started on the 64 bit machine aim-fanta4.
>>>>>> 
>>>>>> hosts32:
>>>>>>   aim-plankton slots=3
>>>>>> hosts64
>>>>>>  aim-fanta4 slots
>>>>>> 
>>>>>> Is this a bug or a feature?  ;)
>>>>>> 
>>>>>> Jody
>>>>>> 
>>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>>> --
>>> 
>>> =========================
>>> rolf.vandeva...@sun.com
>>> 781-442-3043
>>> =========================
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>

Re: [OMPI users] problems with hostfile when doing MPMD

Reply via email to