I believe this -should- work, but can't verify it myself. The most important thing is to be sure you built with --enable-heterogeneous or else it will definitely fail.
Ralph On 4/10/08 7:17 AM, "Rolf Vandevaart" <rolf.vandeva...@sun.com> wrote: > > On a CentOS Linux box, I see the following: > >> grep 113 /usr/include/asm-i386/errno.h > #define EHOSTUNREACH 113 /* No route to host */ > > I have also seen folks do this to figure out the errno. > >> perl -e 'die$!=113' > No route to host at -e line 1. > > I am not sure why this is happening, but you could also check the Open > MPI User's Mailing List Archives where there are other examples of > people running into this error. A search of "113" had a few hits. > > http://www.open-mpi.org/community/lists/users > > Also, I assume you would see this problem with or without the > MPI_Barrier if you add this parameter to your mpirun line: > > --mca mpi_preconnect_all 1 > > The MPI_Barrier is causing the bad behavior because by default > connections are setup up lazily. Therefore only when the MPI_Barrier > call is made and we start communicating and establishing connections do > we start seeing the communication problems. > > Rolf > > jody wrote: >> Rolf, >> I was able to run hostname on the two noes that way, >> and also a simplified version of my testprogram (without a barrier) >> works. Only MPI_Barrier shows bad behaviour. >> >> Do you know what this message means? >> [aim-plankton][0,1,2][btl_tcp_endpoint.c: >> 572:mca_btl_tcp_endpoint_complete_connect] >> connect() failed with errno=113 >> Does it give an idea what could be the problem? >> >> Jody >> >> On Thu, Apr 10, 2008 at 2:20 PM, Rolf Vandevaart >> <rolf.vandeva...@sun.com> wrote: >>> This worked for me although I am not sure how extensive our 32/64 >>> interoperability support is. I tested on Solaris using the TCP >>> interconnect and a 1.2.5 version of Open MPI. Also, we configure >>> with >>> the --enable-heterogeneous flag which may make a difference here. >>> Also >>> this did not work for me over the sm btl. >>> >>> By the way, can you run a simple /bin/hostname across the two nodes? >>> >>> >>> burl-ct-v20z-4 61 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m32 simple.c -o >>> simple.32 >>> burl-ct-v20z-4 62 =>/opt/SUNWhpc/HPC7.1/bin/mpicc -m64 simple.c -o >>> simple.64 >>> burl-ct-v20z-4 63 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -gmca >>> btl_tcp_if_include bge1 -gmca btl sm,self,tcp -host burl-ct-v20z-4 - >>> np 3 >>> simple.32 : -host burl-ct-v20z-5 -np 3 simple.64 >>> [burl-ct-v20z-4]I am #0/6 before the barrier >>> [burl-ct-v20z-5]I am #3/6 before the barrier >>> [burl-ct-v20z-5]I am #4/6 before the barrier >>> [burl-ct-v20z-4]I am #1/6 before the barrier >>> [burl-ct-v20z-4]I am #2/6 before the barrier >>> [burl-ct-v20z-5]I am #5/6 before the barrier >>> [burl-ct-v20z-5]I am #3/6 after the barrier >>> [burl-ct-v20z-4]I am #1/6 after the barrier >>> [burl-ct-v20z-5]I am #5/6 after the barrier >>> [burl-ct-v20z-5]I am #4/6 after the barrier >>> [burl-ct-v20z-4]I am #2/6 after the barrier >>> [burl-ct-v20z-4]I am #0/6 after the barrier >>> burl-ct-v20z-4 64 =>/opt/SUNWhpc/HPC7.1/bin/mpirun -V mpirun (Open >>> MPI) 1.2.5r16572 >>> >>> Report bugs to http://www.open-mpi.org/community/help/ >>> burl-ct-v20z-4 65 => >>> >>> >>> >>> >>> jody wrote: >>>> i narrowed it down: >>>> The majority of processes get stuck in MPI_Barrier. >>>> My Test application looks like this: >>>> >>>> #include <stdio.h> >>>> #include <unistd.h> >>>> #include "mpi.h" >>>> >>>> int main(int iArgC, char *apArgV[]) { >>>> int iResult = 0; >>>> int iRank1; >>>> int iNum1; >>>> >>>> char sName[256]; >>>> gethostname(sName, 255); >>>> >>>> MPI_Init(&iArgC, &apArgV); >>>> >>>> MPI_Comm_rank(MPI_COMM_WORLD, &iRank1); >>>> MPI_Comm_size(MPI_COMM_WORLD, &iNum1); >>>> >>>> printf("[%s]I am #%d/%d before the barrier\n", sName, iRank1, >>>> iNum1); >>>> MPI_Barrier(MPI_COMM_WORLD); >>>> printf("[%s]I am #%d/%d after the barrier\n", sName, iRank1, >>>> iNum1); >>>> >>>> MPI_Finalize(); >>>> >>>> return iResult; >>>> } >>>> >>>> >>>> If i make this call: >>>> mpirun -np 3 --debug-daemons --host aim-plankton -x DISPLAY >>>> ./run_gdb.sh ./MPITest32 : -np 3 --host aim-fanta4 -x DISPLAY >>>> ./run_gdb.sh ./MPITest64 >>>> >>>> (run_gdb.sh is a script which starts gdb in a xterm for each >>>> process) >>>> Process 0 (on aim-plankton) passes the barrier and gets stuck in >>>> PMPI_Finalize, >>>> all other processes get stuck in PMPI_Barrier, >>>> Process 1 (on aim-plankton) displays the message >>>> [aim-plankton][0,1,1][btl_tcp_endpoint.c: >>>> 572:mca_btl_tcp_endpoint_complete_connect] >>>> connect() failed with errno=113 >>>> Process 2 on (aim-plankton) displays the same message twice. >>>> >>>> Any ideas? >>>> >>>> Thanks Jody >>>> >>>> On Thu, Apr 10, 2008 at 1:05 PM, jody <jody....@gmail.com> wrote: >>>>> Hi >>>>> Using a more realistic application than a simple "Hello, world" >>>>> even the --host version doesn't work correctly >>>>> Called this way >>>>> >>>>> mpirun -np 3 --host aim-plankton ./QHGLauncher >>>>> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim- >>>>> fanta4 >>>>> ./QHGLauncher_64 --read-config=pureveg_new.cfg -o output.txt >>>>> >>>>> the application starts but seems to hang after a while. >>>>> >>>>> Running the application in gdb: >>>>> >>>>> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./ >>>>> QHGLauncher >>>>> --read-config=pureveg_new.cfg -o output.txt : -np 3 --host aim- >>>>> fanta4 >>>>> -x DISPLAY ./run_gdb.sh ./QHGLauncher_64 --read- >>>>> config=pureveg_new.cfg >>>>> -o bruzlopf -n 12 >>>>> --seasonality=3,data/cai_temp2.clim,data/cai_precip2.clim >>>>> >>>>> i can see that the processes on aim-fanta4 have indeed gotten stuck >>>>> after a few initial outputs, >>>>> and the processes on aim-plankton all have a messsage: >>>>> >>>>> [aim-plankton][0,1,1][btl_tcp_endpoint.c: >>>>> 572:mca_btl_tcp_endpoint_complete_connect] >>>>> connect() failed with errno=113 >>>>> >>>>> If i opnly use aim-plankton alone or aim-fanta4 alone everythiung >>>>> runs >>>>> as expected. >>>>> >>>>> BTW: i'm, using open MPI 1.2.2 >>>>> >>>>> Thanks >>>>> Jody >>>>> >>>>> >>>>> On Thu, Apr 10, 2008 at 12:40 PM, jody <jody....@gmail.com> wrote: >>>>>> HI >>>>>> In my network i have some 32 bit machines and some 64 bit >>>>>> machines. >>>>>> With --host i successfully call my application: >>>>>> mpirun -np 3 --host aim-plankton -x DISPLAY ./run_gdb.sh ./ >>>>>> MPITest : >>>>>> -np 3 --host aim-fanta4 -x DISPLAY ./run_gdb.sh ./MPITest64 >>>>>> (MPITest64 has the same code as MPITest, but was compiled on the >>>>>> 64 bit machine) >>>>>> >>>>>> But when i use hostfiles: >>>>>> mpirun -np 3 --hostfile hosts32 -x DISPLAY ./run_gdb.sh ./ >>>>>> MPITest : >>>>>> -np 3 --hostfile hosts64 -x DISPLAY ./run_gdb.sh ./MPITest64 >>>>>> all 6 processes are started on the 64 bit machine aim-fanta4. >>>>>> >>>>>> hosts32: >>>>>> aim-plankton slots=3 >>>>>> hosts64 >>>>>> aim-fanta4 slots >>>>>> >>>>>> Is this a bug or a feature? ;) >>>>>> >>>>>> Jody >>>>>> >>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> -- >>> >>> ========================= >>> rolf.vandeva...@sun.com >>> 781-442-3043 >>> ========================= >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >