Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand
Ralph, thanks for investigating. I've applied the two patches you mentioned earlier and ran with the ompi server. Although i was able to runn our standalone test, when I integrated the changes to our code, the processes entered a crazy loop and allocated all the memory available when calling MPI_Port_Connect. I was not able to identify why it works standalone but not integrated with our code. If I found why, I'll let your know. looking forward to your findings. We'll be happy to test any patches if you have some! p. On Sat, Jul 17, 2010 at 9:47 PM, Ralph Castain wrote: > Okay, I can reproduce this problem. Frankly, I don't think this ever worked > with OMPI, and I'm not sure how the choice of BTL makes a difference. > > The program is crashing in the communicator definition, which involves a > communication over our internal out-of-band messaging system. That system has > zero connection to any BTL, so it should crash either way. > > Regardless, I will play with this a little as time allows. Thanks for the > reproducer! > > > On Jun 25, 2010, at 7:23 AM, Philippe wrote: > >> Hi, >> >> I'm trying to run a test program which consists of a server creating a >> port using MPI_Open_port and N clients using MPI_Comm_connect to >> connect to the server. >> >> I'm able to do so with 1 server and 2 clients, but with 1 server + 3 >> clients, I get the following error message: >> >> [node003:32274] [[37084,0],0]:route_callback tried routing message >> from [[37084,1],0] to [[40912,1],0]:102, can't find route >> >> This is only happening with the openib BTL. With tcp BTL it works >> perfectly fine (ofud also works as a matter of fact...). This has been >> tested on two completely different clusters, with identical results. >> In either cases, the IB frabic works normally. >> >> Any help would be greatly appreciated! Several people in my team >> looked at the problem. Google and the mailing list archive did not >> provide any clue. I believe that from an MPI standpoint, my test >> program is valid (and it works with TCP, which make me feel better >> about the sequence of MPI calls) >> >> Regards, >> Philippe. >> >> >> >> Background: >> >> I intend to use openMPI to transport data inside a much larger >> application. Because of that, I cannot used mpiexec. Each process is >> started by our own "job management" and use a name server to find >> about each others. Once all the clients are connected, I would like >> the server to do MPI_Recv to get the data from all the client. I dont >> care about the order or which client are sending data, as long as I >> can receive it with on call. Do do that, the clients and the server >> are going through a series of Comm_accept/Conn_connect/Intercomm_merge >> so that at the end, all the clients and the server are inside the same >> intracomm. >> >> Steps: >> >> I have a sample program that show the issue. I tried to make it as >> short as possible. It needs to be executed on a shared file system >> like NFS because the server write the port info to a file that the >> client will read. To reproduce the issue, the following steps should >> be performed: >> >> 0. compile the test with "mpicc -o ben12 ben12.c" >> 1. ssh to the machine that will be the server >> 2. run ./ben12 3 1 >> 3. ssh to the machine that will be the client #1 >> 4. run ./ben12 3 0 >> 5. repeat step 3-4 for client #2 and #3 >> >> the server accept the connection from client #1 and merge it in a new >> intracomm. It then accept connection from client #2 and merge it. when >> the client #3 arrives, the server accept the connection, but that >> cause client #1 and #2 to die with the error above (see the complete >> trace in the tarball). >> >> The exact steps are: >> >> - server open port >> - server does accept >> - client #1 does connect >> - server and client #1 do merge >> - server does accept >> - client #2 does connect >> - server, client #1 and client #2 do merge >> - server does accept >> - client #3 does connect >> - server, client #1, client #2 and client #3 do merge >> >> >> My infiniband network works normally with other test programs or >> applications (MPI or others like Verbs). >> >> Info about my setup: >> >> openMPI version = 1.4.1 (I also tried 1.4.2, nightly snapshot of >> 1.4.3, nightly snapshot of 1.5 --- all show the same error) >> config.log in the tarball >> "ompi_info --all" in the tarball >> OFED version = 1.3 installed from RHEL 5.3 >> Distro = RedHat Entreprise Linux 5.3 >> Kernel = 2.6.18-128.4.1.el5 x86_64 >> subnet manager = built-in SM from the cisco/topspin switch >> output of ibv_devinfo included in the tarball (there are no "bad" nodes) >> "ulimit -l" says "unlimited" >> >> The tarball contains: >> >> - ben12.c: my test program showing the behavior >> - config.log / config.out / make.out / make-install.out / >> ifconfig.txt / ibv-devinfo.txt / ompi_info.txt >> - trace-tcp.txt: output of the server and
[OMPI users] MPICH2 is working OpenMPI Not
Hello, I have developed a code which I tested on MPICH2, it working fine. But when I compile and run it with OpenMPI, its not working. The result of the program with the errors by OpenMPI is below .. -- bibrak@barq:~/XXX> mpirun -np 4 ./exec 98 warning:regcache incompatible with malloc warning:regcache incompatible with malloc warning:regcache incompatible with malloc warning:regcache incompatible with malloc Send count -- >> 25 Send count -- >> 25 Send count -- >> 24 Send count -- >> 24 Dis -- >> 0 Dis -- >> 25 Dis -- >> 50 Dis -- >> 74 0 d[0] = -14.025975 1 d[0] = -14.025975 -- 1 -- 2 d[0] = -14.025975 -- 2 -- -- 0 -- 3 d[0] = -14.025975 --3 -- [barq:27118] *** Process received signal *** [barq:27118] Signal: Segmentation fault (11) [barq:27118] Signal code: Address not mapped (1) [barq:27118] Failing at address: 0x51681f96 [barq:27121] *** Process received signal *** [barq:27121] Signal: Segmentation fault (11) [barq:27121] Signal code: Address not mapped (1) [barq:27121] Failing at address: 0x77b5685 [barq:27118] [ 0] [0xe410] [barq:27118] [ 1] /lib/libc.so.6(cfree+0x9c) [0xb7d20f3c] [barq:27118] [ 2] ./exec(main+0x2214) [0x804ad8d] [barq:27118] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7cc9705] [barq:27121] [ 0] [0xe410] [barq:27121] [ 1] /lib/libc.so.6(cfree+0x9c) [0xb7d0ef3c] [barq:27121] [ 2] ./exec(main+0x2214) [0x804ad8d] [barq:27121] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7cb7705] [barq:27121] [ 4] ./exec [0x8048b01] [barq:27121] *** End of error message *** [barq:27118] [ 4] ./exec [0x8048b01] [barq:27118] *** End of error message *** -- mpirun noticed that process rank 3 with PID 27121 on node barq exited on signal 11 (Segmentation fault). -- [barq:27120] *** Process received signal *** [barq:27120] Signal: Segmentation fault (11) [barq:27120] Signal code: Address not mapped (1) [barq:27120] Failing at address: 0x4bd1ca3e [barq:27120] [ 0] [0xe410] [barq:27120] [ 1] /lib/libc.so.6(cfree+0x9c) [0xb7c97f3c] [barq:27120] [ 2] ./exec(main+0x2214) [0x804ad8d] [barq:27120] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7c40705] [barq:27120] [ 4] ./exec [0x8048b01] [barq:27120] *** End of error message *** Because of the warning:regcache incompatible with malloc warning I did > bibrak@barq:~/XXX> export MX_RCACHE=2 And now ignored the warning, but the error still remains I shall appreciate any help. Bibrak Qamar NUST-SEECS
Re: [OMPI users] is loop unrolling safe for MPI logic?
On Sat, Jul 17, 2010 at 09:14:11AM -0700, Eugene Loh wrote: > Jeff Squyres wrote: > > >On Jul 17, 2010, at 4:22 AM, Anton Shterenlikht wrote: > > > > > >>Is loop vectorisation/unrolling safe for MPI logic? > >>I presume it is, but are there situations where > >>loop vectorisation could e.g. violate the order > >>of execution of MPI calls? > >> > >> > >I *assume* that the intel compiler will not unroll loops that contain MPI > >function calls. That's obviously an assumption, but I would think that > >unless you put some pragmas in there that tell the compiler that it's safe > >to unroll, the compiler will be somewhat conservative about what it > >automatically unrolls. > > > > > More generally, a Fortran compiler that optimizes aggressively could > "break" MPI code. > > http://www.mpi-forum.org/docs/mpi-20-html/node236.htm#Node241 > > That said, you may not need to worry about this in your particular case. This is a very important point, many thanks Eugene. Fortran MPI programmer definitely needs to pay attention to this. MPI-2.2 provides a slightly updated version of this guide: http://www.mpi-forum.org/docs/mpi22-report/node343.htm#Node348 many thanks anton -- Anton Shterenlikht Room 2.6, Queen's Building Mech Eng Dept Bristol University University Walk, Bristol BS8 1TR, UK Tel: +44 (0)117 331 5944 Fax: +44 (0)117 929 4423
Re: [OMPI users] Ok, I've got OpenMPI set up, now what?!
Check PETSc: http://www.mcs.anl.gov/petsc/petsc-as/ On Jul 18, 2010, at 12:37 AM, Damien wrote: > You should check out the MUMPS parallel linear solver. > > Damien > Sent from my iPhone > > On 2010-07-17, at 5:16 PM, Daniel Janzon wrote: > >> Dear OpenMPI Users, >> >> I successfully installed OpenMPI on some FreeBSD machines and I can >> run MPI programs on the cluster. Yippie! >> >> But I'm not patient enough to write my own MPI-based routines. So I >> thought maybe I could ask here for suggestions. I am primarily >> interested in general linear algebra routines. The best would be to >> for instance start Octave and just use it as normal, only that all >> matrix operations would run on the cluster. Has anyone done that? The >> octave-parallel package seems to be something different. >> >> I installed scalapack and the test files ran successfully with mpirun >> (except a few of them). But the source code examples of scalapack >> looks terrible. Is there no higher-level library that provides an API >> with matrix operations, which have all MPI parallelism stuff handled >> for you in the background? Certainly a smart piece of software can >> decide better than me how to chunk up a matrix and pass it out to the >> available processes. >> >> All the best, >> Daniel >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Ok, I've got OpenMPI set up, now what?!
You should check out the MUMPS parallel linear solver. Damien Sent from my iPhone On 2010-07-17, at 5:16 PM, Daniel Janzon wrote: Dear OpenMPI Users, I successfully installed OpenMPI on some FreeBSD machines and I can run MPI programs on the cluster. Yippie! But I'm not patient enough to write my own MPI-based routines. So I thought maybe I could ask here for suggestions. I am primarily interested in general linear algebra routines. The best would be to for instance start Octave and just use it as normal, only that all matrix operations would run on the cluster. Has anyone done that? The octave-parallel package seems to be something different. I installed scalapack and the test files ran successfully with mpirun (except a few of them). But the source code examples of scalapack looks terrible. Is there no higher-level library that provides an API with matrix operations, which have all MPI parallelism stuff handled for you in the background? Certainly a smart piece of software can decide better than me how to chunk up a matrix and pass it out to the available processes. All the best, Daniel ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users