Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
I'm responsible for some sm changes in 1.3.2, so I can try looking at this. Some questions below: Simone Pellegrini wrote: Dear all, I have successfully compiled and installed openmpi 1.3.2 on a 8 socket quad-core machine from Sun. I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase but when I try to run simple MPI programs processes hangs. Actually this is the kernel of the application I am trying to run: MPI_Barrier(MPI_COMM_WORLD); total = MPI_Wtime(); for(i=0; i0) MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status); for(k=0; k Do you know if this kernel is sufficient to reproduce the problem? How large is N? Evidently, it's greater than 1600, but I'm still curious how big. What are top and bottom? Are they rank+1 and rank-1? Sometimes the program terminates correctly, sometimes don't! Roughly, what fraction of runs hang? 50%? 1%? <0.1%? I am running the program using the shared memory module because I am using just one multi-core with the following command: mpirun --mca btl self,sm --np 32 ./my_prog prob_size Any idea if this fails at lower np? If I print the index number during the program execution I can see that program stop running around index value 1600... but it actually doesn't crash. It just stops! :( I run the program under strace to see what's going on and this is the output: [...] futex(0x2b20c02d9790, FUTEX_WAKE, 1)= 1 futex(0x2afcf2b0, FUTEX_WAKE, 1)= 0 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28 futex(0x19e93fd8, FUTEX_WAKE, 1)= 1 futex(0x2afcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2afcf5e0, FUTEX_WAKE, 1)= 0 writev(102, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN, revents=POLLIN}, ...], 39, 1000) = 1 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28 futex(0x19e93fd8, FUTEX_WAKE, 1)= 1 writev(109, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1 and the program keep printing this poll() call till I stop it! The program
[OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
Dear all, I have successfully compiled and installed openmpi 1.3.2 on a 8 socket quad-core machine from Sun. I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase but when I try to run simple MPI programs processes hangs. Actually this is the kernel of the application I am trying to run: MPI_Barrier(MPI_COMM_WORLD); total = MPI_Wtime(); for(i=0; i0) MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status); for(k=0; kSometimes the program terminates correctly, sometimes don't! I am running the program using the shared memory module because I am using just one multi-core with the following command: mpirun --mca btl self,sm --np 32 ./my_prog prob_size If I print the index number during the program execution I can see that program stop running around index value 1600... but it actually doesn't crash. It just stops! :( I run the program under strace to see what's going on and this is the output: [...] futex(0x2b20c02d9790, FUTEX_WAKE, 1)= 1 futex(0x2afcf2b0, FUTEX_WAKE, 1)= 0 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28 futex(0x19e93fd8, FUTEX_WAKE, 1)= 1 futex(0x2afcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0x2afcf5e0, FUTEX_WAKE, 1)= 0 writev(102, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN, revents=POLLIN}, ...], 39, 1000) = 1 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 36}], 1) = 36 readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) = 28 futex(0x19e93fd8, FUTEX_WAKE, 1)= 1 writev(109, [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"..., 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 2) = 64 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1 poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1 and the program keep printing this poll() call till I stop it! The program runs perfectly with my old configuration which was OpenMPI 1.3.1 compiled with Gcc-4.4. Actually I see the same problem when I compile Openmpi-1.3.1 with Gcc 4.4. Is there any conflict which arise when gcc-4.4 is used? Regards, Simone
Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
Immediately Sir !!! :) Thanks again Ralph Geoffroy > > > -- > > Message: 2 > Date: Thu, 30 Apr 2009 06:45:39 -0600 > From: Ralph Castain > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? > To: Open MPI Users > Message-ID: ><71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > I believe this is fixed now in our development trunk - you can download any > tarball starting from last night and give it a try, if you like. Any > feedback would be appreciated. > > Ralph > > > On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote: > > Ah now, I didn't say it -worked-, did I? :-) > > Clearly a bug exists in the program. I'll try to take a look at it (if > Lenny > doesn't get to it first), but it won't be until later in the week. > > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote: > > I agree with you Ralph , and that 's what I expect from openmpi but my > second example shows that it's not working > > cat hostfile.0 > r011n002 slots=4 > r011n003 slots=4 > > cat rankfile.0 >rank 0=r011n002 slot=0 >rank 1=r011n003 slot=1 > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname > ### CRASHED > > > > Error, invalid rank (1) in the rankfile (rankfile.0) > > > > > > -- > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > > rmaps_rank_file.c at line 404 > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > > base/rmaps_base_map_job.c at line 87 > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > > base/plm_base_launch_support.c at line 77 > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > > plm_rsh_module.c at line 985 > > > > > > -- > > > A daemon (pid unknown) died unexpectedly on signal 1 while > > attempting to > > > launch so we are aborting. > > > > > > There may be more information reported by the environment (see > > above). > > > > > > This may be because the daemon was unable to find all the needed > > shared > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to > > have the > > > location of the shared libraries on the remote nodes and this will > > > automatically be forwarded to the remote nodes. > > > > > > -- > > > > > > -- > > > orterun noticed that the job aborted, but has no info as to the > > process > > > that caused that situation. > > > > > > -- > > > orterun: clean termination accomplished > > > > Message: 4 > Date: Tue, 14 Apr 2009 06:55:58 -0600 > From: Ralph Castain > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? > To: Open MPI Users > Message-ID: > Content-Type: text/plain; charset="us-ascii"; Format="flowed"; > DelSp="yes" > > The rankfile cuts across the entire job - it isn't applied on an > app_context basis. So the ranks in your rankfile must correspond to > the eventual rank of each process in the cmd line. > > Unfortunately, that means you have to count ranks. In your case, you > only have four, so that makes life easier. Your rankfile would look > something like this: > > rank 0=r001n001 slot=0 > rank 1=r001n002 slot=1 > rank 2=r001n001 slot=1 > rank 3=r001n002 slot=2 > > HTH > Ralph > > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote: > > > Hi, > > > > I agree that my examples are not very clear. What I want to do is to > > launch a multiexes application (masters-slaves) and benefit from the > > processor affinity. > > Could you show me how to convert this command , using -rf option > > (whatever the affinity is) > > > > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host r001n002 > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 - > > host r001n002 slave.x options4 > > > > Thanks for your help > > > > Geoffroy > > > > > > > > > > > > Message: 2 > > Date: Sun, 12 Apr 2009 18:26:35 +0300 > > From: Lenny Verkhovsky > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? > > To: Open MPI Users > > Message-ID: > ><453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com> > > Content-Type: text/plain; charset="iso-8859-1" > > > > Hi, > > > > The first "crash" is OK, since your rankfile has ranks 0 and 1 > > defined, > > while n=1, which means only rank 0 is present and can be allocated. > > > > NP must be >= the largest rank in rankfile. > > > > What exactly are you trying to do ? > > > > I tried to recreate your seqv but all I got was > > > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile > > hostfile.0 > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname > > [witch19:30798
Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
I believe this is fixed now in our development trunk - you can download any tarball starting from last night and give it a try, if you like. Any feedback would be appreciated. Ralph On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote: Ah now, I didn't say it -worked-, did I? :-) Clearly a bug exists in the program. I'll try to take a look at it (if Lenny doesn't get to it first), but it won't be until later in the week. On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote: I agree with you Ralph , and that 's what I expect from openmpi but my second example shows that it's not working cat hostfile.0 r011n002 slots=4 r011n003 slots=4 cat rankfile.0 rank 0=r011n002 slot=0 rank 1=r011n003 slot=1 mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname ### CRASHED > > Error, invalid rank (1) in the rankfile (rankfile.0) > > > -- > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > rmaps_rank_file.c at line 404 > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > base/rmaps_base_map_job.c at line 87 > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > base/plm_base_launch_support.c at line 77 > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > plm_rsh_module.c at line 985 > > > -- > > A daemon (pid unknown) died unexpectedly on signal 1 while > attempting to > > launch so we are aborting. > > > > There may be more information reported by the environment (see > above). > > > > This may be because the daemon was unable to find all the needed > shared > > libraries on the remote node. You may set your LD_LIBRARY_PATH to > have the > > location of the shared libraries on the remote nodes and this will > > automatically be forwarded to the remote nodes. > > > -- > > > -- > > orterun noticed that the job aborted, but has no info as to the > process > > that caused that situation. > > > -- > > orterun: clean termination accomplished Message: 4 List-Post: users@lists.open-mpi.org Date: Tue, 14 Apr 2009 06:55:58 -0600 From: Ralph Castain Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? To: Open MPI Users Message-ID: Content-Type: text/plain; charset="us-ascii"; Format="flowed"; DelSp="yes" The rankfile cuts across the entire job - it isn't applied on an app_context basis. So the ranks in your rankfile must correspond to the eventual rank of each process in the cmd line. Unfortunately, that means you have to count ranks. In your case, you only have four, so that makes life easier. Your rankfile would look something like this: rank 0=r001n001 slot=0 rank 1=r001n002 slot=1 rank 2=r001n001 slot=1 rank 3=r001n002 slot=2 HTH Ralph On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote: > Hi, > > I agree that my examples are not very clear. What I want to do is to > launch a multiexes application (masters-slaves) and benefit from the > processor affinity. > Could you show me how to convert this command , using -rf option > (whatever the affinity is) > > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host r001n002 > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 - > host r001n002 slave.x options4 > > Thanks for your help > > Geoffroy > > > > > > Message: 2 > Date: Sun, 12 Apr 2009 18:26:35 +0300 > From: Lenny Verkhovsky > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? > To: Open MPI Users > Message-ID: ><453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > Hi, > > The first "crash" is OK, since your rankfile has ranks 0 and 1 > defined, > while n=1, which means only rank 0 is present and can be allocated. > > NP must be >= the largest rank in rankfile. > > What exactly are you trying to do ? > > I tried to recreate your seqv but all I got was > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile > hostfile.0 > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname > [witch19:30798] mca: base: component_find: paffinity > "mca_paffinity_linux" > uses an MCA interface that is not recognized (component MCA v1.0.0 != > supported MCA v2.0.0) -- ignored > -- > It looks like opal_init failed for some reason; your parallel > process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer)
Re: [OMPI users] users Digest, Vol 1217, Issue 2, Message3
On Apr 30, 2009, at 5:33 AM, jan wrote: Thank You Jeff Squyres. Could you suggest the method to run layer 0 diagnostics to know that if the fabric is clean. I have contacted Dell local(Taiwan). I don't think they are familiar with Openmpi even the infiniband module. Note that the layer 0 diagnostics I'm referring to are IB diagnostics, not Open MPI diagnostics. You need to ensure that your fabric is functioning properly. Does anyone have the IB stack hangs problem with Mellanox ConnectX product? FWIW: we do quite a bit of development and automated regression testing on IB (including ConnectX) every day. -- Jeff Squyres Cisco Systems
[OMPI users] Problem with Filem
Hello, I have a problem with the Filem module when i would checkpoint on a remote host without shared space file system. I use the new open-mpi 1.3.2 and it is the same problem as in the version 1.3.1. Indeed, when i use the NFS system file it works. Thus i guess that is a problem with the Filem. [azur-6.fr:23223] filem:rsh: wait_all(): Wait failed (-1) [azur-6.fr:23223] [[48784,0],0] ORTE_ERROR_LOG: Error in file /home/grenoble/msbouguerra/openmpi-1.3.2/orte/mca/snapc/full/snapc_full_global.c at line 1054 -- Cordialement, Mohamed-Slim BOUGUERRAPhD student INRIA-Grenoble / Projet MOAIS ENSIMAG - antenne de Montbonnot ZIRST 51, avenue Jean Kuntzmann 38330 MONTBONNOT SAINT MARTIN France Tel :+33 (0)4 76 61 20 79 Fax :+33 (0)4 76 61 20 99
Re: [OMPI users] users Digest, Vol 1217, Issue 2, Message3
Thank You Jeff Squyres. Could you suggest the method to run layer 0 diagnostics to know that if the fabric is clean. I have contacted Dell local(Taiwan). I don't think they are familiar with Openmpi even the infiniband module. Does anyone have the IB stack hangs problem with Mellanox ConnectX product? Thank you again. Best Regards, Gloria Jan Wavelink Technology Inc I can confirm that I have exactly the same problem, also on Dell system, even with latest openpmpi. Our system is: Dell M905 OpenSUSE 11.1 kernel: 2.6.27.21-0.1-default ofed-1.4-21.12 from SUSE repositories. OpenMPI-1.3.2 But what I can also add, it not only affect openmpi, if this messages are triggered after mpirun: [node032][[9340,1],11][btl_openib_component.c:3002:poll_device] error polling HP CQ with -2 errno says Success Then IB stack hangs. You cannot even reload it, have to reboot node. Something that severe should not be able to be caused by Open MPI. Specifically: Open MPI should not be able to hang the OFED stack. Have you run layer 0 diagnostics to know that your fabric is clean? You might want to contact your IB vendor to find out how to do that. -- Jeff Squyres Cisco Systems -- ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users End of users Digest, Vol 1217, Issue 2 **