date:20090430

Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-04-30 Thread Eugene Loh

I'm responsible for some sm changes in 1.3.2, so I can try looking at 
this.  Some questions below:


Simone Pellegrini wrote:


Dear all,
I have successfully compiled and installed openmpi 1.3.2 on a 8 socket 
quad-core machine from Sun.


I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase 
but when I try to run simple MPI programs processes hangs. Actually 
this is the kernel of the application I am trying to run:


MPI_Barrier(MPI_COMM_WORLD);
total = MPI_Wtime();
for(i=0; i0)
MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, 
MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status);

for(k=0; k

Do you know if this kernel is sufficient to reproduce the problem?  How 
large is N?  Evidently, it's greater than 1600, but I'm still curious 
how big.  What are top and bottom?  Are they rank+1 and rank-1?



Sometimes the program terminates correctly, sometimes don't!


Roughly, what fraction of runs hang?  50%?  1%?  <0.1%?

I am running the program using the shared memory module because I am 
using just one multi-core with the following command:


mpirun --mca btl self,sm --np 32 ./my_prog prob_size


Any idea if this fails at lower np?

If I print the index number during the program execution I can see 
that program stop running around index value 1600... but it actually 
doesn't crash. It just stops! :(


I run the program under strace to see what's going on and this is the 
output:

[...]
futex(0x2b20c02d9790, FUTEX_WAKE, 1)= 1
futex(0x2afcf2b0, FUTEX_WAKE, 1)= 0
readv(100, 
[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 
36}], 1) = 36
readv(100, 
[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) 
= 28

futex(0x19e93fd8, FUTEX_WAKE, 1)= 1
futex(0x2afcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)

futex(0x2afcf5e0, FUTEX_WAKE, 1)= 0
writev(102, 
[{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"..., 
36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 
28}], 2) = 64
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, 
events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, 
{fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, 
events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, 
{fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, 
events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, 
{fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, 
events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, 
{fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, 
events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, 
{fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, 
events=POLLIN}, {fd=100, events=POLLIN, revents=POLLIN}, ...], 39, 
1000) = 1
readv(100, 
[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 
36}], 1) = 36
readv(100, 
[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 1) 
= 28

futex(0x19e93fd8, FUTEX_WAKE, 1)= 1
writev(109, 
[{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"..., 
36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 
28}], 2) = 64
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, 
events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, 
{fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, 
events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, 
{fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, 
events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, 
{fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, 
events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, 
{fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, 
events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, 
{fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, 
events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, 
events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, 
{fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, 
events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN}, 
{fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55, 
events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN}, 
{fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, 
events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, 
{fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, 
events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN}, 
{fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0, 
events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1


and the program keep printing this poll() call till I stop it!

The program

[OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-04-30 Thread Simone Pellegrini


Dear all,
I have successfully compiled and installed openmpi 1.3.2 on a 8 socket 
quad-core machine from Sun.


I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase but 
when I try to run simple MPI programs processes hangs. Actually this is 
the kernel of the application I am trying to run:


MPI_Barrier(MPI_COMM_WORLD);
   total = MPI_Wtime();
   for(i=0; i0)
   MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N, 
MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status);

   for(k=0; kSometimes the program terminates correctly, sometimes don't! I am 
running the program using the shared memory module because I am using 
just one multi-core with the following command:


mpirun --mca btl self,sm --np 32 ./my_prog prob_size

If I print the index number during the program execution I can see that 
program stop running around index value 1600... but it actually doesn't 
crash. It just stops! :(


I run the program under strace to see what's going on and this is the 
output:

[...]
futex(0x2b20c02d9790, FUTEX_WAKE, 1)= 1
futex(0x2afcf2b0, FUTEX_WAKE, 1)= 0
readv(100, 
[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 
36}], 1) = 36
readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 
28}], 1) = 28

futex(0x19e93fd8, FUTEX_WAKE, 1)= 1
futex(0x2afcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource 
temporarily unavailable)

futex(0x2afcf5e0, FUTEX_WAKE, 1)= 0
writev(102, 
[{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"..., 
36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 
2) = 64
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, 
events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, 
events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, 
events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, 
events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, 
events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, 
events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, 
events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, 
events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, 
{fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN, 
revents=POLLIN}, ...], 39, 1000) = 1
readv(100, 
[{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"..., 
36}], 1) = 36
readv(100, [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 
28}], 1) = 28

futex(0x19e93fd8, FUTEX_WAKE, 1)= 1
writev(109, 
[{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"..., 
36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}], 
2) = 64
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, 
events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, 
events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, 
events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, 
events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, 
events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, 
events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, 
events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, 
events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, 
{fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, 
...], 39, 1000) = 1
poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7, 
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11, 
events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN}, {fd=27, 
events=POLLIN}, {fd=33, events=POLLIN}, {fd=37, events=POLLIN}, {fd=39, 
events=POLLIN}, {fd=44, events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, 
events=POLLIN}, {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, 
events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72, 
events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN}, {fd=83, 
events=POLLIN}, {fd=88, events=POLLIN}, {fd=92, events=POLLIN}, {fd=94, 
events=POLLIN}, {fd=99, events=POLLIN}, {fd=103, events=POLLIN}, 
{fd=105, events=POLLIN}, {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, 
...], 39, 1000) = 1


and the program keep printing this poll() call till I stop it!

The program runs perfectly with my old configuration which was OpenMPI 
1.3.1 compiled with Gcc-4.4. Actually I see the same problem when I 
compile Openmpi-1.3.1 with Gcc 4.4. Is there any conflict which arise 
when gcc-4.4 is used?


Regards, Simone

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-30 Thread Geoffroy Pignot

Immediately Sir !!! :)

Thanks again Ralph

Geoffroy






>
>
> --
>
> Message: 2
> Date: Thu, 30 Apr 2009 06:45:39 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID:
><71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I believe this is fixed now in our development trunk - you can download any
> tarball starting from last night and give it a try, if you like. Any
> feedback would be appreciated.
>
> Ralph
>
>
> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>
> Ah now, I didn't say it -worked-, did I? :-)
>
> Clearly a bug exists in the program. I'll try to take a look at it (if
> Lenny
> doesn't get to it first), but it won't be until later in the week.
>
> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>
> I agree with you Ralph , and that 's what I expect from openmpi but my
> second example shows that it's not working
>
> cat hostfile.0
>   r011n002 slots=4
>   r011n003 slots=4
>
>  cat rankfile.0
>rank 0=r011n002 slot=0
>rank 1=r011n003 slot=1
>
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname
> ### CRASHED
>
> > > Error, invalid rank (1) in the rankfile (rankfile.0)
> > >
> >
> --
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > rmaps_rank_file.c at line 404
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > base/rmaps_base_map_job.c at line 87
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > base/plm_base_launch_support.c at line 77
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > plm_rsh_module.c at line 985
> > >
> >
> --
> > > A daemon (pid unknown) died unexpectedly on signal 1  while
> > attempting to
> > > launch so we are aborting.
> > >
> > > There may be more information reported by the environment (see
> > above).
> > >
> > > This may be because the daemon was unable to find all the needed
> > shared
> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > have the
> > > location of the shared libraries on the remote nodes and this will
> > > automatically be forwarded to the remote nodes.
> > >
> >
> --
> > >
> >
> --
> > > orterun noticed that the job aborted, but has no info as to the
> > process
> > > that caused that situation.
> > >
> >
> --
> > > orterun: clean termination accomplished
>
>
>
> Message: 4
> Date: Tue, 14 Apr 2009 06:55:58 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>   DelSp="yes"
>
> The rankfile cuts across the entire job - it isn't applied on an
> app_context basis. So the ranks in your rankfile must correspond to
> the eventual rank of each process in the cmd line.
>
> Unfortunately, that means you have to count ranks. In your case, you
> only have four, so that makes life easier. Your rankfile would look
> something like this:
>
> rank 0=r001n001 slot=0
> rank 1=r001n002 slot=1
> rank 2=r001n001 slot=1
> rank 3=r001n002 slot=2
>
> HTH
> Ralph
>
> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>
> > Hi,
> >
> > I agree that my examples are not very clear. What I want to do is to
> > launch a multiexes application (masters-slaves) and benefit from the
> > processor affinity.
> > Could you show me how to convert this command , using -rf option
> > (whatever the affinity is)
> >
> > mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host r001n002
> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
> > host r001n002 slave.x options4
> >
> > Thanks for your help
> >
> > Geoffroy
> >
> >
> >
> >
> >
> > Message: 2
> > Date: Sun, 12 Apr 2009 18:26:35 +0300
> > From: Lenny Verkhovsky 
> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> > To: Open MPI Users 
> > Message-ID:
> ><453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com>
> > Content-Type: text/plain; charset="iso-8859-1"
> >
> > Hi,
> >
> > The first "crash" is OK, since your rankfile has ranks 0 and 1
> > defined,
> > while n=1, which means only rank 0 is present and can be allocated.
> >
> > NP must be >= the largest rank in rankfile.
> >
> > What exactly are you trying to do ?
> >
> > I tried to recreate your seqv but all I got was
> >
> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
> > hostfile.0
> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
> > [witch19:30798

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-30 Thread Ralph Castain

I believe this is fixed now in our development trunk - you can download any
tarball starting from last night and give it a try, if you like. Any
feedback would be appreciated.

Ralph


On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:

Ah now, I didn't say it -worked-, did I? :-)

Clearly a bug exists in the program. I'll try to take a look at it (if Lenny
doesn't get to it first), but it won't be until later in the week.

On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:

I agree with you Ralph , and that 's what I expect from openmpi but my
second example shows that it's not working

cat hostfile.0
   r011n002 slots=4
   r011n003 slots=4

 cat rankfile.0
rank 0=r011n002 slot=0
rank 1=r011n003 slot=1

mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname
### CRASHED

> > Error, invalid rank (1) in the rankfile (rankfile.0)
> >
> --
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > rmaps_rank_file.c at line 404
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > base/rmaps_base_map_job.c at line 87
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > base/plm_base_launch_support.c at line 77
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > plm_rsh_module.c at line 985
> >
> --
> > A daemon (pid unknown) died unexpectedly on signal 1  while
> attempting to
> > launch so we are aborting.
> >
> > There may be more information reported by the environment (see
> above).
> >
> > This may be because the daemon was unable to find all the needed
> shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> have the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> >
> --
> >
> --
> > orterun noticed that the job aborted, but has no info as to the
> process
> > that caused that situation.
> >
> --
> > orterun: clean termination accomplished



Message: 4
List-Post: users@lists.open-mpi.org
Date: Tue, 14 Apr 2009 06:55:58 -0600
From: Ralph Castain 
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users 
Message-ID: 
Content-Type: text/plain; charset="us-ascii"; Format="flowed";
   DelSp="yes"

The rankfile cuts across the entire job - it isn't applied on an
app_context basis. So the ranks in your rankfile must correspond to
the eventual rank of each process in the cmd line.

Unfortunately, that means you have to count ranks. In your case, you
only have four, so that makes life easier. Your rankfile would look
something like this:

rank 0=r001n001 slot=0
rank 1=r001n002 slot=1
rank 2=r001n001 slot=1
rank 3=r001n002 slot=2

HTH
Ralph

On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:

> Hi,
>
> I agree that my examples are not very clear. What I want to do is to
> launch a multiexes application (masters-slaves) and benefit from the
> processor affinity.
> Could you show me how to convert this command , using -rf option
> (whatever the affinity is)
>
> mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host r001n002
> master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
> host r001n002 slave.x options4
>
> Thanks for your help
>
> Geoffroy
>
>
>
>
>
> Message: 2
> Date: Sun, 12 Apr 2009 18:26:35 +0300
> From: Lenny Verkhovsky 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID:
><453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
> The first "crash" is OK, since your rankfile has ranks 0 and 1
> defined,
> while n=1, which means only rank 0 is present and can be allocated.
>
> NP must be >= the largest rank in rankfile.
>
> What exactly are you trying to do ?
>
> I tried to recreate your seqv but all I got was
>
> ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
> hostfile.0
> -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
> [witch19:30798] mca: base: component_find: paffinity
> "mca_paffinity_linux"
> uses an MCA interface that is not recognized (component MCA v1.0.0 !=
> supported MCA v2.0.0) -- ignored
> --
> It looks like opal_init failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer)

Re: [OMPI users] users Digest, Vol 1217, Issue 2, Message3

2009-04-30 Thread Jeff Squyres


On Apr 30, 2009, at 5:33 AM, jan wrote:


Thank You Jeff Squyres. Could you suggest the method to run layer 0
diagnostics to know that if the fabric is clean. I have contacted Dell
local(Taiwan). I don't think they are familiar with Openmpi even the
infiniband module.



Note that the layer 0 diagnostics I'm referring to are IB diagnostics,  
not Open MPI diagnostics.  You need to ensure that your fabric is  
functioning properly.



Does anyone have the IB stack hangs problem with Mellanox
ConnectX product?



FWIW: we do quite a bit of development and automated regression  
testing on IB (including ConnectX) every day.


--
Jeff Squyres
Cisco Systems

[OMPI users] Problem with Filem

2009-04-30 Thread Bouguerra mohamed slim


Hello,
I have a problem with the Filem module when i would checkpoint on a 
remote host without shared space file system.
I use the new open-mpi 1.3.2 and it is the same problem as in the 
version 1.3.1. Indeed, when i use the NFS system file it works. Thus i 
guess that is a problem with the Filem.


[azur-6.fr:23223] filem:rsh: wait_all(): Wait failed (-1)
[azur-6.fr:23223] [[48784,0],0] ORTE_ERROR_LOG: Error in file 
/home/grenoble/msbouguerra/openmpi-1.3.2/orte/mca/snapc/full/snapc_full_global.c 
at line 1054


--
Cordialement,
Mohamed-Slim BOUGUERRAPhD student INRIA-Grenoble / Projet MOAIS
ENSIMAG - antenne de Montbonnot
ZIRST 51, avenue Jean Kuntzmann
38330 MONTBONNOT SAINT MARTIN France
Tel :+33 (0)4 76 61 20 79
Fax :+33 (0)4 76 61 20 99

Re: [OMPI users] users Digest, Vol 1217, Issue 2, Message3

2009-04-30 Thread jan

Thank You Jeff Squyres. Could you suggest the method to run layer 0 
diagnostics to know that if the fabric is clean. I have contacted Dell 
local(Taiwan). I don't think they are familiar with Openmpi even the 
infiniband module. Does anyone have the IB stack hangs problem with Mellanox 
ConnectX product?


Thank you again.

Best Regards,

Gloria Jan
Wavelink Technology Inc



I can confirm that I have exactly the same problem, also on Dell
system, even with latest openpmpi.

Our system is:

Dell M905
OpenSUSE 11.1
kernel: 2.6.27.21-0.1-default
ofed-1.4-21.12 from SUSE repositories.
OpenMPI-1.3.2


But what I can also add, it not only affect openmpi, if this messages
are triggered after mpirun:
[node032][[9340,1],11][btl_openib_component.c:3002:poll_device] error
polling HP CQ with -2 errno says Success

Then IB stack hangs. You cannot even reload it, have to reboot node.




Something that severe should not be able to be caused by Open MPI.
Specifically: Open MPI should not be able to hang the OFED stack.
Have you run layer 0 diagnostics to know that your fabric is clean?
You might want to contact your IB vendor to find out how to do that.

--
Jeff Squyres
Cisco Systems



--

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

End of users Digest, Vol 1217, Issue 2
**

Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

[OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

Re: [OMPI users] users Digest, Vol 1217, Issue 2, Message3

[OMPI users] Problem with Filem

Re: [OMPI users] users Digest, Vol 1217, Issue 2, Message3

7 matches

Site Navigation

Mail list logo

Footer information