Re: [OMPI users] MPI_ERR_COMM: invalid communicator using POP 1.2

2007-01-22 Thread Jeff Squyres
Looking at the web page for POP (http://climate.lanl.gov/Models/POP/ 
index.shtml), it looks like POP 1.2 is pretty ancient.  I gather from  
your text that later versions work ok ("POP 2").


My first guess -- knowing nothing about the POP code itself -- is  
that there is a bug in the POP 1.2 code such that it is passing a bad  
parameter to MPI_CART_SHIFT, and that later versions (POP 2) fixed  
the problem.


Do you know if this is the case?


On Jan 19, 2007, at 8:06 PM, Axel Schweiger wrote:


I am having a problem running pop 1.2 (Parallel Ocean Model) with
OpenMPI version 1.1.2  compiled with PGI 6.2-4  on RH EL-4 Update 4
(configure result attached)

The error is as follows:

mpirun -v -np 4 -machinefile node18.dat pop
[node18:11220] *** An error occurred in MPI_Cart_shift
[node18:11221] *** An error occurred in MPI_Cart_shift
[node18:11221] *** on communicator MPI_COMM_WORLD
[node18:11221] *** MPI_ERR_COMM: invalid communicator
[node18:11221] *** MPI_ERRORS_ARE_FATAL (goodbye)
[node18:11220] *** on communicator MPI_COMM_WORLD
[node18:11220] *** MPI_ERR_COMM: invalid communicator
[node18:11220] *** MPI_ERRORS_ARE_FATAL (goodbye)
3 additional processes aborted (not shown)

The application runs fine with MPICH 1.2.6 and other applications  
(POP 2) run fine with OpenMPI


Any suggestions

Thanks



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] OpenMPI/OpenIB/IMB hangs[Scanned]

2007-01-22 Thread Arif Ali
On Fri, 2007-01-19 at 20:15 -0500, Jeff Squyres wrote:
> On Jan 19, 2007, at 6:19 PM, Arif Ali wrote:
> 
> > > [0,1,59][btl_openib_component.c: 
> > 1153:btl_openib_component_progress] from
> > > node16 to: node02 error polling HP CQ with status REMOTE ACCESS  
> > ERROR
> > > status number 10 for wr_id 268919352 opcode 256614836
> > > mpirun noticed that job rank 0 with PID 0 on node node02 exited on
> > > signal 15 (Terminated).
> > > 55 additional processes aborted (not shown)
> > does this happen with btl_openib_flags=1? Does this also happen  
> > without
> > this setting. This doesn't happen with OpenMPI-1.2b3 right?
> >
> >
> > That's Correct, I tried all the flags that was suggested, and a few  
> > more, which I listed in previous mails
> 
> I can parse your text either way, so forgive me for belaboring the  
> point:

Sorry for not being clear

> - Does this happen with btl_openib_flags=1 on the nightly snapshot of  
> OMPI v1.2?

Yes

> - Does this happen without setting btl_openib_flags on the nightly  
> snapshot of OMPI v1.2?

Yes

> - What is the exact version of the nightly snapshot for OMPI v1.2  
> that you are using?

1.2b4r13137

> > Yes, correct, this doesn't happen with 1.2b3
> 
> Good to know.
> 
> Were you able to experiment with the various MCA parameters that I  
> described in the other mail to see if such problems went away?   
> (i.e., ensure that you're not running out of DMA-able memory)

Not yet, I'll be doing these today, and will get back to you as soon as
I can

regards,
Arif 


Re: [OMPI users] MPI_ERR_COMM: invalid communicator using POP 1.2

2007-01-22 Thread Axel Schweiger

Jeff,
Thanks for your reply. Yes POP 1.2 is dead w.r.t. development but our 
application still uses it. The 1.2 to 2.0 transition
involves a lot of physical differences and for a while at least we are 
stuck with 1.2.


Can't say if there is a bug that was fixed since there was a lot of 
re-engineering going to 2.0. . But I do know that POP 1.2 works
fine with the MPICH MPI implementation. Wouldn't you expect that a bad 
parameters would produce the same error with MPICH?


Thanks much
Axel
Jeff Squyres wrote:
Looking at the web page for POP (http://climate.lanl.gov/Models/POP/ 
index.shtml), it looks like POP 1.2 is pretty ancient.  I gather from  
your text that later versions work ok ("POP 2").


My first guess -- knowing nothing about the POP code itself -- is  
that there is a bug in the POP 1.2 code such that it is passing a bad  
parameter to MPI_CART_SHIFT, and that later versions (POP 2) fixed  
the problem.


Do you know if this is the case?


On Jan 19, 2007, at 8:06 PM, Axel Schweiger wrote:

  

I am having a problem running pop 1.2 (Parallel Ocean Model) with
OpenMPI version 1.1.2  compiled with PGI 6.2-4  on RH EL-4 Update 4
(configure result attached)

The error is as follows:

mpirun -v -np 4 -machinefile node18.dat pop
[node18:11220] *** An error occurred in MPI_Cart_shift
[node18:11221] *** An error occurred in MPI_Cart_shift
[node18:11221] *** on communicator MPI_COMM_WORLD
[node18:11221] *** MPI_ERR_COMM: invalid communicator
[node18:11221] *** MPI_ERRORS_ARE_FATAL (goodbye)
[node18:11220] *** on communicator MPI_COMM_WORLD
[node18:11220] *** MPI_ERR_COMM: invalid communicator
[node18:11220] *** MPI_ERRORS_ARE_FATAL (goodbye)
3 additional processes aborted (not shown)

The application runs fine with MPICH 1.2.6 and other applications  
(POP 2) run fine with OpenMPI


Any suggestions

Thanks



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  




Re: [OMPI users] MPI_ERR_COMM: invalid communicator using POP 1.2

2007-01-22 Thread Jeff Squyres

On Jan 22, 2007, at 11:53 AM, Axel Schweiger wrote:


Thanks for your reply. Yes POP 1.2 is dead w.r.t. development but our
application still uses it. The 1.2 to 2.0 transition
involves a lot of physical differences and for a while at least we are
stuck with 1.2.


Gotcha.


Can't say if there is a bug that was fixed since there was a lot of
re-engineering going to 2.0. . But I do know that POP 1.2 works
fine with the MPICH MPI implementation. Wouldn't you expect that a bad
parameters would produce the same error with MPICH?


Usually, but not always.  Mostly, this involves problems with C  
codes, but it can happen in Fortran as well.  Specifically, different  
run-time behaviors of MPI implementations can sometimes result in a  
code that runs under one MPI and not under another, typically (but  
not always) if the code makes some assumptions or violates the  
standard in some way.


I see in OMPI's MPI_CART_SHIFT, we only return the "bad communicator"  
error if we get an invalid communicator or an intercommunicator.  Are  
you familiar with the POP code at all to be able to dive into it to  
see where the problem is actually occurring?




Thanks much
Axel
Jeff Squyres wrote:

Looking at the web page for POP (http://climate.lanl.gov/Models/POP/
index.shtml), it looks like POP 1.2 is pretty ancient.  I gather from
your text that later versions work ok ("POP 2").

My first guess -- knowing nothing about the POP code itself -- is
that there is a bug in the POP 1.2 code such that it is passing a bad
parameter to MPI_CART_SHIFT, and that later versions (POP 2) fixed
the problem.

Do you know if this is the case?


On Jan 19, 2007, at 8:06 PM, Axel Schweiger wrote:



I am having a problem running pop 1.2 (Parallel Ocean Model) with
OpenMPI version 1.1.2  compiled with PGI 6.2-4  on RH EL-4 Update 4
(configure result attached)

The error is as follows:

mpirun -v -np 4 -machinefile node18.dat pop
[node18:11220] *** An error occurred in MPI_Cart_shift
[node18:11221] *** An error occurred in MPI_Cart_shift
[node18:11221] *** on communicator MPI_COMM_WORLD
[node18:11221] *** MPI_ERR_COMM: invalid communicator
[node18:11221] *** MPI_ERRORS_ARE_FATAL (goodbye)
[node18:11220] *** on communicator MPI_COMM_WORLD
[node18:11220] *** MPI_ERR_COMM: invalid communicator
[node18:11220] *** MPI_ERRORS_ARE_FATAL (goodbye)
3 additional processes aborted (not shown)

The application runs fine with MPICH 1.2.6 and other applications
(POP 2) run fine with OpenMPI

Any suggestions

Thanks



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users







___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI users] MPI_ERR_COMM: invalid communicator using POP 1.2

2007-01-22 Thread Axel Schweiger

Jeff,

I'm afraid, I'm not familiar enough to dive into it. I suspect between 
the fact we have a working MPI
implementation (MPICH)  and the fact the this version of the pop model 
is superceded, it is probably

not worth the effort to spend a lot of time on it.

I was hoping that this was maybe a "typical" error that could be treated 
with different compiler switches

or that it mapped to a  known bug/incompatability in OpenMPI.

If this isn't the case it probably is best to drop it?

Thanks for your offer to help though!

Axel
Jeff Squyres wrote:

On Jan 22, 2007, at 11:53 AM, Axel Schweiger wrote:

  

Thanks for your reply. Yes POP 1.2 is dead w.r.t. development but our
application still uses it. The 1.2 to 2.0 transition
involves a lot of physical differences and for a while at least we are
stuck with 1.2.



Gotcha.

  

Can't say if there is a bug that was fixed since there was a lot of
re-engineering going to 2.0. . But I do know that POP 1.2 works
fine with the MPICH MPI implementation. Wouldn't you expect that a bad
parameters would produce the same error with MPICH?



Usually, but not always.  Mostly, this involves problems with C  
codes, but it can happen in Fortran as well.  Specifically, different  
run-time behaviors of MPI implementations can sometimes result in a  
code that runs under one MPI and not under another, typically (but  
not always) if the code makes some assumptions or violates the  
standard in some way.


I see in OMPI's MPI_CART_SHIFT, we only return the "bad communicator"  
error if we get an invalid communicator or an intercommunicator.  Are  
you familiar with the POP code at all to be able to dive into it to  
see where the problem is actually occurring?



  

Thanks much
Axel
Jeff Squyres wrote:


Looking at the web page for POP (http://climate.lanl.gov/Models/POP/
index.shtml), it looks like POP 1.2 is pretty ancient.  I gather from
your text that later versions work ok ("POP 2").

My first guess -- knowing nothing about the POP code itself -- is
that there is a bug in the POP 1.2 code such that it is passing a bad
parameter to MPI_CART_SHIFT, and that later versions (POP 2) fixed
the problem.

Do you know if this is the case?


On Jan 19, 2007, at 8:06 PM, Axel Schweiger wrote:


  

I am having a problem running pop 1.2 (Parallel Ocean Model) with
OpenMPI version 1.1.2  compiled with PGI 6.2-4  on RH EL-4 Update 4
(configure result attached)

The error is as follows:

mpirun -v -np 4 -machinefile node18.dat pop
[node18:11220] *** An error occurred in MPI_Cart_shift
[node18:11221] *** An error occurred in MPI_Cart_shift
[node18:11221] *** on communicator MPI_COMM_WORLD
[node18:11221] *** MPI_ERR_COMM: invalid communicator
[node18:11221] *** MPI_ERRORS_ARE_FATAL (goodbye)
[node18:11220] *** on communicator MPI_COMM_WORLD
[node18:11220] *** MPI_ERR_COMM: invalid communicator
[node18:11220] *** MPI_ERRORS_ARE_FATAL (goodbye)
3 additional processes aborted (not shown)

The application runs fine with MPICH 1.2.6 and other applications
(POP 2) run fine with OpenMPI

Any suggestions

Thanks



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  


<>

Re: [OMPI users] MPI_ERR_COMM: invalid communicator using POP 1.2

2007-01-22 Thread Jeff Squyres

On Jan 22, 2007, at 2:59 PM, Axel Schweiger wrote:

I'm afraid, I'm not familiar enough to dive into it. I suspect  
between the fact we have a working MPI
implementation (MPICH)  and the fact the this version of the pop  
model is superceded, it is probably

not worth the effort to spend a lot of time on it.

I was hoping that this was maybe a "typical" error that could be  
treated with different compiler switches

or that it mapped to a  known bug/incompatability in OpenMPI.


Sorry.  :-(


If this isn't the case it probably is best to drop it?


I'm sorry that I don't have the cycles to dive into this.   I'm  
*guessing* that it's an application problem, but without actually  
looking into it, it's impossible to know.


--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems