Re: [OMPI users] Windows: MPI_Allreduce() crashes when using MPI_DOUBLE_PRECISION

2011-05-09 Thread Jeff Squyres
On May 9, 2011, at 6:57 AM, hi wrote:

> Test program works fine; but you can notice the difference between the
> callstack images of test program and of my actual application.
> 
> In test program it calls mca_coll_self_allreduce_intra while in my

It doesn't for me...?  

The "self" in there refers to the fact that this is a collective on 
MPI_COMM_SELF (or a dup of it).  If you run with np -1, MPI_COMM_WORLD is 
effectively a dup of MPI_COMM_SEFL.

Hence, it's essentially a no-op.

> application it calls mca_coll_basic_allreduce_intra.

The "basic" in there means that this is not a no-op and it needs to do 
something.

I ran your test program with -np 2 and -np 4 and it seemed to work ok.

> So I want to know which parameter or setting makes call to
> mca_coll_basic_allreduce_intra compared to
> mca_coll_self_allreduce.intra; if you can comment on this would be
> helpful.
> 
> Just for more information:
> op->o_func.intrinsic.fns[27]  points to 0 when using
> MPI_Allreduce(...,...,...,MPI_DOUBLE_PRECISION, MPI_SUM,...,...)

You didn't answer my prior questions.  :-)

Also note that the op pointers are not set by communicator -- they're fixed for 
all uses of that op.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Request for F90 bindings for Windows Builds

2011-05-09 Thread Jeff Squyres
We have F90 support in Open MPI; it's a question of Fortran support in the 
Windows binary builds.

The Windows binaries are only available for the 1.5.x series (it is highly 
unlikely that we will provide Windows binaries for the 1.4.x series).  I 
*thought* that we had added Fortran support to those 1.5.x Windows binaries 
recently...

Ah, I see: 

http://www.open-mpi.org/software/ompi/v1.5/ms-windows.php

It specifically only lists F77 bindings.  

Shiqing -- can we add the F90 bindings, too?


On May 9, 2011, at 7:01 AM, hi wrote:

> I also vote for F90 support in OpenMPI.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Request for F90 bindings for Windows Builds

2011-05-09 Thread hi
I also vote for F90 support in OpenMPI.


Re: [OMPI users] Windows: MPI_Allreduce() crashes when using MPI_DOUBLE_PRECISION

2011-05-09 Thread hi
Hi Jeff,

Test program works fine; but you can notice the difference between the
callstack images of test program and of my actual application.

In test program it calls mca_coll_self_allreduce_intra while in my
application it calls mca_coll_basic_allreduce_intra.
So I want to know which parameter or setting makes call to
mca_coll_basic_allreduce_intra compared to
mca_coll_self_allreduce.intra; if you can comment on this would be
helpful.

Just for more information:
op->o_func.intrinsic.fns[27]  points to 0 when using
MPI_Allreduce(...,...,...,MPI_DOUBLE_PRECISION, MPI_SUM,...,...)

Thank you.
-Hiral


Re: [OMPI users] Windows: MPI_Allreduce() crashes when using MPI_DOUBLE_PRECISION

2011-05-09 Thread Jeff Squyres
Please send all the information listed here:

http://www.open-mpi.org/community/help/

I am able to run your test program with no problem, so I'm not quite sure what 
the issue is...?

If op->o_func.intrinsic.fns[27] initially points to a valid value and then 
later it points to 0, that could imply that there is memory corruption 
occurring in your application somewhere.  Have you tried running through a 
memory-checking debugger?


On May 6, 2011, at 9:56 AM, hi wrote:

> I am observing crash in MPI_Allreduce() call from my actual application.
> After debugging I found that MPI_Allreduce() with MPI_DOUBLE_PRECISION
> returns NULL for following code in op.h
> 
> if (0 != (op->o_flags & OMPI_OP_FLAGS_INTRINSIC)) {
>   op->o_func.intrinsic.fns[ompi_op_ddt_map[dtype->id]](source, target,
>, ,
> 
> op->o_func.intrinsic.modules[ompi_op_ddt_map[dtype->id]]);
> 
> where, o_func.intrinsic.fns[27] points to 0.

> On further debugging, I found that it is making call to
> mca_coll_basic_reduce_lin_intra(); see below trace...
> 
>>  libmpid.dll!ompi_op_reduce(ompi_op_t * op, void * source, void * 
>> target, int count, ompi_datatype_t * dtype)  Line 500  C++
>   libmpid.dll!mca_coll_basic_reduce_lin_intra(void * sbuf, void *
> rbuf, int count, ompi_datatype_t * dtype, ompi_op_t * op, int root,
> ompi_communicator_t * comm, mca_coll_base_module_2_0_0_t * module)
> Line 249C++
>   libmpid.dll!mca_coll_sync_reduce(void * sbuf, void * rbuf, int
> count, ompi_datatype_t * dtype, ompi_op_t * op, int root,
> ompi_communicator_t * comm, mca_coll_base_module_2_0_0_t * module)
> Line 45 + 0xd4 bytesC++
>   libmpid.dll!mca_coll_basic_allreduce_intra(void * sbuf, void * rbuf,
> int count, ompi_datatype_t * dtype, ompi_op_t * op,
> ompi_communicator_t * comm, mca_coll_base_module_2_0_0_t * module)
> Line 57 + 0x58 bytesC++
>   libmpid.dll!MPI_Allreduce(void * sendbuf, void * recvbuf, int count,
> ompi_datatype_t * datatype, ompi_op_t * op, ompi_communicator_t *
> comm)  Line 107 + 0x5c bytesC++
>   libmpi_f77d.dll!mpi_allreduce_f(char * sendbuf, char * recvbuf, int
> * count, int * datatype, int * op, int * comm, int * ierr)  Line 79 +
> 0x34 bytes  C++
>   libmpi_f77d.dll!MPI_ALLREDUCE(char * sendbuf, char * recvbuf, int *
> count, int * datatype, int * op, int * comm, int * ierr)  Line 53 +
> 0x67 bytes  C++
> 
> 
> Now to simulate this problem, the attached test program works fine but
> I observed completely different callstack see attached images...
> 
> Just for information: I am executing my application using following command:
> c:/openmpi/bin/orterun -mca mca_component_show_load_errors 0 --prefix
> ... -x ... -x ...  --machinefile ... -np 2 myApplication
> 
> And test program using following command:
> c:/openmpi/bin/mpirun mar_f_dp.exe
> 
> 
> Please let me know based on what criteria "coll_reduce" is pointing to
> "mca_coll_basic_allreduce_intra() or mca_coll_self_allreduce_intra();
> this would help me to debug my application further.
> 
> Thank you in advance.
> -Hiral
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-09 Thread Jeff Squyres
On May 3, 2011, at 6:42 AM, Dave Love wrote:

>> We managed to have another user hit the bug that causes collectives (this 
>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>> 
>> btl_openib_cpc_include rdmacm
> 
> Could someone explain this?  We also have problems with collective hangs
> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
> see any relevant issues filed.  However, rdmacm isn't an available value
> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
> that I understand what these things are...).

Sorry for the delay -- perhaps an IB vendor can reply here with more detail...

We had a user-reported issue of some hangs that the IB vendors have been unable 
to replicate in their respective labs.  We *suspect* that it may be an issue 
with the oob openib CPC, but that code is pretty old and pretty mature, so all 
of us would be at least somewhat surprised if that were the case.  If anyone 
can reliably reproduce this error, please let us know and/or give us access to 
your machines -- we have not closed this issue, but are unable to move forward 
because the customers who reported this issue switched to rdmacm and moved on 
(i.e., we don't have access to their machines to test any more).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-09 Thread Jeff Squyres
Sorry for the delay on this -- it looks like the problem is caused by messages 
like this (from your first message):

[nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port

RDMA CM requires IP addresses (i.e., IPoIB) to be enabled on every port/LID 
where you want to use it.


On May 5, 2011, at 1:15 PM, Brock Palen wrote:

> Yeah we have ran into more issues, with rdmacm not being avialable on all of 
> our hosts.  So it would be nice to know what we can do to test that a host 
> would support rdmacm,
> 
> Example:
> 
> --
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>  Local host:   nyx5067.engin.umich.edu
>  Local device: mlx4_0
>  Local port:   1
>  CPCs attempted:   rdmacm
> --
> 
> This is one of our QDR hosts that rdmacm generally works on. Which this code 
> (CRASH) requires to avoid a collective hang in MPI_Allreduce() 
> 
> I look on this hosts and I find:
> [root@nyx5067 ~]# rpm -qa | grep rdma
> librdmacm-1.0.11-1
> librdmacm-1.0.11-1
> librdmacm-devel-1.0.11-1
> librdmacm-devel-1.0.11-1
> librdmacm-utils-1.0.11-1
> 
> So all the libraries are installed (I think) is there a way to verify this?  
> Or to have OpenMPI be more verbose what caused rdmacm to fail as an oob 
> option?
> 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On May 3, 2011, at 9:42 AM, Dave Love wrote:
> 
>> Brock Palen  writes:
>> 
>>> We managed to have another user hit the bug that causes collectives (this 
>>> time MPI_Bcast() ) to hang on IB that was fixed by setting:
>>> 
>>> btl_openib_cpc_include rdmacm
>> 
>> Could someone explain this?  We also have problems with collective hangs
>> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't
>> see any relevant issues filed.  However, rdmacm isn't an available value
>> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not
>> that I understand what these things are...).
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] is there an equiv of iprove for bcast?

2011-05-09 Thread Jeff Squyres
On May 3, 2011, at 8:20 PM, Randolph Pullen wrote:

> Sorry, I meant to say:
> - on each node there is 1 listener and 1 worker.
> - all workers act together when any of the listeners send them a request.
> - currently I must use an extra clearinghouse process to receive from any of 
> the listeners and bcast to workers, this is unfortunate because of the 
> potential scaling issues
> 
> I think you have answered this in that I must wait for MPI-3's non-blocking 
> collectives.

Yes and no.  If each worker starts N non-blocking broadcasts just to be able to 
test for completion of any of them, you might end up consuming a bunch of 
resources for them (I'm *anticipating* that pending non-blocking collective 
requests maybe more heavyweight than pending non-blocking point-to-point 
requests).

But then again, if N is small, it might not matter.

> Can anyone suggest another way?  I don't like the serial clearinghouse 
> approach.

If you only have a few workers and/or the broadcast message is small and/or the 
broadcasts aren't frequent, then MPI's built-in broadcast algorithms might not 
offer much more optimization than doing your own with point-to-point 
mechanisms.  I don't usually recommend this, but it may be possible for your 
case.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/