[OMPI devel] Sending large messages over RDMA fails

2010-11-29 Thread Doron Shoham
Hi, The maximum message size of ConnectX HCAs is 1GB (older cards have a maximum of 2GB). Trying to send larger messages over RDMA direct protocol will fail. A reminder - RDMA direct will be used if RDMA writes or reads are allowed by |btl_openib_flags| and the sender's message is already

Re: [OMPI devel] Sending large messages over RDMA fails

2010-12-05 Thread Doron Shoham
Jeff Squyres wrote: On Nov 29, 2010, at 3:51 AM, Doron Shoham wrote: If only the PUT flag is set and/or the btl supports only PUT method then the sender will allocate a rendezvous header and will not eager send any data. The receiver will schedule rdma PUT(s) of the entire message. It is

[OMPI devel] OMPI 1.4.3 hangs in gather

2011-01-11 Thread Doron Shoham
Hi All machines on the setup are IDataPlex with Nehalem 12 cores per node, 24GB memory. · *Problem 1 – OMPI 1.4.3 hangs in gather:* I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla). It happens when np >= 64 and message size exceed 4k: mpirun -np 64 -machinefile

Re: [OMPI devel] OMPI 1.4.3 hangs in gather

2011-01-12 Thread Doron Shoham
58 PM, Doron Shoham wrote: > > Hi > > All machines on the setup are IDataPlex with Nehalem 12 cores per node, 24GB   > memory. > > > > · Problem 1 – OMPI 1.4.3 hangs in gather: > > > > I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla). &

Re: [OMPI devel] OMPI 1.4.3 hangs in gather

2011-01-16 Thread Doron Shoham
s issue there, the code is very-very >>> old, we did not touch it for a long time. >>> >>> Regards, >>> >>> Pavel (Pasha) Shamis >>> --- >>> Application Performance Tools Group >>> Computer Science and Math Division >>>

Re: [OMPI devel] OMPI 1.4.3 hangs in gather

2011-01-26 Thread Doron Shoham
using the flag --mca mpi_preconnect_mpi seems to solved the issue with the oob connection manager. This solution is not scalable but it looks more and more like a connection establishment problem. I'm still trying to figure out what is the root cause of this and how to solve it. Any ideas will be m