Just an update for folks: The connection setup latency was a bug in my
iw_cxgb3 rdma driver. It wasn't turning off RX coalescing for the iwarp
connections. This resulted in 100-200ms added latency since the iwarp
connection setup uses TCP streaming mode messages to negotiate the iwarp
connec
I'll look into Solaris Studio. I think somehow the connections are
getting single threaded or somehow funneled due to the gather
algorithm. And since they are taking ~160ms to setup each one, and
there are ~3600 connections getting setup, we end up with a 7 minute run
time. Now, 160ms seem
Right, by default all connections will be handled on the fly. So as an
MPI_Send is executed to a process that there is not a connection to then
a dance happens between the sender and the receiver. So why this
happens with np > 60 may have to do with how many connections are
happening at the s
Does anyone have a NP64 IB cluster handy? I'd be interested if IB
behaves this way when running with the rdmacm connect method. IE with:
--mca btl_openib_cpc_include rdmacm --mca btl openib,sm,self
Steve.
On 9/17/2010 10:41 AM, Steve Wise wrote:
Yes it does. With mpi_preconnect_mpi to 1
Yes it does. With mpi_preconnect_mpi to 1, NP64 doesn't stall. So
its not the algorithm in and of itself, but rather some interplay
between the algorithm and connection setup I guess.
On 9/17/2010 5:24 AM, Terry Dontje wrote:
Does setting mca parameter mpi_preconnect_mpi to 1 help at all.
Does setting mca parameter mpi_preconnect_mpi to 1 help at all. This
might be able to help determine if it is the actually connection set up
between processes that are out of sync as oppose to something in the
actual gather algorithm.
--td
Steve Wise wrote:
Here's a clue: ompi_coll_tuned_ga
Here's a clue: ompi_coll_tuned_gather_intra_dec_fixed() changes its
algorithm for job sizes > 60 to some binomial method. I changed the
threshold to 100 and my NP64 jobs run fine. Now to try and understand
what about ompi_coll_tuned_gather_intra_binomial() is causing these
connect delays...