[OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-20 Thread Brock Palen
We managed to have another user hit the bug that causes collectives (this time MPI_Bcast() ) to hang on IB that was fixed by setting: btl_openib_cpc_include rdmacm My question is if we set this to the default on our system with an environment variable does it introduce any performance or other

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-21 Thread Jeff Squyres
Over IB, I'm not sure there is much of a drawback. It might be slightly slower to establish QP's, but I don't think that matters much. Over iWARP, rdmacm can cause connection storms as you scale to thousands of MPI processes. On Apr 20, 2011, at 5:03 PM, Brock Palen wrote: > We managed to ha

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-21 Thread Brock Palen
Given that part of our cluster is TCP only, openib wouldn't even startup on those hosts and this would be ignored on hosts with IB adaptors? Just checking thanks! Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On Apr 21, 2011, at 6:21 PM, Jeff

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-21 Thread Ralph Castain
On Apr 21, 2011, at 4:41 PM, Brock Palen wrote: > Given that part of our cluster is TCP only, openib wouldn't even startup on > those hosts That is correct - it would have no impact on those hosts > and this would be ignored on hosts with IB adaptors? Ummm...not sure I understand this one.

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-22 Thread Brock Palen
On Apr 21, 2011, at 6:49 PM, Ralph Castain wrote: > > On Apr 21, 2011, at 4:41 PM, Brock Palen wrote: > >> Given that part of our cluster is TCP only, openib wouldn't even startup on >> those hosts > > That is correct - it would have no impact on those hosts > >> and this would be ignored on

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-27 Thread Brock Palen
Argh, our messed up environment with three generations on infiniband bit us, Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR ib on some of our hosts. Note that jobs will never run across our old DDR ib and our new QDR stuff where rdmacm does work. I am doing some te

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-28 Thread Jeff Squyres
On Apr 27, 2011, at 10:02 AM, Brock Palen wrote: > Argh, our messed up environment with three generations on infiniband bit us, > Setting openib_cpc_include to rdmacm causes ib to not be used on our old DDR > ib on some of our hosts. Note that jobs will never run across our old DDR ib > and our

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-04-28 Thread Brock Palen
Attached is the output of running with verbose 100, mpirun --mca btl_openib_cpc_include rdmacm --mca btl_base_verbose 100 NPmpi [nyx0665.engin.umich.edu:06399] mca: base: components_open: Looking for btl components [nyx0666.engin.umich.edu:07210] mca: base: components_open: Looking for btl compo

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-03 Thread Dave Love
Brock Palen writes: > We managed to have another user hit the bug that causes collectives (this > time MPI_Bcast() ) to hang on IB that was fixed by setting: > > btl_openib_cpc_include rdmacm Could someone explain this? We also have problems with collective hangs with openib/mlx4 (specifically

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-05 Thread Brock Palen
Yeah we have ran into more issues, with rdmacm not being avialable on all of our hosts. So it would be nice to know what we can do to test that a host would support rdmacm, Example: -- No OpenFabrics connection schemes rep

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-09 Thread Jeff Squyres
Sorry for the delay on this -- it looks like the problem is caused by messages like this (from your first message): [nyx0665.engin.umich.edu:06399] openib BTL: rdmacm IP address not found on port RDMA CM requires IP addresses (i.e., IPoIB) to be enabled on every port/LID where you want to use i

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-09 Thread Jeff Squyres
On May 3, 2011, at 6:42 AM, Dave Love wrote: >> We managed to have another user hit the bug that causes collectives (this >> time MPI_Bcast() ) to hang on IB that was fixed by setting: >> >> btl_openib_cpc_include rdmacm > > Could someone explain this? We also have problems with collective han

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Brock Palen
On May 9, 2011, at 9:31 AM, Jeff Squyres wrote: > On May 3, 2011, at 6:42 AM, Dave Love wrote: > >>> We managed to have another user hit the bug that causes collectives (this >>> time MPI_Bcast() ) to hang on IB that was fixed by setting: >>> >>> btl_openib_cpc_include rdmacm >> >> Could someo

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Ralph Castain
Sent from my iPad On May 11, 2011, at 2:05 PM, Brock Palen wrote: > On May 9, 2011, at 9:31 AM, Jeff Squyres wrote: > >> On May 3, 2011, at 6:42 AM, Dave Love wrote: >> We managed to have another user hit the bug that causes collectives (this time MPI_Bcast() ) to hang on IB that

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Dave Love
Jeff Squyres writes: > We had a user-reported issue of some hangs that the IB vendors have > been unable to replicate in their respective labs. We *suspect* that > it may be an issue with the oob openib CPC, but that code is pretty > old and pretty mature, so all of us would be at least somewhat

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-11 Thread Dave Love
Ralph Castain writes: > I'll go back to my earlier comments. Users always claim that their > code doesn't have the sync issue, but it has proved to help more often > than not, and costs nothing to try, Could you point to that post, or tell us what to try excatly, given we're running IMB? Thanks

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Ralph Castain
On May 11, 2011, at 4:27 PM, Dave Love wrote: > Ralph Castain writes: > >> I'll go back to my earlier comments. Users always claim that their >> code doesn't have the sync issue, but it has proved to help more often >> than not, and costs nothing to try, > > Could you point to that post, or te

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Jeff Squyres
On May 11, 2011, at 3:21 PM, Dave Love wrote: > We can reproduce it with IMB. We could provide access, but we'd have to > negotiate with the owners of the relevant nodes to give you interactive > access to them. Maybe Brock's would be more accessible? (If you > contact me, I may not be able to

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Brock Palen
On May 12, 2011, at 10:13 AM, Jeff Squyres wrote: > On May 11, 2011, at 3:21 PM, Dave Love wrote: > >> We can reproduce it with IMB. We could provide access, but we'd have to >> negotiate with the owners of the relevant nodes to give you interactive >> access to them. Maybe Brock's would be mor

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-12 Thread Brock Palen
I am pretty sure MTL's and BTL's are very different, but just as a note, This users code (Crash) hangs at MPI_Allreduce() in Openib But runs on: tcp psm (an mtl, different hardware) Putting it out there if it does have any bearing. Otherwise ignore. Brock Palen www.umich.edu/~brockp Center f

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-13 Thread Dave Love
Jeff Squyres writes: > On May 11, 2011, at 3:21 PM, Dave Love wrote: > >> We can reproduce it with IMB. We could provide access, but we'd have to >> negotiate with the owners of the relevant nodes to give you interactive >> access to them. Maybe Brock's would be more accessible? (If you >> con

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-13 Thread Brock Palen
On May 13, 2011, at 4:09 PM, Dave Love wrote: > Jeff Squyres writes: > >> On May 11, 2011, at 3:21 PM, Dave Love wrote: >> >>> We can reproduce it with IMB. We could provide access, but we'd have to >>> negotiate with the owners of the relevant nodes to give you interactive >>> access to them.

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread Samuel K. Gutierrez
Hi, Just out of curiosity - what happens when you add the following MCA option to your openib runs? -mca btl_openib_flags 305 Thanks, Samuel Gutierrez Los Alamos National Laboratory On May 13, 2011, at 2:38 PM, Brock Palen wrote: > On May 13, 2011, at 4:09 PM, Dave Love wrote: > >> Jeff Squ

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread Brock Palen
On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote: > Hi, > > Just out of curiosity - what happens when you add the following MCA option to > your openib runs? > > -mca btl_openib_flags 305 You Sir found the magic combination. I verified this lets IMB and CRASH progress pass their lock

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread Samuel K. Gutierrez
On May 16, 2011, at 8:53 AM, Brock Palen wrote: > > > > On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote: > >> Hi, >> >> Just out of curiosity - what happens when you add the following MCA option >> to your openib runs? >> >> -mca btl_openib_flags 305 > > You Sir found the magic co

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-16 Thread George Bosilca
Here is the output of the "ompi_info --param btl openib": MCA btl: parameter "btl_openib_flags" (current value: <306>, data source: default value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4, SEN

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-17 Thread Brock Palen
Thanks, I though of looking at ompi_info after I sent that note sigh. SEND_INPLACE appears to help performance of larger messages in my synthetic benchmarks over regular SEND. Also it appears that SEND_INPLACE still allows our code to run. We working on getting devs access to our system and co

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-17 Thread Brock Palen
Sorry typo 314 not 313, Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 On May 17, 2011, at 2:02 PM, Brock Palen wrote: > Thanks, I though of looking at ompi_info after I sent that note sigh. > > SEND_INPLACE appears to help performance of large

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-18 Thread Brock Palen
Well I have a new wrench into this situation. We have a power failure at our datacenter took down our entire system nodes,switch,sm. Now I am unable to produce the error with oob default ibflags etc. Does this shed any light on the issue? It also makes it hard to now debug the issue without b

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-05-24 Thread Dave Love
Brock Palen writes: > Well I have a new wrench into this situation. > We have a power failure at our datacenter took down our entire system > nodes,switch,sm. > Now I am unable to produce the error with oob default ibflags etc. As far as I know, we could still reproduce it. Mail me if you ne

Re: [OMPI users] btl_openib_cpc_include rdmacm questions

2011-07-27 Thread Brock Palen
Sorry to bring this back up. We recently had an outage updated the firmware on our GD4700 and installed a new mellonox provided OFED stack and the problem has returned. Specifically I am able to produce the problem with IMB 4 12 core nodes when it tries to go to 16 cores. I have verified that en