Re: [OMPI devel] periodic hangs of hello_usempi.x on uh jenkins slave

2015-09-03 Thread Ralph Castain
I’ve managed to create a 100% reproducer - I’ll try to track this down as quickly as I can. Meantime, I’m working on that internal timeout so we don’t hang in case anything else interferes. > On Sep 3, 2015, at 12:53 PM, Howard Pritchard wrote: > > HI Ralph, > > If its any help, the first r

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Ralph Castain
Yes, it actually is rather easy to do. I can check, but I think that should happen now (unless psm2 was set to auto-build if the lib was detected). Regardless, we can always have RH et al simply build with —enable-mca-no-build=mtl-psm2 and that will solve the problem. Please keep us posted - an

Re: [OMPI devel] RFC: Remove --without-hwloc configure option

2015-09-03 Thread Ralph Castain
I see - yes, that would be true. It would not build without hwloc. An alternative would be to have hwloc return a neutral response that we check and ignore if hwloc isn’t “active”. Would that suffice? I’m just looking to remove all that #if cruft all over the place. > On Sep 3, 2015, at 4:02 PM

Re: [OMPI devel] RFC: Remove --without-hwloc configure option

2015-09-03 Thread Paul Hargrove
Gilles, You have the nature of my question correct. To restate: Imagine somebody is developing an experimental platform (such as a research OS) and they want an MPI for it. Additionally assume that hwloc (the embedded one or otherwise) doesn't build at all for this platform. It is my understandin

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Friedley, Andrew
Hi Ralph & crew, I'm representing the Intel PSM team to Open MPI. They're aware of the problem and have seen the comments on both this thread and in OFI, and are working on solving the issue within PSM2. Current estimate is that it will take 3-4 weeks. If it comes to removing the PSM2 MTL fro

Re: [OMPI devel] RFC: Remove --without-hwloc configure option

2015-09-03 Thread Gilles Gouaillardet
Ralph, just to be clear, your proposal is to abort if openmpi is configured with --without-hwloc, right ? ( the --with-hwloc option is not removed because we want to keep the option of using an external hwloc library ) if I understand correctly, Paul's point is that if openmpi is ported to a new

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-2436-g7adb9b7

2015-09-03 Thread Dave Goodell (dgoodell)
On Sep 3, 2015, at 3:40 PM, Burette, Yohann wrote: > I see what you are saying. Thank you for pointing it out. > > Would MTL_OFI_RETRY_UNTIL_DONE be better instead? Yes, I think that would be an improvement. Thanks, -Dave

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-2436-g7adb9b7

2015-09-03 Thread Burette, Yohann
I see what you are saying. Thank you for pointing it out. Would MTL_OFI_RETRY_UNTIL_DONE be better instead? Yohann -Original Message- From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Dave Goodell (dgoodell) Sent: Thursday, September 03, 2015 11:47 AM To: de...@open-mpi.org S

Re: [OMPI devel] periodic hangs of hello_usempi.x on uh jenkins slave

2015-09-03 Thread Howard Pritchard
HI Ralph, If its any help, the first run has yet to hang. Its always one of the subsequent mpirun's (and hence why its the fortran) that shows this problem. Howard 2015-09-03 13:52 GMT-06:00 Ralph Castain : > Thanks! I’ll at least try, and can certainly provide some diag output > (just have

Re: [OMPI devel] periodic hangs of hello_usempi.x on uh jenkins slave

2015-09-03 Thread Ralph Castain
Thanks! I’ll at least try, and can certainly provide some diag output (just have to live thru it when it doesn’t fail, and hopefully it won’t change the timing so much that it won’t reproduce any more) > On Sep 3, 2015, at 12:44 PM, Howard Pritchard wrote: > > Hi Ralph, > > Warning that it se

Re: [OMPI devel] periodic hangs of hello_usempi.x on uh jenkins slave

2015-09-03 Thread Howard Pritchard
Hi Ralph, Warning that it seems to be hard to reproduce, at least on the UH server. Howard 2015-09-03 13:12 GMT-06:00 Ralph Castain : > I’ll try to replicate, and provide some diagnostics targeting this > exchange. What is happening is that the client process is attempting to > connect to the

Re: [OMPI devel] periodic hangs of hello_usempi.x on uh jenkins slave

2015-09-03 Thread Ralph Castain
I’ll try to replicate, and provide some diagnostics targeting this exchange. What is happening is that the client process is attempting to connect to the ORTE daemon, and for some reason the connection isn’t generating a response from the daemon. I’ll also add a timeout function in there so we

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-2436-g7adb9b7

2015-09-03 Thread Dave Goodell (dgoodell)
On Sep 3, 2015, at 1:03 PM, git...@crest.iu.edu wrote: > diff --git a/ompi/mca/mtl/ofi/mtl_ofi.h b/ompi/mca/mtl/ofi/mtl_ofi.h > index 3584d8a..a035b1c 100644 > --- a/ompi/mca/mtl/ofi/mtl_ofi.h > +++ b/ompi/mca/mtl/ofi/mtl_ofi.h > @@ -38,6 +38,14 @@ > #include "mtl_ofi_endpoint.h" > #include "mtl_o

Re: [OMPI devel] RFC: Remove --without-hwloc configure option

2015-09-03 Thread Ralph Castain
No - hwloc is embedded in OMPI anyway. > On Sep 3, 2015, at 11:09 AM, Paul Hargrove wrote: > > > On Thu, Sep 3, 2015 at 8:03 AM, Ralph Castain > wrote: > Does anyone know of a reason why we shouldn’t do this? > > > Would doing this mean that a port to a new system w

[OMPI devel] periodic hangs of hello_usempi.x on uh jenkins slave

2015-09-03 Thread Howard Pritchard
Hi Folks, I'm seeing again a case of a hang (yes I'm going to start using timeout) of a two process run on the iu jenkins server for master. This is the --disable-dlopen jenkins project for the IU jenkins server. I attached to the hanging processes and get this for a backtrace: #0 0x7fdd4c

Re: [OMPI devel] RFC: Remove --without-hwloc configure option

2015-09-03 Thread Paul Hargrove
On Thu, Sep 3, 2015 at 8:03 AM, Ralph Castain wrote: > Does anyone know of a reason why we shouldn’t do this? > Would doing this mean that a port to a new system would require that one first perform a full hwloc port? -Paul -- Paul H. Hargrove phhargr...@lbl.gov Comp

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Howard Pritchard
I vote for Ralph's proposal. 2015-09-03 10:05 GMT-06:00 Ralph Castain : > As we discussed on the phone, I prefer the bullet #3 approach - ask RedHat > to build/distribute 1.10.0 without PSM2 support, and let Intel provide a > PSM2-enabled variant via their current proprietary distribution channel

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Ralph Castain
As we discussed on the phone, I prefer the bullet #3 approach - ask RedHat to build/distribute 1.10.0 without PSM2 support, and let Intel provide a PSM2-enabled variant via their current proprietary distribution channel until they can provide a “clean” solution to the community. If that hasn’t

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Jeff Squyres (jsquyres)
Ralph and I just chatted about this on the phone. I think I understand his position better now. Just to be clear/put some context in this conversation: 1. PSM (aka "PSM1") supports TrueScale Intel networks 2. PSM2 supports OmniScale Intel networks -- The following three solutions are more

[OMPI devel] RFC: Remove --without-hwloc configure option

2015-09-03 Thread Ralph Castain
Hi folks We have carried the ability to build without hwloc since we first pulled that package into OMPI. This made sense initially as the code was still maturing, and we were concerned about the breadth of support. However, that has certainly changed, and I propose we remove this configure opt

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Gilles Gouaillardet
Ralph, if I correctly read between the lines of your second point, omnipath (PSM2) is working out of the box. I am not sure this is the case, and/or my extrapolation might be incorrect. if I understood correctly, psm2 is a new feature. from a distro point of view, that could be a new package (kno

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread hppritcha
hi Jeff to answer your question I too find the PSM 1/2 weird and a real mess. Back to IB verbs? Howard Von meinem iPhone gesendet > Am 03.09.2015 um 06:55 schrieb Jeff Squyres (jsquyres) : > > I agree with what George says. > > AFAIK, Red Hat builds Open MPI support for dlopen, so the config

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Jeff Squyres (jsquyres)
On Sep 3, 2015, at 9:30 AM, Gilles Gouaillardet wrote: > > on second thought, wouldn't it be better to simple disable both PSM and PSM2 > in openmpi, > and let libfabric handle these conflicts ? There's two reasons: 1. Intel still wants to use their PSM and PSM2 MTLs. 2. The publicly-released

Re: [OMPI devel] orte-dvm and orte_max_vm_size

2015-09-03 Thread Ralph Castain
Hi Mark The purpose of orte_max_vm_size is to subdivide the allocation - i.e., for a given mpirun execution, you can specify to only use a certain number of the allocated nodes. If you want to further limit the VM to specific nodes in the allocation, then you would use -host option. It’s a lit

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Ralph Castain
I guess I didn’t make it clear in my prior comment, so let me try again. I understand about dlopen and the fix that George proposed - we had internally discussed this as well. However, the questions that raises are: 1. how does the distro (Michal) decide which PSM module to disable by default i

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Gilles Gouaillardet
Jeff, on second thought, wouldn't it be better to simple disable both PSM and PSM2 in openmpi, and let libfabric handle these conflicts ? does that make any sense ? Cheers, Gilles On Thursday, September 3, 2015, Jeff Squyres (jsquyres) wrote: > I agree with what George says. > > AFAIK, Red Ha

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Jeff Squyres (jsquyres)
I agree with what George says. AFAIK, Red Hat builds Open MPI support for dlopen, so the config file option is probably suitable. However, I have to admit that I resent the fact that PSM's poor upgrade path design is forcing both the Open MPI and libfabric communities to have similar confusing

[OMPI devel] orte-dvm and orte_max_vm_size

2015-09-03 Thread Mark Santcroos
Hi, I've been running into some funny issue with using orte-dvm (Hi Ralph ;-) and trying to define the size of the created vm and for that I use "--mca orte_max_vm_size" which in general seems to work. In this example I have a PBS job of 4 nodes and want to run the DVM on < 4 nodes. If I creat

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Gilles Gouaillardet
Michael, if a solution with two packages is acceptable, then an other and simpler option is to configure openmpi for PSM with --without-psm2, and openmpi for PSM2 with --without-psm this is safe for --disable-dlopen or --enable-static, and you do not need to tweak the conf files Cheers, Gilles

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread George Bosilca
Hi Michael, I might have missed some context when proposing this solution. As Gilles suggested if you build Open MPI without support for dlopen (configure option --disable-dlopen) this simple solution will not work because the symbol conflict issue is generated deep inside the constructors of the

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Michal Schmidt
[I apologize for not threading the email properly. I was not subscribed before and found the conversation in the web archive.] Hello, I am the one who discovered the PSM vs. PSM2 library conflict and proposed the temporary workaround of having two builds of the openmpi package. George Bosilca wr

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread George Bosilca
On Thu, Sep 3, 2015 at 12:49 AM, Ralph Castain wrote: > George, I think you misunderstand the difference between the two modules. > PSM supports one type of fabric, and PSM2 supports a different one. They > are not interchangeable. > Ralph, what these two modules do is irrelevant. My point is th

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Ralph Castain
George, I think you misunderstand the difference between the two modules. PSM supports one type of fabric, and PSM2 supports a different one. They are not interchangeable. I agree with your second point. If you have a way of resolving it, I would welcome hearing it. So far, the problems have be

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Gilles Gouaillardet
George, about your third point : some libraries does stuff in the constructors, so "mtl = ^psm" might also not work if OMPI was configure'd with --disable-dlopen. as far as i know, --disable-dlopen is quite popular (and --disable-shared --enable-static is not so much) Cheers, Gilles On 9/3/

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread George Bosilca
I might have missed something here but: 1. I bet that, and I'm certainly using a lower bound here, 99.9% of our users will not even notice the issue between PSM and PSM2. 2. If there is anything that might negatively impact us as a community is the recurrent screwed-up with our own releases. For

Re: [OMPI devel] 1.10.0 issue

2015-09-03 Thread Gilles Gouaillardet
Ralph, on one hand, i do not have a strong opinion about keeping PSM2 i the v1.10 series on the other hand, i feel confused by this explanation ... if PSM2 is simply removed, only one version of ompi can be released, but there is no way to support PSM2 at all. how is this better than giving th