Re: [OMPI devel] Commit 6e6a3e96

2015-09-16 Thread Gilles Gouaillardet
George, I will revisit this. if I added const modifier when not required by the standard, this was not intentional, this was a mistake. thanks for the report Gilles On Wednesday, September 16, 2015, George Bosilca wrote: > Gilles, > > Your commit 6e6a3e96 is only partially correct. There is n

[OMPI devel] Interaction between orterun and user program

2015-09-16 Thread Kay Khandan (Hamed)
Hello everyone, My name is Kay. I’m a huge "oom-pi" fan, but only recently have been looking at from devel perspective. I appreciate if somebody shows me the entry point into understanding how orterun and user program interact, and more importantly how to change the way they interact. The rea

[OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread Алексей Рыжих
Hi all, We experimented with MPI+OpenMP hybrid application (MPI_THREAD_MULTIPLE support level) where several threads submits a lot of MPI_Irecv() requests simultaneously and encountered an intermittent bug OMPI_ERR_TEMP_OUT_OF_RESOURCE after MCA_PML_OB1_RECV_REQUEST_ALLOC() because OMPI_FREE_LIS

[OMPI devel] inter vs. intra communicator problem on master

2015-09-16 Thread Edgar Gabriel
something is borked right now on master in the management of inter vs. intra communicators. It looks like intra communicators are wrongly selecting the inter coll module thinking that it is an inter communicator, and we have hangs because of that. I attach a small replicator, where a bcast of a

Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread Nathan Hjelm
The formatting of the code got all messed up. Please send a diff and I will take a look. ompi free list no longer exists in master or the next release branch but the change may be worthwhile for the opal free list code. -Nathan On Wed, Sep 16, 2015 at 04:03:44PM +0300, Алексей Рыжих wrote: >

Re: [OMPI devel] inter vs. intra communicator problem on master

2015-09-16 Thread Nathan Hjelm
The reproducer is working for me with master on OX 10.10. Some changes to ompi_comm_set went in yesterday. Are you on the latest hash? -Nathan On Wed, Sep 16, 2015 at 08:49:59AM -0500, Edgar Gabriel wrote: > something is borked right now on master in the management of inter vs. intra > communica

Re: [OMPI devel] inter vs. intra communicator problem on master

2015-09-16 Thread Edgar Gabriel
yes, I did fresh pull this morning, for me it deadlocks reliably for 2 and more processes. Thanks Edgar On 9/16/2015 10:42 AM, Nathan Hjelm wrote: The reproducer is working for me with master on OX 10.10. Some changes to ompi_comm_set went in yesterday. Are you on the latest hash? -Nathan O

Re: [OMPI devel] inter vs. intra communicator problem on master

2015-09-16 Thread Nathan Hjelm
I just realized my branch is behind master. Updating now and will retest. -Nathan On Wed, Sep 16, 2015 at 10:43:45AM -0500, Edgar Gabriel wrote: > yes, I did fresh pull this morning, for me it deadlocks reliably for 2 and > more processes. > > Thanks > Edgar > > On 9/16/2015 10:42 AM, Nathan H

Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread George Bosilca
Alexey, This is not necessarily the fix for all cases. Most of the internal uses of the free_list can easily accommodate to the fact that no more elements are available. Based on your description of the problem I would assume you encounter this problem once the MCA_PML_OB1_RECV_REQUEST_ALLOC is ca

Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread George Bosilca
While looking into a possible fix for this problem we should also cleanup in the trunk the leftover from the OMPI_FREE_LIST. $find . -name "*.[ch]" -exec grep -Hn OMPI_FREE_LIST_GET_MT {} + ./opal/mca/btl/usnic/btl_usnic_compat.h:161:OMPI_FREE_LIST_GET_MT(list, (item)) ./ompi/mca/pml/bfo/pml_b

[OMPI devel] edison/hopper jenkins nodes back on line

2015-09-16 Thread Howard Pritchard
Hi Folks, I had to update my password for NERSC systems and that broke the credentials the IU jenkins was using to launch on those nodes. Should be working again. Sorry for the inconvenience, Howard

Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread Nathan Hjelm
iboffload and bfo are opal ignored by default. Neither exists in the release branch. -Nathan On Wed, Sep 16, 2015 at 12:02:29PM -0400, George Bosilca wrote: >While looking into a possible fix for this problem we should also cleanup >in the trunk the leftover from the OMPI_FREE_LIST. >

Re: [OMPI devel] inter vs. intra communicator problem on master

2015-09-16 Thread Nathan Hjelm
I see the problem. Before my changes ompi_comm_dup signalled that the communicator was not an inter-communicator by setting remote_size to 0. The remote size is now from the remote group if one was supplied (which is the case with intra-communicators) so ompi_comm_dup needs to make sure NULL is pa

Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread George Bosilca
As they don't even compile why are we keeping them around? George. On Wed, Sep 16, 2015 at 12:05 PM, Nathan Hjelm wrote: > > iboffload and bfo are opal ignored by default. Neither exists in the > release branch. > > -Nathan > > On Wed, Sep 16, 2015 at 12:02:29PM -0400, George Bosilca wrote:

Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread Nathan Hjelm
Not sure. I give a +1 for blowing them away. We can bring them back later if needed. -Nathan On Wed, Sep 16, 2015 at 01:19:24PM -0400, George Bosilca wrote: >As they don't even compile why are we keeping them around? > George. >On Wed, Sep 16, 2015 at 12:05 PM, Nathan Hjelm wrote:

Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread Rolf vandeVaart
The bfo was my creation many years ago. Can we keep it around for a little longer? If we blow it away, then we should probably clean up all the code I also have in the openib BTL for supporting failover. There is also some configure code that would have to go as well. Rolf >-Original Me

Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread Владимир Трущин
George, You are right. The sequence of calls in our test is MPI_Irecv -> mca_pml_ob1_irecv -> MCA_PML_OB1_RECV_REQUEST_ALLOC. We will try to use OMPI_FREE_LIST_WAIT_MT. We saw the following problem in OMPI_FREE_LIST_WAIT_MT. It returned NULL in case when thread A was suspended after the call

Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread Владимир Трущин
Sorry, “We saw the following problem in OMPI_FREE_LIST_GET_MT…”. *From:* Владимир Трущин [mailto:vdtrusc...@compcenter.org] *Sent:* Wednesday, September 16, 2015 10:09 PM *To:* 'Open MPI Developers' *Subject:* RE: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT() George, You are right. T

Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()

2015-09-16 Thread George Bosilca
On Wed, Sep 16, 2015 at 3:11 PM, Владимир Трущин wrote: > Sorry, “We saw the following problem in OMPI_FREE_LIST_GET_MT…”. > That's exactly what the WAIT macro is supposed to solve, wait (grow the freelist and call opal_progress) until an item become available. George. > > > *From:* Владим

[OMPI devel] --enable-spare-groups build broken

2015-09-16 Thread Jeff Squyres (jsquyres)
Did something change in the group structure in the last 24-48 hours? --enable-spare-groups groups are currently broken: make[2]: Entering directory `/home/jsquyres/git/ompi/ompi/debuggers' CC libdebuggers_la-ompi_debuggers.lo In file included from ../../ompi/communicator/communicator

Re: [OMPI devel] --enable-spare-groups build broken

2015-09-16 Thread Ralph Castain
Yes - Nathan made some changes related to the add_procs code. I doubt that configure option was checked... On Wed, Sep 16, 2015 at 7:13 PM, Jeff Squyres (jsquyres) wrote: > Did something change in the group structure in the last 24-48 hours? > > --enable-spare-groups groups are currently broken:

Re: [OMPI devel] inter vs. intra communicator problem on master

2015-09-16 Thread Howard Pritchard
Edgar Do you have a simple test we could run with jenkins ghprb that would catch this going forward? I could add it to some of the checks we run on your UH slave node. Howard -- sent from my smart phonr so no good type. Howard On Sep 16, 2015 12:36 PM, "Nathan Hjelm" wrote: > > I se

Re: [OMPI devel] inter vs. intra communicator problem on master

2015-09-16 Thread Ralph Castain
Actually, Edgar attached a simple reproducer to the first message in this thread. On Wed, Sep 16, 2015 at 7:27 PM, Howard Pritchard wrote: > Edgar > > Do you have a simple test we could run with jenkins ghprb that would catch > this going forward? > > I could add it to some of the checks we run

Re: [OMPI devel] inter vs. intra communicator problem on master

2015-09-16 Thread Howard Pritchard
thanks Ralph. I will add it to one of the UH jenkins scripts. -- sent from my smart phonr so no good type. Howard On Sep 16, 2015 10:28 PM, "Ralph Castain" wrote: > Actually, Edgar attached a simple reproducer to the first message in this > thread. > > > On Wed, Sep 16, 2015 at 7:27 P