Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-16 Thread Ralph Castain
I found the reason for the notification and fixed that as well - should all be done now > On Jul 16, 2016, at 10:37 AM, Ralph Castain wrote: > > Kewl - thanks! I will take care of this, but to me the most pressing issue is > why this event notification is being generated at all. It shouldn’t b

Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-16 Thread Ralph Castain
Kewl - thanks! I will take care of this, but to me the most pressing issue is why this event notification is being generated at all. It shouldn’t be. > On Jul 16, 2016, at 9:11 AM, Gilles Gouaillardet > wrote: > > I finally got it :-) > > in send_notification() from orted_submit.c, info is >

Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-16 Thread Gilles Gouaillardet
I finally got it :-) in send_notification() from orted_submit.c, info is OPAL_PMIX_EVENT_NON_DEFAULT, but in pmix2x.c and pmix_ext20.c, PMIX_EVENT_NON_DEFAULT is tested. If I use OPAL_PMIX_EVENT_NON_DEFAULT in pmix*, that fixes the issue Cheers, Gilles On Sunday, July 17, 2016, Ralph Castain w

Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-16 Thread Ralph Castain
Okay, I’ll investigate why that is happening - thanks! > On Jul 16, 2016, at 7:45 AM, Gilles Gouaillardet > wrote: > > The parent job (e.g. the task that calls MPI_Comm_spawn) receives it. > I cannot tell whether the child (e.g. the spawned task) receives it too or not > > Cheers, > > Gilles

Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-16 Thread Gilles Gouaillardet
The parent job (e.g. the task that calls MPI_Comm_spawn) receives it. I cannot tell whether the child (e.g. the spawned task) receives it too or not Cheers, Gilles On Saturday, July 16, 2016, Ralph Castain wrote: > I can fix the initialization. What puzzles me is that no debugger_release > me

Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-16 Thread Ralph Castain
I can fix the initialization. What puzzles me is that no debugger_release message should be sent unless a debugger is attached - in which case, the event should be registered. So why is it being sent? Is it the child job that is receiving it? Or is it the parent? > On Jul 16, 2016, at 7:19 AM

Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-16 Thread Gilles Gouaillardet
I found some time to investigate this. tscon should initialize nondefault to false in both pmix2x.c and pmix_ext20.c A better workaround is to update ompi_errhandler_callback, so it does not invoke ompi_mpi_abort if status is OPAL_ERR_DEBUGGER_RELEASE That still seems counter intuitive to me ...

Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-15 Thread Ralph Castain
Okay, I’ll take a look - thanks! > On Jul 15, 2016, at 7:08 AM, Gilles Gouaillardet > wrote: > > > Yep, > > The constructor of pmix2x_threadshift_t (tscon) does not initialize > nondefault to false. > I won't be able to investigate this until Monday, but so far, my guess is > that if the co

Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-15 Thread Gilles Gouaillardet
Yep, The constructor of pmix2x_threadshift_t (tscon) does not initialize nondefault to false. I won't be able to investigate this until Monday, but so far, my guess is that if the constructor is fixed, then RHEL6 will fail like RHEL7 ... fwiw, the intercomm_create used to fail in Cisco mtt becaus

Re: [OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-15 Thread Ralph Castain
That would break debugger attach. Sounds to me like it’s just an uninitialized variable for in_event_hdlr? > On Jul 15, 2016, at 1:20 AM, Gilles Gouaillardet wrote: > > Ralph, > > i noticed MPI_Comm_spawn is broken on master and on RHEL7 > > for some reason i cannot yet explain, it works just

[OMPI devel] MPI_Comm_spawn broken on master on RHEL7

2016-07-15 Thread Gilles Gouaillardet
Ralph, i noticed MPI_Comm_spawn is broken on master and on RHEL7 for some reason i cannot yet explain, it works just fine on RHEL6 (!) mpirun -np 1 ./dynamic/intercomm_create from the ibm test suite can be used to reproduce the issue. i digged a bit and i found OPAL_ERR_DEBUGGER_RELEASE is

Re: [OMPI devel] MPI_Comm_spawn crashes with the openib btl

2014-10-01 Thread Gilles Gouaillardet
Thanks Ralph ! it did fix the problem Cheers, Gilles On 2014/10/01 3:04, Ralph Castain wrote: > I fixed this in r32818 - the components shouldn't be passing back success if > the requested info isn't found. Hope that fixes the problem. > > > On Sep 30, 2014, at 1:54 AM, Gilles Gouaillardet >

Re: [OMPI devel] MPI_Comm_spawn crashes with the openib btl

2014-09-30 Thread Ralph Castain
I fixed this in r32818 - the components shouldn't be passing back success if the requested info isn't found. Hope that fixes the problem. On Sep 30, 2014, at 1:54 AM, Gilles Gouaillardet wrote: > Folks, > > the dynamic/spawn test from the ibm test suite crashes if the openib btl > is detecte

[OMPI devel] MPI_Comm_spawn crashes with the openib btl

2014-09-30 Thread Gilles Gouaillardet
Folks, the dynamic/spawn test from the ibm test suite crashes if the openib btl is detected (the test can be ran on one node with an IB port) here is what happens : in mca_btl_openib_proc_create, the macro OPAL_MODEX_RECV(rc, &mca_btl_openib_component.super.btl_version, p

Re: [OMPI devel] MPI_Comm_spawn fails under certain conditions

2014-06-25 Thread Ralph Castain
I see your point, but I don't know how to make that happen. The problem is that spawn really should fail under certain conditions because you asked us to do something we couldn't do - i.e., you asked that we launch and bind more processes then we could. Increasing the number of available resources

Re: [OMPI devel] MPI_Comm_spawn fails under certain conditions

2014-06-24 Thread Gilles Gouaillardet
Hi Ralph, On 2014/06/25 2:51, Ralph Castain wrote: > Had a chance to review this with folks here, and we think that having > oversubscribe automatically set overload makes some sense. However, we do > want to retain the ability to separately specify oversubscribe and overload > as well since these

Re: [OMPI devel] MPI_Comm_spawn fails under certain conditions

2014-06-24 Thread Ralph Castain
Hi Gilles Had a chance to review this with folks here, and we think that having oversubscribe automatically set overload makes some sense. However, we do want to retain the ability to separately specify oversubscribe and overload as well since these two terms don't mean quite the same thing. Our

[OMPI devel] MPI_Comm_spawn fails under certain conditions

2014-06-24 Thread Gilles Gouaillardet
Folks, this issue is related to the failures reported by mtt on the trunk when the ibm test suite invokes MPI_Comm_spawn. my test bed is made of 3 (virtual) machines with 2 sockets and 8 cpus per socket each. if i run on one host (without any batch manager) mpirun -np 16 --host slurm1 --oversub

Re: [OMPI devel] MPI_Comm_spawn affinity and coll/ml

2014-06-06 Thread Ralph Castain
uaillardet > [gilles.gouaillar...@gmail.com] > Sent: Thursday, June 05, 2014 1:20 AM > To: Open MPI Developers > Subject: [OMPI devel] MPI_Comm_spawn affinity and coll/ml > > Folks, > > on my single socket four cores VM (no batch manager), i am running the > interc

Re: [OMPI devel] MPI_Comm_spawn affinity and coll/ml

2014-06-05 Thread Hjelm, Nathan T
Gilles Gouaillardet [gilles.gouaillar...@gmail.com] Sent: Thursday, June 05, 2014 1:20 AM To: Open MPI Developers Subject: [OMPI devel] MPI_Comm_spawn affinity and coll/ml Folks, on my single socket four cores VM (no batch manager), i am running the intercomm_create test from the ibm test suite

[OMPI devel] MPI_Comm_spawn affinity and coll/ml

2014-06-05 Thread Gilles Gouaillardet
Folks, on my single socket four cores VM (no batch manager), i am running the intercomm_create test from the ibm test suite. mpirun -np 1 ./intercomm_create => OK mpirun -np 2 ./intercomm_create => HANG :-( mpirun -np 2 --mca coll ^ml ./intercomm_create => OK basically, this first two tasks w

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-22 Thread Ralph Castain
On Feb 22, 2014, at 10:14 AM, Suraj Prabhakaran wrote: >> Yeah, we added those capabilities specifically for this purpose. Indeed, >> another researcher added this to Torque a couple of years ago, though it >> didn't get pushed upstream. Also was added to Slurm. > > Thanks for your help . By

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-22 Thread Suraj Prabhakaran
> Yeah, we added those capabilities specifically for this purpose. Indeed, > another researcher added this to Torque a couple of years ago, though it > didn't get pushed upstream. Also was added to Slurm. Thanks for your help . By any chance you have more info on that one? Or a faint idea where

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-22 Thread Ralph Castain
On Feb 22, 2014, at 9:30 AM, Suraj Prabhakaran wrote: > Thanks Ralph. > > I cannot get rid of Torque since I am actually working on dynamic allocation > of nodes for a running job on Torque. What I actually want to do is spawn > processes on the dynamically assigned nodes since that is the m

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-22 Thread Suraj Prabhakaran
Thanks Ralph. I cannot get rid of Torque since I am actually working on dynamic allocation of nodes for a running job on Torque. What I actually want to do is spawn processes on the dynamically assigned nodes since that is the most easiest way to expand MPI processes when a resource allocation

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-22 Thread Ralph Castain
On Feb 21, 2014, at 5:55 PM, Suraj Prabhakaran wrote: > Hmm.. but in actual the MPI_Comm_spawn of parents and MPI_Init of children > never returned! Understood - my point was that the output shows no errors or issues. For some reason, the progress thread appears to just stop. This usually in

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Suraj Prabhakaran
Hmm.. but in actual the MPI_Comm_spawn of parents and MPI_Init of children never returned! I configured MPI with ./configure --prefix=/dir/ --enable-debug --with-tm=/usr/local/ On Feb 22, 2014, at 12:53 AM, Ralph Castain wrote: > Strange - it all looks just fine. How was OMPI configured? >

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Ralph Castain
Strange - it all looks just fine. How was OMPI configured? On Feb 21, 2014, at 3:31 PM, Suraj Prabhakaran wrote: > Ok, I figured out that it was not a problem with the node grsacc04 because I > now conducted the same on totally different set of nodes. > > I must really say that with --bind-t

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Suraj Prabhakaran
Ok, I figured out that it was not a problem with the node grsacc04 because I now conducted the same on totally different set of nodes. I must really say that with --bind-to none option, the program completed "many" times compared to earlier but still "sometimes" it hangs! Attaching now the out

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Ralph Castain
Well, that all looks fine. However, I note that the procs on grsacc04 all stopped making progress at the same point, which is why the job hung. All the procs on the other nodes were just fine. So let's try a couple of things: 1. add "--bind-to none" to your cmd line so we avoid any contention i

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Suraj Prabhakaran
Right, so I have the output here. Same case, mpiexec -mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 5 -np 3 ./simple_spawn Output attached. Best, Suraj output Description: Binary data On Feb 21, 2014, at 5:30 AM, Ralph Castain wrote: > > On Feb 20, 2014, at

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Ralph Castain
On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran wrote: > Thanks Ralph! > > I must have mentioned though. Without the Torque environment, spawning with > ssh works ok. But Under the torque environment, not. Ah, no - you forgot to mention that point. > > I started the simple_spawn with 3 pro

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Suraj Prabhakaran
Thanks Ralph! I must have mentioned though. Without the Torque environment, spawning with ssh works ok. But Under the torque environment, not. I started the simple_spawn with 3 processes and spawned 9 processes (3 per node on 3 nodes). There is no problem with the Torque environment because

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Ralph Castain
Hmmm...I don't see anything immediately glaring. What do you mean by "doesn't work"? Is there some specific behavior you see? You might try the attached program. It's a simple spawn test we use - 1.7.4 seems happy with it. simple_spawn.c Description: Binary data On Feb 20, 2014, at 10:14 AM

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Suraj Prabhakaran
I am using 1.7.4! On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: > What OMPI version are you using? > > On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran > wrote: > >> Hello! >> >> I am having problem using MPI_Comm_spawn under torque. It doesn't work when >> spawning more than 12 processe

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Ralph Castain
What OMPI version are you using? On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran wrote: > Hello! > > I am having problem using MPI_Comm_spawn under torque. It doesn't work when > spawning more than 12 processes on various nodes. To be more precise, > "sometimes" it works, and "sometimes" it d

[OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Suraj Prabhakaran
Hello! I am having problem using MPI_Comm_spawn under torque. It doesn't work when spawning more than 12 processes on various nodes. To be more precise, "sometimes" it works, and "sometimes" it doesn't! Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE looks like below.

[OMPI devel] mpi_comm_spawn

2010-06-03 Thread KHALDI Dounia
Hello, I want to create a process y from a process x( the master for example). then, i want to communicate between y and another processes in the group of x (between child and his uncles :) ) i have tried to use mpi_comm_connect and mpi_connect_accept, but that concemt requiere taht there is no

[OMPI devel] MPI_COMM_SPAWN[_MULTIPLE] launching non-MPI jobs

2007-12-03 Thread Jeff Squyres
This has been a long-standing ticket: https://svn.open-mpi.org/trac/ompi/ticket/1106 Ralph just recently implemented this on his branch, and is open for suggestions as to what we should name the MPI_Info key. Anyone have any suggestions? -- Jeff Squyres Cisco Systems

Re: [OMPI devel] MPI_Comm_spawn[_multiple] and orted

2006-05-31 Thread Pak Lui
okay, sorry that I might have confused you before, but at least we are all clear about the issues. It now sounds like I'll have to try restricting the number of times the "qrsh" gets called in the sge pls module for the current release as I initially though, and revisit this issue later in ORTE

Re: [OMPI devel] MPI_Comm_spawn[_multiple] and orted

2006-05-31 Thread Ralph Castain
Hmmm...no, it appears that we are not talking about the same problem at all. An internal comm_spawn is not at all equivalent to an external execution of another mpirun command - even though the same functions may get called, there is a very large fundamental difference between the two scenarios

Re: [OMPI devel] MPI_Comm_spawn[_multiple] and orted

2006-05-31 Thread Pak Lui
Ralph Castain wrote: First, the fact that an orted already exists on a node is not sufficient to allow us to use it again for another application. The orted must be persistent or else we do not allow a new application to re-use it. This is required because the existing orted will go away when

Re: [OMPI devel] MPI_Comm_spawn[_multiple] and orted

2006-05-31 Thread Ralph Castain
Pak Lui wrote: Ralph Castain wrote: Hi Pak I'm afraid I don't fully understand your question, so forgive me if I don't seem to address the problem adequately. As I understand it, you are asking about the scenario where someone wants to execute multiple calls of mpirun, with the

Re: [OMPI devel] MPI_Comm_spawn[_multiple] and orted

2006-05-31 Thread Pak Lui
Ralph Castain wrote: Hi Pak I'm afraid I don't fully understand your question, so forgive me if I don't seem to address the problem adequately. As I understand it, you are asking about the scenario where someone wants to execute multiple calls of mpirun, with the applications executing on the

Re: [OMPI devel] MPI_Comm_spawn[_multiple] and orted

2006-05-31 Thread Ralph Castain
Hi Pak I'm afraid I don't fully understand your question, so forgive me if I don't seem to address the problem adequately. As I understand it, you are asking about the scenario where someone wants to execute multiple calls of mpirun, with the applications executing on the same set of nodes. Y

[OMPI devel] MPI_Comm_spawn[_multiple] and orted

2006-05-31 Thread Pak Lui
Hi, When I run a spawn program over rsh/ssh, I notice that each time the child program gets spawned, it will need to establish a new rsh/ssh connection to the remote node to launch orted on that node, even the parent executable and the orted are running on that node. So I wonder if there is any