Re: [OMPI devel] Parent terminates when child crashes/terminates (without finalizing)

2010-12-17 Thread Suraj Prabhakaran
t terminates too. Perhaps there is a way to avoid this kind of auto cleanup? Thanks, Suraj Prabhakaran

[OMPI devel] Problems with Connect/Accept and Disconnect

2010-12-17 Thread Suraj Prabhakaran
Hello, I have been having some problems with connect and disconnect between two processes. The processes seem to be indefinitely blocking at Connect/Accept stage or at Disconnect stage. For example Process A { MPI_Open_port(...); MPI_Publish_name(...); MPI_Comm_accept(... &b_comm)

Re: [OMPI devel] Parent terminates when child crashes/terminates (without finalizing)

2010-12-17 Thread Suraj Prabhakaran
On 12/17/2010 06:24 PM, George Bosilca wrote: Let me try to round the edges on this one. It is not that we couldn't or wouldn't like to have a more "MPI" compliant approach on this, but the definition of connected processes in the MPI standard is [kind of] shady. One thing is clear however, i

[OMPI devel] Connect/Accept and Disconnect

2010-12-21 Thread Suraj Prabhakaran
Hello, This is basically a repost of my previous mail regarding problems with connect/accept and disconnect (**this is not related to spawning, parent/child**). I *sometimes* find processes blocking indefinitely at Connect/Accept calls or at Disconnect calls. I have an example below. Process

Re: [OMPI devel] Connect/Accept and Disconnect

2010-12-21 Thread Suraj Prabhakaran
On 12/21/2010 03:12 PM, Ralph Castain wrote: Are you using ompi-server for pub/sub, or just letting it default to mpirun? You might want to output the return value from lookup_name and publish_name to see if they match. If they are different, then you will definitely hang. I used ompi-serv

[OMPI devel] Quick fix for MPI_Publish_name

2011-03-04 Thread Suraj Prabhakaran
other tool. Hope this is useful. Best, Suraj Prabhakaran

[OMPI devel] MPI_Comm_connect/accept does not work as it should

2012-09-12 Thread Suraj Prabhakaran
Dear all, I observed a strange behavior with MPI_Comm_connect and MPI_Comm_disconnect. In short, after two processes connect to each other with a port and merge to create a intra comm (rank 0 and rank 1), only one of them (the root) is thereafter able to reach a third new process through MPI_Co

Re: [OMPI devel] MPI_Comm_connect/accept does not work as it should

2012-09-12 Thread Suraj Prabhakaran
I am referring to v1.6. On Sep 12, 2012, at 5:27 PM, Ralph Castain wrote: > what version of ompi are you referring to? > > On Sep 12, 2012, at 8:13 AM, Suraj Prabhakaran > wrote: > >> Dear all, >> >> I observed a strange behavior with MPI_Comm_connect and

[OMPI devel] Intercomm Merge

2013-09-13 Thread Suraj Prabhakaran
Hello, Is there a plan to fix the problem with MPI_Intercomm_merge with 1.7.3 as stated in this ticket? We are really in need of this at the moment. Any hints? We face the following problem. Parents (x and y) spawn child (z). (all of them execute on separate nodes) x is the root. x,y and z do a

Re: [OMPI devel] Intercomm Merge

2013-09-13 Thread Suraj Prabhakaran
. I doubt it > will make 1.7.3, though if you need it, I'll give it a try. > > On Sep 13, 2013, at 7:21 AM, Suraj Prabhakaran < > suraj.prabhaka...@gmail.com> wrote: > > > Hello, > > > > Is there a plan to fix the problem with MPI_Intercomm_merge with 1.7

Re: [OMPI devel] Intercomm Merge

2013-09-17 Thread Suraj Prabhakaran
Hi Ralph, Thanks a lot!!! thats really cool!! Best, Suraj On Sep 15, 2013, at 5:01 PM, Ralph Castain wrote: > I fixed it and have filed a cmr to move it to 1.7.3 > > Thanks for your patience, and for reminding me > Ralph > > On Sep 13, 2013, at 12:05 PM, Suraj Pra

Re: [OMPI devel] Intercomm Merge

2013-09-21 Thread Suraj Prabhakaran
Dear all, Really thanks a lot for your efforts. I too downloaded the trunk to check if it works for my case and as of revision 29215, it works for the original case I reported. Although it works, I still see the following in the output. Does it mean anything? [grsacc17][[13611,1],0][btl_openib_

Re: [OMPI devel] Intercomm Merge

2013-09-22 Thread Suraj Prabhakaran
ilable - no Torque. > > You might add "-mca plm_base_verbose 5" to your cmd line to get some debug > output as to the problem. > > > On Sep 21, 2013, at 5:48 PM, Ralph Castain wrote: > >> >> On Sep 21, 2013, at 4:54 PM, Suraj Prabhakaran >

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Suraj Prabhakaran
and current 1.7.3 branch for "add-host" and both worked just >>> fine. This was on my little test cluster which only has rsh available - no >>> Torque. >>> >>> You might add "-mca plm_base_verbose 5" to your cmd line to get some debug >&

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Suraj Prabhakaran
eck your > cables, subnet manager configuration, etc. The openib BTL will be > ignored for this job. > > Local host: %s > > Looks like at least one node being used doesn't have an active Infiniband > port on it? > > > On Sep 24, 2013, at 6:11 AM, Suraj Prabhaka

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Suraj Prabhakaran
r_base_verbose 10 > > Thanks > Ralph > > On Sep 24, 2013, at 6:35 AM, Suraj Prabhakaran > wrote: > >> Hi Ralph, >> >> I always got this output from any MPI job that ran on our nodes. There seems >> to be a problem somewhere but it never stopped

Re: [OMPI devel] Intercomm Merge

2013-09-24 Thread Suraj Prabhakaran
little more info as to exactly what you are doing? Perhaps > send me your test code? > > On Sep 24, 2013, at 7:48 AM, Suraj Prabhakaran > wrote: > >> Hi Ralph, >> >> Output attached in a file. >> Thanks a lot! >> >> Best, >> Suraj >

Re: [OMPI devel] Intercomm Merge

2013-09-25 Thread Suraj Prabhakaran
Dear Ralph, I am sorry but I think I missed adding plm verbosity to 5 last time. Here is the output of the complete program with and without -novm to the following mpiexec. mpiexec -mca state_base_verbose 10 -mca errmgr_base_verbose 10 -mca plm_base_verbose 5 -mca btl tcp,sm,self -np 2 ./addho

Re: [OMPI devel] Intercomm Merge

2013-10-02 Thread Suraj Prabhakaran
Sorry for the very late reply. Everything works now! Thanks a lot!! On Sep 25, 2013, at 7:00 PM, Ralph Castain wrote: > I've committed a fix to the trunk (r29245) and scheduled it for v1.7.3 - > thanks for the debug info! > > Ralph > > On Sep 25, 2013, at 5:00 AM, Sura

[OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Suraj Prabhakaran
Hello! I am having problem using MPI_Comm_spawn under torque. It doesn't work when spawning more than 12 processes on various nodes. To be more precise, "sometimes" it works, and "sometimes" it doesn't! Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE looks like below.

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Suraj Prabhakaran
I am using 1.7.4! On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote: > What OMPI version are you using? > > On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran > wrote: > >> Hello! >> >> I am having problem using MPI_Comm_spawn under torque. It doesn'

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-20 Thread Suraj Prabhakaran
? Is there some specific behavior you see? > > You might try the attached program. It's a simple spawn test we use - 1.7.4 > seems happy with it. > > > > On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran > wrote: > >> I am using 1.7.4! >> >>

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Suraj Prabhakaran
, 2014, at 7:05 PM, Suraj Prabhakaran > wrote: > >> Thanks Ralph! >> >> I must have mentioned though. Without the Torque environment, spawning with >> ssh works ok. But Under the torque environment, not. > > Ah, no - you forgot to mention that point. &

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Suraj Prabhakaran
Ok, I figured out that it was not a problem with the node grsacc04 because I now conducted the same on totally different set of nodes. I must really say that with --bind-to none option, the program completed "many" times compared to earlier but still "sometimes" it hangs! Attaching now the out

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-21 Thread Suraj Prabhakaran
red? > > On Feb 21, 2014, at 3:31 PM, Suraj Prabhakaran > wrote: > >> Ok, I figured out that it was not a problem with the node grsacc04 because I >> now conducted the same on totally different set of nodes. >> >> I must really say that with --bind-to none

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-22 Thread Suraj Prabhakaran
Ralph Castain wrote: > > On Feb 21, 2014, at 5:55 PM, Suraj Prabhakaran > wrote: > >> Hmm.. but in actual the MPI_Comm_spawn of parents and MPI_Init of children >> never returned! > > Understood - my point was that the output shows no errors or issues. For so

Re: [OMPI devel] MPI_Comm_spawn under Torque

2014-02-22 Thread Suraj Prabhakaran
a faint idea where I can find some info on that? I never found something like that Best, Suraj On Feb 22, 2014, at 6:38 PM, Ralph Castain wrote: > > On Feb 22, 2014, at 9:30 AM, Suraj Prabhakaran > wrote: > >> Thanks Ralph. >> >> I cannot get rid of T