Re: [OMPI devel] --with-devel-headers and intenal 'hwloc' header on 'mpicc --showme:compile'

2017-03-06 Thread Mark Santcroos
Hi, Picking up on the thread that continued on the hwloc issue. Jeff wrote: > The --with-devel-headers option is really only intended for developers who > are building Open MPI components out of tree. It is not really intended for > end users (e.g., we don't even document it in the README file)

Re: [OMPI devel] RFC: warn if running a debug build

2016-03-02 Thread Mark Santcroos
> On 02 Mar 2016, at 14:54 , Ralph Castain wrote: > * remove the enable-debug-by-default logic Given that it currently depends whether your VPATH is inside or outside the source tree, I think that is the only consistent decision :)

Re: [OMPI devel] RFC: warn if running a debug build

2016-03-02 Thread Mark Santcroos
> On 02 Mar 2016, at 5:06 , Gilles Gouaillardet wrote: > what about *not* issuing this warning if OpenMPI is built from git ? > that would be friendlier for OMPI developers, > and should basically *not* affect endusers, since they would rather build > OMPI from a tarball. VPATH builds aren't de

Re: [OMPI devel] question on ORTE_DAEMON_ADD_LOCAL_PROCS

2015-11-11 Thread Mark Santcroos
te > out-of-band with some non-local peer. The RTE has no idea of the possible > communication pattern, so we err on the side of more info to avoid sometime > having to say “I don’t know how to do that, Dave” > > >> On Nov 11, 2015, at 1:58 PM, Mark Santcroos >> wrote:

Re: [OMPI devel] question on ORTE_DAEMON_ADD_LOCAL_PROCS

2015-11-11 Thread Mark Santcroos
> On 11 Nov 2015, at 22:43 , Ralph Castain wrote: > You must have the “-d” option set on the orte-dvm cmd line? Yes. > You’ll get one of those from every daemon in the job each time you launch an > app. Ok, that explains the number better :) Why is every orted aware of a job on another orted?

[OMPI devel] question on ORTE_DAEMON_ADD_LOCAL_PROCS

2015-11-11 Thread Mark Santcroos
Hi, I'm seeing this message an awful amount of times. (I.e. orders of magnitude more than I launch processes) [nid15897:18424] [[2305,0],103] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS How should I interpret that? Thanks Mark

Re: [OMPI devel] PMIX deadlock

2015-11-09 Thread Mark Santcroos
It seems the change suggested by Nysal also allows me to run into the next problem ;-) Mark > On 09 Nov 2015, at 20:19 , George Bosilca wrote: > > All 10k tests completed successfully. Nysal pinpointed the real problem > behind the deadlocks. :+1: > > George. > > > On Mon, Nov 9, 2015 at

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-27 Thread Mark Santcroos
> On 24 Oct 2015, at 7:54 , Mark Santcroos wrote: > Will test it on real systems once it hits master. FYI: Its been holding up pretty well on real deployment too!

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-24 Thread Mark Santcroos
gt; On Oct 23, 2015, at 5:40 PM, Ralph Castain wrote: >> >> Could be - let me investigate this weekend. >> >> Thanks for all that parsing!!! >> >>> On Oct 23, 2015, at 5:00 PM, Mark Santcroos >>> wrote: >>> >>> Is this the

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-23 Thread Mark Santcroos
Is this the culprit? 'ACTIVATING PROC [[8679,2],0] STATE IOF COMPLETE PRI 4', 'state:base:track_procs called for proc [[8679,2],0] state RUNNING', That seems to be out of order for the hanging processes.

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-23 Thread Mark Santcroos
> On 23 Oct 2015, at 23:45 , Mark Santcroos wrote: > the second is output from my parser script I figured you might want the output of the succeeded jobs too, please see the updated output attached. Jobs started 16 Jobs completed: 15 Procs completed: 16 Communication "Error

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-23 Thread Mark Santcroos
> On 21 Oct 2015, at 2:50 , Ralph Castain wrote: > Can you do me a favor? Hi Ralph, It required some parsing-fu, but here you go! :-) Three text files attached. One is the raw log, the second is output from my parser script and the third is the output of pstree after it hangs. Hopefully this

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
> On 16 Oct 2015, at 0:44 , Ralph Castain wrote: > > Hmmmok. I'll have to look at it this weekend when I return from travel. > Can you please send me your test program so I can try to locally reproduce it? Ok, thanks Ralph. Start the DVM with: orte-dvm --report-uri dvm_uri --debug-devel

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
> On 16 Oct 2015, at 0:23 , Ralph Castain wrote: > Okay, that means that the dvm isn't recognizing that the jobs actually > completed. Ok. > So the question is: what is it about those jobs? They are all the same. > Are those 6 jobs very short-lived, and the others are longer-lived? All very

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
> On 16 Oct 2015, at 0:09 , Ralph Castain wrote: > > Help me out a bit - how many jobs did you actually run? 42 tasks in total, 6 stalled, 36 returned.

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
> On 15 Oct 2015, at 17:25 , Ralph Castain wrote: > > Interesting - I see why. Please try this version. Ok, that works as expected. I'll repeat the results with this version too: $ grep TERMINATED dvm_output-patched.txt |wc -l 36 $ grep NOTIFYING dvm_output-patched.txt |wc -l 36

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
> On 15 Oct 2015, at 4:38 , Ralph Castain wrote: > Okay, please try the attached patch. *scratch* Although I reported results with the patch earlier, I can't reproduce it anymore. Now orte-dvm shuts down after the first orte-submit completes with: [netbook:72038] [[9827,0],0] orted:comm:proc

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
Another data point, this only seems to happen for really short tasks, i.e. < 1 sec.

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-15 Thread Mark Santcroos
Hi! > On 15 Oct 2015, at 4:38 , Ralph Castain wrote: > > Okay, please try the attached patch. It will cause two messages to be output > for each job: one indicating the job has been marked terminated, and the > other reporting that the completion message was sent to the requestor. Let's > see

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-14 Thread Mark Santcroos
Hi Ralph, > On 15 Oct 2015, at 0:26 , Ralph Castain wrote: > Okay, so each orte-submit is reporting job has launched, which means the hang > is coming while waiting to hear the job completed. Are you sure that orte-dvm > believes the job has completed? No, I'm not. > In other words, when you

Re: [OMPI devel] orte-dvm / orte-submit race condition

2015-10-14 Thread Mark Santcroos
Hi Ralph, > On 14 Oct 2015, at 21:50 , Ralph Castain wrote: > I wonder if they might be getting duplicate process names if started quickly > enough. Do you get the "job has launched" message (orte-submit outputs a > message after orte-dvm responds that the job launched)? Based on the output bel

[OMPI devel] orte-dvm / orte-submit race condition

2015-10-14 Thread Mark Santcroos
Hi, By hammering on a DVM with orte-submit I can reproducibly make orte-submit not return, but hang instead. The task is executed correctly though. It can be reproduced using the small snippet below. Switching from sequential to "concurrent" execution of the orte-submit's triggers the effect.

Re: [OMPI devel] pmix warnings on cray with HEAD

2015-09-22 Thread Mark Santcroos
k there is no more mca_pmix_native.so. > you can confirm that by checking the timestamps of the libs after > running make install. > just remove your install dir, and run make install again, and that > will solve your issue. > > Cheers, > > Gilles > > > On Tue, Sep

[OMPI devel] pmix warnings on cray with HEAD

2015-09-22 Thread Mark Santcroos
Hi, On some Cray's I see the following warning (regardless whether I run through aprun, mpirun or orte-submit): [nid01926:30931] mca_base_component_repository_open: unable to open mca_pmix_native: /work/e290/e290/marksant/openmpi/installed/HEAD/lib/openmpi/mca_pmix_native.so: undefined symbo

Re: [OMPI devel] regression running mpi applications with dvm

2015-09-22 Thread Mark Santcroos
Thanks Ralph, 4899e7fe fixes it! Cheers, Mark

Re: [OMPI devel] orte-dvm and orte_max_vm_size

2015-09-17 Thread Mark Santcroos
he full output of params. Right I tried that. So I don't understand it completely or it doesn't work as expected, as I dont manage to get e.g. "orte_max_vm_size" as output from that. (I also believe that -all sets the level to 9 already) Thanks! Mark > > >> On

Re: [OMPI devel] regression running mpi applications with dvm

2015-09-17 Thread Mark Santcroos
> On 17 Sep 2015, at 20:48 , Ralph Castain wrote: > Might not - there has been a very large amount of change over the last few > months, and I confess I haven't been checking the DVM regularly. So let me > take a step back and look at that code. Ok. > I'll also include the extensions you requ

Re: [OMPI devel] regression running mpi applications with dvm

2015-09-17 Thread Mark Santcroos
didn't check every single version between March and now, but its safe to assume that it didn't work in between either I guess. > > > On Thu, Sep 17, 2015 at 11:30 AM, Mark Santcroos > wrote: > Hi (Ralph), > > Over the last months I have been focussing on exec through

[OMPI devel] regression running mpi applications with dvm

2015-09-17 Thread Mark Santcroos
Hi (Ralph), Over the last months I have been focussing on exec throughput, and not so much on the application payload (read: mainly using /bin/sleep ;-) As things are stabilising now, I returned my attention to "real" applications. To discover that launching MPI applications (build with the same

[OMPI devel] orte-dvm and orte_max_vm_size

2015-09-03 Thread Mark Santcroos
Hi, I've been running into some funny issue with using orte-dvm (Hi Ralph ;-) and trying to define the size of the created vm and for that I use "--mca orte_max_vm_size" which in general seems to work. In this example I have a PBS job of 4 nodes and want to run the DVM on < 4 nodes. If I creat

Re: [OMPI devel] bind to interface / address oob_tcp_listener.c:create_listen()

2015-08-28 Thread Mark Santcroos
Hi Gilles, > On 28 Aug 2015, at 2:55 , Gilles Gouaillardet wrote: > what about : > - if only one interface is specified (e.g. *_if_include eth0), then bind to > that interface > - otherwise, bind to all interfaces I agree, with the notion that you don't really bind to interfaces, but to addres

Re: [OMPI devel] bind to interface / address oob_tcp_listener.c:create_listen()

2015-08-28 Thread Mark Santcroos
Hi Ralph, > On 28 Aug 2015, at 2:50 , Ralph Castain wrote: > I committed the change that prevents orte-submit from binding a listener - > seems to work fine for me, so please let me know how it works for you. Great, works indeed! > The other issue - binding to all interfaces instead of only th

Re: [OMPI devel] bind to interface / address oob_tcp_listener.c:create_listen()

2015-08-27 Thread Mark Santcroos
> On 27 Aug 2015, at 17:58 , Ralph Castain wrote: > Okay, let me take a look Thanks Ralph, please let me know if I can be of any assistance!

Re: [OMPI devel] bind to interface / address oob_tcp_listener.c:create_listen()

2015-08-27 Thread Mark Santcroos
Hi Howard, > On 27 Aug 2015, at 17:59 , Mark Santcroos wrote: >> If you bind to ipogif0 then you should have much better luck, unless >> you're trying to have open mpi span outside the cray HPN. > > > Now you get me wondering. I actually played with both oob-t

Re: [OMPI devel] bind to interface / address oob_tcp_listener.c:create_listen()

2015-08-27 Thread Mark Santcroos
sed me up with > runs on carver > system at NERSC for a while. > > Howard > > > 2015-08-27 9:42 GMT-06:00 Mark Santcroos : > Hi, > > For some reason that is currently still beyond me, I can't bind to INADDR_ANY > for more than 74 ports on a Cray compute no

Re: [OMPI devel] bind to interface / address oob_tcp_listener.c:create_listen()

2015-08-27 Thread Mark Santcroos
> On 27 Aug 2015, at 17:44 , Ralph Castain wrote: > Just to be clear: you are saying that orte-submit is creating a listener? If > so, I can correct that as it doesn’t need to do so. Yes, I think it does indeed. At least its hitting that code path that looks suspiciously like a listener! :)

[OMPI devel] bind to interface / address oob_tcp_listener.c:create_listen()

2015-08-27 Thread Mark Santcroos
Hi, For some reason that is currently still beyond me, I can't bind to INADDR_ANY for more than 74 ports on a Cray compute node, without getting EADDRINUSE. This impacts my use of the oob_tcp_listener.c:create_listen() code on that machine (through means of orte-submit). I've implemented a proo

Re: [OMPI devel] orte-dvm startup fails on HEAD

2015-08-22 Thread Mark Santcroos
Yep, it works again, thanks! > On 22 Aug 2015, at 0:00 , Mark Santcroos wrote: > > Thanks Ralph. > The machine in question is in maintenance currently, so can't check, will get > back to you as soon as I can. > >> On 21 Aug 2015, at 16:51 , Ralph Castain wr

Re: [OMPI devel] orte-dvm startup fails on HEAD

2015-08-21 Thread Mark Santcroos
eproduce on nersc systems. >>> >>> -- >>> >>> sent from my smart phonr so no good type. >>> >>> Howard >>> >>> On Aug 21, 2015 7:51 AM, "Ralph Castain" wrote: >>> I’ll take a look at

[OMPI devel] orte-dvm startup fails on HEAD

2015-08-21 Thread Mark Santcroos
Hi all, I see the errors below on startup of orte-dvm on a Cray XE/XK hybrid. Didn't track the commit that caused it yet, but maybe somebody has a clue from the error already. Last known to work was on July 14. The 2.x branch works fine. Please let me know if this should be a ticket. Thanks Ma

Re: [OMPI devel] Proposal: update Open MPI's version number and release process

2015-05-18 Thread Mark Santcroos
Hi Jeff, all, Thanks for bringing this to the wider community. I hope this will eventually address my main concern: the relatively old versions that get deployed on HPC systems around the world, which I assume is/was because of the "odd ;-)" numbering. What I didn't see in the doc, will you co