Hi,
Picking up on the thread that continued on the hwloc issue.
Jeff wrote:
> The --with-devel-headers option is really only intended for developers who
> are building Open MPI components out of tree. It is not really intended for
> end users (e.g., we don't even document it in the README file)
> On 02 Mar 2016, at 14:54 , Ralph Castain wrote:
> * remove the enable-debug-by-default logic
Given that it currently depends whether your VPATH is inside or outside the
source tree, I think that is the only consistent decision :)
> On 02 Mar 2016, at 5:06 , Gilles Gouaillardet wrote:
> what about *not* issuing this warning if OpenMPI is built from git ?
> that would be friendlier for OMPI developers,
> and should basically *not* affect endusers, since they would rather build
> OMPI from a tarball.
VPATH builds aren't de
te
> out-of-band with some non-local peer. The RTE has no idea of the possible
> communication pattern, so we err on the side of more info to avoid sometime
> having to say “I don’t know how to do that, Dave”
>
>
>> On Nov 11, 2015, at 1:58 PM, Mark Santcroos
>> wrote:
> On 11 Nov 2015, at 22:43 , Ralph Castain wrote:
> You must have the “-d” option set on the orte-dvm cmd line?
Yes.
> You’ll get one of those from every daemon in the job each time you launch an
> app.
Ok, that explains the number better :)
Why is every orted aware of a job on another orted?
Hi,
I'm seeing this message an awful amount of times. (I.e. orders of magnitude
more than I launch processes)
[nid15897:18424] [[2305,0],103] orted:comm:process_commands() Processing
Command: ORTE_DAEMON_ADD_LOCAL_PROCS
How should I interpret that?
Thanks
Mark
It seems the change suggested by Nysal also allows me to run into the next
problem ;-)
Mark
> On 09 Nov 2015, at 20:19 , George Bosilca wrote:
>
> All 10k tests completed successfully. Nysal pinpointed the real problem
> behind the deadlocks. :+1:
>
> George.
>
>
> On Mon, Nov 9, 2015 at
> On 24 Oct 2015, at 7:54 , Mark Santcroos wrote:
> Will test it on real systems once it hits master.
FYI: Its been holding up pretty well on real deployment too!
gt; On Oct 23, 2015, at 5:40 PM, Ralph Castain wrote:
>>
>> Could be - let me investigate this weekend.
>>
>> Thanks for all that parsing!!!
>>
>>> On Oct 23, 2015, at 5:00 PM, Mark Santcroos
>>> wrote:
>>>
>>> Is this the
Is this the culprit?
'ACTIVATING PROC [[8679,2],0] STATE IOF COMPLETE PRI 4',
'state:base:track_procs called for proc [[8679,2],0] state RUNNING',
That seems to be out of order for the hanging processes.
> On 23 Oct 2015, at 23:45 , Mark Santcroos wrote:
> the second is output from my parser script
I figured you might want the output of the succeeded jobs too, please see the
updated output attached.
Jobs started 16
Jobs completed: 15
Procs completed: 16
Communication "Error
> On 21 Oct 2015, at 2:50 , Ralph Castain wrote:
> Can you do me a favor?
Hi Ralph,
It required some parsing-fu, but here you go! :-)
Three text files attached. One is the raw log, the second is output from my
parser script and the third is the output of pstree after it hangs.
Hopefully this
> On 16 Oct 2015, at 0:44 , Ralph Castain wrote:
>
> Hmmmok. I'll have to look at it this weekend when I return from travel.
> Can you please send me your test program so I can try to locally reproduce it?
Ok, thanks Ralph.
Start the DVM with: orte-dvm --report-uri dvm_uri --debug-devel
> On 16 Oct 2015, at 0:23 , Ralph Castain wrote:
> Okay, that means that the dvm isn't recognizing that the jobs actually
> completed.
Ok.
> So the question is: what is it about those jobs?
They are all the same.
> Are those 6 jobs very short-lived, and the others are longer-lived?
All very
> On 16 Oct 2015, at 0:09 , Ralph Castain wrote:
>
> Help me out a bit - how many jobs did you actually run?
42 tasks in total, 6 stalled, 36 returned.
> On 15 Oct 2015, at 17:25 , Ralph Castain wrote:
>
> Interesting - I see why. Please try this version.
Ok, that works as expected.
I'll repeat the results with this version too:
$ grep TERMINATED dvm_output-patched.txt |wc -l
36
$ grep NOTIFYING dvm_output-patched.txt |wc -l
36
> On 15 Oct 2015, at 4:38 , Ralph Castain wrote:
> Okay, please try the attached patch.
*scratch*
Although I reported results with the patch earlier, I can't reproduce it
anymore.
Now orte-dvm shuts down after the first orte-submit completes with:
[netbook:72038] [[9827,0],0] orted:comm:proc
Another data point, this only seems to happen for really short tasks, i.e. < 1
sec.
Hi!
> On 15 Oct 2015, at 4:38 , Ralph Castain wrote:
>
> Okay, please try the attached patch. It will cause two messages to be output
> for each job: one indicating the job has been marked terminated, and the
> other reporting that the completion message was sent to the requestor. Let's
> see
Hi Ralph,
> On 15 Oct 2015, at 0:26 , Ralph Castain wrote:
> Okay, so each orte-submit is reporting job has launched, which means the hang
> is coming while waiting to hear the job completed. Are you sure that orte-dvm
> believes the job has completed?
No, I'm not.
> In other words, when you
Hi Ralph,
> On 14 Oct 2015, at 21:50 , Ralph Castain wrote:
> I wonder if they might be getting duplicate process names if started quickly
> enough. Do you get the "job has launched" message (orte-submit outputs a
> message after orte-dvm responds that the job launched)?
Based on the output bel
Hi,
By hammering on a DVM with orte-submit I can reproducibly make orte-submit not
return, but hang instead.
The task is executed correctly though.
It can be reproduced using the small snippet below.
Switching from sequential to "concurrent" execution of the orte-submit's
triggers the effect.
k there is no more mca_pmix_native.so.
> you can confirm that by checking the timestamps of the libs after
> running make install.
> just remove your install dir, and run make install again, and that
> will solve your issue.
>
> Cheers,
>
> Gilles
>
>
> On Tue, Sep
Hi,
On some Cray's I see the following warning (regardless whether I run through
aprun, mpirun or orte-submit):
[nid01926:30931] mca_base_component_repository_open: unable to open
mca_pmix_native:
/work/e290/e290/marksant/openmpi/installed/HEAD/lib/openmpi/mca_pmix_native.so:
undefined symbo
Thanks Ralph, 4899e7fe fixes it!
Cheers,
Mark
he full output of params.
Right I tried that. So I don't understand it completely or it doesn't work as
expected, as I dont manage to get e.g. "orte_max_vm_size" as output from that.
(I also believe that -all sets the level to 9 already)
Thanks!
Mark
>
>
>> On
> On 17 Sep 2015, at 20:48 , Ralph Castain wrote:
> Might not - there has been a very large amount of change over the last few
> months, and I confess I haven't been checking the DVM regularly. So let me
> take a step back and look at that code.
Ok.
> I'll also include the extensions you requ
didn't check every single version between March and now, but its
safe to assume that it didn't work in between either I guess.
>
>
> On Thu, Sep 17, 2015 at 11:30 AM, Mark Santcroos
> wrote:
> Hi (Ralph),
>
> Over the last months I have been focussing on exec through
Hi (Ralph),
Over the last months I have been focussing on exec throughput, and not so much
on the application payload (read: mainly using /bin/sleep ;-)
As things are stabilising now, I returned my attention to "real" applications.
To discover that launching MPI applications (build with the same
Hi,
I've been running into some funny issue with using orte-dvm (Hi Ralph ;-) and
trying to define the size of the created vm and for that I use "--mca
orte_max_vm_size" which in general seems to work.
In this example I have a PBS job of 4 nodes and want to run the DVM on < 4
nodes.
If I creat
Hi Gilles,
> On 28 Aug 2015, at 2:55 , Gilles Gouaillardet wrote:
> what about :
> - if only one interface is specified (e.g. *_if_include eth0), then bind to
> that interface
> - otherwise, bind to all interfaces
I agree, with the notion that you don't really bind to interfaces, but to
addres
Hi Ralph,
> On 28 Aug 2015, at 2:50 , Ralph Castain wrote:
> I committed the change that prevents orte-submit from binding a listener -
> seems to work fine for me, so please let me know how it works for you.
Great, works indeed!
> The other issue - binding to all interfaces instead of only th
> On 27 Aug 2015, at 17:58 , Ralph Castain wrote:
> Okay, let me take a look
Thanks Ralph, please let me know if I can be of any assistance!
Hi Howard,
> On 27 Aug 2015, at 17:59 , Mark Santcroos wrote:
>> If you bind to ipogif0 then you should have much better luck, unless
>> you're trying to have open mpi span outside the cray HPN.
>
>
> Now you get me wondering. I actually played with both oob-t
sed me up with
> runs on carver
> system at NERSC for a while.
>
> Howard
>
>
> 2015-08-27 9:42 GMT-06:00 Mark Santcroos :
> Hi,
>
> For some reason that is currently still beyond me, I can't bind to INADDR_ANY
> for more than 74 ports on a Cray compute no
> On 27 Aug 2015, at 17:44 , Ralph Castain wrote:
> Just to be clear: you are saying that orte-submit is creating a listener? If
> so, I can correct that as it doesn’t need to do so.
Yes, I think it does indeed. At least its hitting that code path that looks
suspiciously like a listener! :)
Hi,
For some reason that is currently still beyond me, I can't bind to INADDR_ANY
for more than 74 ports on a Cray compute node, without getting EADDRINUSE.
This impacts my use of the oob_tcp_listener.c:create_listen() code on that
machine (through means of orte-submit).
I've implemented a proo
Yep, it works again, thanks!
> On 22 Aug 2015, at 0:00 , Mark Santcroos wrote:
>
> Thanks Ralph.
> The machine in question is in maintenance currently, so can't check, will get
> back to you as soon as I can.
>
>> On 21 Aug 2015, at 16:51 , Ralph Castain wr
eproduce on nersc systems.
>>>
>>> --
>>>
>>> sent from my smart phonr so no good type.
>>>
>>> Howard
>>>
>>> On Aug 21, 2015 7:51 AM, "Ralph Castain" wrote:
>>> I’ll take a look at
Hi all,
I see the errors below on startup of orte-dvm on a Cray XE/XK hybrid.
Didn't track the commit that caused it yet, but maybe somebody has a clue from
the error already.
Last known to work was on July 14. The 2.x branch works fine.
Please let me know if this should be a ticket.
Thanks
Ma
Hi Jeff, all,
Thanks for bringing this to the wider community.
I hope this will eventually address my main concern: the relatively old
versions that get deployed on HPC systems around the world, which I assume
is/was because of the "odd ;-)" numbering.
What I didn't see in the doc, will you co
41 matches
Mail list logo