Just to wrap this up for the user list: this has now been fixed and added to
1.8.2 in the nightly tarball. The problem proved to be an edge case when
partial allocations were combined with coprocessor existence (hit a slightly
different code path).
On Jun 12, 2014, at 9:04 AM, Dan Dietz
Kewl - thanks! I'm a Purdue alum, if that helps :-)
On Jun 12, 2014, at 9:04 AM, Dan Dietz wrote:
> That shouldn't be a problem. Let me figure out the process and I'll
> get back to you.
>
> Dan
>
> On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain wrote:
That shouldn't be a problem. Let me figure out the process and I'll
get back to you.
Dan
On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain wrote:
> Arggh - is there any way I can get access to this beast so I can debug this?
> I can't figure out what in the world is going on,
Arggh - is there any way I can get access to this beast so I can debug this? I
can't figure out what in the world is going on, but it seems to be something
triggered by your specific setup.
On Jun 12, 2014, at 8:48 AM, Dan Dietz wrote:
> Unfortunately, the nightly tarball
Unfortunately, the nightly tarball appears to be crashing in a similar
fashion. :-( I used the latest snapshot 1.8.2a1r31981.
Dan
On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote:
> I've poked and prodded, and the 1.8.2 tarball seems to be handling this
> situation just
It isn't a development tarball - it's the current state of the release branch
and is therefore managed much more strictly than the developer trunk. We are
preparing it now for release candidate. I have about a dozen CMR's waiting for
final review before moving across to 1.8.2, and then we'll
On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote:
> I've poked and prodded, and the 1.8.2 tarball seems to be handling this
> situation
Ralph,
That's still the development tarball, right? 1.8.2 remains unreleased?
Is the an ETA for 1.8.2 the end of this month?
I've poked and prodded, and the 1.8.2 tarball seems to be handling this
situation just fine. I don't have access to a Torque machine, but I did set
everything to follow the same code path, added faux coprocessors, etc. - and it
ran just fine.
Can you try the 1.8.2 tarball and see if it solves
Okay, let me poke around some more. It is clearly tied to the coprocessors, but
I'm not yet sure just why.
One thing you might do is try the nightly 1.8.2 tarball - there have been a
number of fixes, and this may well have been caught there. Worth taking a look.
On Jun 11, 2014, at 6:44 AM,
Sorry - it crashes with both torque and rsh launchers. The output from
a gdb backtrace on the core files looks identical.
Dan
On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote:
> Afraid I'm a little confused now - are you saying it works fine under Torque,
> but segfaults
Afraid I'm a little confused now - are you saying it works fine under Torque,
but segfaults under rsh? Could you please clarify your current situation?
On Jun 11, 2014, at 6:27 AM, Dan Dietz wrote:
> It looks like it is still segfaulting with the rsh launcher:
>
>
It looks like it is still segfaulting with the rsh launcher:
ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh
-np 4 -machinefile ./nodes ./hello
[conte-a084:51113] *** Process received signal ***
[conte-a084:51113] Signal: Segmentation fault (11)
[conte-a084:51113] Signal
Ack - that was my fault. Too early on a monday morning. This seems to
work perfectly when I correctly submit a job! Thanks!
Dan
On Mon, Jun 9, 2014 at 9:34 AM, Dan Dietz wrote:
> Yes, you're exactly right - this system has 2 Phi cards per node. I
> believe the "PCI 8086"
Yes, you're exactly right - this system has 2 Phi cards per node. I
believe the "PCI 8086" device in the lstopo output is them. Possibly
related, we've observed a weird bug with Torque and the allocation it
provides when you request the Phis. When requesting them you get a
nodefile with only 1
It's a good idea to provide the default setting for the modifier pe.
Okay, I can take a look to review but a bit busy now, so please give me
a few days.
Regards,
Tetsuya
> Okay, I revised the command line option to be a little more
user-friendly. You can now specify the equivalent of the old
I'm having no luck poking at this segfault issue. For some strange reason, we
seem to think there are coprocessors on those remote nodes - e.g., a Phi card.
Yet your lstopo output doesn't seem to show it.
Out of curiosity, can you try running this with "-mca plm rsh"? This will
substitute the
Okay, I revised the command line option to be a little more user-friendly. You
can now specify the equivalent of the old --cpus-per-proc as just "--map-by
:pe=N", leaving the mapping policy set as the default. We will default to NUMA
so the cpus will all be in the same NUMA region, if possible,
HmmmTetsuya is quite correct. Afraid I got distracted by the segfault
(still investigating that one). Our default policy for 2 processes is to map-by
core, and that would indeed fail when cpus-per-proc > 1. However, that seems
like a non-intuitive requirement, so let me see if I can make
Hi Dan,
Please try:
mpirun -np 2 --map-by socket:pe=8 ./hello
or
mpirun -np 2 --map-by slot:pe=8 ./hello
You can not bind 8 cpus to the object "core" which has
only one cpu. This limitation started from 1.8 series.
The objcet "socket" has 8 cores in your case. So you
can do it. And, the
Okay, I'll poke into this - thanks!
On Jun 6, 2014, at 12:48 PM, Dan Dietz wrote:
> No problem -
>
> These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips.
> 2 per node, 8 cores each. No threading enabled.
>
> $ lstopo
> Machine (64GB)
> NUMANode L#0 (P#0
No problem -
These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips.
2 per node, 8 cores each. No threading enabled.
$ lstopo
Machine (64GB)
NUMANode L#0 (P#0 32GB)
Socket L#0 + L3 L#0 (20MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
Sorry to pester with questions, but I'm trying to narrow down the issue.
* What kind of chips are on these machines?
* If they have h/w threads, are they enabled?
* you might have lstopo on one of those machines - could you pass along its
output? Otherwise, you can run a simple "mpirun -n 1
Thanks for the reply. I tried out the --display-allocation option with
several different combinations and have attached the output. I see
this behavior on both RHEL6.4, RHEL6.5, and RHEL5.10 clusters.
Here's debugging info on the segfault. Does that help? FWIW this does
not seem to crash on the
On Jun 5, 2014, at 2:13 PM, Dan Dietz wrote:
> Hello all,
>
> I'd like to bind 8 cores to a single MPI rank for hybrid MPI/OpenMP
> codes. In OMPI 1.6.3, I can do:
>
> $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello
>
> I get one rank bound to procs 0-7 and
Hello all,
I'd like to bind 8 cores to a single MPI rank for hybrid MPI/OpenMP
codes. In OMPI 1.6.3, I can do:
$ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello
I get one rank bound to procs 0-7 and the other bound to 8-15. Great!
But I'm having some difficulties doing this with
25 matches
Mail list logo