date:20170327

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

2017-03-27 Thread Ben Menadue

Hi,

> On 28 Mar 2017, at 2:00 am, r...@open-mpi.org wrote:
> I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” 
> setting. So why would you expect different results?

Ahh — I didn’t realise it auto-detected this. I recall working on a system in 
the past where I needed to explicitly set this to get that behaviour, but that 
could have been due to some local site configuration.

> On 28 Mar 2017, at 2:51 am, Jeff Squyres (jsquyres)  
> wrote:
> 1. Recall that sched_yield() has effectively become a no-op in newer Linux 
> kernels.  Hence, Open MPI's "yield when idle" may not do much to actually 
> de-schedule a currently-running process.

In some kernel versions there’s a sysctl (/proc/sys/kernel/sched_compat_yield) 
that makes it mimic the older behaviour by putting the yielding task at the end 
of the tree instead of the start. It was introduced in 1799e35 (2.6.23-rc7) and 
removed in ac53db5 (2.6.39-rc1), so if you have a kernel in this range (e.g. 
the stock CentOS or RHEL 6 kernel), it might be available to you.

As an aside, has anyone thought about changing the niceness of the thread 
before yielding (and then back when sched_yield returns)? I just tested that 
sequence and it seems to produce the desired behaviour.

On their own, and pinned to a single CPU, my test programs sat at 100% CPU 
utilisations with these timings:
spin: 2.1ns per iteration
yield: 450ns per iteration
nice+yield: 2330ns per iteration

When I ran spin and yield at the same on the same CPU, both consumed 50% of the 
CPU, and produced times about double of above — as expected given the new 
sched_yield behaviour.

On the other hand, when I ran spin and nice+yield together, spin sat at 98.5% 
CPU utilisation, nice+yield at 1.5%, and the timing for spin was identical to 
when it was on its own. 

The only problem I can see is that the timing for nice+yield increased 
dramatically — to 187,600 ns per iteration! Is this too long for a yield?

Cheers,
Ben

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

2017-03-27 Thread Jeff Hammond

On Sat, Mar 25, 2017 at 7:15 AM Jeff Squyres (jsquyres) 
wrote:

> On Mar 25, 2017, at 3:04 AM, Ben Menadue  wrote:
> >
> > I’m not sure about this. It was my understanding that HyperThreading is
> implemented as a second set of e.g. registers that share execution units.
> There’s no division of the resources between the hardware threads, but
> rather the execution units switch between the two threads as they stall
> (e.g. cache miss, hazard/dependency, misprediction, …) — kind of like a
> context switch, but much cheaper. As long as there’s nothing being
> scheduled on the other hardware thread, there’s no impact on the
> performance. Moreover, turning HT off in the BIOS doesn’t make more
> resources available to now-single hardware thread.
>
> Here's an old post on this list where I cited a paper from the Intel
> Technology Journal.  The paper is pretty old at this point (2002, I
> believe?), but I believe it was published near the beginning of the HT
> technology at Intel:
>
>
> https://www.mail-archive.com/hwloc-users@lists.open-mpi.org/msg01135.html
>
> The paper is attached on that post; see, in particular, the section
> "Single-task and multi-task modes".
>
> All this being said, I'm a software wonk with a decent understanding of
> hardware.  But I don't closely follow all the specific details of all
> hardware.  So if Haswell / Broadwell / Skylake processors, for example, are
> substantially different than the HT architecture described in that paper,
> please feel free to correct me!
>

I don't know the details, but HPC centers like NERSC noticed a shift around
Ivy Bridge (Edison) that caused them to enable it.

https://www.nersc.gov/users/computational-systems/edison/performance-and-optimization/hyper-threading/

I know two of the authors of that 2002 paper on HT. Will ask them for
insight next time we cross paths.

Jeff

>
> > This matches our observations on our cluster — there was no
> statistically-significant change in performance between having HT turned
> off in the BIOS and turning the second hardware thread of each core off in
> Linux. We run a mix of architectures — Sandy, Ivy, Haswell, and Broadwell
> (all dual-socket Xeon E5s), and KNL, and this appears to hold true across
> of these.
>
> These are very complex architectures; the impacts of enabling/disabling HT
> are going to be highly specific to both the platform and application.
>
> > Moreover, having the second hardware thread turned on in Linux but not
> used by batch jobs (by cgroup-ing them to just one hardware thread of each
> core) substantially reduced the performance impact and jitter from the OS —
> by ~10% in at least one synchronisation-heavy application. This is likely
> because the kernel began scheduling OS tasks (Lustre, IB, IPoIB, IRQs,
> Ganglia, PBS, …) on the second, unused hardware thread of each core, which
> were then run when the batch job’s processes stalled the CPU’s execution
> units. This is with both a CentOS 6.x kernel and a custom (tickless) 7.2
> kernel.
>
> Yes, that's a pretty clever use of HT in an HPC environment.  But be aware
> that you are cutting on-core pipeline depths that can be used by
> applications to do this.  In your setup, it sounds like this is still a net
> performance win (which is pretty sweet).  But that may not be a universal
> effect.
>
> This is probably a +3 on the existing trend from the prior emails in this
> thread: "As always, experiment to find the best for your hardware and
> jobs."  ;-)
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI-2.1.0 problem with executing orted when using SGE

2017-03-27 Thread Heinz-Ado Arnolds

Dear rhc,
dear Reuti,

thanks for your valuable help!

Kind regards,

Ado Arnolds

On 22.03.2017 15:55, r...@open-mpi.org wrote:
> Sorry folks - for some reason (probably timing for getting 2.1.0 out), the 
> fix for this got pushed to v2.1.1 - see the PR here: 
> https://github.com/open-mpi/ompi/pull/3163
> 
> 
>> On Mar 22, 2017, at 7:49 AM, Reuti > > wrote:
>>
>>>
>>> Am 22.03.2017 um 15:31 schrieb Heinz-Ado Arnolds 
>>> mailto:arno...@mpa-garching.mpg.de>>:
>>>
>>> Dear Reuti,
>>>
>>> thanks a lot, you're right! But why did the default behavior change but not 
>>> the value of this parameter:
>>>
>>> 2.1.0: MCA plm rsh: parameter "plm_rsh_agent" (current value: "ssh : rsh", 
>>> data source: default, level: 2 user/detail, type: string, synonyms: 
>>> pls_rsh_agent, orte_rsh_agent)
>>> The command used to launch executables on remote 
>>> nodes (typically either "ssh" or "rsh")
>>>
>>> 1.10.6:  MCA plm: parameter "plm_rsh_agent" (current value: "ssh : rsh", 
>>> data source: default, level: 2 user/detail, type: string, synonyms: 
>>> pls_rsh_agent, orte_rsh_agent)
>>> The command used to launch executables on remote 
>>> nodes (typically either "ssh" or "rsh")
>>>
>>> That means there must have been changes in the code regarding that, perhaps 
>>> for detecting SGE? Do you know of a way to revert to the old style (e.g. 
>>> configure option)? Otherwise all my users have to add this option.
>>
>> There was a discussion in https://github.com/open-mpi/ompi/issues/2947
>>
>> For now you can make use of 
>> https://www.open-mpi.org/faq/?category=tuning#setting-mca-params
>>
>> Essentially to have it set for all users automatically, put:
>>
>> plm_rsh_agent=foo
>>
>> in $prefix/etc/openmpi-mca-params.conf of your central Open MPI 2.1.0 
>> installation.
>>
>> -- Reuti
>>
>>
>>> Thanks again, and have a nice day
>>>
>>> Ado Arnolds
>>>
>>> On 22.03.2017 13:58, Reuti wrote:
 Hi,

> Am 22.03.2017 um 10:44 schrieb Heinz-Ado Arnolds 
> mailto:arno...@mpa-garching.mpg.de>>:
>
> Dear users and developers,
>
> first of all many thanks for all the great work you have done for OpenMPI!
>
> Up to OpenMPI-1.10.6 the mechanism for starting orted was to use SGE/qrsh:
> mpirun -np 8 --map-by ppr:4:node ./myid
> /opt/sge-8.1.8/bin/lx-amd64/qrsh -inherit -nostdin -V  Machine> orted --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess 
> "env" -mca orte_ess_jobid "1621884928" -mca orte_ess_vpid 1 -mca 
> orte_ess_num_procs "2" -mca orte_hnp_uri "1621884928.0;tcp:// Master>:41031" -mca plm "rsh" -mca rmaps_base_mapping_policy "ppr:4:node" 
> --tree-spawn
>
> Now with OpenMPI-2.1.0 (and the release candidates) "ssh" is used to 
> start orted:
> mpirun -np 8 --map-by ppr:4:node -mca mca_base_env_list OMP_NUM_THREADS=5 
> ./myid
> /usr/bin/ssh -x  
> PATH=/afs/./openmpi-2.1.0/bin:$PATH ; export PATH ; 
> LD_LIBRARY_PATH=/afs/./openmpi-2.1.0/lib:$LD_LIBRARY_PATH ; export 
> LD_LIBRARY_PATH ; 
> DYLD_LIBRARY_PATH=/afs/./openmpi-2.1.0/lib:$DYLD_LIBRARY_PATH ; 
> export DYLD_LIBRARY_PATH ;   /afs/./openmpi-2.1.0/bin/orted 
> --hnp-topo-sig 2N:2S:2L3:20L2:20L1:20C:40H:x86_64 -mca ess "env" -mca 
> ess_base_jobid "1626013696" -mca ess_base_vpid 1 -mca ess_base_num_procs 
> "2" -mca orte_hnp_uri "1626013696.0;usock;tcp:// Master>:43019" -mca plm_rsh_args "-x" -mca plm "rsh" -mca 
> rmaps_base_mapping_policy "ppr:4:node" -mca pmix "^s1,s2,cray"
>
> qrsh set the environment properly on the remote side, so that environment 
> variables from job scripts are properly transferred. With the ssh variant 
> the environment is not set properly on the remote side, and it seems that 
> there are handling problems with Kerberos tickets and/or AFS tokens.
>
> Is there any way to revert the 2.1.0 behavior to the 1.10.6 (use 
> SGE/qrsh) one? Are there mca params to set this?
>
> If you need more info, please let me know. (Job submitting machine and 
> target cluster are the same with all tests. SW is residing in AFS 
> directories visible on all machines. Parameter "plm_rsh_disable_qrsh" 
> current value: "false")

 It looks like `mpirun` still needs:

 -mca plm_rsh_agent foo

 to allow SGE to be detected.

 -- Reuti



 ___
 users mailing list
 users@lists.open-mpi.org 
 https://rfd.newmexicoconsortium.org/mailman/listinfo/users

>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org 
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

2017-03-27 Thread Jeff Squyres (jsquyres)

On Mar 27, 2017, at 11:00 AM, r...@open-mpi.org wrote:
> 
> I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” 
> setting. So why would you expect different results?

A few additional points to Ralph's question:

1. Recall that sched_yield() has effectively become a no-op in newer Linux 
kernels.  Hence, Open MPI's "yield when idle" may not do much to actually 
de-schedule a currently-running process.

2. As for why there is a difference between version 1.10.1 and 1.10.2 in 
oversubscription behavior, we likely do not know offhand (as all of these 
emails have shown!).  Honestly, we don't really pay much attention to 
oversubscription performance -- our focus tends to be on 
under/exactly-subscribed performance, because that's the normal operating mode 
for MPI applications.  With oversubscribed, we have typically just said "all 
bets are off" and leave it at that.

3. I don't recall if there was a default affinity policy change between 1.10.1 
and 1.10.2.  Do you know that your taskset command is -- for absolutely sure -- 
overriding what Open MPI is doing?  Or is what Open MPI is doing in terms of 
affinity/binding getting merged with what your taskset call is doing 
somehow...?  (seems unlikely, but I figured I'd ask anyway)

One more question -- see below:

>> Thanks for your feedback. As described here 
>> (https://www.open-mpi.org/faq/?category=running#oversubscribing), OpenMPI 
>> detects that I'm oversubscribing and runs in degraded mode (yielding the 
>> processor). Anyway, I repeated the experiments setting explicitly the 
>> yielding flag, and I obtained the same weird results:
>> 
>> $HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 
>> taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
>> $HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 
>> taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93

Per text later in your mail, "taskset -c 0-27" corresponds to the first 
hardware thread on each core.

Hence, this is effectively binding each process to the set of all "first 
hardware threads" across all cores.

>> Given these results, it seems that spin-waiting is not causing the issue.

I'm guessing that this difference is going to end up being the symptom of a 
highly complex system, of which spin-waiting is playing a part.  I.e., if Open 
MPI weren't spin waiting, this might not be happening.

-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

2017-03-27 Thread Jordi Guitart

I was not expecting different results. I just wanted to respond to Ben's 
suggestion, and demonstrate that the problem (the performance difference 
between v.1.10.1 and v.1.10.2) is not caused by spin-waiting.


On 27/03/2017 17:00, r...@open-mpi.org wrote:
I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” 
setting. So why would you expect different results?


On Mar 27, 2017, at 3:52 AM, Jordi Guitart > wrote:


Hi Ben,

Thanks for your feedback. As described here 
(https://www.open-mpi.org/faq/?category=running#oversubscribing), 
OpenMPI detects that I'm oversubscribing and runs in degraded mode 
(yielding the processor). Anyway, I repeated the experiments setting 
explicitly the yielding flag, and I obtained the same weird results:


$HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 
36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in 
seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 
36 taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in 
seconds = 110.93


Given these results, it seems that spin-waiting is not causing the 
issue. I also agree that this should not be caused by HyperThreading, 
given that 0-27 correspond to single HW threads on distinct cores, as 
shown in the following output returned by the lstopo command:


Machine (128GB total)
  NUMANode L#0 (P#0 64GB)
Package L#0 + L3 L#0 (35MB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#28)
  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#29)
  L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#30)
  L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#31)
  L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#32)
  L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#33)
  L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#34)
  L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#35)
  L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#36)
  L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#37)
  L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#38)
  L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#39)
  L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#40)
  L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#41)
HostBridge L#0
  PCIBridge
PCI 8086:24f0
  Net L#0 "ib0"
  OpenFabrics L#1 "hfi1_0"
  PCIBridge
PCI 14e4:1665
  Net L#2 "eno1"
PCI 14e4:1665
  Net L#3 "eno2"
  PCIBridge
PCIBridge
  PCIBridge
PCIBridge
  PCI 102b:0534
GPU L#4 "card0"
GPU L#5 "controlD64"
  NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
  PU L#28 (P#14)
  PU L#29 (P#42)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
  PU L#30 (P#15)
  PU L#31 (P#43)
L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
  PU L#32 (P#16)
  PU L#33 (P#44)
L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
  PU L#34 (P#17)
  PU L#35 (P#45)
L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
  PU L#36 (P#18)
  PU L#37 (P#46)
L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
  PU L#38 (P#19)
  PU L#39 (P#47)
L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
  PU L#40 (P#20)
  PU L#41 (P#48)
L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
  PU L#42 (P#21)
  PU L#43 (P#49)
L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
  PU L#44 (P#22)
  PU L#45 (P#50)
L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
  PU L#46 (P#23)
  PU L#47 (P#51)
L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
  PU L#48 (P#24)
  PU L#49 (P#52)
L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
  PU L#50 (P#25)
  PU L#51 (P#53)
L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
  PU L#52 (P#26)
  PU L#53 (P#54)
L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
  PU L#54 (P#27)
  PU

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

2017-03-27 Thread r...@open-mpi.org

I’m confused - mpi_yield_when_idle=1 is precisely the “oversubscribed” setting. 
So why would you expect different results?

> On Mar 27, 2017, at 3:52 AM, Jordi Guitart  wrote:
> 
> Hi Ben,
> 
> Thanks for your feedback. As described here 
> (https://www.open-mpi.org/faq/?category=running#oversubscribing 
> ), OpenMPI 
> detects that I'm oversubscribing and runs in degraded mode (yielding the 
> processor). Anyway, I repeated the experiments setting explicitly the 
> yielding flag, and I obtained the same weird results:
> 
> $HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 
> taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
> $HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 
> taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93
> 
> Given these results, it seems that spin-waiting is not causing the issue. I 
> also agree that this should not be caused by HyperThreading, given that 0-27 
> correspond to single HW threads on distinct cores, as shown in the following 
> output returned by the lstopo command:
> 
> Machine (128GB total)
>   NUMANode L#0 (P#0 64GB)
> Package L#0 + L3 L#0 (35MB)
>   L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
> PU L#0 (P#0)
> PU L#1 (P#28)
>   L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
> PU L#2 (P#1)
> PU L#3 (P#29)
>   L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
> PU L#4 (P#2)
> PU L#5 (P#30)
>   L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
> PU L#6 (P#3)
> PU L#7 (P#31)
>   L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
> PU L#8 (P#4)
> PU L#9 (P#32)
>   L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
> PU L#10 (P#5)
> PU L#11 (P#33)
>   L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
> PU L#12 (P#6)
> PU L#13 (P#34)
>   L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
> PU L#14 (P#7)
> PU L#15 (P#35)
>   L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
> PU L#16 (P#8)
> PU L#17 (P#36)
>   L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
> PU L#18 (P#9)
> PU L#19 (P#37)
>   L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
> PU L#20 (P#10)
> PU L#21 (P#38)
>   L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
> PU L#22 (P#11)
> PU L#23 (P#39)
>   L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
> PU L#24 (P#12)
> PU L#25 (P#40)
>   L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
> PU L#26 (P#13)
> PU L#27 (P#41)
> HostBridge L#0
>   PCIBridge
> PCI 8086:24f0
>   Net L#0 "ib0"
>   OpenFabrics L#1 "hfi1_0"
>   PCIBridge
> PCI 14e4:1665
>   Net L#2 "eno1"
> PCI 14e4:1665
>   Net L#3 "eno2"
>   PCIBridge
> PCIBridge
>   PCIBridge
> PCIBridge
>   PCI 102b:0534
> GPU L#4 "card0"
> GPU L#5 "controlD64"
>   NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
> L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
>   PU L#28 (P#14)
>   PU L#29 (P#42)
> L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
>   PU L#30 (P#15)
>   PU L#31 (P#43)
> L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
>   PU L#32 (P#16)
>   PU L#33 (P#44)
> L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
>   PU L#34 (P#17)
>   PU L#35 (P#45)
> L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
>   PU L#36 (P#18)
>   PU L#37 (P#46)
> L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
>   PU L#38 (P#19)
>   PU L#39 (P#47)
> L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
>   PU L#40 (P#20)
>   PU L#41 (P#48)
> L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
>   PU L#42 (P#21)
>   PU L#43 (P#49)
> L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
>   PU L#44 (P#22)
>   PU L#45 (P#50)
> L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
>   PU L#46 (P#23)
>   PU L#47 (P#51)
> L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
>   PU L#48 (P#24)
>   PU L#49 (P#52)
> L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
>   PU L#50 (P#25)
>   PU L#51 (P#53)
> L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
>   PU L#52 (P#26)
>   PU L#53 (P#54)
> L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
>   PU

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

2017-03-27 Thread Jordi Guitart


Hi Ben,

Thanks for your feedback. As described here 
(https://www.open-mpi.org/faq/?category=running#oversubscribing), 
OpenMPI detects that I'm oversubscribing and runs in degraded mode 
(yielding the processor). Anyway, I repeated the experiments setting 
explicitly the yielding flag, and I obtained the same weird results:


$HOME/openmpi-bin-1.10.1/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 
taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 82.79
$HOME/openmpi-bin-1.10.2/bin/mpirun --mca mpi_yield_when_idle 1 -np 36 
taskset -c 0-27 $HOME/NPB/NPB3.3-MPI/bin/bt.C.36 -> Time in seconds = 110.93


Given these results, it seems that spin-waiting is not causing the 
issue. I also agree that this should not be caused by HyperThreading, 
given that 0-27 correspond to single HW threads on distinct cores, as 
shown in the following output returned by the lstopo command:


Machine (128GB total)
  NUMANode L#0 (P#0 64GB)
Package L#0 + L3 L#0 (35MB)
  L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#28)
  L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#29)
  L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#30)
  L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#31)
  L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#4)
PU L#9 (P#32)
  L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#5)
PU L#11 (P#33)
  L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#6)
PU L#13 (P#34)
  L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#35)
  L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
PU L#16 (P#8)
PU L#17 (P#36)
  L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
PU L#18 (P#9)
PU L#19 (P#37)
  L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
PU L#20 (P#10)
PU L#21 (P#38)
  L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
PU L#22 (P#11)
PU L#23 (P#39)
  L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
PU L#24 (P#12)
PU L#25 (P#40)
  L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
PU L#26 (P#13)
PU L#27 (P#41)
HostBridge L#0
  PCIBridge
PCI 8086:24f0
  Net L#0 "ib0"
  OpenFabrics L#1 "hfi1_0"
  PCIBridge
PCI 14e4:1665
  Net L#2 "eno1"
PCI 14e4:1665
  Net L#3 "eno2"
  PCIBridge
PCIBridge
  PCIBridge
PCIBridge
  PCI 102b:0534
GPU L#4 "card0"
GPU L#5 "controlD64"
  NUMANode L#1 (P#1 64GB) + Package L#1 + L3 L#1 (35MB)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
  PU L#28 (P#14)
  PU L#29 (P#42)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
  PU L#30 (P#15)
  PU L#31 (P#43)
L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
  PU L#32 (P#16)
  PU L#33 (P#44)
L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
  PU L#34 (P#17)
  PU L#35 (P#45)
L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
  PU L#36 (P#18)
  PU L#37 (P#46)
L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
  PU L#38 (P#19)
  PU L#39 (P#47)
L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
  PU L#40 (P#20)
  PU L#41 (P#48)
L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
  PU L#42 (P#21)
  PU L#43 (P#49)
L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
  PU L#44 (P#22)
  PU L#45 (P#50)
L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
  PU L#46 (P#23)
  PU L#47 (P#51)
L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
  PU L#48 (P#24)
  PU L#49 (P#52)
L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
  PU L#50 (P#25)
  PU L#51 (P#53)
L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
  PU L#52 (P#26)
  PU L#53 (P#54)
L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
  PU L#54 (P#27)
  PU L#55 (P#55)

On 26/03/2017 9:37, Ben Menadue wrote:
On 26 Mar 2017, at 2:22 am, Jordi Guitart > wrote:
However, what is puzzling me is the performance difference between 
OpenMPI 1.10.1 (and prior versions) and OpenMPI 1.10.2 (and later 
versions) in my experiments with oversubscription, i.e. 82 seconds 
vs. 111 seconds.


You’re oversubscribing while letting the OS migrate individual threads 
between cores. That taskset will bind each

Re: [OMPI users] Suppressing Nvidia warnings

2017-03-27 Thread Roland Fehrenbacher

> "SJ" == Sylvain Jeaugey  writes:

Hi Sylvain,

thanks for looking into this further.

SJ> I'm still working to get a clear confirmation of what is
SJ> printing this error message and since when.

SJ> However, running strings, I could only find this string in
SJ> /usr/lib/libnvidia-ml.so, which comes with the CUDA driver, so
SJ> it should not be related to the CUDA runtime version ... but
SJ> again, until I find the code responsible for that, I can't say
SJ> for sure.

libcuda (in my case libcuda.so.367.57) also contains the string, and I'm
pretty sure, that's where it's coming from. libcudart (linked to orted
and libmpi.so.x) seems to dlopen libcuda.1 (at least "strings libcudart"
suggests that) ...

Best,

Roland

---
http://www.q-leap.com / http://qlustar.com
  --- HPC / Storage / Cloud Linux Cluster OS ---

SJ> I'm sorry it's taking so long -- I'm on it though.

SJ> On 03/24/2017 01:56 PM, Roland Fehrenbacher wrote:
>>> "SJ" == Sylvain Jeaugey  writes:
>> Hi Sylvain,
>>
SJ> Hi Roland, I can't find this message in the Open MPI source
SJ> code. Could it be hwloc ? Some other library you are using ?
>>
>> after a longer detour about the suspicion it might have something
>> to do with nvml support of hwloc, I now found that a change in
>> libcudart between 7.5 and 8.0 is the cause of the messages
>> appearing now. Our earlier 1.8 version was built against CUDA 7.5
>> and didn't show the problem, but a 1.8 version built against CUDA
>> 8 shows the same problem as 2.0.2 built against CUDA 8. Do you
>> think you could ask your team members at Nvidia how this new
>> behaviour in libcudart can be suppressed?
>>
>> BTW: Disabling nvml support for the internal hwloc has the effect
>> that OpenMPI doesn't link in libnvidia-ml.so.x anymore, but has
>> no effect on the messages.
>>
>> Thanks,
>>
>> Roland
>>
SJ> On 03/16/2017 04:23 AM, Roland Fehrenbacher wrote:
>> >> Hi,
>> >>
>> >> OpenMPI 2.0.2 built with cuda support brings up lots of
>> >> warnings like
>> >>
>> >> NVIDIA: no NVIDIA devices found
>> >>
>> >> when running on HW without Nvidia devices. Is there a way to
>> >> suppress these warnings? It would be quite a hassle to
>> >> maintain different OpenMPI builds on clusters with just some
>> >> GPU machines.
>> ___ users mailing
>> list users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users


SJ> 
---
SJ> This email message is for the sole use of the intended
SJ> recipient(s) and may contain confidential information.  Any
SJ> unauthorized review, use, disclosure or distribution is
SJ> prohibited.  If you are not the intended recipient, please
SJ> contact the sender by reply email and destroy all copies of the
SJ> original message.
SJ> 
---
SJ> ___ users mailing
SJ> list users@lists.open-mpi.org
SJ> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

-- 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Re: [OMPI users] OpenMPI-2.1.0 problem with executing orted when using SGE

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Re: [OMPI users] Performance degradation of OpenMPI 1.10.2 when oversubscribed?

Re: [OMPI users] Suppressing Nvidia warnings

8 matches

Site Navigation

Mail list logo

Footer information