Re: [OMPI users] OpenMPI Profiling

2016-01-07 Thread Eugene Loh
I don't know specifically what you want to do, but there is a FAQ 
section on profiling and tracing.

http://www.open-mpi.org/faq/?category=perftools

On 12/31/2015 9:03 AM, anil maurya wrote:
I have compiled HPL using OpenMPI and GotoBLAS. I want to do profiling 
and tracing.
I have compiled openmpi using -enable-profile-mpi. Please let me know 
how to do the profiling.


If I want to use PAPI for the hardware base profiling, Do I need to 
compile HPL again with PAPI support??


Re: [hwloc-users] [WARNING: A/V UNSCANNABLE] hwloc error after upgrading from Centos 6.5 to Centos 7 on Supermicro with AMD Opteron 6344

2016-01-07 Thread Brice Goglin
Hello

Good to know, thanks.

There are two ways to workaround the issue:
* run "lstopo foo.xml" on a node that doesn't have the bug and do export
HWLOC_XMLFILE=foo.xml and HWLOC_THISSYSTEM=1 on buggy nodes. (that's
what you call a "map" below). Works with very old hwloc releases.
* export HWLOC_COMPONENTS=x86 (only works since hwloc >= 1.11.2)

Brice




Le 07/01/2016 16:20, David Winslow a écrit :
> Brice,
>
> Thanks for the information! It’s good to know it wasn’t a flaw in the 
> upgrade. This bug must have been introduced in kernel 3.x. I ran lstopo on on 
> of our servers that still have Centos 6.5 and it correctly reports L3 cache 
> for every 6 cores as shown below.
>
> We have 75 servers with the exact same specifications. I have only upgraded 
> two when I came across this problem during testing. Since I have a correct 
> map on the non-upgraded servers, can I use that map on the upgraded servers 
> somehow? Essentially hard code it?
>
> --- FROM Centos 6.5 ---
>   Socket L#0 (P#0 total=134215604KB CPUModel="AMD Opteron(tm) Processor 6344  
>" CPUType=x86_64)
> NUMANode L#0 (P#0 local=67106740KB total=67106740KB)
>   L3Cache L#0 (size=6144KB linesize=64 ways=64)
> L2Cache L#0 (size=2048KB linesize=64 ways=16)
>   L1iCache L#0 (size=64KB linesize=64 ways=2)
> L1dCache L#0 (size=16KB linesize=64 ways=4)
>   Core L#0 (P#0)
> PU L#0 (P#0)
> L1dCache L#1 (size=16KB linesize=64 ways=4)
>   Core L#1 (P#1)
> PU L#1 (P#1)
> L2Cache L#1 (size=2048KB linesize=64 ways=16)
>   L1iCache L#1 (size=64KB linesize=64 ways=2)
> L1dCache L#2 (size=16KB linesize=64 ways=4)
>   Core L#2 (P#2)
> PU L#2 (P#2)
> L1dCache L#3 (size=16KB linesize=64 ways=4)
>   Core L#3 (P#3)
> PU L#3 (P#3)
> L2Cache L#2 (size=2048KB linesize=64 ways=16)
>   L1iCache L#2 (size=64KB linesize=64 ways=2)
> L1dCache L#4 (size=16KB linesize=64 ways=4)
>   Core L#4 (P#4)
> PU L#4 (P#4)
> L1dCache L#5 (size=16KB linesize=64 ways=4)
>   Core L#5 (P#5)
> PU L#5 (P#5)
>
>> On Jan 7, 2016, at 1:22 AM, Brice Goglin  wrote:
>>
>> Hello
>>
>> This is a kernel bug for 12-core AMD Bulldozer/Piledriver (62xx/63xx) 
>> processors. hwloc is just complaining about buggy L3 information. lstopo 
>> should report one L3 above each set of 6 cores below each NUMA node. Instead 
>> you get strange L3s with 2, 4 or 6 cores.
>>
>> If you're not binding tasks based on L3 locality and if your applications do 
>> not care about L3, you can pass HWLOC_HIDE_ERRORS=1 in the environment to 
>> hide the message.
>>
>> AMD was working on a kernel patch but it doesn't seem to be in the upstream 
>> Linux yet. In hwloc v1.11.2, you can workaround the problem by passing 
>> HWLOC_COMPONENTS=x86 in the environment.
>>
>> I am not sure why CentOS 6.5 didn't complain. That 2.6.32 kernel should be 
>> buggy too, and old hwloc releases already complained about such bugs.
>>
>> thanks
>> Brice
>>
>>
>>
>>
>>
>>
>> Le 07/01/2016 04:10, David Winslow a écrit :
>>> I upgraded our servers from Centos 6.5 to Centos7.2. Since then, when I run 
>>> mpirun I get the following error but the software continues to run and it 
>>> appears to work fine.
>>>
>>> * hwloc 1.11.0rc3-git has encountered what looks like an error from the 
>>> operating system.
>>> *
>>> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset 0x003f) 
>>> without inclusion!
>>> * Error occurred in topology.c line 983
>>> *
>>> * The following FAQ entry in the hwloc documentation may help:
>>> *   What should I do when hwloc reports "operating system" warnings?
>>> * Otherwise please report this error message to the hwloc user's mailing 
>>> list,
>>> * along with the output+tarball generated by the hwloc-gather-topology 
>>> script.
>>>
>>> I can replicate the error by simply running hwloc-info.
>>>
>>> The version of hwloc used with mpirun is 1.9. The version installed on the 
>>> server that I ran is 1.7 that comes with Centos 7. They both give the error 
>>> with minor differences shown below.
>>>
>>> With hwloc 1.7
>>> * object (L3 cpuset 0x03f0) intersection without inclusion!
>>> * Error occurred in topology.c line 753
>>>
>>> With hwloc 1.9
>>> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset 0x003f) 
>>> without inclusion!
>>> * Error occurred in topology.c line 983
>>>
>>> The current kernel is 3.10.0-327.el7.x86_64. I’ve tried updating the kernel 
>>> to a minor release update and even tried to install kernel v4.4.3. None of 
>>> the kernels worked. Again, hwloc works fine in Centos 6.5 with kernel 
>>> 2.6.32-431.29.2.el6.x86_64.
>>>
>>> I’ve attached the files generated by hwloc-gather-topology.sh.  I compared 

Re: [OMPI users] Singleton process spawns additional thread

2016-01-07 Thread Sasso, John (GE Power, Non-GE)
Stefan,  I don't know if this is related to your issue, but FYI...


> Those are async progress threads - they block unless something requires doing
>
>
>> On Apr 15, 2015, at 8:36 AM, Sasso, John (GE Power & Water, Non-GE) 
>>  wrote:
>> 
>> I stumbled upon something while using 'ps -eFL' to view threads of 
>> processes, and Google searches have failed to answer my question.  This 
>> question holds for OpenMPI 1.6.x and even OpenMPI 1.4.x.
 >> 
>> For a program which is pure MPI (built and run using OpenMPI) and does not 
>> implement Pthreads or OpenMP, why is it that each MPI task appears as having 
>> 3 threads:
 >>
>> UID  PID  PPID   LWP  C NLWPSZ   RSS PSR STIME TTY  TIME CMD
>> sasso  20512 20493 20512 993 187849 582420 14 11:01 ?   00:26:37 
>> /home/sasso/mpi_example.exe
>> sasso  20512 20493 20588  03 187849 582420 11 11:01 ?   00:00:00 
>> /home/sasso/mpi_example.exe
>> sasso  20512 20493 20599  03 187849 582420 12 11:01 ?   00:00:00 
>> /home/sasso/mpi_example.exe
 >>
>> whereas if I compile and run a non-MPI program, 'ps -eFL' shows it running 
>> as a single thread?
>>
>> Granted the CPU utilization (C) for 2 of the 3 threads is zero, but the 
>> threads are bound to different processors (11,12,14).   I am curious as to 
>> why this is, and no complaining that there is a problem.  Thanks!
 >> 
>> --john



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Au Eelis
Sent: Thursday, January 07, 2016 7:10 AM
To: us...@open-mpi.org
Subject: [OMPI users] Singleton process spawns additional thread

Hi!

I have a weird problem with executing a singleton OpenMPI program, where an 
additional thread causes full load, while the master thread performs the actual 
calculations.

In contrast, executing "mpirun -np 1 [executable]" performs the same 
calculation at the same speed but the additional thread is idling.

In my understanding, both calculations should behave in the same way (i.e., one 
working thread) for a program which is simply moving some data around (mainly 
some MPI_BCAST and MPI_GATHER commands).

I could observe this behaviour in OpenMPI 1.10.1 with ifort 16.0.1 and gfortran 
5.3.0. I could create a minimal working example, which is appended to this mail.

Am I missing something?

Best regards,
Stefan

-

MWE: Compile this with "mpifort main.f90". When executing with "./a.out", there 
is thread wasting cycles, while the master thread waits for input. When 
executing with "mpirun -np 1 ./a.out" this thread is idling.

program main
 use mpi_f08
 implicit none

 integer :: ierror,rank

 call MPI_Init(ierror)
 call MPI_Comm_Rank(MPI_Comm_World,rank,ierror)

 ! let master thread wait on [RETURN]-key
 if (rank == 0) then
 read(*,*)
 end if

 write(*,*) rank

 call mpi_barrier(mpi_comm_world, ierror) end program 
___
users mailing list
us...@open-mpi.org
Subscription: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org_mailman_listinfo.cgi_users=CwICAg=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk=NPeEHKik35WrcHGDl5ZRq4IC6Le5g03o5YoqD9InrHw=eRYTNaknio7tNJFdOMTqvdlNNIq9p6evJoQxuvmqrLs=
Link to this post: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.open-2Dmpi.org_community_lists_users_2016_01_28237.php=CwICAg=IV_clAzoPDE253xZdHuilRgztyh_RiV3wUrLrDQYWSI=tqKZ2vRCLufSSXPvzNxBrKr01YPimBPnb-JT-Js0Fmk=NPeEHKik35WrcHGDl5ZRq4IC6Le5g03o5YoqD9InrHw=2_axdls1JH4Wm5MlkOXRrtXFb2LLVLCleKVx4ybpltU=
 


[OMPI users] Singleton process spawns additional thread

2016-01-07 Thread Au Eelis

Hi!

I have a weird problem with executing a singleton OpenMPI program, where 
an additional thread causes full load, while the master thread performs 
the actual calculations.


In contrast, executing "mpirun -np 1 [executable]" performs the same 
calculation at the same speed but the additional thread is idling.


In my understanding, both calculations should behave in the same way 
(i.e., one working thread) for a program which is simply moving some 
data around (mainly some MPI_BCAST and MPI_GATHER commands).


I could observe this behaviour in OpenMPI 1.10.1 with ifort 16.0.1 and 
gfortran 5.3.0. I could create a minimal working example, which is 
appended to this mail.


Am I missing something?

Best regards,
Stefan

-

MWE: Compile this with "mpifort main.f90". When executing with 
"./a.out", there is thread wasting cycles, while the master thread waits 
for input. When executing with "mpirun -np 1 ./a.out" this thread is idling.


program main
use mpi_f08
implicit none

integer :: ierror,rank

call MPI_Init(ierror)
call MPI_Comm_Rank(MPI_Comm_World,rank,ierror)

! let master thread wait on [RETURN]-key
if (rank == 0) then
read(*,*)
end if

write(*,*) rank

call mpi_barrier(mpi_comm_world, ierror)
end program


[OMPI users] warnings building openmpi-v2.x-dev-950-g995993b

2016-01-07 Thread Siegmar Gross

Hi,

perhaps sombody is interested in some warnings which I got building
openmpi-v2.x-dev-950-g995993b on my machines (Solaris 10 Sparc,
Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64) with gcc-5.1.0
and Sun C 5.13. I got the same warnings on Solaris sparc and x86_64.

I've produced the attached files with the following command

grep -i warning \
  openmpi-v2.x-dev-950-g995993b-Linux.x86_64.64_cc/log.make.* | \
  grep -v atomic.h | grep -v relinking | grep -v "not reached" | \
  grep -v "failed to detect system" | grep -v "Optimizer level" | \
  grep -v "seems to be" | grep -v "attempted multiple" | \
  sort | uniq > /tmp/warning_Linux_x86_64_cc.txt


Kind regards

Siegmar
"../../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/hwloc/hwloc/hwloc/src/topology-custom.c",
 line 88: warning: initializer will be sign-extended: -1
"../../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/hwloc/hwloc/hwloc/src/topology-linux.c",
 line 2694: warning: initializer will be sign-extended: -1
"../../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/hwloc/hwloc/hwloc/src/topology-synthetic.c",
 line 851: warning: initializer will be sign-extended: -1
"../../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/hwloc/hwloc/hwloc/src/topology-xml.c",
 line 1667: warning: initializer will be sign-extended: -1
"../../../../../../openmpi-v2.x-dev-950-g995993b/ompi/mca/io/romio314/romio/adio/common/utils.c",
 line 97: warning: argument #3 is incompatible with prototype:
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/buffer_ops/internal_functions.c",
 line 117: warning: identifier redeclared: 
opal_pmix_pmix112_pmix_bfrop_get_data_type
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/buffer_ops/open_close.c",
 line 56: warning: initialization type mismatch
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/buffer_ops/open_close.c",
 line 57: warning: initialization type mismatch
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/buffer_ops/open_close.c",
 line 58: warning: initialization type mismatch
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/buffer_ops/open_close.c",
 line 59: warning: initialization type mismatch
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/buffer_ops/open_close.c",
 line 60: warning: initialization type mismatch
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/buffer_ops/pack.c",
 line 63: warning: identifier redeclared: 
opal_pmix_pmix112_pmix_bfrop_pack_buffer
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/class/pmix_hash_table.c",
 line 299: warning: identifier redeclared: 
opal_pmix_pmix112_pmix_hash_table_remove_value_uint64
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/class/pmix_hash_table.c",
 line 365: warning: identifier redeclared: 
opal_pmix_pmix112_pmix_hash_table_get_value_ptr
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/class/pmix_hash_table.c",
 line 392: warning: identifier redeclared: 
opal_pmix_pmix112_pmix_hash_table_set_value_ptr
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/class/pmix_hash_table.c",
 line 433: warning: identifier redeclared: 
opal_pmix_pmix112_pmix_hash_table_remove_value_ptr
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/class/pmix_hash_table.c",
 line 466: warning: identifier redeclared: 
opal_pmix_pmix112_pmix_hash_table_get_first_key_uint32
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/class/pmix_hash_table.c",
 line 493: warning: identifier redeclared: 
opal_pmix_pmix112_pmix_hash_table_get_next_key_uint32
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/class/pmix_hash_table.c",
 line 538: warning: identifier redeclared: 
opal_pmix_pmix112_pmix_hash_table_get_first_key_uint64
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/class/pmix_hash_table.c",
 line 565: warning: identifier redeclared: 
opal_pmix_pmix112_pmix_hash_table_get_next_key_uint64
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/class/pmix_hash_table.c",
 line 73: warning: identifier redeclared: opal_pmix_pmix112_pmix_hash_table_init
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/class/pmix_hash_table.h",
 line 323: warning: implicit function declaration: __builtin_clz
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
 line 223: warning: identifier redeclared: OPAL_PMIX_PMIX112_PMIx_Init
"../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
 line 465: warning: identifier redeclared: OPAL_PMIX_PMIX112_PMIx_Abort

Re: [hwloc-users] error from the operating system - Solaris 11.3 - SOLVED

2016-01-07 Thread Brice Goglin
Thanks, I copied useful information from this thread and some links to
https://github.com/open-mpi/hwloc/issues/143

However, not sure I'll have time to look at this in the near future :/

Brice




Le 07/01/2016 09:03, Matthias Reich a écrit :
> Hello,
>
> To check whether kstat is able to report the psrset definitions, I
> defined a set consisting of 2 CPUs (psrset -c 1-2) CPU1 and CPU2. The
> remaining CPUs (CPU0, CPU2..CPU23) were left undefined.
>
> On the machine, we can execute the "kstat" command and receive (among
> 1000s of lines) the following info:
>
> module: unixinstance: 0
> name:   psetclass:misc
> avenrun_15min   70
> avenrun_1min53
> avenrun_5min47
> crtime  0
> ncpus   22
> runnable1146912
> snaptime80083.491239257
> updates 790784
> waiting 0
>
>
> module: unixinstance: 1
> name:   psetclass:misc
> avenrun_15min   0
> avenrun_1min0
> avenrun_5min0
> crtime  79983.070416351
> ncpus   2
> runnable0
> snaptime80083.595839172
> updates 1005
> waiting 0
>
> which is not very comprehensive and doesn't even tell, which CPUs are
> part of the particular set, but could probably be used to at least warn
> about the existence of a CPU set and prevent the (not very intuitive)
> error message and consequent abort.
>
> However, doing the same on the machine without the pset defined, we get:
>
> module: unixinstance: 0
> name:   psetclass:misc
> avenrun_15min   50
> avenrun_1min38
> avenrun_5min41
> crtime  0
> ncpus   24
> runnable1163866
> snaptime81105.346688035
> updates 801003
> waiting 0
>
> so the (only) processor set encompasses all 24 (virtual) cores. This
> may be the key to check for.
>
> The C-API to check for processor set(s) is available through the
> libpool library, which allows more resource pool configuration than
> just processor sets, but can probably act as an abstraction layer for
> different Solaris flavors...
>
> Matthias
>
>>  Hello
>> So processor sets are not taken into account when Solaris reports
>> topology information in kstat etc.
>> Do you know if hwloc can query processor sets from the C interface?
>> If so, we could apply the processor set mask to hwloc object cpusets
>> during discovery to avoid your error.
>> Brice
>>
>> Le 05/01/2016 10:18, Karl Behler a écrit :
>>> There was a processor set defined (command psrset) on this machine.
>>> Having removed the psrset hwloc-info produces a result without error
>>> messages:
>>>
>>> hwloc-info -v
>>> depth 0:1 Machine (type #1)
>>>  depth 1:   2 NUMANode (type #2)
>>>   depth 2:  2 Package (type #3)
>>>depth 3: 12 Core (type #5)
>>> depth 4:24 PU (type #6)
>>>
>>> It seems the concept of defining a psrset is in contradiction to what
>>> hwloc and/or openmpi expects/allows.
>>>
>>>
>>> On 04.01.16 18:16, Karl Behler wrote:
 We used to run our MPI application with the SUNWhpc implementation
 from Sun/Oracle. (This was derived from openmpi 1.5.)
 However, the Oracle HPC implementation fails for the new Solaris 11.3
 platform.
 So we downloaded and made openmpi 1.10.1 on this platform from
 scratch.

 All seems fine and a simple test application runs fine.
 However, with the real application we are running into a hwloc
 problem.

 So we also downloaded and made the hwloc package 1.11.2.

 Now examining hardware locality we get the following error:

 hwloc-info -v --whole-io
 


 * hwloc 1.11.2 has encountered what looks like an error from the
 operating system.
 *
 * Core (P#0 cpuset 0x1001) intersects with NUMANode (P#1 cpuset
 0x0003c001) without inclusion!
 * Error occurred in topology.c line 1046
 *
 * The following FAQ entry in the hwloc documentation may help:
 *   What should I do when hwloc reports "operating system" warnings?
 * Otherwise please report this error message to the hwloc user's
 mailing list,

Re: [hwloc-users] error from the operating system - Solaris 11.3 - SOLVED

2016-01-07 Thread Matthias Reich

Hello,

To check whether kstat is able to report the psrset definitions, I
defined a set consisting of 2 CPUs (psrset -c 1-2) CPU1 and CPU2. The
remaining CPUs (CPU0, CPU2..CPU23) were left undefined.

On the machine, we can execute the "kstat" command and receive (among
1000s of lines) the following info:

module: unixinstance: 0
name:   psetclass:misc
avenrun_15min   70
avenrun_1min53
avenrun_5min47
crtime  0
ncpus   22
runnable1146912
snaptime80083.491239257
updates 790784
waiting 0


module: unixinstance: 1
name:   psetclass:misc
avenrun_15min   0
avenrun_1min0
avenrun_5min0
crtime  79983.070416351
ncpus   2
runnable0
snaptime80083.595839172
updates 1005
waiting 0

which is not very comprehensive and doesn't even tell, which CPUs are
part of the particular set, but could probably be used to at least warn
about the existence of a CPU set and prevent the (not very intuitive)
error message and consequent abort.

However, doing the same on the machine without the pset defined, we get:

module: unixinstance: 0
name:   psetclass:misc
avenrun_15min   50
avenrun_1min38
avenrun_5min41
crtime  0
ncpus   24
runnable1163866
snaptime81105.346688035
updates 801003
waiting 0

so the (only) processor set encompasses all 24 (virtual) cores. This may 
be the key to check for.


The C-API to check for processor set(s) is available through the libpool 
library, which allows more resource pool configuration than just 
processor sets, but can probably act as an abstraction layer for

different Solaris flavors...

Matthias


 Hello
So processor sets are not taken into account when Solaris reports
topology information in kstat etc.
Do you know if hwloc can query processor sets from the C interface?
If so, we could apply the processor set mask to hwloc object cpusets
during discovery to avoid your error.
Brice

Le 05/01/2016 10:18, Karl Behler a écrit :

There was a processor set defined (command psrset) on this machine.
Having removed the psrset hwloc-info produces a result without error
messages:

hwloc-info -v
depth 0:1 Machine (type #1)
 depth 1:   2 NUMANode (type #2)
  depth 2:  2 Package (type #3)
   depth 3: 12 Core (type #5)
depth 4:24 PU (type #6)

It seems the concept of defining a psrset is in contradiction to what
hwloc and/or openmpi expects/allows.


On 04.01.16 18:16, Karl Behler wrote:

We used to run our MPI application with the SUNWhpc implementation
from Sun/Oracle. (This was derived from openmpi 1.5.)
However, the Oracle HPC implementation fails for the new Solaris 11.3
platform.
So we downloaded and made openmpi 1.10.1 on this platform from scratch.

All seems fine and a simple test application runs fine.
However, with the real application we are running into a hwloc problem.

So we also downloaded and made the hwloc package 1.11.2.

Now examining hardware locality we get the following error:

hwloc-info -v --whole-io


* hwloc 1.11.2 has encountered what looks like an error from the
operating system.
*
* Core (P#0 cpuset 0x1001) intersects with NUMANode (P#1 cpuset
0x0003c001) without inclusion!
* Error occurred in topology.c line 1046
*
* The following FAQ entry in the hwloc documentation may help:
*   What should I do when hwloc reports "operating system" warnings?
* Otherwise please report this error message to the hwloc user's
mailing list,
* along with any relevant topology information from your platform.


depth 0:1 Machine (type #1)
 depth 1:   2 Package (type #3)
  depth 2:  2 NUMANode (type #2)
   depth 3: 1 Core (type #5)
depth 4:24 PU (type #6)

Since I could not find the mentioned FAQ topic I'm asking the list
for advice.

Our system is an Oracle/ Solaris 11.3 (latest patch level) on an
Intel hardware platform from Sun.

output of uname -a -> SunOS sxaug28 5.11 11.3 i86pc i386 i86pc
output of psrinfo -v ->

Re: [hwloc-users] [WARNING: A/V UNSCANNABLE] hwloc error after upgrading from Centos 6.5 to Centos 7 on Supermicro with AMD Opteron 6344

2016-01-07 Thread Brice Goglin
Hello

This is a kernel bug for 12-core AMD Bulldozer/Piledriver (62xx/63xx)
processors. hwloc is just complaining about buggy L3 information. lstopo
should report one L3 above each set of 6 cores below each NUMA node.
Instead you get strange L3s with 2, 4 or 6 cores.

If you're not binding tasks based on L3 locality and if your
applications do not care about L3, you can pass HWLOC_HIDE_ERRORS=1 in
the environment to hide the message.

AMD was working on a kernel patch but it doesn't seem to be in the
upstream Linux yet. In hwloc v1.11.2, you can workaround the problem by
passing HWLOC_COMPONENTS=x86 in the environment.

I am not sure why CentOS 6.5 didn't complain. That 2.6.32 kernel should
be buggy too, and old hwloc releases already complained about such bugs.

thanks
Brice






Le 07/01/2016 04:10, David Winslow a écrit :
> I upgraded our servers from Centos 6.5 to Centos7.2. Since then, when I run 
> mpirun I get the following error but the software continues to run and it 
> appears to work fine.
>
> * hwloc 1.11.0rc3-git has encountered what looks like an error from the 
> operating system.
> *
> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset 0x003f) 
> without inclusion!
> * Error occurred in topology.c line 983
> *
> * The following FAQ entry in the hwloc documentation may help:
> *   What should I do when hwloc reports "operating system" warnings?
> * Otherwise please report this error message to the hwloc user's mailing list,
> * along with the output+tarball generated by the hwloc-gather-topology script.
>
> I can replicate the error by simply running hwloc-info.
>
> The version of hwloc used with mpirun is 1.9. The version installed on the 
> server that I ran is 1.7 that comes with Centos 7. They both give the error 
> with minor differences shown below.
>
> With hwloc 1.7
> * object (L3 cpuset 0x03f0) intersection without inclusion!
> * Error occurred in topology.c line 753
>
> With hwloc 1.9
> * L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset 0x003f) 
> without inclusion!
> * Error occurred in topology.c line 983
>
> The current kernel is 3.10.0-327.el7.x86_64. I’ve tried updating the kernel 
> to a minor release update and even tried to install kernel v4.4.3. None of 
> the kernels worked. Again, hwloc works fine in Centos 6.5 with kernel 
> 2.6.32-431.29.2.el6.x86_64.
>
> I’ve attached the files generated by hwloc-gather-topology.sh.  I compared 
> what this script says is the expected output to the actual output and, from 
> what I can tell, they look the same. Maybe I’m missing something after 
> staring all day at the information.
>
> I did a clean install of the OS to perform the upgrade from 6.5.
>
> I’ve attached the results of the hwloc-gather-topology.sh script. Any help 
> will be greatly appreciated.
>
>
>
>
>
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1238.php



Re: [OMPI users] runtime errors with openmpi-v2.x-dev-950-g995993b

2016-01-07 Thread Gilles Gouaillardet

Siegmar,

thanks for the report

about the issue with Sun compiler and helloworld, the root cause is an 
incorrect packaging and a fix is available at 
https://github.com/open-mpi/ompi/pull/1285


(note the issue only occurs when building from a tarball)

i will have a look at the other issues

Cheers,

Gilles

On 1/6/2016 9:57 PM, Siegmar Gross wrote:

Hi,

I've successfully built openmpi-v2.x-dev-950-g995993b on my machines
(Solaris 10 Sparc, Solaris 10 x86_64, and openSUSE Linux 12.1
x86_64) with gcc-5.1.0 and Sun C 5.13. Unfortunately I get errors
running some small test programs. All programs work as expected
using my gcc or cc version of openmpi-v1.10.1-138-g0e3b111. I get
similar errors for the master openmpi-dev-3329-ge4bdad0.
I used the following commands to build the package for gcc.


mkdir openmpi-v2.x-dev-950-g995993b-${SYSTEM_ENV}.${MACHINE_ENV}.64_gcc
cd openmpi-v2.x-dev-950-g995993b-${SYSTEM_ENV}.${MACHINE_ENV}.64_gcc

../openmpi-v2.x-dev-950-g995993b/configure \
  --prefix=/usr/local/openmpi-2.0.0_64_gcc \
  --libdir=/usr/local/openmpi-2.0.0_64_gcc/lib64 \
  --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
  --with-jdk-headers=/usr/local/jdk1.8.0/include \
  JAVA_HOME=/usr/local/jdk1.8.0 \
  LDFLAGS="-m64" CC="gcc" CXX="g++" FC="gfortran" \
  CFLAGS="-m64" CXXFLAGS="-m64" FCFLAGS="-m64" \
  CPP="cpp" CXXCPP="cpp" \
  --enable-mpi-cxx \
  --enable-cxx-exceptions \
  --enable-mpi-java \
  --enable-heterogeneous \
  --enable-mpi-thread-multiple \
  --with-hwloc=internal \
  --without-verbs \
  --with-wrapper-cflags="-std=c11 -m64" \
  --with-wrapper-cxxflags="-m64" \
  --with-wrapper-fcflags="-m64" \
  --enable-debug \
  |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_gcc

make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
rm -r /usr/local/openmpi-2.0.0_64_gcc.old
mv /usr/local/openmpi-2.0.0_64_gcc /usr/local/openmpi-2.0.0_64_gcc.old
make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_gcc
make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_gcc



A simple "hello world" or "matrix multiplication" program works with
my gcc version but breaks with my cc version as you can see at the
bottom. Spawning processes breaks with both versions.


tyr spawn 128 mpiexec -np 1 --hetero-nodes --host 
tyr,sunpc1,linpc1,tyr spawn_multiple_master


Parent process 0 running on tyr.informatik.hs-fulda.de
  I create 3 slave processes.

[tyr.informatik.hs-fulda.de:22370] PMIX ERROR: UNPACK-PAST-END in file 
../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/server/pmix_server_ops.c 
at line 829
[tyr.informatik.hs-fulda.de:22370] PMIX ERROR: UNPACK-PAST-END in file 
../../../../../../openmpi-v2.x-dev-950-g995993b/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c 
at line 2176

[tyr:22378] *** An error occurred in MPI_Comm_spawn_multiple
[tyr:22378] *** reported by process [4047765505,0]
[tyr:22378] *** on communicator MPI_COMM_WORLD
[tyr:22378] *** MPI_ERR_SPAWN: could not spawn processes
[tyr:22378] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,

[tyr:22378] ***and potentially your MPI job)
tyr spawn 129





tyr spawn 151 mpiexec -np 1 --hetero-nodes --host sunpc1,linpc1,linpc1 
spawn_intra_comm

Parent process 0: I create 2 slave processes

Parent process 0 running on sunpc1
MPI_COMM_WORLD ntasks:  1
COMM_CHILD_PROCESSES ntasks_local:  1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:0

Child process 1 running on linpc1
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:2

Child process 0 running on linpc1
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:1
-- 

mpiexec noticed that process rank 0 with PID 16203 on node sunpc1 
exited on signal 13 (Broken Pipe).
-- 


tyr spawn 152



I don't see a broken pipe, if a change the sequence of sunpc1 and
linpc1.

tyr spawn 146 mpiexec -np 1 --hetero-nodes --host linpc1,sunpc1,sunpc1 
spawn_intra_comm

Parent process 0: I create 2 slave processes

Child process 1 running on sunpc1
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:2

Child process 0 running on sunpc1
MPI_COMM_WORLD ntasks:  2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:1

Parent process 0 running on linpc1
MPI_COMM_WORLD ntasks:  1
COMM_CHILD_PROCESSES ntasks_local:  1
COMM_CHILD_PROCESSES ntasks_remote: 2
COMM_ALL_PROCESSES ntasks:  3
mytid in COMM_ALL_PROCESSES:0



The process doesn't return and uses about 50% cpu time (1 of 2
processors), if I combine a x86_64