Re: [hwloc-users] Travis CI unit tests failing with HW "operating system" error

2018-09-14 Thread Jeff Hammond
With HWLOC_COMPONENTS=no_os, MPICH is now working fine but all tests now
fail with Open-MPI (see below).  I know how to resolve this, but am noting
it for the benefit of others.

--
All nodes which are allocated for this job are already filled.
--

Jeff

On Thu, Sep 13, 2018 at 10:36 PM, Brice Goglin 
wrote:

> If lstopo fails there, run "hwloc-gather-topology foo" and send foo.tar.bz2
>
> As a workaround for ARMCI, you may try setting HWLOC_COMPONENTS=no_os,stop
> in the environment so that hwloc behaves as if the operating system had no
> topology support.
>
> Brice
>
>
>
> Le 14/09/2018 à 06:11, Jeff Hammond a écrit :
>
> All of the job failures have this warning so I am inclined to think they
> are related.  I don't know what I should expect from lstopo on inside of
> AWS, but I guess I'll try it.
>
> I was using the hwloc shipped with the mpich-3.3b1.  Talk to the MPICH
> team if you want them to upgrade :-)
>
> Jeff
>
> On Thu, Sep 13, 2018 at 8:42 AM, Brice Goglin 
> wrote:
>
>> This is actually just a warning. Usually it causes the topology to be
>> wrong (like a missing object), but it shouldn't prevent the program from
>> working. Are you sure your programs are failing because of hwloc? Do you
>> have a way to run lstopo on that node?
>>
>> By the way, you shouldn't use hwloc 2.0.0rc2, at least because it's old,
>> it has a broken ABI, and it's a RC :)
>>
>> Brice
>>
>>
>>
>> Le 13/09/2018 à 16:12, Jeff Hammond a écrit :
>>
>> I am running ARMCI-MPI over MPICH in a Travis CI Linux instance and
>> topology is causing it to fail.  I do not care about topology in a
>> virtualized environment.  How do I fix this?
>>
>> 
>> 
>> * hwloc 2.0.0rc2-git has encountered what looks like an error from the
>> operating system.
>> *
>> * Group0 (cpuset 0x,0x) intersects with L3 (cpuset
>> 0x1000,0x0212) without inclusion!
>> * Error occurred in topology.c line 1384
>> *
>> * The following FAQ entry in the hwloc documentation may help:
>> *   What should I do when hwloc reports "operating system" warnings?
>> * Otherwise please report this error message to the hwloc user's mailing
>> list
>> * along with the files generated by the hwloc-gather-topology script.
>> 
>> 
>>
>> https://travis-ci.org/jeffhammond/armci-mpi/jobs/425342479 has all of
>> the details.
>>
>> Jeff
>>
>>
>> --
>> Jeff Hammond
>> jeff.scie...@gmail.com
>> http://jeffhammond.github.io/
>>
>>
>> ___
>> hwloc-users mailing 
>> listhwloc-us...@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/hwloc-users
>>
>>
>>
>> ___
>> hwloc-users mailing list
>> hwloc-users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
>>
>
>
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>
>
> ___
> hwloc-users mailing 
> listhwloc-us...@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/hwloc-users
>
>
>
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
>



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Travis CI unit tests failing with HW "operating system" error

2018-09-13 Thread Jeff Hammond
All of the job failures have this warning so I am inclined to think they
are related.  I don't know what I should expect from lstopo on inside of
AWS, but I guess I'll try it.

I was using the hwloc shipped with the mpich-3.3b1.  Talk to the MPICH team
if you want them to upgrade :-)

Jeff

On Thu, Sep 13, 2018 at 8:42 AM, Brice Goglin  wrote:

> This is actually just a warning. Usually it causes the topology to be
> wrong (like a missing object), but it shouldn't prevent the program from
> working. Are you sure your programs are failing because of hwloc? Do you
> have a way to run lstopo on that node?
>
> By the way, you shouldn't use hwloc 2.0.0rc2, at least because it's old,
> it has a broken ABI, and it's a RC :)
>
> Brice
>
>
>
> Le 13/09/2018 à 16:12, Jeff Hammond a écrit :
>
> I am running ARMCI-MPI over MPICH in a Travis CI Linux instance and
> topology is causing it to fail.  I do not care about topology in a
> virtualized environment.  How do I fix this?
>
> 
> 
> * hwloc 2.0.0rc2-git has encountered what looks like an error from the
> operating system.
> *
> * Group0 (cpuset 0x,0x) intersects with L3 (cpuset
> 0x1000,0x0212) without inclusion!
> * Error occurred in topology.c line 1384
> *
> * The following FAQ entry in the hwloc documentation may help:
> *   What should I do when hwloc reports "operating system" warnings?
> * Otherwise please report this error message to the hwloc user's mailing
> list
> * along with the files generated by the hwloc-gather-topology script.
> 
> ****
>
> https://travis-ci.org/jeffhammond/armci-mpi/jobs/425342479 has all of the
> details.
>
> Jeff
>
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>
>
> ___
> hwloc-users mailing 
> listhwloc-us...@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/hwloc-users
>
>
>
> _______
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
>



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

[hwloc-users] Travis CI unit tests failing with HW "operating system" error

2018-09-13 Thread Jeff Hammond
I am running ARMCI-MPI over MPICH in a Travis CI Linux instance and
topology is causing it to fail.  I do not care about topology in a
virtualized environment.  How do I fix this?


* hwloc 2.0.0rc2-git has encountered what looks like an error from the
operating system.
*
* Group0 (cpuset 0x,0x) intersects with L3 (cpuset
0x1000,0x0212) without inclusion!
* Error occurred in topology.c line 1384
*
* The following FAQ entry in the hwloc documentation may help:
*   What should I do when hwloc reports "operating system" warnings?
* Otherwise please report this error message to the hwloc user's mailing
list
* along with the files generated by the hwloc-gather-topology script.


https://travis-ci.org/jeffhammond/armci-mpi/jobs/425342479 has all of the
details.

Jeff


--
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Issue running hwloc on Xeon-Phi Coprocessor uOS

2017-01-17 Thread Jeff Hammond
See
https://software.intel.com/en-us/articles/building-a-native-application-for-intel-xeon-phi-coprocessors
where it says "From a console, set the SINK_LD_LIBRARY_PATH to the location
of the Intel compiler runtime libraries for Intel Xeon Phi coprocessors and
to the location of any other dynamic libraries required by the application."

If that page isn't sufficient to resolve, please post on Intel developer
forum and privately send me a link to your post so I can forward it
internally.

Best,

Jeff

On Tue, Jan 17, 2017 at 9:45 AM, Samuel Thibault <samuel.thiba...@inria.fr>
wrote:
>
> Hello,
>
> Jacob Peter Caswell, on Tue 17 Jan 2017 11:43:07 -0600, wrote:
> > However, I'm still getting some make issues. After running make, I'm
getting
> > the following output:
> >
> > x86_64-k1om-linux-ld: warning: libimf.so, needed by
/home/caswell/hwloc/hwloc
> > /.libs/libhwloc.so, not found (try using -rpath or -rpath-link)
> >
> > I tried searching for any resolutions, and saw that it might be from an
> > improperly configured $LD_LIBRARY_PATH variable.
>
> Almost :) ld doesn't look at $LD_LIBRARY_PATH, but $LIBRARY_PATH. It'd
> ld.so (at runtime) which looks at $LD_LIBRARY_PATH.
>
> Samuel
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users




--
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Issue running hwloc on Xeon-Phi Coprocessor uOS

2017-01-16 Thread Jeff Hammond
You need to cross-compile binaries for Knights Corner (KNC) aka Xeon Phi 71xx 
if you're on a Xeon host. KNC is x86 but the binary format differs, as your 
analysis indicates. 

You can either ssh to card and build native, build on host with k1om GCC tool 
chain, or build on host with Intel compiler and -mmic.

If configure needs to execute binaries, you'll find compiling on the card to be 
the simplest method.

Disclaimer: I work for Intel but haven't touched KNC in a long time.

Jeff 

Sent from my iPhone

> On Jan 16, 2017, at 9:12 AM, Jacob Peter Caswell  wrote:
> 
> Hello all,
> 
> I'm sorry if this has been brought up before, however I could not immediately 
> find a resolution to the problem I'm having.
> 
> I'm trying to run hwloc on the Xeon Phi MIC hardware, as I believe this 
> exchange noted that running hwloc-ls on the host only describes the location, 
> as well as the core count and memory. I have mounted a NFS drive at /install, 
> and have tried to install hwloc from the host onto the shared space using:
> # ./autogen.sh
> # ./configure --prefix=/install
> # make
> # make install
> However, when I run:
> # /install/bin/hwloc-info
> I get the following output:
> -bash: /install/bin/hwloc-info: cannot execute binary file
> According to this super user post, it may likely be that the architectures 
> are incompatible, and indeed:
> # file /install/bin/hwloc-info
> outputs:
> hwloc-info: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically 
> linked (uses shared libs), for GNU/Linux 2.6.32, 
> BuildID[sha1]=0x43de04bb145de2499255af0cfb1e21ae7736ba5f, not stripped
> And:
> # uname -a
> outputs:
> Linux PhiMachine 2.6.38.8+mpss3.3.5 #1 SMP Thu May 14 10:27:45 PDT 2015 k10m 
> GNU/Linux
> And
> # uname -m
> returns:
> k1om
> Is this why I cannot natively run hwloc on my MIC hardware? And if so, are 
> there any suggestions about potential workarounds?
> 
> Thank you all very much,
> Jake
> 
> 
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] memory binding on Knights Landing

2016-09-08 Thread Jeff Hammond
On Thu, Sep 8, 2016 at 8:59 AM, Dave Love <d.l...@liverpool.ac.uk> wrote:

> Brice Goglin <brice.gog...@inria.fr> writes:
>
> > Hello
> > It's not a feature. This should work fine.
> > Random guess: do you have NUMA headers on your build machine ? (package
> > libnuma-dev or numactl-devel)
> > (hwloc-info --support also report whether membinding is supported or not)
> > Brice
>
> Oops, you're right!  Thanks.  I thought what I'm using elsewhere was
> built from the same srpm, but the rpm on the KNL box doesn't actually
> require libnuma.  After a rebuild, it's OK and I'm suitably embarrassed.
>
> By the way, is it expected that binding will be slow on it?  hwloc-bind
> is ~10 times slower (~1s) than on two-socket sandybridge, and ~3 times
> slower than on a 128-core, 16-socket system.
>
> Is this a bottleneck in any application?  Are there codes bindings memory
frequently?

Because most things inside the kernel are limited by single-threaded
performance, it is reasonable for them to be slower than on a Xeon
processor, but I've not seen slowdowns that high.

Jeff

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Re: [hwloc-users] Selecting real cores vs HT cores

2014-12-11 Thread Jeff Hammond
Can someone post docs for this resource halving Squyres claims? I've never
heard of this.

Jeff

On Thursday, December 11, 2014, Samuel Thibault <samuel.thiba...@inria.fr>
wrote:

> Jeff Squyres (jsquyres), le Thu 11 Dec 2014 21:12:27 +, a écrit :
> > When the BIOS is set to enable hyper threading, then several resources
> on the core are split when the machine is booted up (e.g., some of the
> queue depths for various processing units in the core are half the length
> that they are when hyperthreading is disabled in the BIOS).
>
> Perhaps some queues get divided, but most of the resources (such as
> cache, TLB, etc.) are completely available when using only one
> hyperthread, like they would be with HT disabled.
>
> Samuel
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org <javascript:;>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post:
> http://www.open-mpi.org/community/lists/hwloc-users/2014/12/1132.php
>


-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [hwloc-users] BGQ question.

2014-03-25 Thread Jeff Hammond
I am inclined to think the issue us that ion Linux uses only 16 cores. When 
hwloc does n-- for CNK to skip the 17th (OS) core, it gets the wrong answer for 
Linux. 

Just check for Linux support and use /proc/cpuinfo and don't adjust manually. 

I'm not sure hwloc on BGQ ion ion needs and special hook either. Isn't Linux 
PPC support enough?

Jeff 

Sent from my iPhone

> On Mar 25, 2014, at 7:28 AM, Chris Samuel  wrote:
> 
>> On Tue, 25 Mar 2014 06:51:49 AM Biddiscombe, John A. wrote:
>> 
>> I’m compiling hwloc using clang (bgclang++11 from ANL) to run on IO nodes af
>> a BGQ.
> 
> Out of interest, why on an I/O node?
> 
>> It seems to have compiled ok, and when I run lstopo, I get an output
>> like this (below), which looks reasonable, but there are 15 sockets instead
>> of 16.
> 
> I've not tried on our I/O nodes, but looking at /proc/cpuinfo on one it 
> reports 68 cores (hardware threads), thus 17 real cores (on CNK one of those 
> is reserved for the CNK so is not available for codes without determined 
> fiddling).
> 
> -bash-4.1# grep -w processor /proc/cpuinfo | wc -l
> 68
> 
> This is V1R2M0 (RHEL 6.3 based).
> 
> cheers,
> Chris
> -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


Re: [hwloc-users] BGQ question.

2014-03-25 Thread Jeff Hammond
I can arrange both BGQ Vesta access and the ability to get Linux-equipped BGQ 
nodes, ie basically IONs.

I'll send you details offline. 

Jeff

Sent from my iPhone

> On Mar 25, 2014, at 2:55 AM, "Biddiscombe, John A."  wrote:
> 
> Brice,
>  
> Correct : The IO nodes are running a  full linux install (RHE 6.4) on the 
> same hardware as the CNK nodes.
>  
> On vesta I do not have an account and I am not certain the IO nodes are 
> available for direct login. I’m using the BGQ at CSCS which is an EPFL 
> machine. The IO nodes are open for some special projects where we are trying 
> to customise the IO.
>  
> JB
>  
> From: Brice Goglin [mailto:brice.gog...@inria.fr] 
> Sent: 25 March 2014 08:43
> To: Hardware locality user list; Biddiscombe, John A.
> Subject: Re: [hwloc-users] BGQ question.
>  
> Wait, I missed the "io node" part of your first mail. The bgq support is for 
> compute nodes running cnk. Are io nodes running linux on same hardware as the 
> compute nodes?
> 
> I have an account on vesta. Where should I logon to have a look?
> Brice
> 
> 
> On 25 mars 2014 08:12:58 UTC+01:00, "Biddiscombe, John A."  
> wrote:
> Brice,
>  
> 
> lstopo --whole-system
>  
> 
> gives the same output and setting env var BG_THREADMODEL=2 does not appear to 
> make any visible difference.
>  
> 
> my configure command for compiling hwloc had no special options,
> ./configure --prefix=/gpfs/bbp.cscs.ch/home/biddisco/apps/clang/hwloc-1.8.1
>  
> 
> should I rerun with something set?
>  
> 
> Thanks
>  
> 
> JB
>  
> 
>  
> 
> From: hwloc-users [mailto:hwloc-users-boun...@open-mpi.org] On Behalf Of 
> Brice Goglin
> Sent: 25 March 2014 08:04
> To: Hardware locality user list
> Subject: Re: [hwloc-users] BGQ question.
>  
> 
> Le 25/03/2014 07:51, Biddiscombe, John A. a écrit :
> I’m compiling hwloc using clang (bgclang++11 from ANL) to run on IO nodes af 
> a BGQ. It seems to have compiled ok, and when I run lstopo, I get an output 
> like this (below), which looks reasonable, but there are 15 sockets instead 
> of 16. I’m a little worried because the first time I compiled, I had problems 
> where apps would report an error from HWLOC on start and tell me to set 
> HWLOC_FORCE_BGQ=1. when I did set this env var, it would then report that 
> “topology became empty” and the app would segfault due to the unexpected 
> return from hwloc presumably.
> 
> Can you give a bit more details on what you did there? I'd like to check if 
> that case should be better supported or not.
> 
> 
> I wiped everything and recompiled (not sure what I did differently), and now 
> it behaves more sensibly, but with 15 instead of 16 sockets.
>  
> Should IO be worried?
> 
> The topology detection is hardwired so you shouldn't worried on the hardware 
> side.
> The problem could be related to how you reserved resources before running 
> lstopo.
> Does lstopo --whole-system see more sockets?
> Does BG_THREADMODEL=2 help?
> 
> Brice
> 
> 
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


Re: [hwloc-users] CPU info on ARM

2014-01-28 Thread Jeff Hammond
I don't know the answer but I'm happy to test on my ARM systems if you
have an experiment to perform.

Right now, I have an NVIDIA Kayla board and a Parallella board, both
of which are ARMv7.

Jeff

On Tue, Jan 28, 2014 at 6:51 AM, Jeff Squyres (jsquyres)
<jsquy...@cisco.com> wrote:
> I passed this on to my OMPI ARM contact (Leif Lindholm).  Here's what he said:
>
>"It gets a bit trickier on ARM... since we may also have (implementation
> time) configurable cache sizes and also big.LITTLE (different processor
> models executing in the same SMP system)."
>
> He passed the question on to another ARM guy, asking for further detail.  
> I'll pass on what he says.
>
>
>
> On Jan 28, 2014, at 3:39 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
>
>> Hello,
>>
>> Is anybody familiar with ARM CPUs?
>>
>> I am adding more CPU information because Intel needs more:
>> CPUVendor=GenuineIntel
>> CPUModel=Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
>> CPUModelNumber=45
>> CPUFamilyNumber=6
>>
>> Would something similar be useful for ARM? What are the fields below
>> from /proc/cpuinfo on ARM that would be useful to developers?
>> Processor: Marvell PJ4Bv7 Processor rev 1 (v7l)
>> BogoMIPS: 1196.85
>> Features: swp half thumb fastmult vfp edsp vfpv3 vfpv3d16 tls
>> CPU implementer: 0x56
>> CPU architecture: 7
>> CPU variant: 0x1
>> CPU part: 0x581
>> CPU revision: 1
>> Hardware: Marvell Armada-370
>> Revision: 
>> Serial: 
>>
>> thanks
>> Brice
>>
>> ___
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users



-- 
Jeff Hammond
jeff.scie...@gmail.com


Re: [hwloc-users] Many queries creating slow performance

2013-03-05 Thread Jeff Hammond
Si - Is your code that calls hwloc part of MPICH or OpenMPI or
something that can be made standalone and shared?

Brice - Do you have access to a MIC system for testing?  Write me
offline if you don't and I'll see what I can do to help.

If this affects MPICH i.e. Hydra, then I'm sure Intel will be
committed to helping fix it since Intel MPI is using Hydra as the
launcher on systems like Stampede.

Best,

Jeff

On Tue, Mar 5, 2013 at 3:05 PM, Brice Goglin <brice.gog...@inria.fr> wrote:
> Just tested on a 96-core shared-memory machine. Running OpenMPI 1.6 mpiexec
> lstopo, here's the execution time (mpiexec launch time is 0.2-0.4s)
>
> 1 rank :  0.2s
> 8 ranks:  0.3-0.5s depending on binding (packed or scatter)
> 24ranks:  0.8-3.7s depending on binding
> 48ranks:  2.8-8.0s depending on binding
> 96ranks: 14.2s
>
> 96ranks from a single XML file: 0.4s (negligible against mpiexec launch
> time)
>
> Brice
>
>
>
> Le 05/03/2013 20:23, Simon Hammond a écrit :
>
> Hi HWLOC users,
>
> We are seeing some significant performance problems using HWLOC 1.6.2 on
> Intel's MIC products. In one of our configurations we create 56 MPI ranks,
> each rank then queries the topology of the MIC card before creating threads.
> We are noticing that if we run 56 MPI ranks as opposed to one the calls to
> query the topology in HWLOC are very slow, runtime goes from seconds to
> minutes (and upwards).
>
> We guessed that this might be caused by the kernel serializing access to the
> /proc filesystem but this is just a hunch.
>
> Has anyone had this problem and found an easy way to change the library /
> calls to HWLOC so that the slow down is not experienced? Would you describe
> this as a bug?
>
> Thanks for your help.
>
>
> --
> Simon Hammond
>
> 1-(505)-845-7897 / MS-1319
> Scalable Computer Architectures
> Sandia National Laboratories, NM
>
>
>
>
>
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>
>
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhamm...@alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond



Re: [hwloc-users] hwloc on Blue Gene/Q?

2013-01-08 Thread Jeff Hammond
These functions are returning the physical placement at the moment
they are called.  If a Pthread moves around, it will still return the
correct, current value.

You should not cache the output of these functions.  They require ~105
cycles per call (I just measured this for 1M calls, with 315-318M
cycles required for the test), which is cheaper than loading a stored
value if it's not in cache.

Jeff

On Tue, Jan 8, 2013 at 7:50 PM, Erik Schnetter <schnet...@cct.lsu.edu> wrote:
> Jeff
>
> Thanks, this is helpful. I am mostly interested in finding out which threads
> share the D1 cache. I guess that get_bgq_core returns this information.
>
> Is there a way to guarantee that this association doesn't change at run
> time? I guess I could just check periodically...
>
> -erik
>
>
>
> On Tue, Jan 8, 2013 at 5:33 PM, Jeff Hammond <jhamm...@alcf.anl.gov> wrote:
>>
>> As a temporary, non-portable substitute for hwloc, you can use the SPI
>> calls that are described on my Wiki:
>> https://wiki.alcf.anl.gov/parts/index.php/Blue_Gene/Q#Node_topology.
>> I presume that this is the means by which hwloc will support BGQ when
>> it does.
>>
>> Blue Gene/Q has 16+1 cores with 4 hw threads each.  Only 16 cores are
>> visible to applications but as users can, in theory, run code on the
>> 17th core (see
>> https://wiki.alcf.anl.gov/parts/index.php/Blue_Gene/Q#17th_Core_App_Agents
>> for how), it is important for these functions to return values in the
>> range 0..16 and 0..67 instead of 0..15 and 0..63.  I include this
>> information in case users are confused about the additional range
>> documented for these calls.
>>
>> Best,
>>
>> Jeff
>>
>> On Tue, Jan 8, 2013 at 11:10 AM, Brice Goglin <brice.gog...@inria.fr>
>> wrote:
>> > Hello Erik,
>> > We need specific BGQ binding support, the binding API is different. Also
>> > we
>> > don't properly detect the 16 4-way cores properly, we only only 64
>> > identical
>> > PUs.
>> > I am supposed to get a BGQ account in the near future so I hope I will
>> > have
>> > everything working in v1.7.
>> > Stay tuned
>> > Brice
>> >
>> >
>> >
>> >
>> > Le 08/01/2013 18:06, Erik Schnetter a écrit :
>> >
>> > I am trying to use hwloc on a Blue Gene/Q. Building and installing
>> > worked
>> > fine, and it reports the system configuration fine as well (i.e. it
>> > shows
>> > all PUs). However, when I try to inquire the thread/core bindings, hwloc
>> > crashes with an error in libc's free(). This is both with 1.6 and
>> > 1.6.1rc1.
>> >
>> > The error occurs apparently in CPU_FREE called from
>> > hwloc_linux_find_kernel_nr_cpus.
>> >
>> > Does this ring a bell with anyone? I know this is not enough information
>> > to
>> > debug things, but do you have any pointers for things to look at?
>> >
>> > I remember reading somewhere that the last bit in a cpu_set_t cannot be
>> > used. A Blue Gene/Q has 64 PUs, and may be using 64-bit integers to hold
>> > cpu_set_t data. Could this be an issue?
>> >
>> > My goal is to examine and experiment with thread/core bindings with
>> > OpenMP
>> > to improve performance.
>> >
>> > -erik
>> >
>> > --
>> > Erik Schnetter <schnet...@gmail.com>
>> > http://www.perimeterinstitute.ca/personal/eschnetter/
>> >
>> >
>> > ___
>> > hwloc-users mailing list
>> > hwloc-us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> >
>> >
>> >
>> > ___
>> > hwloc-users mailing list
>> > hwloc-us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>
>>
>>
>> --
>> Jeff Hammond
>> Argonne Leadership Computing Facility
>> University of Chicago Computation Institute
>> jhamm...@alcf.anl.gov / (630) 252-5381
>> http://www.linkedin.com/in/jeffhammond
>> https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond
>>
>> ___
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>
>
>
>
> --
> Erik Schnetter <schnet...@cct.lsu.edu>
> http://www.perimeterinstitute.ca/personal/eschnetter/
>
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users



-- 
Jeff Hammond
Argonne Leadership Computing Facility
University of Chicago Computation Institute
jhamm...@alcf.anl.gov / (630) 252-5381
http://www.linkedin.com/in/jeffhammond
https://wiki.alcf.anl.gov/parts/index.php/User:Jhammond