Re: [OMPI devel] 32 bit issues (was: 32 bit support needs a maintainer)

2023-02-12 Thread Brice Goglin via devel

Hello

This seems to be the code for sharing the hwloc topology in shared 
memory. This whole thing was designed for very large virtual spaces 
(>=48bits on most CPUs) where it's easy to find a virtual memory area 
that is unused in all participating processes. I am not sure it's worth 
fixing on 32bits since it's not clear it'll often work there.


Brice




Le 12/02/2023 à 23:06, Yatindra Vaishnav via devel a écrit :


Hi Jeff,

I created 32-bit environment and was able to see the issue in in PMIx. 
I’m attaching the screen shot here, please confirm.



Regards,

--Yatindra.

Sent from Mail  for 
Windows


*From: *Jeff Squyres (jsquyres) 
*Sent: *Tuesday, February 7, 2023 2:13 PM
*To: *Yatindra Vaishnav ; Open MPI 
Developers 

*Subject: *Re: 32 bit issues (was: 32 bit support needs a maintainer)

Be sure to see 
https://github.com/open-mpi/ompi/pull/11282#issuecomment-1421523518.


*From:*Jeff Squyres (jsquyres) 
*Sent:* Tuesday, February 7, 2023 4:07 PM
*To:* Yatindra Vaishnav ; Open MPI Developers 


*Subject:* 32 bit issues (was: 32 bit support needs a maintainer)

Just try to build Open PMIx -- by itself, not as embedded in Open MPI 
-- in 32 bit mode and you'll see the compile failures.


I *think*​ that the PMIx's configure has a "if building in 32 bit 
mode, print a message that this is not supported and abort" (just like 
Open MPI).  You will need to remove that check in order to be able to 
compile Open PMIx in 32 bit mode.


*From:*Yatindra Vaishnav 
*Sent:* Tuesday, February 7, 2023 4:02 PM
*To:* Open MPI Developers ; Jeff Squyres 
(jsquyres) 

*Subject:* RE: [OMPI devel] 32 bit support needs a maintainer

Hi Jeff,

Where can I see the bugs of 32-bit support on OpenPMIx? I see this link:

Memory Leaks · Issue #1276 · openpmix/openpmix (github.com) 



Regards,

--Yatindra.

Sent from Mail  for 
Windows


*From: *Yatindra Vaishnav via devel 
*Sent: *Monday, February 6, 2023 1:27 PM
*To: *Jeff Squyres (jsquyres) ; 
devel@lists.open-mpi.org

*Cc: *Yatindra Vaishnav 
*Subject: *Re: [OMPI devel] 32 bit support needs a maintainer

Sure Jeff, Let me take a look.

*From: *Jeff Squyres (jsquyres) 
*Sent: *Monday, February 6, 2023 1:03 PM
*To: *Yatindra Vaishnav ; 
devel@lists.open-mpi.org

*Subject: *Re: [OMPI devel] 32 bit support needs a maintainer

Ok.  OpenPMIx is the "bottom" of the stack of Open MPI --> PRRTE --> 
OpenPMIx, so fixing the 32 bit issues there first is what makes sense 
-- see 
https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/required-support-libraries.html#library-dependencies.


We have been trying to get Open MPI v5.0.0 out for quite a while, and 
seem to actually be getting closer to getting over the finish line.  
So getting these fixes in sooner rather than later would be better.


*From:*Yatindra Vaishnav 
*Sent:* Monday, February 6, 2023 2:56 PM
*To:* Jeff Squyres (jsquyres) ; 
devel@lists.open-mpi.org 

*Subject:* RE: [OMPI devel] 32 bit support needs a maintainer

Yes Jeff, I can give roughly 5-8 hours a week. And yes I can take care 
of OpenPMIx bugs first and will look into others afterwards?


Sent from Mail  for 
Windows


*From: *Jeff Squyres (jsquyres) 
*Sent: *Monday, February 6, 2023 11:48 AM
*Subject: *Re: [OMPI devel] 32 bit support needs a maintainer

Ok.  What kind of timeframe do you have to work on this? Will you be 
able to look into the OpenPMIx 32-bit bugs in the immediate future, 
and then start testing with PRRTE and Open MPI?


*From:*Yatindra Vaishnav 
*Sent:* Monday, February 6, 2023 10:57 AM
*To:* Jeff Squyres (jsquyres) ; 
devel@lists.open-mpi.org 

*Subject:* Re: [OMPI devel] 32 bit support needs a maintainer

Hi Jeff,

Nice to your response. And yes I responded to the same mail. I 
registered myself for OpenMPI development community. And I'm aware 
about the responsibilities. And I can setup a 32-bit VMs to do bug 
fixes and testing. I already have a server machine which I can help with.


Get Outlook for iOS 

*From:* Jeff Squyres (jsquyres) 
*Sent:* Monday, February 6, 2023 7:45 AM
*To:* devel@lists.open-mpi.org 
*Cc:* Yatindra Vaishnav 
*Subject:* Re: [OMPI devel] 32 bit support needs a maintainer

Greetings Yatindra; thanks for responding.

Just curious: are you replying in response to the discussion that just 
came up a few days ago on https://github.com/open-mpi/ompi/pull/11282 
, where we set the 
upcoming Open MPI v5.0's configure script to abort in 32-bit 
environments?  I.e., are you part of the Debian community?


[hwloc-devel] hwloc 3.0 breaking the ABI again?

2022-03-03 Thread Brice Goglin

Hello

hwloc 2.0 was released 4 years ago and this major release was painful 
because it broke the ABI and significantly changed the API. I want to 
avoid having users modify their code with #ifdef again. But there is at 
least one good reason to break the ABI in the future (support for 32bits 
PCI domains). It means you wouldn't have to modify your code but you 
would have to rebuild it against the new hwloc.


I don't know yet we'll do it in 6 months (in 3.0 after 2.9?) or in 5 
years (it currently only depends on whether the PCI domain issue becomes 
very common or not). If you see something else to change in the API or 
ABI, or if you have any comment about all this, please let me know.


Brice


___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel


Re: [OMPI devel] v5.0 equivalent of --map-by numa

2021-11-11 Thread Brice Goglin via devel

Hello Ben

It will be back, at least for the majority of platforms (those without 
heterogeneous memory).


See https://github.com/open-mpi/ompi/issues/8170 and 
https://github.com/openpmix/prrte/pull/1141


Brice



Le 11/11/2021 à 05:33, Ben Menadue via devel a écrit :

Hi,

Quick question: what's the equivalent of "--map-by numa" for the new
PRRTE-based runtime for v5.0? I can see "package" and "l3cache" in the
help, which are close, but don't quite match "numa" for our system.

In more detail...

We have dual-socket CLX- and SKL-based nodes with sub-NUMA clustering
enabled. This shows up in the OS as two packages, each with 1 L3 cache
domain and 2 NUMA domains. Even worse, each compute node effectively
has its own unique mapping of the cores of each socket between the NUMA
domains.

A common way of running for our users is with 1 MPI process per NUMA
domain and then some form of threading within the cores associated with
that domain. This effectively gives each MPI process its own memory
controller and DIMMs.

Using "--map-by numa" worked really well for this, since it took care
of the unique core numbering of each node. The only way I can think of
to set up something equivalent without that would be manually
enumerating the nodes in each job and building a rank file.

I've include an example topology below.

Or do you think this is better as a GitHub issue?

Thanks,
Ben

[bjm900@gadi-cpu-clx-0143 build]$ lstopo
Machine (189GB total)
   Package L#0 + L3 L#0 (36MB)
 Group0 L#0
   NUMANode L#0 (P#0 47GB)
   L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU
L#0 (P#0)
   L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU
L#1 (P#1)
   L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU
L#2 (P#2)
   L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU
L#3 (P#3)
   L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU
L#4 (P#7)
   L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU
L#5 (P#8)
   L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU
L#6 (P#9)
   L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU
L#7 (P#13)
   L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU
L#8 (P#14)
   L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU
L#9 (P#15)
   L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
+ PU L#10 (P#19)
   L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
+ PU L#11 (P#20)
   HostBridge
 PCI 00:11.5 (SATA)
 PCI 00:17.0 (SATA)
   Block(Disk) "sda"
 PCIBridge
   PCIBridge
 PCI 02:00.0 (VGA)
   HostBridge
 PCIBridge
   PCIBridge
 PCIBridge
   PCI 08:00.2 (Ethernet)
 Net "eno1"
 Group0 L#1
   NUMANode L#1 (P#1 47GB)
   L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
+ PU L#12 (P#4)
   L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
+ PU L#13 (P#5)
   L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
+ PU L#14 (P#6)
   L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
+ PU L#15 (P#10)
   L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
+ PU L#16 (P#11)
   L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
+ PU L#17 (P#12)
   L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
+ PU L#18 (P#16)
   L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
+ PU L#19 (P#17)
   L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
+ PU L#20 (P#18)
   L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
+ PU L#21 (P#21)
   L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
+ PU L#22 (P#22)
   L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
+ PU L#23 (P#23)
   HostBridge
 PCIBridge
   PCI 58:00.0 (InfiniBand)
 Net "ib0"
 OpenFabrics "mlx5_0"
   Package L#1 + L3 L#1 (36MB)
 Group0 L#2
   NUMANode L#2 (P#2 47GB)
   L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
+ PU L#24 (P#24)
   L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
+ PU L#25 (P#25)
   L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
+ PU L#26 (P#26)
   L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
+ PU L#27 (P#27)
   L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
+ PU L#28 (P#30)
   L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
+ PU L#29 (P#31)
   L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
+ PU L#30 (P#35)
   L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
+ PU L#31 (P#36)
   L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
+ PU L#32 (P#37)
 

Re: [hwloc-devel] Negative values for die_id and physical_package_id

2021-05-26 Thread Brice Goglin

Le 26/05/2021 à 15:23, Samuel Thibault a écrit :

Brice Goglin, le mer. 26 mai 2021 14:13:02 +0200, a ecrit:

os_index is already *unsigned* in the API (did you mean signed?). We cannot
change the obj->os_index back to signed now, it would break existing users.

Mmm, it wouldn't break the ABI, only printf formats using %u?



Right. I tried the idea for hwloc 2.0, but I quickly stopped because of 
the loads of printf signedness warnings that our picky users were getting 
:/


Brice




OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] Negative values for die_id and physical_package_id

2021-05-26 Thread Brice Goglin

Le 26/05/2021 à 14:24, Jirka Hladky a écrit :


 However maybe debugging would be easier if tools printed that
special value as -1 instead of 4294967295 (I'd need to check other
tools too, lstopo takes care of some of these values, maybe not all).

I agree.  So perhaps we can update to tools only, to print 4294967295 
as -1?



Opened as https://github.com/open-mpi/hwloc/issues/468

Hopefully we'll get some time to fix this before releasing 2.5.

Brice




OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] Negative values for die_id and physical_package_id

2021-05-26 Thread Brice Goglin

Le 26/05/2021 à 13:51, Jirka Hladky a écrit :

Hi Brice,

I would like to get your opinion on the following issue. On IBM LPAR, 
kernel reports  die_id and physical_package_id to be -1. 

See [0]


hwloc-calc converts these values into an unsigned integer, resulting 
in Socket ID 2^32-1:


hwloc-calc --physical-output --intersect socket core:0
4294967295

I'm not quite sure why are die_id and physical_package_id set to -1. 
Perhaps it signalizes some error condition.



Hello Jirka

die_id might be only implemented on x86 since it was the only 
architecture that could expose different dies within packages when die 
topology info was added to recent kernels.


Package ID seems to be properly set on POWER8/9 machines I have access 
to. Maybe something related to LPAR exposing a special/virtual topology 
hence die and package ID wouldn't make sense?



I will try to find out. However, I think that hwloc-calc should store 
the values as an unsigned integers and represent them the same way as 
kernel. BTW, when using hwloc API, I'm getting the correct values:


obj[0] = hwloc_get_pu_obj_by_os_index(topology, pu_hier]);
obj[2] = hwloc_get_ancestor_obj_by_type (topology, HWLOC_OBJ_SOCKET, 
obj[0]);

obj[2]->os_index => -1

What are your thoughts?



os_index is already *unsigned* in the API (did you mean signed?). We 
cannot change the obj->os_index back to signed now, it would break 
existing users.


But being signed wouldn't help much. -1 is the special value 
HWLOC_UNKNOWN_INDEX, it doesn't matter if it's stored as -1 or 
4294967295. Users shouldn't rely on these numbers anyway. However maybe 
debugging would be easier if tools printed that special value as -1 
instead of 4294967295 (I'd need to check other tools too, lstopo takes 
care of some of these values, maybe not all).


Brice


OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] hwloc on IBM Power LPAR VMs

2021-04-26 Thread Brice Goglin
This patch should fix the issue. We had to fix the same issue for CPU#0 
being offline recently but I didn't know it could be needed for NUMA 
node#0 being offline too.


I am trying to release hwloc 2.5 "soon". If that's too slow, please let 
me know, I'll see if I can do a 2.4.1 earlier.


Brice




commit 7c159d723432e461b4e48cc2d38212913d2ba7c7
Author: Brice Goglin 
Date:   Mon Apr 26 20:35:42 2021 +0200

linux: fix support for NUMA node0 being oddline

Just like we didn't support offline CPU#0 until commit

7bcc273efd50536961ba16d474efca4ae163229b, we need to
support node0 being offline as well.
It's not clear whether it's a new Linux feature or not,
this was reported on a POWER LPAR VM.

We opportunistically assume node0 is online to avoid

the overhead in the vast majority of cases. If node0
is missing, we parse "online" to find the first node.

Thanks to Jirka Hladky for the report.

Signed-off-by: Brice Goglin 


diff --git a/hwloc/topology-linux.c b/hwloc/topology-linux.c
index 94b242dd0..10e038e64 100644
--- a/hwloc/topology-linux.c
+++ b/hwloc/topology-linux.c
@@ -5264,6 +5264,9 @@ static const char *find_sysfs_cpu_path(int root_fd, 
int *old_filenames)
 
 static const char *find_sysfs_node_path(int root_fd)

 {
+  unsigned first;
+  int err;
+
   if (!hwloc_access("/sys/bus/node/devices", R_OK|X_OK, root_fd)
   && !hwloc_access("/sys/bus/node/devices/node0/cpumap", R_OK, root_fd))
 return "/sys/bus/node/devices";
@@ -5272,6 +5275,28 @@ static const char *find_sysfs_node_path(int root_fd)
   && !hwloc_access("/sys/devices/system/node/node0/cpumap", R_OK, root_fd))
 return "/sys/devices/system/node";
 
+  /* node0 might be offline, fallback to looking at the first online node.

+   * online contains comma-separated ranges, just read the first number.
+   */
+  hwloc_debug("Failed to find sysfs node files using node0, looking at online 
nodes...\n");
+  err = hwloc_read_path_as_uint("/sys/devices/system/node/online", , 
root_fd);
+  if (err) {
+hwloc_debug("Failed to find read /sys/devices/system/node/online.\n");
+  } else {
+char path[PATH_MAX];
+hwloc_debug("Found node#%u as first online node\n", first);
+
+snprintf(path, sizeof(path), "/sys/bus/node/devices/node%u/cpumap", first);
+if (!hwloc_access("/sys/bus/node/devices", R_OK|X_OK, root_fd)
+&& !hwloc_access(path, R_OK, root_fd))
+  return "/sys/bus/node/devices";
+
+snprintf(path, sizeof(path), "/sys/devices/system/node/node%u/cpumap", 
first);
+if (!hwloc_access("/sys/devices/system/node", R_OK|X_OK, root_fd)
+&& !hwloc_access(path, R_OK, root_fd))
+  return "/sys/devices/system/node";
+  }
+
   return NULL;
 }
 






Le 26/04/2021 à 16:48, Brice Goglin a écrit :


Hello,

Maybe we have something that assumes that the first NUMA node on Linux 
is #0. And something is wrong in the disallowed case anyway since the 
NUMA node physical number is 0 instead of 2 there.


Can you run "hwloc-gather-topology lpar" and send the resulting 
lpar.tar.bz2? (send it only to me if it's too big or somehow 
confidential).


Thanks

Brice



Le 26/04/2021 à 16:40, Jirka Hladky a écrit :

Hi Brice,

how are you doing? I hope you are fine. We are all well and safe.

I have been running hwloc on IBM Power LPAR VM with only 1 CPU core 
and 8 PUs [1]. There is only one NUMA node. The numbering is however 
quite strange, the NUMA node number is "2".  See [2].


hwloc reports "Topology does not contain any NUMA node, aborting!"

$ lstopo
Topology does not contain any NUMA node, aborting!
hwloc_topology_load() failed (No such file or directory).

Could you please double-check if this behavior is correct? I believe 
hwloc should work on this HW setup.


FYI, we can get it working with --disallowed option [3] (but I think 
it should work without this option as well)


Thanks a lot!
Jirka


[1] $ lscpu
Architecture:    ppc64le
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   1
NUMA node(s):    1

[2] There is ONE NUMA node with the number "2":
$ numactl -H
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 7614 MB
node 2 free: 1098 MB
node distances:
node   2
 2:  10

[3]
$ lstopo --disallowed
Machine (7615MB total)
 Package L#0
   NUMANode L#0 (P#0 7615MB)
   L3 L#0 (4096KB) + L2 L#0 (1024KB) + Core L#0
 L1d L#0 (32KB) + L1i L#0 (48KB)
   Die L#0 + PU L#0 (P#0)
   PU L#1 (P#2)
   PU L#2 (P#4)
   PU L#3 (P#6)
 L1d L#1 (32KB) + L1i L#1 (48KB)
   PU L#4 (P#1)
   PU L#5 (P#3)
   PU L#6 (P#5)
   PU L#7 (P#7)
 Block(Disk) "sda"
 Net

Re: [hwloc-devel] hwloc on IBM Power LPAR VMs

2021-04-26 Thread Brice Goglin

Hello,

Maybe we have something that assumes that the first NUMA node on Linux 
is #0. And something is wrong in the disallowed case anyway since the 
NUMA node physical number is 0 instead of 2 there.


Can you run "hwloc-gather-topology lpar" and send the resulting 
lpar.tar.bz2? (send it only to me if it's too big or somehow confidential).


Thanks

Brice



Le 26/04/2021 à 16:40, Jirka Hladky a écrit :

Hi Brice,

how are you doing? I hope you are fine. We are all well and safe.

I have been running hwloc on IBM Power LPAR VM with only 1 CPU core 
and 8 PUs [1]. There is only one NUMA node. The numbering is however 
quite strange, the NUMA node number is "2".  See [2].


hwloc reports "Topology does not contain any NUMA node, aborting!"

$ lstopo
Topology does not contain any NUMA node, aborting!
hwloc_topology_load() failed (No such file or directory).

Could you please double-check if this behavior is correct? I believe 
hwloc should work on this HW setup.


FYI, we can get it working with --disallowed option [3] (but I think 
it should work without this option as well)


Thanks a lot!
Jirka


[1] $ lscpu
Architecture:    ppc64le
Byte Order:  Little Endian
CPU(s):  8
On-line CPU(s) list: 0-7
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):   1
NUMA node(s):    1

[2] There is ONE NUMA node with the number "2":
$ numactl -H
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 7614 MB
node 2 free: 1098 MB
node distances:
node   2
 2:  10

[3]
$ lstopo --disallowed
Machine (7615MB total)
 Package L#0
   NUMANode L#0 (P#0 7615MB)
   L3 L#0 (4096KB) + L2 L#0 (1024KB) + Core L#0
 L1d L#0 (32KB) + L1i L#0 (48KB)
   Die L#0 + PU L#0 (P#0)
   PU L#1 (P#2)
   PU L#2 (P#4)
   PU L#3 (P#6)
 L1d L#1 (32KB) + L1i L#1 (48KB)
   PU L#4 (P#1)
   PU L#5 (P#3)
   PU L#6 (P#5)
   PU L#7 (P#7)
 Block(Disk) "sda"
 Net "env2"




___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel


OpenPGP_signature
Description: OpenPGP digital signature
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [OMPI devel] HWLOC duplication relief

2021-02-04 Thread Brice Goglin via devel
The text looks correct to me. I don't have any better suggestion for now.

I am thinking about adding a adopt() flag to say "adopt it, or give me a
pointer to the already adopted one", but it's not clear to me how to
implement this safely. I opened a hwloc issues to discuss the details of
making sure both adopt() calls point to the very same shmem topology
file https://github.com/open-mpi/hwloc/issues/449

Brice


Le 04/02/2021 à 01:28, Ralph Castain via devel a écrit :
> I have updated the site to reflect this discussion to-date. I'm still trying 
> to figure out what to do about low-level libs. For now, I've removed the 
> envars and modified suggestions.
>
> https://openpmix.github.io/support/faq/avoid-hwloc-dup
>
> Further comment/input is welcome.
>
>
>> On Feb 3, 2021, at 8:09 AM, Ralph Castain via devel 
>>  wrote:
>>
>> What if we do this:
>>
>> - if you are using PMIx v4.1 or above, then there is no problem. Call 
>> PMIx_Load_topology and we will always return a valid pointer to the 
>> topology, subject to the caveat that all members of the process (as well as 
>> the server) must use the same hwloc version.
>>
>> - if you are using PMIx v4.0 or below, then first do a PMIx_Get for 
>> PMIX_TOPOLOGY. If "not found", then try to get the shmem info and adopt it. 
>> If the shmem info isn't found, then do a topology_load to discover the 
>> topology. Either way, when done, do a PMIx_Store_internal of the 
>> hwloc_topology_t using the PMIX_TOPOLOGY key.
>>
>> This still leaves open the question of what to do with low-level libraries 
>> that really don't want to link against PMIx. I'm not sure what to do there. 
>> I agree it is "ugly" to pass an addr in the environment, but there really 
>> isn't any cleaner option that I can see short of asking every library to 
>> provide us with the ability to pass hwloc_topology_t down to them. Outside 
>> of that obvious answer, I suppose we could put the hwloc_topology_t address 
>> into the environment and have them connect that way?
>>
>>
>>> On Feb 3, 2021, at 7:36 AM, Ralph Castain via devel 
>>>  wrote:
>>>
>>> I guess this begs the question: how does a library detect that the shmem 
>>> region has already been mapped? If we attempt to map it and fail, does that 
>>> mean it has already been mapped or that it doesn't exist?
>>>
>>> It isn't reasonable to expect that all the libraries in a process will 
>>> coordinate such that they "know" hwloc has been initialized by the main 
>>> program, for example. So how do they determine that the topology is 
>>> present, and how do they gain access to it?
>>>
>>>
>>>> On Feb 3, 2021, at 6:07 AM, Brice Goglin via devel 
>>>>  wrote:
>>>>
>>>> Hello Ralph
>>>>
>>>> One thing that isn't clear in this document : the hwloc shmem region may
>>>> only be mapped *once* per process (because the mmap address is always
>>>> the same). Hence, if a library calls adopt() in the process, others will
>>>> fail. This applies to the 2nd and 3rd case in "Accessing the HWLOC
>>>> topology tree from clients".
>>>>
>>>> For the 3rd case where low-level libraries don't want to depend on PMIx,
>>>> storing the pointer to the topology in an environment variable might be
>>>> a (ugly) solution.
>>>>
>>>> By the way, you may want to specify somewhere that all these libraries
>>>> using the topology pointer in the process must use the same hwloc
>>>> version (e.g. not 2.0 vs 2.4). shmem_adopt() verifies that the exported
>>>> and importer are compatible. But passing the topology pointer doesn't
>>>> provide any way to verify that the caller doesn't use its own
>>>> incompatible embedded hwloc.
>>>>
>>>> Brice
>>>>
>>>>
>>>> Le 02/02/2021 à 18:32, Ralph Castain via devel a écrit :
>>>>> Hi folks
>>>>>
>>>>> Per today's telecon, here is a link to a description of the HWLOC
>>>>> duplication issue for many-core environments and methods by which you
>>>>> can mitigate the impact.
>>>>>
>>>>> https://openpmix.github.io/support/faq/avoid-hwloc-dup
>>>>> <https://openpmix.github.io/support/faq/avoid-hwloc-dup>
>>>>>
>>>>> George: for lower-level libs like treematch or HAN, you might want to
>>>>> look at the envar method (described about half-way down the page) to
>>>>> avoid directly linking those libraries against PMIx. That wouldn't be
>>>>> a problem while inside OMPI, but could be an issue if people want to
>>>>> use them in a non-PMIx environment.
>>>>>
>>>>> Ralph
>>>>>
>>>
>>
>



Re: [OMPI devel] HWLOC duplication relief

2021-02-03 Thread Brice Goglin via devel
Hello Ralph

One thing that isn't clear in this document : the hwloc shmem region may
only be mapped *once* per process (because the mmap address is always
the same). Hence, if a library calls adopt() in the process, others will
fail. This applies to the 2nd and 3rd case in "Accessing the HWLOC
topology tree from clients".

For the 3rd case where low-level libraries don't want to depend on PMIx,
storing the pointer to the topology in an environment variable might be
a (ugly) solution.

By the way, you may want to specify somewhere that all these libraries
using the topology pointer in the process must use the same hwloc
version (e.g. not 2.0 vs 2.4). shmem_adopt() verifies that the exported
and importer are compatible. But passing the topology pointer doesn't
provide any way to verify that the caller doesn't use its own
incompatible embedded hwloc.

Brice


Le 02/02/2021 à 18:32, Ralph Castain via devel a écrit :
> Hi folks
>
> Per today's telecon, here is a link to a description of the HWLOC
> duplication issue for many-core environments and methods by which you
> can mitigate the impact.
>
> https://openpmix.github.io/support/faq/avoid-hwloc-dup
> 
>
> George: for lower-level libs like treematch or HAN, you might want to
> look at the envar method (described about half-way down the page) to
> avoid directly linking those libraries against PMIx. That wouldn't be
> a problem while inside OMPI, but could be an issue if people want to
> use them in a non-PMIx environment.
>
> Ralph
>



Re: [OMPI devel] Git submodules are coming

2020-11-13 Thread Brice Goglin via devel
FYI, this was a git bug that will be fixed soon (the range of commits
being rebased was wrong).

https://lore.kernel.org/git/pull.789.git.1605314085.gitgitgad...@gmail.com/T/#t

https://lore.kernel.org/git/20d6104d-ca02-4ce4-a1c0-2f9386ded...@gmail.com/T/#t

Brice



Le 07/02/2020 à 10:27, Brice Goglin a écrit :
>
> Hello
>
> I have a git submodule issue that I don't understand.
>
> PR#7367 was initially on top of PR #7366. When Jeff merged PR#7366, I
> rebased my #7367 with git prrs and got this error:
>
> $ git prrs origin master
> From https://github.com/open-mpi/ompi
>  * branch  master -> FETCH_HEAD
> Fetching submodule opal/mca/hwloc/hwloc2/hwloc
> fatal: cannot rebase with locally recorded submodule modifications
>
> I didn't touch the hwloc submodule as far as I can see. The hwloc
> submodule also didn't change in origin/master between before and after
> the rebasing.
>
> $ git submodule status
>  38433c0f5fae0b761bd20e7b928c77f3ff2e76dc opal/mca/hwloc/hwloc2/hwloc 
> (hwloc-2.1.0rc2-33-g38433c0f)
> opal/mca/hwloc/hwloc2/hwloc $ git status
> HEAD detached from f1a2e22a
> nothing to commit, working tree clean
>
> I am not sure what's this "HEAD detached ..." is doing here.
>
> I seem to be able to reproduce the issue in my master branch by doing
> "git reset --hard HEAD^". git prrs will then fail the same.
>
> I worked around the issue by manually reapplying all commits from my
> PR on top of master with git cherry-pick, but I'd like to understand
> what's going on. It looks like my submodule is clean but not clean
> enough for a rebase?
>
> Thanks
>
> Brice
>
>
>
> Le 07/01/2020 à 18:02, Jeff Squyres (jsquyres) via devel a écrit :
>> We now have two PRs pending that will introduce the use of Git submodules 
>> (and there are probably more such PRs on the way).  At last one of these 
>> first two PRs will likely be merged "Real Soon Now".
>>
>> We've been talking about using Git submodules forever.  Now we're just about 
>> ready.
>>
>> **
>> *** DEVELOPERS: THIS AFFECTS YOU!! ***
>> **
>>
>> You cannot just "clone and build" any more:
>>
>> -
>> git clone g...@github.com:open-mpi/ompi.git
>> cd ompi && ./autogen.pl && ./configure ...
>> -
>>
>> You will *have* to initialize the Git submodule(s) -- either during or after 
>> the clone.  *THEN* you can build Open MPI.
>>
>> Go read this wiki: https://github.com/open-mpi/ompi/wiki/GitSubmodules
>>
>> May the force be with us!
>>


Re: [OMPI devel] Git submodules are coming

2020-02-07 Thread Brice Goglin via devel
Hello

I have a git submodule issue that I don't understand.

PR#7367 was initially on top of PR #7366. When Jeff merged PR#7366, I
rebased my #7367 with git prrs and got this error:

$ git prrs origin master
>From https://github.com/open-mpi/ompi
 * branch  master -> FETCH_HEAD
Fetching submodule opal/mca/hwloc/hwloc2/hwloc
fatal: cannot rebase with locally recorded submodule modifications

I didn't touch the hwloc submodule as far as I can see. The hwloc
submodule also didn't change in origin/master between before and after
the rebasing.

$ git submodule status
 38433c0f5fae0b761bd20e7b928c77f3ff2e76dc opal/mca/hwloc/hwloc2/hwloc 
(hwloc-2.1.0rc2-33-g38433c0f)

opal/mca/hwloc/hwloc2/hwloc $ git status
HEAD detached from f1a2e22a
nothing to commit, working tree clean

I am not sure what's this "HEAD detached ..." is doing here.

I seem to be able to reproduce the issue in my master branch by doing
"git reset --hard HEAD^". git prrs will then fail the same.

I worked around the issue by manually reapplying all commits from my PR
on top of master with git cherry-pick, but I'd like to understand what's
going on. It looks like my submodule is clean but not clean enough for a
rebase?

Thanks

Brice



Le 07/01/2020 à 18:02, Jeff Squyres (jsquyres) via devel a écrit :
> We now have two PRs pending that will introduce the use of Git submodules 
> (and there are probably more such PRs on the way).  At last one of these 
> first two PRs will likely be merged "Real Soon Now".
>
> We've been talking about using Git submodules forever.  Now we're just about 
> ready.
>
> **
> *** DEVELOPERS: THIS AFFECTS YOU!! ***
> **
>
> You cannot just "clone and build" any more:
>
> -
> git clone g...@github.com:open-mpi/ompi.git
> cd ompi && ./autogen.pl && ./configure ...
> -
>
> You will *have* to initialize the Git submodule(s) -- either during or after 
> the clone.  *THEN* you can build Open MPI.
>
> Go read this wiki: https://github.com/open-mpi/ompi/wiki/GitSubmodules
>
> May the force be with us!
>


Re: [OMPI devel] Git submodules are coming

2020-01-07 Thread Brice Goglin via devel
Thanks a lot for writing all this.


At the end
https://github.com/open-mpi/ompi/wiki/GitSubmodules#adding-a-new-submodule-pointing-to-a-specific-commit
should "bar" be "bar50x" in line "$ git add bar" ?

It seems to me that you are in opal/mca/foo and the new submodule is in
"bar50x" (according to "cd opal/mca/foo/bar50x" at the beginning).

There's also a "bar-50x" instead of "bar50x" in line "git submodule add
--name bar-50x ...". Should the submodule name match the directory name?


By the way, in
https://github.com/open-mpi/ompi/wiki/GitSubmodules#updating-the-commit-that-a-submodule-refers-to
you may want to rename hwloc201 into hwloc2 to avoid confusion and match
the current PR.

Brice (who cannot edit the wiki :))



Le 07/01/2020 à 18:02, Jeff Squyres (jsquyres) via devel a écrit :
> We now have two PRs pending that will introduce the use of Git submodules 
> (and there are probably more such PRs on the way).  At last one of these 
> first two PRs will likely be merged "Real Soon Now".
>
> We've been talking about using Git submodules forever.  Now we're just about 
> ready.
>
> **
> *** DEVELOPERS: THIS AFFECTS YOU!! ***
> **
>
> You cannot just "clone and build" any more:
>
> -
> git clone g...@github.com:open-mpi/ompi.git
> cd ompi && ./autogen.pl && ./configure ...
> -
>
> You will *have* to initialize the Git submodule(s) -- either during or after 
> the clone.  *THEN* you can build Open MPI.
>
> Go read this wiki: https://github.com/open-mpi/ompi/wiki/GitSubmodules
>
> May the force be with us!
>


Re: [hwloc-devel] Strange CPU topology numbering on dual socket ARM server with 2×ThunderX2 CN9975

2019-09-06 Thread Brice Goglin
Hello Jirka

I don't think there's a bug here.

physical_package_id don't have to be between 0 and N-1, they just have
to be different to identify packages and cores between packages. Having
other values is uncommon on x86 but quite common on POWER at least.

core_id is even worse. They are basically not used at all fortunately.
They are often the same in both sockets. They are often discontigous
inside sockets (maybe because CPU vendors disable specific cores in the
middle of the CPU when your CPU doesn't have the max number of cores).
On a dual-socket 20-core Xeon (Cascade Lake), both sockets have these
core_ids: 0,4,1,3,2,12,8,11,9,10,16,20,17,19,18,28,24,27,25,26 (5-7,
13-15 and 21-23 are missing).

PU and NUMA nodes often have contigous OS indexes, but not necessarily
in order either.

FWIW, I get the same values as yours on a Gigabyte platform with 2x
ThunderX2 running RHEL7 4.14 kernel.

Brice



Le 06/09/2019 à 15:29, Jiri Hladky a écrit :
> Hi all! 
>
> We are seeing strange CPU topology/numbering on a dual-socket ARM
> server with 2×ThunderX2 CN9975 CPU [0].
>
> Package IDs:
> 36 and 3180
> cd /sys/devices/system/cpu
> $ cat cpu0/topology/physical_package_id
> 36
> Expected values: 0 and 1
>
> Core IDs on the second socket:
> 256-283
> $ cat cpu112/topology/core_id
> 256
> $ cat cpu223/topology/core_id
> 283
>
> Expected values for the second socket:
> 28 - 55
>
> (On the first socket, the core numbering is OK - 0-27)
>
> I assume this is Linux kernel bug. Have you seen anything like this in
> the past? What might a root cause? Linux kernel bug or perhaps a BIOS
> issue? 
>
> We see it on 5.3.0-0.rc7 and 4.18 kernels. I'm attaching lstopo and
> gather-topology output. I would appreciate any feedback on that. 
>
> Thank you!
> Jirka
>
>
> [0]
> https://en.wikichip.org/wiki/cavium/thunderx2
>
>
> ___
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-devel
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

[hwloc-devel] Signed-off-by in commits now required

2019-07-09 Thread Brice Goglin
Hello

Starting today, we require that all hwloc commits have a Signed-off-by
line with your employer email. Our Github repo is not configured to
actually reject non-signed commits [1] , please try to include that line
so that we don't have to revert some commits later.

See what this line means in the very last section of
https://github.com/open-mpi/ompi/wiki/Administrative-rules

Documentation about signing commits
https://jjasghar.github.io/blog/2016/10/04/signing-commits-in-git/

And about git config variables
https://www.git-scm.com/book/en/v2/Customizing-Git-Git-Configuration

Thanks

Brice

[1] It looks it would require to switch to pull-request only mode

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel


Re: [OMPI devel] Gentle reminder: sign up for the face to face

2019-02-26 Thread Brice Goglin
Hello Jeff

Looks like I am not allowed to modify the page but I'll be at the meeting ;)

Brice



Le 26/02/2019 à 17:13, Jeff Squyres (jsquyres) via devel a écrit :
> Gentle reminder to please sign up for the face-to-face meeting and add your 
> items to the wiki:
>
> https://github.com/open-mpi/ompi/wiki/Meeting-2019-04
>
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [hwloc-devel] HWLOC - Memory bind

2018-12-03 Thread Brice Goglin
Hello


Which hwloc release are you using? Before hwloc 2.0, libnuma/numactl
devel headers were required for memory binding. The end of the output of
"configure" will tell you whether memory binding is supported when
building such an old release:


-
Hwloc optional build support status (more details can be found above):
[...]
libnuma memory support:  yes
[...]
-


Besides, on a laptop, memory binding is usually useless on laptops
because there's a single NUMA node. All your allocation can only go to
one place, there's no need to enforce anything. Also some kernels may
have NUMA support disabled for such platforms. You need CONFIG_NUMA:


grep NUMA /boot/config-$(uname -r)
[...]
CONFIG_NUMA=y
[...]



For MPI on a laptop, you may still want to bind execution so that cache
locality is enforced. This is done with hwloc_set_cpubind(). But most
MPI implementations have easier ways to bind their ranks anyway.


Brice




Le 03/12/2018 à 08:46, Sa3aD vIp a écrit :
> Hello,
>
> first of all, thank you so much for this tool.
> Actually, I have tried to bind memory but with no success.
> I have got this message:
> Couldn't bind to cpuset 0x0001: Function not implemented
> Couldn't bind to cpuset 0x0020: Function not implemented
> Couldn't bind to cpuset 0x0004: Function not implemented
> Couldn't bind to cpuset 0x0008: Function not implemented
> Couldn't bind to cpuset 0x0010: Function not implemented
> Couldn't bind to cpuset 0x0002: Function not implemented
>
>
> My program code:
> int flags;
> hwloc_membind_policy_t policy;
> policy = HWLOC_MEMBIND_BIND;
> flags = HWLOC_MEMBIND_PROCESS;
> hwloc_cpuset_t cpuset;
> cpuset = hwloc_bitmap_alloc();
> if (!cpuset)
> {
>     fprintf(stderr, "failed to allocate a bitmap\n");
>     hwloc_topology_destroy(HWtopology);
> }
> hwloc_get_cpubind(HWtopology, cpuset, 0);
>
> if(hwloc_set_membind(HWtopology, cpuset, policy, flags))
> {
>  char *showStringMSGofERROR;
>  int errorNumber = errno;
>  hwloc_bitmap_asprintf(, cpuset);
>  printf("Couldn't bind to cpuset %s: %s\n", showStringMSGofERROR,
> strerror(errorNumber));
>  free(showStringMSGofERROR);
> }
>
>
> if it's not possible to bind in my laptop - is there any option to
> bind memory or cache for specific MPI process.
>
>
> Thanks a lot in advance.
> SAAD
>
>
> ___
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-devel
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-11 Thread Brice Goglin
Hello

We now have a CI slave running cygwin on Windows 10. Everything should
work fine in v2.0.3. In the meantime, the only patch that you need to
add on top of what you sent earlier is:

https://github.com/open-mpi/hwloc/commit/3ac7bd3b3bddd763b2e58eff77a2104ea79230af
(for fixing tests/hwloc/x86).

Thanks

Brice



Le 30/09/2018 à 20:02, Marco Atzeri a écrit :
> Trying to build 2.0.2 on cygwin 64 bit.
>
>   CC   diff.lo
>   CC   shmem.lo
> /cygdrive/d/cyg_pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/hwloc/distances.c:347:5:
> error: conflicting types for ‘hwloc_distances_add’
>  int hwloc_distances_add(hwloc_topology_t topology,
>  ^~~
> In file included from
> /cygdrive/d/cyg_pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/include/hwloc.h:2258:0,
>  from
> /cygdrive/d/cyg_pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/hwloc/distances.c:9:
> /cygdrive/d/cyg_pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/include/hwloc/distances.h:229:20:
> note: previous declaration of ‘hwloc_distances_add’ was here
>  HWLOC_DECLSPEC int hwloc_distances_add(hwloc_topology_t topology,
>     ^~~
> In file included from
> /cygdrive/d/cyg_pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/include/private/private.h:29:0,
>  from
> /cygdrive/d/cyg_pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/hwloc/distances.c:10:
> /cygdrive/d/cyg_pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/hwloc/distances.c:
> In function ‘hwloc__groups_by_distances’:
>
> also adding a HWLOC_DECLSPEC on the first case distances.c:347
> does not solve the issue as the two declaration are not the same.
>
> Suggestion ?
>
>
>
> ---
> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
> https://www.avast.com/antivirus
>
> ___
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-02 Thread Brice Goglin
OK

I pushed your #ifdef fixes and I fixed the printf warning.

I opened 3 issues related to x86 cpuid and OpenProcess failing in lstopo
--ps. Hopefully we'll find a way to play with cygwin here for real in
the near future, and then add that config to our CI.

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel


Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-01 Thread Brice Goglin
Le 02/10/2018 à 00:28, Marco Atzeri a écrit :
> Am 01.10.2018 um 19:57 schrieb Brice Goglin:
>> Le 01/10/2018 à 19:22, Marco Atzeri a écrit :
>>>
>>
>> Your own machine doesn't matter. None is these tests look at your CPU or
>> topology. *All* of them on all x86 machines.
>> CPUID are emulated by reading files, nothing is read from your local
>> machine topology. There's just something wrong here that prevents these
>> emulating CPUID files from being read. "lstopo -i ..." will tell you.
>
> $
> /pub/devel/hwloc/hwloc-2.0.2-1.x86_64/build/utils/lstopo/lstopo-no-graphics.exe
>  -i
> /pub/devel/hwloc/hwloc-2.0.2-1.x86_64/src/hwloc-2.0.2/tests/hwloc/x86/AMD-15h-Bulldozer-4xOpteron-6272/
>  --if cpuid --of xml -
> Ignoring dumped cpuid directory.
> 
>
>
> It works instead with "--if xml"
>
> IMHO, should be better to produce an error
> instead of the local machine output with a warning,
> if the input is not understandable

The input is understandable here, but there's a cygwin-related bug
somewhere when we actually try to use it.

--if xml makes no sense here since you're not giving any XML as input.

The error message comes from hwloc_x86_check_cpuiddump_input() failing
in hwloc/topology-x86.c.
That function always prints an error message before returning an error,
except when opendir() fails on the given directory.
The directory was passed by lstopo to the core using environment
variable HWLOC_CPUID_PATH.

Anyway, I have no way to debug this for now so you're stuck with not
running make check in that directory :/

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-01 Thread Brice Goglin
Le 01/10/2018 à 19:22, Marco Atzeri a écrit :
>
>> Unfortunately that test script isn't easy to debug in the v2.x branch.
>> If that OpenProcess is where things fail, I assume that the line that
>> fails is "lstopo --ps". On MinGW, that code is ignored because /proc
>> doesn't exist. Does /proc exist on Cygwin? If so, we should just disable
>> that test line on Windows.
>
> /proc exists but windows calls of course are not aware of it.
> The failing message in German is coming from the Windows layer,
> as my Cygwin enviroment is in English.

Actually the error message comes from lstopo itself. We list PIDs from
/proc and then pass them to OpenProcess, which likely fails for
administrator processes. And we abort() instead of returning an error.
I guess things could work but we'd need to setup a cygwin here for
testing, so it'll take some time.

>
>>>
>>> 
>>>
>>> And all the :
>>>
>>> FAIL: Intel-Skylake-2xXeon6140.output
>>> FAIL: Intel-Broadwell-2xXeon-E5-2650Lv4.output
>>> FAIL: Intel-Haswell-2xXeon-E5-2680v3.output
>>> FAIL: Intel-IvyBridge-12xXeon-E5-4620v2.output
>>> FAIL: Intel-SandyBridge-2xXeon-E5-2650.output
>>> FAIL: Intel-Westmere-2xXeon-X5650.output
>>> FAIL: Intel-Nehalem-2xXeon-X5550.output
>>> FAIL: Intel-Penryn-4xXeon-X7460.output
>>> FAIL: Intel-Core-2xXeon-E5345.output
>>> FAIL: Intel-KnightsLanding-XeonPhi-7210.output
>>> FAIL: Intel-KnightsCorner-XeonPhi-SE10P.output
>>> FAIL: AMD-17h-Zen-2xEpyc-7451.output
>>> FAIL: AMD-15h-Piledriver-4xOpteron-6348.output
>>> FAIL: AMD-15h-Bulldozer-4xOpteron-6272.output
>>> FAIL: AMD-K10-MagnyCours-2xOpteron-6164HE.output
>>> FAIL: AMD-K10-Istanbul-8xOpteron-8439SE.output
>>> FAIL: AMD-K8-SantaRosa-2xOpteron-2218.output
>>> FAIL: AMD-K8-SledgeHammer-2xOpteron-250.output
>>> FAIL: Zhaoxin-CentaurHauls-ZXD-4600.output
>>> FAIL: Zhaoxin-Shanghai-KaiSheng-ZXC+-FC1081.output
>>> ###
>>>
>>> But it is not clear to me how these tests should pass.
>>>
>>> The Laptop has a Quad Core I5
>>
>> These tests use a tarball of the output of the cpuid instruction to
>> emulate calling cpuid on those platforms.
>> Go to tests/hwloc/xml, unpack one of the tarballs, and run
>> "/path/to/utils/lstopo/lstopo -i ", you
>> should get more information about what's failing when reading these
>> dumped cpuid outputs.
>> If it doesn't work tests/hwloc/xml/Intel-Skylake-2xXeon6140.output.log
>> will show the difference between the expected and obtained topology when
>> exported to XML.
>
> I saw the difference, and as my machine is different from
> everyone on the list none of the tests can pass.

Your own machine doesn't matter. None is these tests look at your CPU or
topology. *All* of them on all x86 machines.
CPUID are emulated by reading files, nothing is read from your local
machine topology. There's just something wrong here that prevents these
emulating CPUID files from being read. "lstopo -i ..." will tell you.

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] hwloc_distances_add conflicting declaration

2018-10-01 Thread Brice Goglin
Le 01/10/2018 à 17:27, Marco Atzeri a écrit :
> Am 30.09.2018 um 20:11 schrieb Samuel Thibault:
>> Marco Atzeri, le dim. 30 sept. 2018 20:02:59 +0200, a ecrit:
>>> also adding a HWLOC_DECLSPEC on the first case distances.c:347
>>> does not solve the issue as the two declaration are not the same.
>>>
>>> Suggestion ?
>>
>> Perhaps use hwloc_uint64_t instead of uint64_t in hwloc/distances.c?
>>
>> Samuel
>
> Thanks Samuel,
> it was that, in more than one place.
>
> The attached patch allowed the compilation on cygwin64 bit.

hwloc_uint64_t is currently defined to DWORDLONG (worked fine on MinGW
and MSVC so far). I'd like to see if there's an easier way to solve this
issue by just making that definition compatible for cygwin.

> FAIL: test-lstopo.sh
>
> that seems due to a mix between Cygwin and Windows
>
>  utils/lstopo/test-lstopo.sh.log #
>
> Machine (3665MB total) + Package L#0
>   NUMANode L#0 (P#0 3665MB)
>   L3 L#0 (6144KB)
>     L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
>   PU L#0 (P#0)
>   PU L#1 (P#1)
>     L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
>   PU L#2 (P#2)
>   PU L#3 (P#3)
>     L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
>   PU L#4 (P#4)
>   PU L#5 (P#5)
>     L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
>   PU L#6 (P#6)
>   PU L#7 (P#7)
> OpenProcess 13220 failed 5: Zugriff verweigert

Unfortunately that test script isn't easy to debug in the v2.x branch.
If that OpenProcess is where things fail, I assume that the line that
fails is "lstopo --ps". On MinGW, that code is ignored because /proc
doesn't exist. Does /proc exist on Cygwin? If so, we should just disable
that test line on Windows.


>
> 
>
> And all the :
>
> FAIL: Intel-Skylake-2xXeon6140.output
> FAIL: Intel-Broadwell-2xXeon-E5-2650Lv4.output
> FAIL: Intel-Haswell-2xXeon-E5-2680v3.output
> FAIL: Intel-IvyBridge-12xXeon-E5-4620v2.output
> FAIL: Intel-SandyBridge-2xXeon-E5-2650.output
> FAIL: Intel-Westmere-2xXeon-X5650.output
> FAIL: Intel-Nehalem-2xXeon-X5550.output
> FAIL: Intel-Penryn-4xXeon-X7460.output
> FAIL: Intel-Core-2xXeon-E5345.output
> FAIL: Intel-KnightsLanding-XeonPhi-7210.output
> FAIL: Intel-KnightsCorner-XeonPhi-SE10P.output
> FAIL: AMD-17h-Zen-2xEpyc-7451.output
> FAIL: AMD-15h-Piledriver-4xOpteron-6348.output
> FAIL: AMD-15h-Bulldozer-4xOpteron-6272.output
> FAIL: AMD-K10-MagnyCours-2xOpteron-6164HE.output
> FAIL: AMD-K10-Istanbul-8xOpteron-8439SE.output
> FAIL: AMD-K8-SantaRosa-2xOpteron-2218.output
> FAIL: AMD-K8-SledgeHammer-2xOpteron-250.output
> FAIL: Zhaoxin-CentaurHauls-ZXD-4600.output
> FAIL: Zhaoxin-Shanghai-KaiSheng-ZXC+-FC1081.output
> ###
>
> But it is not clear to me how these tests should pass.
>
> The Laptop has a Quad Core I5

These tests use a tarball of the output of the cpuid instruction to
emulate calling cpuid on those platforms.
Go to tests/hwloc/xml, unpack one of the tarballs, and run
"/path/to/utils/lstopo/lstopo -i ", you
should get more information about what's failing when reading these
dumped cpuid outputs.
If it doesn't work tests/hwloc/xml/Intel-Skylake-2xXeon6140.output.log
will show the difference between the expected and obtained topology when
exported to XML.

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] [PATCH] topology-x86: add support for Hygon Dhyana Family 18h processor

2018-07-24 Thread Brice Goglin
Thanks a lot, the code looks good. I pushed it to master. I'll backport
to stable branches once you'll send cpuid dumps for regression testing :)

Did you test the code by forcing hwloc's x86 backend with
HWLOC_COMPONENTS=x86? Otherwise hwloc uses the Linux backend by default
(which reads sysfs), then the x86 backend does pretty much nothing.

Brice



Le 24/07/2018 à 14:49, Pu Wen a écrit :
> Hygon Dhyana CPU shares similar architecture with AMD family 17h.
>
> In order to run the tool such as lstopo succesfully on Hygon platforms,
> enable Hygon support to topology-x86.c. To share AMD's flow, add code
> check for Hygon family ID 18h to run AMD family 17h code path.
>
> Also a new cpu type hygon is added to cpuid_type, and the processor
> vendor string "HygonGenuine" is added too.
>
> Signed-off-by: Pu Wen 
> ---
>  hwloc/topology-x86.c | 26 ++
>  1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/hwloc/topology-x86.c b/hwloc/topology-x86.c
> index ff9e70b..bcaa832 100644
> --- a/hwloc/topology-x86.c
> +++ b/hwloc/topology-x86.c
> @@ -210,6 +210,7 @@ enum cpuid_type {
>intel,
>amd,
>zhaoxin,
> +  hygon,
>unknown
>  };
>  
> @@ -293,13 +294,13 @@ static void look_proc(struct hwloc_backend *backend, 
> struct procinfo *infos, uns
>_extendedmodel  = (eax>>16) & 0xf;
>_family = (eax>>8) & 0xf;
>_extendedfamily = (eax>>20) & 0xff;
> -  if ((cpuid_type == intel || cpuid_type == amd) && _family == 0xf) {
> +  if ((cpuid_type == intel || cpuid_type == amd || cpuid_type == hygon) && 
> _family == 0xf) {
>  infos->cpufamilynumber = _family + _extendedfamily;
>} else {
>  infos->cpufamilynumber = _family;
>}
>if ((cpuid_type == intel && (_family == 0x6 || _family == 0xf))
> -  || (cpuid_type == amd && _family == 0xf)
> +  || ((cpuid_type == amd || cpuid_type == hygon) && _family == 0xf)
>|| (cpuid_type == zhaoxin && (_family == 0x6 || _family == 0x7))) {
>  infos->cpumodelnumber = _model + (_extendedmodel << 4);
>} else {
> @@ -387,12 +388,13 @@ static void look_proc(struct hwloc_backend *backend, 
> struct procinfo *infos, uns
>node_id = 0;
>nodes_per_proc = 1;
>  } else {
> +  /* AMD other families or Hygon family 18h */
>node_id = ecx & 0xff;
>nodes_per_proc = ((ecx >> 8) & 7) + 1;
>  }
>  infos->nodeid = node_id;
>  if ((infos->cpufamilynumber == 0x15 && nodes_per_proc > 2)
> - || (infos->cpufamilynumber == 0x17 && nodes_per_proc > 4)) {
> + || ((infos->cpufamilynumber == 0x17 || infos->cpufamilynumber == 0x18) 
> && nodes_per_proc > 4)) {
>hwloc_debug("warning: undefined nodes_per_proc value %u, assuming it 
> means %u\n", nodes_per_proc, nodes_per_proc);
>  }
>  
> @@ -492,7 +494,7 @@ static void look_proc(struct hwloc_backend *backend, 
> struct procinfo *infos, uns
>/* Get thread/core + cache information from cpuid 0x04
> * (not supported on AMD)
> */
> -  if (cpuid_type != amd && highest_cpuid >= 0x04) {
> +  if ((cpuid_type != amd && cpuid_type != hygon) && highest_cpuid >= 0x04) {
>  unsigned max_nbcores;
>  unsigned max_nbthreads;
>  unsigned level;
> @@ -672,6 +674,15 @@ static void look_proc(struct hwloc_backend *backend, 
> struct procinfo *infos, uns
>   cache->cacheid = (infos->apicid % legacy_max_log_proc) / 
> cache->nbthreads_sharing /* cacheid within the package */
> + 2 * (infos->apicid / legacy_max_log_proc); /* add 2 cache per 
> previous package */
>}
> +} else if (cpuid_type == hygon) {
> +  if (infos->cpufamilynumber == 0x18
> +   && cache->level == 3 && cache->nbthreads_sharing == 6) {
> +/* Hygon family 0x18 always shares L3 between 8 APIC ids,
> + * even when only 6 APIC ids are enabled and reported in 
> nbthreads_sharing
> + * (on 24-core CPUs).
> + */
> +cache->cacheid = infos->apicid / 8;
> +  }
>  }
>}
>  
> @@ -1122,6 +1133,11 @@ static void 
> hwloc_x86_os_state_restore(hwloc_x86_os_state_t *state __hwloc_attri
>  #define AMD_EDX ('e' | ('n'<<8) | ('t'<<16) | ('i'<<24))
>  #define AMD_ECX ('c' | ('A'<<8) | ('M'<<16) | ('D'<<24))
>  
> +/* HYGON "HygonGenuine" */
> +#define HYGON_EBX ('H' | ('y'<<8) | ('g'<<16) | ('o'<<24))
> +#define HYGON_EDX ('n' | ('G'<<8) | ('e'<<16) | ('n'<<24))
> +#define HYGON_ECX ('u' | ('i'<<8) | ('n'<<16) | ('e'<<24))
> +
>  /* (Zhaoxin) CentaurHauls */
>  #define ZX_EBX ('C' | ('e'<<8) | ('n'<<16) | ('t'<<24))
>  #define ZX_EDX ('a' | ('u'<<8) | ('r'<<16) | ('H'<<24))
> @@ -1221,6 +1237,8 @@ int hwloc_look_x86(struct hwloc_backend *backend, int 
> fulldiscovery)
>else if ((ebx == ZX_EBX && ecx == ZX_ECX && edx == ZX_EDX)
>  || (ebx == SH_EBX && ecx == SH_ECX && edx == SH_EDX))
>  cpuid_type = zhaoxin;
> +  else if (ebx == HYGON_EBX && ecx == HYGON_ECX && edx == HYGON_EDX)
> +cpuid_type = hygon;
>  
>hwloc_debug("highest cpuid %x, 

Re: [OMPI devel] About supporting HWLOC 2.0.x

2018-05-24 Thread Brice Goglin
I just pushed my patches rebased on master + update to hwloc 2.0.1 to
bgoglin/ompi (master branch).

My testing of mapping/ranking/binding looks good here (on dual xeon with
CoD, 2 sockets x 2 NUMA x 6 cores).

It'd be nice if somebody else could test on another platform with
different options and/or advanced options (PPR, PE, etc).

Brice




Le 23/05/2018 à 17:07, Vallee, Geoffroy R. a écrit :
> I totally missed that PR before I sent my email, sorry. It pretty much covers 
> all the modifications I made. :) Let me know if I can help in any way.
>
> Thanks,
>
>> On May 22, 2018, at 11:49 AM, Jeff Squyres (jsquyres)  
>> wrote:
>>
>> Geoffroy -- check out https://github.com/open-mpi/ompi/pull/4677.
>>
>> If all those issues are now moot, great.  I really haven't followed up much 
>> since I made the initial PR; I'm happy to have someone else take it over...
>>
>>
>>> On May 22, 2018, at 11:46 AM, Vallee, Geoffroy R.  wrote:
>>>
>>> Hi,
>>>
>>> HWLOC 2.0.x support was brought up during the call. FYI, I am currently 
>>> using (and still testing) hwloc 2.0.1 as an external library with master 
>>> and I did not face any major problem; I only had to fix minor things, 
>>> mainly for putting the HWLOC topology in a shared memory segment. Let me 
>>> know if you want me to help with the effort of supporting HWLOC 2.0.x.
>>>
>>> Thanks,
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/devel
>>
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
>>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] About supporting HWLOC 2.0.x

2018-05-22 Thread Brice Goglin
Sorry guys, I think I have all patches ready since the F2F meeting, but
I couldn't test them enough because ranking was broken. I'll work on
that by next week.

Brie



Le 22/05/2018 à 17:50, r...@open-mpi.org a écrit :
> I’ve been running with hwloc 2.0.1 for quite some time now without problem, 
> including use of the shared memory segment. It would be interesting to hear 
> what changes you had to make.
>
> However, that said, there is a significant issue in ORTE when trying to 
> map-by NUMA as hwloc 2.0.1 no longer associates cpus with NUMA regions. So 
> you’ll get an error when you try it. Unfortunately, that is the default 
> mapping policy when #procs > 2.
>
>
>> On May 22, 2018, at 8:46 AM, Vallee, Geoffroy R.  wrote:
>>
>> Hi,
>>
>> HWLOC 2.0.x support was brought up during the call. FYI, I am currently 
>> using (and still testing) hwloc 2.0.1 as an external library with master and 
>> I did not face any major problem; I only had to fix minor things, mainly for 
>> putting the HWLOC topology in a shared memory segment. Let me know if you 
>> want me to help with the effort of supporting HWLOC 2.0.x.
>>
>> Thanks,
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[hwloc-devel] libhwloc soname change in 2.0.1rc1

2018-03-21 Thread Brice Goglin
Hello

In case you missed the announce yesterday, hwloc 2.0.1rc1 changes the
library soname from 12:0:0 to 15:0:0. On Linux, it means that we'll now
build libhwloc.so.15 instead of libhwloc.so.12. That means any
application built for hwloc 2.0.0 will need to be recompiled against 2.0.1.

I should have set the soname to 15:0:0 in 2.0.0 but I forgot. It may
cause issues because hwloc 1.11.x uses 12:x:y (we have "12" in both).
Given that 2.0.0 isn't widely used yet, I hope this way-too-late change
won't cause too many issues. Sorry.

As said on the download page, we want people to stop using 2.0.0 so that
we can forget this issue. If you already switched to hwloc 2.0.0 (and if
some applications are linked with libhwloc), please try to upgrade to
2.0.1 as soon as possible (final release expected next monday).

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel


Re: [hwloc-devel] [SCM] open-mpi/hwloc branch master updated. 14e727976867931a2eb74f2630b0ce9137182874

2018-02-05 Thread Brice Goglin
configure only looks for CL/cl_ext.h before enabling the OpenCL backend.
Did it enable OpenCL on your machine?

On my VMs, there's
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/OpenCL.framework/Versions/A/Headers/cl_ext.h
(and OpenCL currently doesn't get enabled in hwloc)

Brice


Le 05/02/2018 à 11:57, 'Gitdub ' a écrit :
> This is an automated email from the git hooks/post-receive script. It was
> generated because a ref change was pushed to the repository containing
> the project "open-mpi/hwloc".
>
> The branch, master has been updated
>via  14e727976867931a2eb74f2630b0ce9137182874 (commit)
>   from  24214085271f0cc9865300ba0a4e34699e4ad892 (commit)
>
> Those revisions listed above that are new to this repository have
> not appeared on any other notification email; so we list those
> revisions in full, below.
>
> - Log -
> https://github.com/open-mpi/hwloc/commit/14e727976867931a2eb74f2630b0ce9137182874
>
> commit 14e727976867931a2eb74f2630b0ce9137182874
> Author: Samuel Thibault 
> Date:   Mon Feb 5 11:56:42 2018 +0100
>
> Fix including OpenCL headers on MacOS
>
> diff --git a/hwloc/topology-opencl.c b/hwloc/topology-opencl.c
> index 3931977..e11dc60 100644
> --- a/hwloc/topology-opencl.c
> +++ b/hwloc/topology-opencl.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright © 2012-2018 Inria.  All rights reserved.
> - * Copyright © 2013 Université Bordeaux.  All right reserved.
> + * Copyright © 2013, 2018 Université Bordeaux.  All right reserved.
>   * See COPYING in top-level directory.
>   */
>  
> @@ -12,7 +12,11 @@
>  #include 
>  #include 
>  
> +#ifdef __APPLE__
> +#include 
> +#else
>  #include 
> +#endif
>  
>  static int
>  hwloc_opencl_discover(struct hwloc_backend *backend)
> diff --git a/include/hwloc/opencl.h b/include/hwloc/opencl.h
> index d97fe5d..058968d 100644
> --- a/include/hwloc/opencl.h
> +++ b/include/hwloc/opencl.h
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright © 2012-2018 Inria.  All rights reserved.
> - * Copyright © 2013 Université Bordeaux.  All right reserved.
> + * Copyright © 2013, 2018 Université Bordeaux.  All right reserved.
>   * See COPYING in top-level directory.
>   */
>  
> @@ -21,8 +21,13 @@
>  #include 
>  #endif
>  
> +#ifdef __APPLE__
> +#include 
> +#include 
> +#else
>  #include 
>  #include 
> +#endif
>  
>  #include 
>  
> diff --git a/tests/hwloc/opencl.c b/tests/hwloc/opencl.c
> index 87400e9..74592b7 100644
> --- a/tests/hwloc/opencl.c
> +++ b/tests/hwloc/opencl.c
> @@ -5,7 +5,11 @@
>  
>  #include 
>  #include 
> +#ifdef __APPLE__
> +#include 
> +#else
>  #include 
> +#endif
>  #include 
>  #include 
>  
>
>
> ---
>
> Summary of changes:
>  hwloc/topology-opencl.c | 6 +-
>  include/hwloc/opencl.h  | 7 ++-
>  tests/hwloc/opencl.c| 4 
>  3 files changed, 15 insertions(+), 2 deletions(-)
>
>
> hooks/post-receive

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

[OMPI devel] hwloc issues in this week telcon?

2018-01-31 Thread Brice Goglin
Hello

Two hwloc issues are listed in this week telcon:

"hwloc2 WIP, may need help with."
https://github.com/open-mpi/ompi/pull/4677
* Is this really a 3.0.1 thing? I thought hwloc2 was only for 3.1+
* As I replied in this PR, I have some patches but I need help for
testing them. Can you list some good test cases?

"Issue - hwloc can't handle cuda from a different location"
I have no idea what this is about. Is there a github issue for this?

Thanks
Brice

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [hwloc-devel] No opencl osdev for NVidia GPU devices

2018-01-08 Thread Brice Goglin
Le 27/09/2017 à 20:39, Brice Goglin a écrit :
>
> Le 27/09/2017 18:58, Samuel Thibault a écrit :
>> Hello,
>>
>> On systems with NVidia GPU devices, opencl devices don't show up in
>> lstopo. Running in debug mode shows that they are detected:
>>
>> 1 OpenCL platforms
>> This is opencl0d0
>>
>> but then topology-opencl.c stops there and does not create an osdev
>> object. AMD GPUs with non-PCIe device type are not reported either.
>>
>> I know that in these cases, we don't know where the GPUs are connected
>> exactly, but we already have code coping with this:
>>
>> if (!parent)
>>   parent = hwloc_get_root_obj(topology);
>>
>> Isn't it better to show OpenCL at the root rather then not at all?
>>
> As you want.
> If there's a need for these objects without any topology information,
> that's fine with me.
>

I think I have fixed everything needed to make this work in my opencl
branch.
https://github.com/open-mpi/hwloc/pull/266

Do we want to see OpenCL CPU devices too?
Something like this attached to the root (even on dual-socket machines):
Co-Processor(OpenCL) L#7 (Backend=OpenCL OpenCLDeviceType=CPU
GPUVendor=GenuineIntel GPUModel="Intel(R) Xeon(R) CPU E5-2650 0 @
2.00GHz" OpenCLPlatformIndex=0 OpenCLPlatformName="AMD Accelerated
Parallel Processing" OpenCLPlatformDeviceIndex=2 OpenCLComputeUnits=32
OpenCLGlobalMemorySize=32869688) "opencl0d2"

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel


Re: [hwloc-devel] get_last_cpu_location for process (Was: Hardware locality (hwloc) v2.0.0-beta1 released)

2017-12-29 Thread Brice Goglin
Le 28/12/2017 à 19:20, Samuel Thibault a écrit :
>> we can add a hwloc env var to disable process-wide asserts.
> So I would set it for all Debian builds? (we don't have a fixed set of
> archs which are built inside qemu-user)
>

Yes, if that's OK for you.

I couldn't test since binding doesn't seem to work in my qemu (always
goes to PU #0), even when using qemu-x86_64 on x86_64. Is this fixed
with your patches sent to qemu-devel yesterday?
Also sched_getcpu() isn't implemented in my qemu, but I can disable it
in config.h after configure.

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] get_last_cpu_location for process (Was: Hardware locality (hwloc) v2.0.0-beta1 released)

2017-12-28 Thread Brice Goglin
Le 28/12/2017 à 16:18, Samuel Thibault a écrit :
> Samuel Thibault, on jeu. 28 déc. 2017 15:08:30 +0100, wrote:
>> Samuel Thibault, on mer. 20 déc. 2017 18:32:48 +0100, wrote:
>>> I have uploaded it to debian experimental, so when it passes NEW,
>>> various arch test results will show up on 
>>>
>>> https://buildd.debian.org/status/package.php?p=hwloc=experimental
>>>
>>> so you can check the results on odd systems :)
>> FI, the failure on m68k is due to a bug in qemu's linux-user emulation,
>> which I'm currently fixing.
> There is however an issue with the hwloc_get_last_cpu_location test when
> run inside qemu's linux-user emulation, because in that case qemu
> introduces a thread for its own purposes in addition to the normal
> thread, and then the test looks like this:

Can you clarify what this qemu linux-user emulation does? Is it
emulating each process of a "fake-VM" inside a dedicated process on the
host?
Any idea when this could be useful beside (I guess) cross-building
platforms?

> I'm tid 15573
> trying 0x0003 1
> setaffinity 15573 3 gave 0
> setaffinity 15606 3 gave 0
> getting last location for 15573
> got 0
> getting last location for 15606
> got 2
> got 0x0005
> hwloc_get_last_cpu_location: hwloc_get_last_cpu_location.c:38: check: 
> Assertion `hwloc_bitmap_isincluded(last, set)' failed.
>
> I.e. when trying check(set, HWLOC_CPUBIND_PROCESS);, the
> hwloc_set_cpubind() call does bind the two threads of the process, and
> then looks for the CPU locations of the two threads, but probably thread
> 15606 didn't actually run in between, and thus the last CPU location is
> still with the old binding, and that fails the assertion.
>
> Of course, in the Debian package I could patch over this test to ignore
> the failure, possibly by blacklisting architectures which are known to
> be built inside qemu, but it could pose problem more generally. Perhaps
> we should use the attached patch, to try to check inclusion only from
> the result of the current-thread-only method?

If this failure isn't expected to matter in normal cases, I'd like to
keep the opportunity to check the process-wide cpulocation when possible.
I couldn't find an env var for detecting qemu user-emulation, but we can
add a hwloc env var to disable process-wide asserts. Something like the
attached patch (which adds lots of printf and just disables the assert
if flags!=THREAD or if HWLOC_TEST_DONTCHECK_PROC_CPULOCATION is set in
the env).

Brice

diff --git a/tests/hwloc/hwloc_get_last_cpu_location.c b/tests/hwloc/hwloc_get_last_cpu_location.c
index 03ab103b..01373a7d 100644
--- a/tests/hwloc/hwloc_get_last_cpu_location.c
+++ b/tests/hwloc/hwloc_get_last_cpu_location.c
@@ -5,6 +5,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 
@@ -13,21 +14,29 @@
 hwloc_topology_t topology;
 const struct hwloc_topology_support *support;
 
+static int checkprocincluded;
+
 /* check that a bound process execs on a non-empty cpuset included in the binding */
 static int check(hwloc_const_cpuset_t set, int flags)
 {
   hwloc_cpuset_t last;
   int ret;
 
+  printf("  binding\n");
   ret = hwloc_set_cpubind(topology, set, flags);
   if (ret)
 return 0;
 
+  printf("  getting cpu location\n");
   last = hwloc_bitmap_alloc();
   ret = hwloc_get_last_cpu_location(topology, last, flags);
   assert(!ret);
   assert(!hwloc_bitmap_iszero(last));
-  assert(hwloc_bitmap_isincluded(last, set));
+
+  if (flags == HWLOC_CPUBIND_THREAD || checkprocincluded) {
+printf("  checking inclusion\n");
+assert(hwloc_bitmap_isincluded(last, set));
+  }
 
   hwloc_bitmap_free(last);
   return 0;
@@ -35,12 +44,18 @@ static int check(hwloc_const_cpuset_t set, int flags)
 
 static int checkall(hwloc_const_cpuset_t set)
 {
-  if (support->cpubind->get_thisthread_last_cpu_location)
+  if (support->cpubind->get_thisthread_last_cpu_location) {
+printf(" with HWLOC_CPUBIND_THREAD...\n");
 check(set, HWLOC_CPUBIND_THREAD);
-  if (support->cpubind->get_thisproc_last_cpu_location)
+  }
+  if (support->cpubind->get_thisproc_last_cpu_location) {
+printf(" with HWLOC_CPUBIND_PROCESS...\n");
 check(set, HWLOC_CPUBIND_PROCESS);
-  if (support->cpubind->get_thisthread_last_cpu_location || support->cpubind->get_thisproc_last_cpu_location)
+  }
+  if (support->cpubind->get_thisthread_last_cpu_location || support->cpubind->get_thisproc_last_cpu_location) {
+printf(" with flags 0...\n");
 check(set, 0);
+  }
   return 0;
 }
 
@@ -49,24 +64,29 @@ int main(void)
   unsigned depth;
   hwloc_obj_t obj;
 
+  checkprocincluded = (NULL == getenv("HWLOC_TEST_DONTCHECK_PROC_CPULOCATION"));
+
   hwloc_topology_init();
   hwloc_topology_load(topology);
 
   support = hwloc_topology_get_support(topology);
 
   /* check at top level */
+  printf("testing at top level\n");
   obj = hwloc_get_root_obj(topology);
   checkall(obj->cpuset);
 
   depth = hwloc_topology_get_depth(topology);
   /* check at intermediate level if it exists */
   if (depth >= 3) {
+

Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v2.0.0-beta1 released

2017-12-22 Thread Brice Goglin
Le 22/12/2017 à 11:42, Samuel Thibault a écrit :
> Hello,
>
> Brice Goglin, on mar. 19 déc. 2017 11:48:39 +0100, wrote:
>>   + Memory, I/O and Misc objects are now stored in dedicated children lists,
>> not in the usual children list that is now only used for CPU-side 
>> objects.
>> - hwloc_get_next_child() may still be used to iterate over these 4 lists
>>   of children at once.
> I hadn't realized this before: so the NUMA-related hierarchy level can
> not be easily obtained with hwloc_get_type_depth and such, that's really
> a concern. For instance in slurm-llnl one can find
>
>   if (hwloc_get_type_depth(topology, HWLOC_OBJ_NODE) >
>   hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET)) {
>
> and probably others are doing this too, e.g. looking up from a CPU to
> find the NUMA level becomes very different from looking up from a cPU to
> find the L3 level etc.
>
> Instead of moving these objects to another place which is very
> different to find, can't we rather create another type of object, e.g.
> HWLOC_OBJ_MEMORY, to represent the different kinds of memories that can
> be found in a given NUMA level, and keep HWLOC_OBJ_NODE as it is?
>

That won't work. You can have memory attached at different levels of the
hierarchy (things like HBM inside a die, normal memory attached to a
package, and slow memory attached to the memory interconnect). The
notion of NUMA node and proximity domain is changing. It's not a set of
CPU+memory anymore. Things are moving towards the separation of "memory
initiator" (CPUs) and "memory target" (memory banks, possibly behind
memory-side caches). And those targets can be attached to different things.



I agree that finding local NUMA nodes is harder now. I thought about
having an explicit type saying "I have memory children, other don't"
(you propose NUMA with MEMORY children, I rather thought about MEMORY
with NUMA children because people are used to NUMA node numbers, and
memory-bind to NUMA nodes). But again, there's no guarantee that they
will be at the same depth in the hierarchy since they might be attached
to different kinds of resources. Things like comparing their depth with
socket depth won't work either. So we'd end up with multiple levels just
like Groups.

I will add helpers to simplify the lookup (give me my local NUMA node if
there's a single one, give me the number of "normal" NUMA nodes so I can
split the machine in parts, ...) but it's too early to add these, we
need more feedback first.



About Slurm-llnl, their code is obsolete anyway. NUMA is inside Socket
in all modern architectures. So they expose a "Socket" resource that is
actually a NUMA node. They used an easy way to detect whether there are
multiple NUMAs per socket or the contrary. We can still detect that in
v2.0, even if the code is different. Once we'll understand what they
*really* want to do, we'll help them update that code to v2.0.

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [OMPI devel] hwloc2 and cuda and non-default cudatoolkit install location

2017-12-20 Thread Brice Goglin
Le 20/12/2017 à 22:01, Howard Pritchard a écrit :
>
> I can think of several ways to fix it.  Easiest would be to modify the
>
> opal/mca/hwloc/hwloc2a/configure.m4
>
> to not set --enable-cuda if --with-cuda is evaluated to something
> other than yes.
>
>
> Optionally, I could fix the hwloc configury to use a --with-cuda
> argument rather than an --enable-cuda configury argument.  Would 
>
> such a configury argument change be traumatic for the hwloc community?
>
> I think it would be weird to have both an --enable-cuda and a
> --with-cuda configury argument for hwloc.
>
>

Hello

hwloc currently only has --enable-foo configure options, but very few
--with-foo. We rely on pkg-config and variables for setting dependency
paths.

OMPI seems to use --enable for enabling features, and --with for
enabling dependencies and setting dependency paths. If that's the
official recommended way to choose between --enable and --with, maybe
hwloc should just replace many --enable-foo with --with-foo ? But I tend
to think we should support both to ease the transition?

Brice

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v2.0.0-beta1 released

2017-12-20 Thread Brice Goglin
Thanks.

Your machine is very similar to mine, running a similar Debian, with
4.14 too. And my local build still doesn't crash. Maybe a different
compiler causing the bug to appear more often on yours (Debian's
7.2.0-16 here). Let's forget about it if it's fixed now :)

Brice



Le 20/12/2017 à 18:26, Samuel Thibault a écrit :
> Brice Goglin, on mer. 20 déc. 2017 18:16:34 +0100, wrote:
>> Le 20/12/2017 à 18:06, Samuel Thibault a écrit :
>>> It has only one NUMA node, thus triggering the code I patched over.
>> Well, this has been working fine for a while, since that's my daily
>> development machine and all our jenkins slaves.
>>
>> Can you give the usually requested details about the OS, kernel,
>> hwloc-gather-topology? hwloc-gather-cpuid if the x86 backend is involved?
> Your commit 301c0f94e0a54823bfd530c36b5f9c9d9862332b seems to have fixed
> it.
>
> It's Debian Buster, kernel 4.14.0, and attached gathers.
>
> Samuel
>
>
> ___
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v2.0.0-beta1 released

2017-12-20 Thread Brice Goglin
Le 20/12/2017 à 18:06, Samuel Thibault a écrit :
> It has only one NUMA node, thus triggering the code I patched over.

Well, this has been working fine for a while, since that's my daily
development machine and all our jenkins slaves.

Can you give the usually requested details about the OS, kernel,
hwloc-gather-topology? hwloc-gather-cpuid if the x86 backend is involved?

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] [hwloc-announce] Hardware locality (hwloc) v2.0.0-beta1 released

2017-12-20 Thread Brice Goglin
Le 20/12/2017 à 17:49, Samuel Thibault a écrit :
> Samuel Thibault, on mer. 20 déc. 2017 13:57:45 +0100, wrote:
>> Brice Goglin, on mar. 19 déc. 2017 11:48:39 +0100, wrote:
>>> The Hardware Locality (hwloc) team is pleased to announce the first
>>> beta release for v2.0.0:
>>>
>>>http://www.open-mpi.org/projects/hwloc/
>> I tried to build the Debian package, there are a few failures in the
>> testsuite:
>>
>> FAIL: test-lstopo.sh
>> FAIL: hwloc_bind
>> FAIL: hwloc_get_last_cpu_location
>> FAIL: hwloc_get_area_memlocation
>> FAIL: hwloc_object_userdata
>> FAIL: hwloc_backends
>> FAIL: hwloc_pci_backend
>> FAIL: hwloc_is_thissystem
>> FAIL: hwloc_topology_diff
>> FAIL: hwloc_topology_abi
>> FAIL: hwloc_obj_infos
>> FAIL: glibc-sched
>> ../.././config/test-driver: line 107: 27886 Segmentation fault  "$@" > 
>> $log_file 2>&1
>> FAIL: hwloc-hello
>> ../.././config/test-driver: line 107: 27905 Segmentation fault  "$@" > 
>> $log_file 2>&1
>> FAIL: hwloc-hello-cpp
>>
>> This is running inside a Debian Buster system.
> It seems to be fixed by the attached patch.
>

I can't reproduce the issue, what's specific about your system? I tried
inside a debian-build-chroot, etc, ...

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel


[hwloc-devel] RFCs about latest API changes

2017-11-19 Thread Brice Goglin
Hello

Here are 4 pull requests about the likely-last significant API changes
for hwloc 2.0. You'll get more details by clicking on the links. I'll
merge these next week unless somebody complains.

Only maintain allowed_cpuset and allowed_nodeset for the entire topology
https://github.com/open-mpi/hwloc/pull/277

Make all depths *signed* ints
https://github.com/open-mpi/hwloc/pull/276

Remove the "System" object type
https://github.com/open-mpi/hwloc/pull/275

Move local_memory to NUMA node specific attrs
https://github.com/open-mpi/hwloc/pull/274

Brice





Le 26/10/2017 17:36, Brice Goglin a écrit :
> Hello
>
> I finally merged the new memory model in master (mainly for properly
> supporting KNL-like heterogeneous memory). This was the main and last
> big change for hwloc 2.0. I still need to fix some caveats (and lstopo
> needs to better display NUMA nodes) but that part of the API should be
> ready.
>
> Now we encourage people to start porting their code to the new hwloc 2.0
> API. Here's a guide that should answer most questions about the upgrade:
> https://github.com/open-mpi/hwloc/wiki/Upgrading-to-v2.0-API
>
> The final 2.0 release isn't planned before at least the end of november,
> but we need to fix API issues before releasing it. So please start
> testing it and report issues, missing docs, etc. If there's any existing
> function/feature (either new or old) that needs to be changed, please
> report it too. We're only breaking the ABI once for 2.0, we cannot break
> it again 2 months later.
>
>
> Tarballs of git master are already available from
> https://ci.inria.fr/hwloc/job/master-0-tarball/lastBuild/
> and in nightly snapshots on the website starting tomorrow.
>
>
> There are still a couple things that may or may not change before the
> final 2.0 API. If you have an opinion, please let us know.
> * (likely) Make all depths *signed* ints: some objects have a negative
> depth (meaning "special depth", not a normal depth in the main tree).
> You'll have to cast to (int) whenever you printf a depth while
> supporting both hwloc 1.x and 2.x.
> * (likely) Drop obj->allowed_cpuset (and allowed_nodeset) and just keep
> one for the entire topology: It is very rarely used (only when you set
> HWLOC_TOPOLOGY_FLAG_WHOLESYSTEM) and can be emulated by doing a binary
> "and" or "intersects" between obj->cpuset and topology->allowed_cpuset.
> * (likely) obj->memory becomes obj->attr->numanode since it's only used
> for numa nodes. But obj->total_memory should remain in obj because it's
> available in all objects (accumulated memory in all children).
> * (likely) Remove HWLOC_OBJ_SYSTEM: not used anymore (we don't support
> multinode topologies anymore). The root is always MACHINE now. I guess
> we'd #define SYSTEM MACHINE so that you don't have to change your code.
> * (unlikely) rename some info objects for consistency (examples below).
>   + GPUVendor and PCIVendor and CPUVendor -> Vendor.
>   + GPUModel and PCIDevice and CPUModel -> Model
>   + NVIDIASerial and MICSerialNumber -> SerialNumber
> But that will make your life harder for looking up attributes while
> supporting hwloc 1.x and 2.x. And XML import from 1.x would be more
> expensive since we'd have to rename these.
> * (unlikely) Share information between osdev (e.g. eth0 or cuda0) and
> pcidev: Lots of attributes are identical (Vendor, Model, kind of device
> etc). We could merge those objects into a single generic "I/O object".
> However a single PCI device can contain multiple OS devices (for
> instance "mlx5_0"+"ib0", or "cuda0"+"opencl0d0", etc).
>
>
> --
> Brice
>

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [OMPI devel] HWLOC / rmaps ppr build failure

2017-10-04 Thread Brice Goglin
Looks like you're using a hwloc < 1.11. If you want to support this old
API while using the 1.11 names, you can add this to OMPI after #include

#if HWLOC_API_VERSION < 0x00010b00
#define HWLOC_OBJ_NUMANODE HWLOC_OBJ_NODE
#define HWLOC_OBJ_PACKAGE HWLOC_OBJ_SOCKET
#endif

Brice




Le 04/10/2017 19:54, Barrett, Brian via devel a écrit :
> It looks like a change in either HWLOC or the rmaps ppr component is causing 
> Cisco build failures on master for the last couple of days:
>
>   https://mtt.open-mpi.org/index.php?do_redir=2486
>
> rmaps_ppr.c:665:17: error: ‘HWLOC_OBJ_NUMANODE’ undeclared (first use in this 
> function); did you mean ‘HWLOC_OBJ_NODE’?
>  level = HWLOC_OBJ_NUMANODE;
>  ^~
>  HWLOC_OBJ_NODE
> rmaps_ppr.c:665:17: note: each undeclared identifier is reported only once 
> for each function it
>
> Can someone take a look?
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [hwloc-devel] C99 will be back soon

2017-09-28 Thread Brice Goglin
Turns out Microsoft Visual Studio doesn't support most of C99
(at least dynamic arrays and mixed declaration/code).

I am using some glue for now. However I am not 100% sure
I'll remain happy to restrict hwloc just to support such a
dump compiler (MinGW doesn't have any issue with C99).

Brice




Le 27/09/2017 20:52, Brice Goglin a écrit :
> Hello
>
> hwloc was using C99 7 years ago. We had to revert that in hwloc 1.2
> because some software embedding hwloc needed to support some non-C99
> compilers. Things have changed and C99 is well supported now. So we're
> going to use C99 in hwloc, at least for simple features such as dynamic
> arrays on the stack and designated structure initializer. configure will
> fail if the compiler isn't C99.
>
> Brice
>

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] Python bindings for hwloc

2017-09-02 Thread Brice Goglin
Hello
Thanks a lot!
I updated the links on the website.
Note that your code should work up to 1.11.8 except for one new topology
flag added in 1.11.6.
I am not very good at Python but I could help finishing CUDA support if
you tell me what's missing and where to look.
Brice


Le 02/09/2017 00:32, Guy Streeter a écrit :
> I'm still retired, but I found time to update python-hwloc for hwloc
> version 1.11.5, the version currently shipped in Fedora 26. I made
> some bug-fixes along the way, and built new versions for Fedora 25 and
> Cento 7 as well.
>
> There are 2 important hosting changes for python-hwloc: the
> fdeorahosted git server has been deactivated, and I no longer host
> files at redhat.com .
>
> The new location of the python-hwloc git tree is
>  https://gitlab.com/guystreeter/python-hwloc
>
> RPM repos for Fedora 25 and 26 and Centos 7 (EPEL) are found at
>  https://copr.fedorainfracloud.org/coprs/streeter/python-hwloc/
>
> Other important changes:
>
> Documentation! I wrote a programmer's guide of sorts. It describes all
> the classes and their methods, and has a few examples.
>
> For non-RPM distros, setup.py should now be usable to build and
> install python-hwloc. Just run it with python2 or python3 (or both) to
> build the correct version. Building requires Cython, and the
> development files for several other packages. See the lines containing
> "BuildRequires" in the top-level python-hwloc.spec file for a list.
>
> One last note about the state of python-hwloc development: I don't
> have access to a system with CUDA development files or CUDA devices,
> so I have not completed the implementation of CUDA-related features.
>
> I'm in the process of writing (slowly, I'm still retired) a GUI
> program in Python to assist in setting process affinities. I'll post
> when I have it in a usable state.
>
> Let me know if you are interested in seeing any changes, or have any
> problems, in python-hwloc.
>
> regards,
> --Guy Streeter
>
>
>
> ___
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [OMPI devel] KNL/hwloc funny message question

2017-09-01 Thread Brice Goglin
Hello

This message is related to /var/cache/hwloc/knl_memoryside_cache. This
file exposes the KNL cluster and MCDRAM configuration, only accessible
from root-only files. hwloc-dump-hwdata runs at boot time to create that
file, and non-root hwloc users can read it later. Failing to read that
file basically means that hwloc won't be able to expose MCDRAM *cache*
information.

Intel had to change the file format in hwloc 1.11.4 to expose more info.
Earlier releases cannot read it unfortunately. So the error means that
your OMPI uses a hwloc 1.11.[23] while the above file was dumped by
hwloc >= 1.11.4 running in a boot-time service.

I thought we changed that warning to verbose-only, but looks like we didn't.


There's no human intervention after you subscribe on
https://lists.open-mpi.org/mailman/listinfo/hwloc-users. I don't see you
on the current members.

Brice






Le 01/09/2017 18:26, Howard Pritchard a écrit :
> Hi Folks,
>
> I just now subscribed to the hwloc user mail list, but I suspect that
> requires human intervention to get on, and that might not mean
> something happening till next week.
>
> Alas google has failed me in helping to understand the message.
>
> So, I decided to post to Open MPI devel list and see if I get a response.
>
> Here's what I see with with Open MPI 2.1.1:
>
> srun -n 4 ./hello_c
>
> Invalid knl_memoryside_cache header, expected "version: 1".
>
> Invalid knl_memoryside_cache header, expected "version: 1".
>
> Invalid knl_memoryside_cache header, expected "version: 1".
>
> Invalid knl_memoryside_cache header, expected "version: 1".
>
> Hello, world, I am 0 of 4, (Open MPI v2.1.1rc1, package: Open MPI
> dshrader@tt-fey1 Distribution, ident: 2.1.1rc1, repo rev:
> v2.1.1-4-g5ded3a2d, Unreleased developer copy, 142)
>
> Hello, world, I am 1 of 4, (Open MPI v2.1.1rc1, package: Open MPI
> dshrader@tt-fey1 Distribution, ident: 2.1.1rc1, repo rev:
> v2.1.1-4-g5ded3a2d, Unreleased developer copy, 142)
>
> Hello, world, I am 2 of 4, (Open MPI v2.1.1rc1, package: Open MPI
> dshrader@tt-fey1 Distribution, ident: 2.1.1rc1, repo rev:
> v2.1.1-4-g5ded3a2d, Unreleased developer copy, 142)
>
> Hello, world, I am 3 of 4, (Open MPI v2.1.1rc1, package: Open MPI
> dshrader@tt-fey1 Distribution, ident: 2.1.1rc1, repo rev:
> v2.1.1-4-g5ded3a2d, Unreleased developer copy, 142)
>
> Anyone know what might be causing hwloc to report this invalid
> knl_memoryside_cache header thingy?
>
> Thanks for any help,
>
> Howard
>
>
>
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [hwloc-devel] [SCM] open-mpi/hwloc branch master updated. 37eb93c7dfeca1a0ce84474bac9d2f234bcbacd4

2017-08-29 Thread Brice Goglin


Le 29/08/2017 18:58, Samuel Thibault a écrit :
> Well, for coherency only, if you prefer long command-lines there, I'm
> fine with it.

That's actually the difference I like between lstopo --ps and hwloc-ps.
Few details but nice output for humans in the former. More details in a
machine-readable format in the latter, and easier to tweak with
--pid-cmd or pipes.

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] [SCM] open-mpi/hwloc branch master updated. 37eb93c7dfeca1a0ce84474bac9d2f234bcbacd4

2017-08-29 Thread Brice Goglin
Contrary to lstopo, hwloc-ps has no problem with long command-lines.
What's the point of shortening to comm here?

Brice




Le 29/08/2017 18:27, 'Gitdub ' a écrit :
> This is an automated email from the git hooks/post-receive script. It was
> generated because a ref change was pushed to the repository containing
> the project "open-mpi/hwloc".
>
> The branch, master has been updated
>via  37eb93c7dfeca1a0ce84474bac9d2f234bcbacd4 (commit)
>   from  be52ce0ca44b36cc50b55b0ab2cf784f1a2a1c75 (commit)
>
> Those revisions listed above that are new to this repository have
> not appeared on any other notification email; so we list those
> revisions in full, below.
>
> - Log -
> https://github.com/open-mpi/hwloc/commit/37eb93c7dfeca1a0ce84474bac9d2f234bcbacd4
>
> commit 37eb93c7dfeca1a0ce84474bac9d2f234bcbacd4
> Author: Samuel Thibault 
> Date:   Tue Aug 29 18:25:36 2017 +0200
>
> hwloc-ps: harmonize with lstopo --ps
> 
> Harmonize the hwloc-ps source code for getting process name with lstopo 
> --ps
> source code.  This thus gets 11e1957 ('show only comm of processes').
> 
> (cherry picked from commit d801688ef7e5fbf22c7259f679d350d6f41ebea1)
>
> diff --git a/utils/hwloc/hwloc-ps.c b/utils/hwloc/hwloc-ps.c
> index 095e8c7..c5256a9 100644
> --- a/utils/hwloc/hwloc-ps.c
> +++ b/utils/hwloc/hwloc-ps.c
> @@ -1,6 +1,6 @@
>  /*
>   * Copyright © 2009-2017 Inria.  All rights reserved.
> - * Copyright © 2009-2012 Université Bordeaux
> + * Copyright © 2009-2012, 2017 Université Bordeaux
>   * Copyright © 2009-2011 Cisco Systems, Inc.  All rights reserved.
>   * See COPYING in top-level directory.
>   */
> @@ -116,20 +116,69 @@ static void one_process(hwloc_topology_t topology, 
> hwloc_const_bitmap_t topocpus
>path = malloc(pathlen);
>snprintf(path, pathlen, "/proc/%ld/cmdline", pid);
>file = open(path, O_RDONLY);
> -  free(path);
> +  if (file < 0) {
> + /* Ignore errors */
> + free(path);
> + goto out;
> +  }
> +  n = read(file, name, sizeof(name) - 1);
> +  close(file);
>  
> -  if (file >= 0) {
> -n = read(file, name, sizeof(name) - 1);
> -close(file);
> +  if (n <= 0) {
> + /* Ignore kernel threads and errors */
> + free(path);
> + goto out;
> +  }
> +
> +  snprintf(path, pathlen, "/proc/%ld/comm", pid);
> +  file = open(path, O_RDONLY);
>  
> -if (n <= 0)
> -  /* Ignore kernel threads and errors */
> -  goto out;
> +  if (file >= 0) {
> + n = read(file, name, sizeof(name) - 1);
> + close(file);
> + if (n > 0) {
> +   name[n] = 0;
> +   if (n > 1 && name[n-1] == '\n')
> + name[n-1] = 0;
> + } else {
> +   snprintf(name, sizeof(name), "(unknown)");
> + }
> +  } else {
> + /* Old kernel, have to look at old file */
> + char stats[32];
> + char *parenl = NULL, *parenr;
>  
> -name[n] = 0;
> + snprintf(path, pathlen, "/proc/%ld/stat", pid);
> + file = open(path, O_RDONLY);
>  
> - if (only_name && !strstr(name, only_name))
> + if (file < 0) {
> +   /* Ignore errors */
> +   free(path);
> goto out;
> + }
> +
> + /* "pid (comm) ..." */
> + n = read(file, stats, sizeof(stats) - 1);
> + close(file);
> + if (n > 0) {
> +   stats[n] = 0;
> +   parenl = strchr(stats, '(');
> +   parenr = strchr(stats, ')');
> +   if (!parenr)
> + parenr = [sizeof(stats)-1];
> +   *parenr = 0;
> + }
> + if (!parenl) {
> +   snprintf(name, sizeof(name), "(unknown)");
> + } else {
> +   snprintf(name, sizeof(name), parenl+1);
> + }
> +  }
> +
> +  free(path);
> +
> +  if (only_name && !strstr(name, only_name)) {
> + goto out;
>}
>  }
>  #endif /* HWLOC_LINUX_SYS */
> diff --git a/utils/lstopo/lstopo.c b/utils/lstopo/lstopo.c
> index cb4f7dc..70c78eb 100644
> --- a/utils/lstopo/lstopo.c
> +++ b/utils/lstopo/lstopo.c
> @@ -217,12 +217,12 @@ static void add_process_objects(hwloc_topology_t 
> topology)
>{
>  /* Get threads */
>  DIR *task_dir;
> -struct dirent *task_dirent;
>  
>  snprintf(path, pathlen, "/proc/%s/task", dirent->d_name);
>  task_dir = opendir(path);
>  
>  if (task_dir) {
> +  struct dirent *task_dirent;
>while ((task_dirent = readdir(task_dir))) {
>  long local_tid;
>  char *task_end;
>
>
> ---
>
> Summary of changes:
>  utils/hwloc/hwloc-ps.c | 69 
> ++
>  utils/lstopo/lstopo.c  |  2 +-
>  2 files changed, 60 insertions(+), 11 deletions(-)
>
>
> hooks/post-receive

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org

Re: [OMPI devel] Coverity strangeness

2017-06-15 Thread Brice Goglin
You can email scan-ad...@coverity.com to report bugs and/or ask what's
going on.
Brice




Le 16/06/2017 07:12, Gilles Gouaillardet a écrit :
> Ralph,
>
>
> my 0.02 US$
>
>
> i noted the error message mentions 'holding lock
> "pmix_mutex_t.m_lock_pthread"', but it does not explicitly mentions
>
> 'pmix_global_lock' (!)
>
> at line 446, PMIX_WAIT_THREAD() does release 'cb.lock', which has the
> same type than 'pmix_global_lock', but is not the very same lock.
>
> so maybe coverity is being mislead by PMIX_WAIT_THREAD(), and hence
> the false positive
>
>
> if you have contacts at coverity, it would be interesting to report
> this false positive
>
>
>
> Cheers,
>
>
> Gilles
>
>
> On 6/16/2017 12:02 PM, r...@open-mpi.org wrote:
>> I’m trying to understand some recent coverity warnings, and I confess
>> I’m a little stumped - so I figured I’d ask out there and see if
>> anyone has a suggestion. This is in the PMIx repo, but it is reported
>> as well in OMPI (down in opal/mca/pmix/pmix2x/pmix). The warnings all
>> take the following form:
>>
>> 
>>
>> *** CID 145810:  Concurrent data access violations  (MISSING_LOCK)
>> /src/client/pmix_client.c: 451 in PMIx_Init()
>> 445 /* wait for the data to return */
>> 446 PMIX_WAIT_THREAD();
>> 447 rc = cb.status;
>> 448 PMIX_DESTRUCT();
>> 449
>> 450 if (PMIX_SUCCESS == rc) {
> CID 145810:  Concurrent data access violations  (MISSING_LOCK)
> Accessing "pmix_globals.init_cntr" without holding lock
> "pmix_mutex_t.m_lock_pthread". Elsewhere,
> "pmix_globals_t.init_cntr" is accessed with
> "pmix_mutex_t.m_lock_pthread" held 10 out of 11 times.
>> 451 pmix_globals.init_cntr++;
>> 452 } else {
>> 453 PMIX_RELEASE_THREAD(_global_lock);
>> 454 return rc;
>> 455 }
>> 456 PMIX_RELEASE_THREAD(_global_lock);
>>
>> Now the odd thing is that the lock is in fact being held - it gets
>> released 5 lines lower down. However, the lock was taken nearly 100
>> lines above this point.
>>
>> I’m therefore inclined to think that the lock somehow “slid” outside
>> of Coverity’s analysis window and it therefore thought (erroneously)
>> that the lock isn’t being held. Has anyone else seen such behavior?
>>
>> Ralph
>>
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[hwloc-devel] who's using hwloc.m4 "embedding" feature?

2017-04-11 Thread Brice Goglin
Hello

Open MPI currently uses hwloc by "embedding" hwloc.m4 in its own
configure script (instead of calling hwloc's configure script
explicitly). Is there any other project doing this?

Thanks
Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel


[OMPI devel] anybody ported OMPI to hwloc 2.0 API?

2017-04-05 Thread Brice Goglin
Hello

Did anybody start porting OMPI to the new hwloc 2.0 API (currently in
hwloc git master)?
Gilles, I seem to remember you were interested a while ago?

I will have to do it in the near future. If anybody already started that
work, please let me know.

Brice

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [hwloc-devel] hwloc.m4: minor english fixes

2017-02-08 Thread Brice Goglin
FWIW, I didn't get the commit email either, and I am pretty sure it's
not the first time it happens. There are no archives for this ML, do we
have a way to see the logs of the emailing script that runs on github ?

Brice



Le 08/02/2017 16:19, Jeff Squyres (jsquyres) a écrit :
> On Feb 7, 2017, at 12:12 PM, Jeff Squyres (jsquyres)  
> wrote:
>> Fair enough, but the test itself is just a switch/case statement -- it's not 
>> an actual test to see if the system supports binding or not.  Hence, hedging 
>> the warning message a little seemed reasonable.
> I see you actually reverted my commit (somehow I didn't get an email about 
> that -- I only noticed it by chance today on GitHub.com).
>
> 1. You reverted an actual grammar fix: "support" -> "supported".
>
> 2. I don't think that "likely" is bad to have.  Like I said above, the test 
> itself is just a switch/case test based on a hard-coded list of OSs.  The 
> test does not *actually* test to see if the system supports binding.  So 
> weakening the language a little to say "likely" is not necessarily a bad 
> thing.
>
> Sure, in some (most? all?) cases, the likelihood of not supporting binding 
> will be 100%.  But a) that doesn't mean the use of "likely" is incorrect, and 
> b) allows for the possibility of not supporting binding to be less than 100% 
> in some future / unpredicted system.
>
> "Always" (and words/phrasing like it) is a very, very strong word.  It should 
> be avoided when possible.
>

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel


Re: [OMPI devel] hwloc missing NUMANode object

2017-01-05 Thread Brice Goglin


Le 05/01/2017 07:07, Gilles Gouaillardet a écrit :
> Brice,
>
> things would be much easier if there were an HWLOC_OBJ_NODE object in
> the topology.
>
> could you please consider backporting the relevant changes from master
> into the v1.11 branch ?
>
> Cheers,
>
> Gilles

Hello
Unfortunately, I can't backport this to 1.x. This is very intrusive and
would break other things.
However, what problem are you actually seeing? They are no NUMA node in
hwloc 1.x when the machine isn't NUMA (or when there's no NUMA support
in the operating system but that's very unlikely). hwloc master would
show a single NUMA node that is equivalent to the entire machine, so
binding would be a noop.
Regards
Brice

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [hwloc-devel] Git: open-mpi/hwloc annotated tag hwloc-1.11.5 deleted. hwloc-1.11.5

2016-11-25 Thread Brice Goglin
FWIW, I am trying to remove github automatic "release" tarballs from
the github page but it actually removes the corresponding git tag.

Those releases are not autogen'ed, do not contain the doc, etc.
Users are confused when they download tarballs from github instead of
from the hwloc website.

If anybody has a solution to hide or delete those github tarballs
without removing git tags, please let me know.

Brice



Le 25/11/2016 13:18, git...@open-mpi.org a écrit :
> This is an automated email from the git hooks/post-receive script. It was
> generated because a ref change was pushed to the repository containing
> the project "open-mpi/hwloc".
>
> The annotated tag, hwloc-1.11.5 has been deleted
>was  ada16725e68210552ae30b40edc00ae31bc34356
>
> - Log -
> d6f3ae30c9953bbc141b69b1bec2067570709293 v1.11.5rc1 released, doing rc2 now
> ---
>
>
> hooks/post-receive

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel


Re: [hwloc-devel] [RFC] applying native OS restrictions to XML imported topology

2016-10-21 Thread Brice Goglin
Le 21/10/2016 17:21, r...@open-mpi.org a écrit :
> I should add: this does beg the question of how a proc “discovers” its 
> resource constraints without having access to the hwloc tree. One possible 
> solution - the RM already knows the restrictions, and so it could pass those 
> down at proc startup (e.g., as part of the PMIx info). We could pass whatever 
> info hwloc would like passed into its calls - doesn’t have to be something 
> “understandable” by the proc itself.

Retrieving cgroups info from Linux isn't expensive so my feeling was to
still have compute processes do it. But, indeed, we could also avoid
that step by having the caller pass a hwloc_bitmap_t for allowed PUs and
another one for allowed NUMA nodes. More below.

>
>> On Oct 21, 2016, at 8:15 AM, r...@open-mpi.org wrote:
>>
>> Hmmm...I think maybe we are only seeing a small portion of the picture here. 
>> There are two pieces of the problem when looking at large SMPs:
>>
>> * time required for discovery - your proposal is attempting to address that, 
>> assuming that the RM daemon collects the topology and then communicates it 
>> to each process (which is today’s method)

There's actually an easy way to do that. Export to XML during boot,
export HWLOC_XMLFILE=/path/to/xml to the processes.

>>
>> * memory footprint. We are seeing over 40MBytes being consumed by hwloc 
>> topologies on fully loaded KNL machines, which is a disturbing number

I'd be interested in knowing whether these 40MB are hwloc_obj structure,
or bitmaps, or info strings, etc. Do you have this information already?

>> Where we are headed is to having only one copy of the hwloc topology tree on 
>> a node, stored in a shared memory segment hosted by the local RM daemon.

And would you need some sort of relative pointers so that all processes
can traverse parent/child/sibling/... pointers from their own mapping at
a random address in their address space?

>>  Procs will then access that tree to obtain any required info. Thus, we are 
>> less interested in each process creating its own tree based on an XML 
>> representation passed to it by the RM, and more interested in having the 
>> hwloc search algorithms correctly handle any resource restrictions when 
>> searching the RM’s tree.
>>
>> In other words, rather than (or perhaps, in addition to?) filtering the XML, 
>> we’d prefer to see some modification of the search APIs to allow a proc to 
>> pass in its resource constraints, and have the search algorithm properly 
>> consider them when returning the result. This eliminates all the XML 
>> conversion overhead, and resolves the memory footprint issue.

What do you call "search algorithm"? We have many functions to walk the
tree, as levels, from top to bottom, etc. Passing such a resource
constraints to all of them isn't easy. And we have explicit pointers
between objects too.

Maybe define a good basic set of interesting functions for your search
algorithm and duplicate these in a new hwloc/allowed.h with a new
allowed_cpuset attribute? Whenever they find an object, they check
whether hwloc_bitmap_intersects(obj->cpuset, allowed_cpuset). If FALSE,
ignore that object.

There's also a related change that I wasn't ready/sure to try yet:
obj->allowed_cpuset is currently just a duplicate of obj->cpuset in the
default case. When the WHOLE_SYSTEM topology flag is set, it's a binary
AND between obj->cpuset and root->allowed_cpuset. Quite a lot of
duplication. We could remove all these allowed_{cpuset,nodeset} from
objects and have a topology->allowed_cpuset instead. Most users don't
care and wouldn't see the difference. Others would pass the WHOLE_SYSTEM
flag and use hwloc/allowed.h or do things manually):
* ignore an object if !hwloc_bitmap_intersects(obj->cpuset,
allowed_cpuset) like what hwloc/allowed.h would do.
* bind using:
set = hwloc_bitmap_dup(obj->cpuset);
hwloc_bitmap_and(set, set, allowed_cpuset);
set_cpubind(set);
hwloc_bitmap_free(set);

allowed_cpuset can be either a new topology->allowed_cpuset retrieve by
the current process using the OS, or their own provided allowed_cpuset
that came from the RM.

I only talked about allowed_cpuset above, but there's also a
allowed_nodeset. What happens if a NUMA node is disallowed but its local
cores are allowed? We want to ignore that NUMA node when looking up NUMA
nodes for manipulating memory. But we don't want to ignore it when
looking up NUMA nodes and children for placing tasks. It's not clear to
me how to handle these cases. Have all new functions receive both
allowed_cpuset and allowed_nodeset but one of them can be NULL?



By the way, obj->complete_{cpuset,nodeset} is also something we could
drop and just have a topology->complete_{cpuset,nodeset} saying "by the
way, there are others resources 

[hwloc-devel] [RFC] applying native OS restrictions to XML imported topology

2016-10-21 Thread Brice Goglin
Hello

Based on recent discussion about hwloc_topology_load() being slow on
some "large" platforms (almost 1 second on KNL), here's a new feature
proposal:

We've been recommending the use of XML to avoid multiple expensive
discovery: Export to XML once at boot, and reload from XML for each
actual process using hwloc. The main limitation is cgroups: resource
managers use cgroups to restrict the processors and memory that are
actually available to each job. So the topology of different jobs on the
same machine is actually slightly different from the main XML that
contained everything when it was created outside of cgroups during boot.

So we're looking at adding a new topology flag that loads the entire
machine from XML (or synthetic) and applies restrictions from the
local/native operating system.

Details at https://github.com/open-mpi/hwloc/pull/212
Comments welcome here or there.

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel


Re: [hwloc-devel] Query on the output of hwloc API

2016-09-26 Thread Brice Goglin
To be future-proof (so that your code works both with current hwloc 1.x
and upcoming 2.0), the best check for non-NUMA machines is
   hwloc_get_nbobjs_by_type(topology, HWLOC_OBJ_NODE) <= 1

Otherwise, yes. Infinite/full nodeset means no NUMA. To actually check
whether a nodeset is infinite, use hwloc_bitmap_weight(obj->nodeset). It
returns -1 on infinite/full bitmaps since there's no way to count an
infinite set of bits.

Brice


Le 27/09/2016 07:45, Swati Agrawal a écrit :
> Thanks Brice for the detailed info.
> So, can I say that if obj->nodeset is all 1s, it is a no NUMA mode setup?
>
> Thanks,
> Swati
>
> On Monday, September 26, 2016, Brice Goglin <brice.gog...@inria.fr
> <mailto:brice.gog...@inria.fr>> wrote:
>
> Hello
>
> If there's no NUMA node object in your hwloc topology, it means
> your machine isn't NUMA (there's a single NUMA node), or your
> system doesn't report NUMA information at all (missing NUMA
> support in the kernel, etc).
>
> This is an old design choice that is not convenient. So we'll
> change that in the upcoming hwloc 2.0. There will always be at
> least one NUMA node object (just like in lspcu).
>
> In the meantime, the meaning of obj->nodeset isn't very useful
> when there's no NUMA object anyway. If you really need to look at
> obj->nodeset on non-NUMA machines, you'll get either NULL or a
> "full" "infinite" bitmap (meaning "the entire machine memory", as
> explained in the description of the nodeset attribute of the
> object structure
> 
> https://www.open-mpi.org/projects/hwloc/doc/v1.11.4/a00038.php#a08f0d0e16c619a6e653526cbee4ffea3
> 
> <https://www.open-mpi.org/projects/hwloc/doc/v1.11.4/a00038.php#a08f0d0e16c619a6e653526cbee4ffea3>).
>
> By the way, you're not supposed to look at internal nodeset fields
> (ulongs and ulongs_count). For instance, there's another field
> saying that the bitmap is infinite. All these are private details
> not meant to be understood by users. Things like
> hwloc_bitmap_asprintf() or "lstopo -.xml" would show that
> obj->nodeset is 0xf...f which means "infinite" or "full".
>
> Again, these infinite nodesets will go away in the upcoming hwloc 2.0.
>
> Brice
>
>
>
>
>
>
> Le 27/09/2016 01:35, Swati Agrawal a écrit :
>> Hi All,
>>
>> I have recently started using hwloc and stuck with a case where
>> there are no NUMA nodes. I see that when i run "lscpu" command,
>> it shows me there is 1 Numa Node and all the PUs are in this node.
>> But when i try reading the nodeset for my object using
>> hwloc_get_non_io_ancestor_obj(..), I see below output:
>>
>> obj->nodeset->ulongs_count = 1;
>> obj->nodeset->ulongs[0] = 18446744073709551615 (UINT64 MAX Value).
>>
>> What does this actually mean?
>>
>> Thanks,
>> Swati
>>
>>
>> ___
>> hwloc-devel mailing list
>> hwloc-devel@lists.open-mpi.org
>> <javascript:_e(%7B%7D,'cvml','hwloc-devel@lists.open-mpi.org');>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel>
>
> ___
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel
___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel

Re: [hwloc-devel] Query on the output of hwloc API

2016-09-26 Thread Brice Goglin
Hello

If there's no NUMA node object in your hwloc topology, it means your
machine isn't NUMA (there's a single NUMA node), or your system doesn't
report NUMA information at all (missing NUMA support in the kernel, etc).

This is an old design choice that is not convenient. So we'll change
that in the upcoming hwloc 2.0. There will always be at least one NUMA
node object (just like in lspcu).

In the meantime, the meaning of obj->nodeset isn't very useful when
there's no NUMA object anyway. If you really need to look at
obj->nodeset on non-NUMA machines, you'll get either NULL or a "full"
"infinite" bitmap (meaning "the entire machine memory", as explained in
the description of the nodeset attribute of the object structure
https://www.open-mpi.org/projects/hwloc/doc/v1.11.4/a00038.php#a08f0d0e16c619a6e653526cbee4ffea3).

By the way, you're not supposed to look at internal nodeset fields
(ulongs and ulongs_count). For instance, there's another field saying
that the bitmap is infinite. All these are private details not meant to
be understood by users. Things like hwloc_bitmap_asprintf() or "lstopo
-.xml" would show that obj->nodeset is 0xf...f which means "infinite" or
"full".

Again, these infinite nodesets will go away in the upcoming hwloc 2.0.

Brice






Le 27/09/2016 01:35, Swati Agrawal a écrit :
> Hi All,
>
> I have recently started using hwloc and stuck with a case where there
> are no NUMA nodes. I see that when i run "lscpu" command, it shows me
> there is 1 Numa Node and all the PUs are in this node.
> But when i try reading the nodeset for my object using
> hwloc_get_non_io_ancestor_obj(..), I see below output:
>
> obj->nodeset->ulongs_count = 1;
> obj->nodeset->ulongs[0] = 18446744073709551615 (UINT64 MAX Value).
>
> What does this actually mean?
>
> Thanks,
> Swati
>
>
> ___
> hwloc-devel mailing list
> hwloc-devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel

[hwloc-devel] new distances API

2016-09-20 Thread Brice Goglin
Hello

I just pushed a new distances API to simplify/cleanup things in hwloc
2.0. It will be merged in git master next week unless big complains.

Details and comments in the pull request
https://github.com/open-mpi/hwloc/pull/210

Brice

___
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-devel


Re: [OMPI devel] Migration of mailman mailing lists

2016-07-18 Thread Brice Goglin
Yes, kill all netloc lists.
Brice


Le 18 juillet 2016 17:43:49 UTC+02:00, Josh Hursey  a 
écrit :
>Now that netloc has rolled into hwloc, I think it is safe to kill the
>netloc lists.
>
>mtt-devel-core and mtt-annouce should be kept. They probably need to be
>cleaned. But the hope is that we release MTT at some point in the
>near-ish
>future.
>
>On Mon, Jul 18, 2016 at 10:20 AM, Jeff Squyres (jsquyres) <
>jsquy...@cisco.com> wrote:
>
>> We're progressing in the migration plans.  Next up is the mailing
>lists.
>> The first step is to determine which lists to migrate, and which to
>delete
>> (because they're now no longer necessary, anyway).
>>
>> These are the lists we plan to keep/migrate, and the lists we plan to
>> delete/not migrate -- if you know of any list we mis-classified,
>please let
>> us know ASAP.  We plan to start this migration as early as Tuesday
>> afternoon.
>>
>> Lists that we want to keep:
>> • Admin
>> • Announce
>> • Devel
>> • Devel-core
>> • Hwloc-announce
>> • Hwloc-commits -- gitdub sends to this list; KEEP
>> • Hwloc-devel
>> • Hwloc-users
>> • Mirrors
>> • Mtt-commits
>> • Mtt-devel
>> • Mtt-results -- daily MTT results sent to this list; KEEP
>> • Mtt-users
>> • Ompi-commits -- gitdub sends to this list; KEEP
>> • Users
>>
>> Lists that we know that we do not want to migrate:
>> • Bugs -- Trac used to send to this list; it’s now moot and can be
>killed.
>> • Docs -- kill?
>> • mtt-announce -- kill?
>> • mtt-devel-core -- kill?
>> • Netloc-announce -- kill?
>> • Netloc-bugs -- kill?
>> • Netloc-commits -- kill?
>> • Netloc-devel -- kill?
>> • Netloc-users -- kill?
>> • ompi-user-docs-bugs -- kill?
>> • ompi-user-docs-svn -- kill?
>> • otpo-bugs -- kill?
>> • otpo-svn -- kill?
>> • otpo-users -- kill?
>> • Any glassbottom list
>> • Any orcm list
>> • Any pmix list
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/07/19231.php
>
>
>
>
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post:
>http://www.open-mpi.org/community/lists/devel/2016/07/19232.php


Re: [OMPI devel] [PATCH] Fix for xlc-13.1.0 ICE (hwloc)

2016-05-08 Thread Brice Goglin
Thanks, applied to hwloc. And PR for OMPI master at
https://github.com/open-mpi/ompi/pull/1657
Brice



Le 06/05/2016 00:29, Paul Hargrove a écrit :
> I have some good news:  I have a fix!!
>
> FWIW: I too can build w/ xlc 12.1 (also BG/Q).
> It is just the 13.1.0 on Power7 that crashes building hwloc.
> Meanwhile, 13.1.2 on Power8 little-endian does not crash (but is a
> different front-end than big-endian if I understand correctly).
>
> I started "bisecting" the file topology-xml-nolibxml.c and found that
> xlc is crashing on "__hwloc_attribute_may_alias".
> Simply disabling use of that attribute resolves the problem.
>
> So, here is the fix, which simply changes the check for this attribute
> to match the way in which hwloc uses it.
> It disqualifies the buggy compiler version(s) based on behavior,
> rather than us trying to list affected versions.
>
> --- config/hwloc_check_attributes.m4~   2016-05-05 17:18:10.380479303
> -0500
> +++ config/hwloc_check_attributes.m42016-05-05 17:21:30.399799031
> -0500
> @@ -322,9 +322,10 @@
>  # Attribute may_alias: No suitable cross-check available, that
> works for non-supporting compilers
>  # Ignored by intel-9.1.045 -- turn off with -wd1292
>  # Ignored by PGI-6.2.5; ignore not detected due to missing
> cross-check
> +# The test case is chosen to match hwloc's usage, and reproduces
> an xlc-13.1.0 bug.
>  #
>  _HWLOC_CHECK_SPECIFIC_ATTRIBUTE([may_alias],
> -[int * p_value __attribute__ ((__may_alias__));],
> +[struct { int i; } __attribute__ ((__may_alias__)) * p_value;],
>  [],
>  [])
>
>
> -Paul [proving that I am good for more than just *breaking* other
> people's software - I can fix things too]
>
> On Thu, May 5, 2016 at 2:28 PM, Jeff Squyres (jsquyres)
> > wrote:
>
> On May 5, 2016, at 5:27 PM, Josh Hursey  > wrote:
> >
> > Since this also happens with hwloc 1.11.3 standalone maybe hwloc
> folks can take point on further investigation?
>
> I think Brice would love your assistance in figuring this out,
> since I'm guessing he doesn't have access to these platforms,
> either.  :-)
>
> --
> Jeff Squyres
> jsquy...@cisco.com 
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/05/18917.php
>
>
>
>
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] [PATCH] Fix for xlc-13.1.0 ICE (hwloc)

2016-05-06 Thread Brice Goglin
Thanks
I think I would be fine with that fix. Unfortunately I won't have a good
internet access until sunday night. I won't be able to test anything
properly earlier :/



Le 06/05/2016 00:29, Paul Hargrove a écrit :
> I have some good news:  I have a fix!!
>
> FWIW: I too can build w/ xlc 12.1 (also BG/Q).
> It is just the 13.1.0 on Power7 that crashes building hwloc.
> Meanwhile, 13.1.2 on Power8 little-endian does not crash (but is a
> different front-end than big-endian if I understand correctly).
>
> I started "bisecting" the file topology-xml-nolibxml.c and found that
> xlc is crashing on "__hwloc_attribute_may_alias".
> Simply disabling use of that attribute resolves the problem.
>
> So, here is the fix, which simply changes the check for this attribute
> to match the way in which hwloc uses it.
> It disqualifies the buggy compiler version(s) based on behavior,
> rather than us trying to list affected versions.
>
> --- config/hwloc_check_attributes.m4~   2016-05-05 17:18:10.380479303
> -0500
> +++ config/hwloc_check_attributes.m42016-05-05 17:21:30.399799031
> -0500
> @@ -322,9 +322,10 @@
>  # Attribute may_alias: No suitable cross-check available, that
> works for non-supporting compilers
>  # Ignored by intel-9.1.045 -- turn off with -wd1292
>  # Ignored by PGI-6.2.5; ignore not detected due to missing
> cross-check
> +# The test case is chosen to match hwloc's usage, and reproduces
> an xlc-13.1.0 bug.
>  #
>  _HWLOC_CHECK_SPECIFIC_ATTRIBUTE([may_alias],
> -[int * p_value __attribute__ ((__may_alias__));],
> +[struct { int i; } __attribute__ ((__may_alias__)) * p_value;],
>  [],
>  [])
>
>
> -Paul [proving that I am good for more than just *breaking* other
> people's software - I can fix things too]
>
> On Thu, May 5, 2016 at 2:28 PM, Jeff Squyres (jsquyres)
> > wrote:
>
> On May 5, 2016, at 5:27 PM, Josh Hursey  > wrote:
> >
> > Since this also happens with hwloc 1.11.3 standalone maybe hwloc
> folks can take point on further investigation?
>
> I think Brice would love your assistance in figuring this out,
> since I'm guessing he doesn't have access to these platforms,
> either.  :-)
>
> --
> Jeff Squyres
> jsquy...@cisco.com 
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/05/18917.php
>
>
>
>
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] [2.0.0rc2] build failure with ppc64 and "gcc -m32" (hwloc)

2016-05-03 Thread Brice Goglin
https://github.com/open-mpi/ompi/pull/1621 (against master, needs to go
to 2.0 later)


Le 03/05/2016 08:22, Brice Goglin a écrit :
> Yes we should backport this to OMPI master and v2.x.
> I am usually not the one doing the PR, I'd need to learn the exact
> procedure first :)
>
> Brice
>
>
>
> Le 03/05/2016 08:15, Paul Hargrove a écrit :
>> Thanks, Brice.
>>
>> Any plans to get this fix into Open MPI's embedded copy of hwloc
>> 1.11.2, and into v2.x in particular?
>> Or perhaps that is Jeff's job?
>>
>> -Paul
>>
>> On Mon, May 2, 2016 at 11:04 PM, Brice Goglin <brice.gog...@inria.fr
>> <mailto:brice.gog...@inria.fr>> wrote:
>>
>> Should be fixed by
>> 
>> https://github.com/open-mpi/hwloc/commit/9549fd59af04dca2e2340e17f0e685f8c552d818
>> Thanks for the report
>> Brice
>>
>>
>>
>>
>> Le 02/05/2016 21:53, Paul Hargrove a écrit :
>>> I have a linux/ppc64 host running Fedora 20.
>>> I have configured the 2.0.0rc2 tarball with
>>>
>>> --prefix=[] --enable-debug \
>>> CFLAGS=-m32 --with-wrapper-cflags=-m32 \
>>> CXXFLAGS=-m32 --with-wrapper-cxxflags=-m32 \
>>> FCFLAGS=-m32 --with-wrapper-fcflags=-m32 --disable-mpi-fortran
>>>
>>> [yes, I know the fortran flags are pointless with
>>> --disable-mpi-fortran]
>>>
>>> My build is failing (as shown at the bottom of this email) in
>>> tools/wrappers with undefined references to udev symbols.
>>> The udev configure probe run by the embedded hwloc seemed happy
>>> enough:
>>>
>>> --- MCA component hwloc:hwloc1112 (m4 configuration macro,
>>> priority 90)
>>> checking for MCA component hwloc:hwloc1112 compile mode...
>>> static
>>> checking hwloc building mode... embedded
>>> [...]
>>> checking libudev.h usability... yes
>>> checking libudev.h presence... yes
>>> checking for libudev.h... yes
>>> checking for udev_device_new_from_subsystem_sysname in
>>> -ludev... no
>>>
>>>
>>> However, looking at config.log one can see that despite the
>>> presence/usability of libudev.h there is NOT a libudev library
>>> present for "-m32".
>>> This is apparent because the probe
>>> for udev_device_new_from_subsystem_sysname failed with a message
>>> about the *library* not being found rather than about an
>>> undefined symbol.
>>>
>>>
>>> I *can* work-around this issue by passing  --disable-libudev to
>>> configure.
>>> However, it would seem appropriate to check for a usable libudev
>>> library in addition to the header.
>>>
>>> -Paul
>>>
>>>
>>> Making all in tools/wrappers
>>> make[2]: Entering directory
>>> 
>>> `/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/BLD/opal/tools/wrappers'
>>> depbase=`echo opal_wrapper.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
>>> gcc -std=gnu99 "-DEXEEXT=\"\"" -I.
>>> 
>>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/tools/wrappers
>>> -I../../../opal/include -I../../../ompi/include
>>> -I../../../oshmem/include
>>> -I../../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen 
>>> -I../../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen
>>> -I../../../ompi/mpiext/cuda/c  
>>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2
>>> -I../../..
>>> 
>>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/include
>>> 
>>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/orte/include
>>> -I../../../orte/include
>>> 
>>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/ompi/include
>>> 
>>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/oshmem/include
>>>  
>>> 
>>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/mca/hwloc/hwloc1112/hwloc/include
>>> 
>>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/BLD/opal/mca/hwloc/hwloc1112/hwloc/include
>>> 
>>> -I/home/phargrov/OMPI/openm

Re: [OMPI devel] [2.0.0rc2] build failure with ppc64 and "gcc -m32" (hwloc)

2016-05-03 Thread Brice Goglin
Yes we should backport this to OMPI master and v2.x.
I am usually not the one doing the PR, I'd need to learn the exact
procedure first :)

Brice



Le 03/05/2016 08:15, Paul Hargrove a écrit :
> Thanks, Brice.
>
> Any plans to get this fix into Open MPI's embedded copy of hwloc
> 1.11.2, and into v2.x in particular?
> Or perhaps that is Jeff's job?
>
> -Paul
>
> On Mon, May 2, 2016 at 11:04 PM, Brice Goglin <brice.gog...@inria.fr
> <mailto:brice.gog...@inria.fr>> wrote:
>
> Should be fixed by
> 
> https://github.com/open-mpi/hwloc/commit/9549fd59af04dca2e2340e17f0e685f8c552d818
> Thanks for the report
> Brice
>
>
>
>
> Le 02/05/2016 21:53, Paul Hargrove a écrit :
>> I have a linux/ppc64 host running Fedora 20.
>> I have configured the 2.0.0rc2 tarball with
>>
>> --prefix=[] --enable-debug \
>> CFLAGS=-m32 --with-wrapper-cflags=-m32 \
>> CXXFLAGS=-m32 --with-wrapper-cxxflags=-m32 \
>> FCFLAGS=-m32 --with-wrapper-fcflags=-m32 --disable-mpi-fortran
>>
>> [yes, I know the fortran flags are pointless with
>> --disable-mpi-fortran]
>>
>> My build is failing (as shown at the bottom of this email) in
>> tools/wrappers with undefined references to udev symbols.
>> The udev configure probe run by the embedded hwloc seemed happy
>> enough:
>>
>> --- MCA component hwloc:hwloc1112 (m4 configuration macro,
>> priority 90)
>> checking for MCA component hwloc:hwloc1112 compile mode... static
>> checking hwloc building mode... embedded
>> [...]
>> checking libudev.h usability... yes
>> checking libudev.h presence... yes
>> checking for libudev.h... yes
>> checking for udev_device_new_from_subsystem_sysname in
>> -ludev... no
>>
>>
>> However, looking at config.log one can see that despite the
>> presence/usability of libudev.h there is NOT a libudev library
>> present for "-m32".
>> This is apparent because the probe
>> for udev_device_new_from_subsystem_sysname failed with a message
>> about the *library* not being found rather than about an
>> undefined symbol.
>>
>>
>> I *can* work-around this issue by passing  --disable-libudev to
>> configure.
>> However, it would seem appropriate to check for a usable libudev
>> library in addition to the header.
>>
>> -Paul
>>
>>
>> Making all in tools/wrappers
>> make[2]: Entering directory
>> 
>> `/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/BLD/opal/tools/wrappers'
>> depbase=`echo opal_wrapper.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
>> gcc -std=gnu99 "-DEXEEXT=\"\"" -I.
>> 
>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/tools/wrappers
>> -I../../../opal/include -I../../../ompi/include
>> -I../../../oshmem/include
>> -I../../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen
>> -I../../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen
>> -I../../../ompi/mpiext/cuda/c  
>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2
>> -I../../..
>> 
>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/include
>> 
>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/orte/include
>> -I../../../orte/include
>> 
>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/ompi/include
>> 
>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/oshmem/include
>>  
>> 
>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/mca/hwloc/hwloc1112/hwloc/include
>> 
>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/BLD/opal/mca/hwloc/hwloc1112/hwloc/include
>> 
>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/mca/event/libevent2022/libevent
>> 
>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/mca/event/libevent2022/libevent/include
>> 
>> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/BLD/opal/mca/event/libevent2022/libevent/include
>>  -m32 -g -finline-functions -fno-strict-aliasing -pthread -MT
>> opal_wrapper.o -MD -MP -MF $depbase.Tpo -c -o opal_wrapper.o
>> 
>> /home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi

Re: [OMPI devel] [2.0.0rc2] build failure with ppc64 and "gcc -m32" (hwloc)

2016-05-03 Thread Brice Goglin
Should be fixed by
https://github.com/open-mpi/hwloc/commit/9549fd59af04dca2e2340e17f0e685f8c552d818
Thanks for the report
Brice



Le 02/05/2016 21:53, Paul Hargrove a écrit :
> I have a linux/ppc64 host running Fedora 20.
> I have configured the 2.0.0rc2 tarball with
>
> --prefix=[] --enable-debug \
> CFLAGS=-m32 --with-wrapper-cflags=-m32 \
> CXXFLAGS=-m32 --with-wrapper-cxxflags=-m32 \
> FCFLAGS=-m32 --with-wrapper-fcflags=-m32 --disable-mpi-fortran
>
> [yes, I know the fortran flags are pointless with --disable-mpi-fortran]
>
> My build is failing (as shown at the bottom of this email) in
> tools/wrappers with undefined references to udev symbols.
> The udev configure probe run by the embedded hwloc seemed happy enough:
>
> --- MCA component hwloc:hwloc1112 (m4 configuration macro,
> priority 90)
> checking for MCA component hwloc:hwloc1112 compile mode... static
> checking hwloc building mode... embedded
> [...]
> checking libudev.h usability... yes
> checking libudev.h presence... yes
> checking for libudev.h... yes
> checking for udev_device_new_from_subsystem_sysname in -ludev... no
>
>
> However, looking at config.log one can see that despite the
> presence/usability of libudev.h there is NOT a libudev library present
> for "-m32".
> This is apparent because the probe
> for udev_device_new_from_subsystem_sysname failed with a message about
> the *library* not being found rather than about an undefined symbol.
>
>
> I *can* work-around this issue by passing  --disable-libudev to configure.
> However, it would seem appropriate to check for a usable libudev
> library in addition to the header.
>
> -Paul
>
>
> Making all in tools/wrappers
> make[2]: Entering directory
> `/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/BLD/opal/tools/wrappers'
> depbase=`echo opal_wrapper.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
> gcc -std=gnu99 "-DEXEEXT=\"\"" -I.
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/tools/wrappers
> -I../../../opal/include -I../../../ompi/include
> -I../../../oshmem/include
> -I../../../opal/mca/hwloc/hwloc1112/hwloc/include/private/autogen
> -I../../../opal/mca/hwloc/hwloc1112/hwloc/include/hwloc/autogen
> -I../../../ompi/mpiext/cuda/c  
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2 
> -I../../..
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/orte/include
> -I../../../orte/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/ompi/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/oshmem/include
>  
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/mca/hwloc/hwloc1112/hwloc/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/BLD/opal/mca/hwloc/hwloc1112/hwloc/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/mca/event/libevent2022/libevent
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/mca/event/libevent2022/libevent/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/BLD/opal/mca/event/libevent2022/libevent/include
>  -m32 -g -finline-functions -fno-strict-aliasing -pthread -MT
> opal_wrapper.o -MD -MP -MF $depbase.Tpo -c -o opal_wrapper.o
> /home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/openmpi-2.0.0rc2/opal/tools/wrappers/opal_wrapper.c
> &&\
> mv -f $depbase.Tpo $depbase.Po
> /bin/sh ../../../libtool  --tag=CC   --mode=link gcc -std=gnu99  -m32
> -g -finline-functions -fno-strict-aliasing -pthread   -o opal_wrapper
> opal_wrapper.o ../../../opal/libopen-pal.la 
> -lrt -lm -lutil
> libtool: link: gcc -std=gnu99 -m32 -g -finline-functions
> -fno-strict-aliasing -pthread -o .libs/opal_wrapper opal_wrapper.o
>  ../../../opal/.libs/libopen-pal.so -ldl -lrt -lm -lutil -pthread
> -Wl,-rpath
> -Wl,/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/INST/lib
> ../../../opal/.libs/libopen-pal.so: undefined reference to `udev_new'
> ../../../opal/.libs/libopen-pal.so: undefined reference to
> `udev_device_new_from_subsystem_sysname'
> ../../../opal/.libs/libopen-pal.so: undefined reference to `udev_unref'
> ../../../opal/.libs/libopen-pal.so: undefined reference to
> `udev_device_get_property_value'
> ../../../opal/.libs/libopen-pal.so: undefined reference to
> `udev_device_unref'
> collect2: error: ld returned 1 exit status
> make[2]: *** [opal_wrapper] Error 1
> make[2]: Leaving directory
> `/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/BLD/opal/tools/wrappers'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory
> `/home/phargrov/OMPI/openmpi-2.0.0rc2-linux-ppc32-gcc/BLD/opal'
> make: *** [all-recursive] Error 1
>
>
>
>
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> 
> Computer 

Re: [OMPI devel] Why is floating point number used for locality

2016-04-28 Thread Brice Goglin
It comes from the hwloc API. It doesn't use integers because some users
want to provide their own distance matrix that was generated by
benchmarks. Also we normalize the matrix to have latency 1 on the
diagonal (for local memory access latency ) and that causes non-diagonal
items not to be integers anymore (Linux and ACPI SLIT report 10 for
local memory latency and custom values > 10 for non-local latency).

I am actually revisiting that hwloc API right now. I am open to comments
and suggestion about all this.

By the way, I talked to Jeff about this recently: the BTL should use the
distance in the hwloc tree first, instead of these latency values. I'll
try to send patches one day.

Brice



Le 28/04/2016 20:00, dpchoudh . a écrit :
> Hello all
>
> I am wondering about the rationale of using floating point numbers for
> calculating 'distances' in the openib BTL. Is it because some
> distances can be infinite and there is no (conventional) way to
> represent infinity using integers?
>
> Thanks for your comments
>
> Durga
>
>
> The surgeon general advises you to eat right, exercise regularly and
> quit ageing.
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/04/18830.php



Re: [hwloc-devel] Three patches for MSVC/ICL builds on Windows.

2016-04-05 Thread Brice Goglin
More comments about individual changes below.

> add-ifndef-guard-around-gnu-source.patch
> diff --git a/config/hwloc.m4 b/config/hwloc.m4
> index f249713..855244d 100644
> --- a/config/hwloc.m4
> +++ b/config/hwloc.m4
> @@ -486,7 +486,9 @@ EOF])
>  # program_invocation_name and __progname may be available but not 
> exported in headers
>  AC_MSG_CHECKING([for program_invocation_name])
>  AC_TRY_LINK([
> - #define _GNU_SOURCE
> + #ifndef _GNU_SOURCE
> + # define _GNU_SOURCE
> + #endif
>   #include 
>   #include 
>   extern char *program_invocation_name;
> [...]

This one is applied (not pushed yet).

> use-ac-check-decl.patch
> diff --git a/config/hwloc.m4 b/config/hwloc.m4
> index 855244d..49955a6 100644
> --- a/config/hwloc.m4
> +++ b/config/hwloc.m4
> @@ -367,7 +367,7 @@ EOF])
>  AC_CHECK_HEADERS([ctype.h])
>  
>  AC_CHECK_FUNCS([strncasecmp], [
> -  _HWLOC_CHECK_DECL([strncasecmp], [
> +  AC_CHECK_DECLS([strncasecmp], [
>   AC_DEFINE([HWLOC_HAVE_DECL_STRNCASECMP], [1], [Define to 1 if function 
> `strncasecmp' is declared by system headers])
>])
>  ])
> [...]

Samuel pushed a better fix (already in master, I'll backport to v1.11.x
after checking the configure logs on our regression platform)

> windows-compatibility-changes.patch
> diff --git a/config/hwloc.m4 b/config/hwloc.m4
> index 49955a6..12230e1 100644
> --- a/config/hwloc.m4
> +++ b/config/hwloc.m4
> @@ -362,7 +362,7 @@ EOF])
>  #
>  
>  AC_CHECK_HEADERS([unistd.h])
> -AC_CHECK_HEADERS([dirent.h])
> +AC_CHECK_HEADERS([dirent.h], [hwloc_have_dirent=yes])
>  AC_CHECK_HEADERS([strings.h])
>  AC_CHECK_HEADERS([ctype.h])

I am dropping the dirent changes and just disabling hwloc-ps entirely on
Windows.

> +AC_CHECK_LIB([user32], [PostQuitMessage], [hwloc_have_user32="yes"])
>

The user32 part is applied.


> @@ -381,6 +381,21 @@ static __hwloc_inline int hwloc_strncasecmp(const char 
> *s1, const char *s2, size
>  #endif
>  }
>  
> +static __hwloc_inline int hwloc_strcasecmp(const char *s1, const char *s2)
> +{
> +#ifdef HWLOC_HAVE_DECL_STRCASECMP
> +  return strcasecmp(s1, s2);
> +#else
> +  while (1) {
> +char c1 = tolower(*s1), c2 = tolower(*s2);
> +if (!c1 || !c2 || c1 != c2)
> +  return c1-c2;
> +s1++; s2++;
> +  }
> +  return 0;
> +#endif
> +}
> +
>  static __hwloc_inline hwloc_obj_type_t 
> hwloc_cache_type_by_depth_type(unsigned depth, hwloc_obj_cache_type_t type)
>  {
>if (type == HWLOC_OBJ_CACHE_INSTRUCTION) {
> @@ -407,4 +422,25 @@ static __hwloc_inline int hwloc_obj_type_is_io 
> (hwloc_obj_type_t type)
>return type >= HWLOC_OBJ_BRIDGE && type <= HWLOC_OBJ_OS_DEVICE;
>  }
>  
> +#ifdef HWLOC_WIN_SYS
> +#  ifndef HAVE_SSIZE_T
> +typedef SSIZE_T ssize_t;
> +#  endif
> +#  ifndef HAVE_SNPRINTF
> +#define snprintf hwloc_snprintf
> +#  endif
> +#  if !HAVE_DECL_STRTOULL && !defined(HAVE_STRTOULL)
> +#define strtoull _strtoui64
> +#  endif
> +#  if !HAVE_DECL_S_ISREG
> +#define S_ISREG(mode) (mode & _S_IFREG)
> +#  endif
> +#  if !HAVE_DECL_S_ISDIR
> +#define S_ISDIR(mode) (mode & _S_IFDIR)
> +#  endif
> +#  ifndef HAVE_STRCASECMP
> +#define strcasecmp hwloc_strcasecmp
> +#  endif
> +#endif
> +
>  #endif /* HWLOC_PRIVATE_MISC_H */

Overall this looks OK.

In the MSVC project under contrib/windows/, we use a hardwired
hwloc_config.h which says:
typedef SSIZE_T ssize_t;
#define snprintf _snprintf
#define strcasecmp _stricmp
#define strncasecmp _strnicmp
#define strdup _strdup
#define strtoull _strtoui64
#define strtoll _strtoi64
#define S_ISREG(m) ((m)&_S_IFREG)
#define S_ISDIR( m ) (((m) & S_IFMT) == S_IFDIR)
#define putenv _putenv

strncasecmp and strtoll don't seem needed anymore.

For strdup and putenv, my MSVC fails with "The POSIX name for this item
is deprecated. Instead use the ISO C++ conformant name: _foo."
I wonder why you didn't have this problem?

Is _stricmp() OK instead of your code for hwloc_strcasecmp() ?

Don't you have S_IFMT and S_IFREG/DIR without _ prefix?

Brice



Re: [hwloc-devel] Three patches for MSVC/ICL builds on Windows.

2016-04-05 Thread Brice Goglin
Le 05/04/2016 10:26, Samuel Thibault a écrit :
> The bug here is that that HWLOC_CHECK_DECL assumed that availability
> of the function was tested before, i.e.
>> conftest.c(96) : fatal error C1083: Cannot open include file: 'sched.h': No
>> such file or directory
> was unexpected.
>

Adding a check for sched.h availability before CHECK_DECL() might be
enough for Jonathan's case. I am  not sure I want to change this m4 code
in v1.11.3 since it has been working fine for years.

Brice



Re: [hwloc-devel] Three patches for MSVC/ICL builds on Windows.

2016-04-04 Thread Brice Goglin
Le 04/04/2016 21:39, Peyton, Jonathan L a écrit :
>
> Hello everyone,
>
>  
>
> I’ve been working on a build using both MSVC and the Intel Windows
> compiler (ICL).  These three patches allow building of hwloc + utils.
>
>  
>
> 1) add-ifndef-guard-around-gnu-source.patch – this minor change only
> adds #ifndef _/GNU/_SOURCE inside the hwloc.m4 tests because it seems
> to be defined on Linux systems beforehand causing a warning in these
> autoconf tests.
>

Hello

I am pushing this one thanks.

> 2) use-ac-check-decl.patch – this change removes the
> _/HWLOC_CHECK_DECL() macro with the autoconf AC_CHECK/_DECLS() macro. 
> The problem I was having concerned how _/HWLOC_CHECK/_DECL() worked. 
> It has an expected failure structure where if say, sched_setaffinity,
> is already defined, then the AC_COMPILE_IFELSE() macro will fail and
> say it **is** declared (the AC_MSG_RESULT([yes]) is in the “if-false”
> part of the check).  This is problematic when using MSVC because it
> will say that sched_setaffinity is declared when it really isn’t.  The
> comment for _/HWLOC_CHECK/_DECL is also outdated so I think this can
> be safely removed.
>

I am not very confident about this one because this is really something
that was needed in the past. Unfortunately the very old commit
075eff1d1dd64292ff421a95f06d0151f1c246b5 doesn't give any detail.
Looking the hwloc-devel archives in early 2009/11, it's likely related
to some PGCC issues.

What problem did you actually see?

>
> 3) windows-compatibility-changes.patch – this change adds necessary
> autoconf checks that I needed to get MSVC/ICL to compile hwloc.  For
> instance, ssize_t wasn’t declared and is defined from SSIZE_T instead,
> S_ISREG isn’t defined in the windows headers so it is defined
> correctly when it doesn’t exist, etc.  This also introduced
> hwloc_strcasecmp() which is modeled after hwloc_strncasecmp().  If
> strcasecmp() isn’t defined, then hwloc_strcasecmp() is used instead. 
> These MSVC/ICL auxiliary defines are put in include/private/misc.h and
> this header was added to some source files that needed it.
>
>  
>

There are some easy pieces that I will commit soon.
There are some harder ones like changing the strtoull() stuff, I need to
spend some time making sure it doesn't break anything.
By the way, hwloc-ps uses dirent for readding /proc, I think we should
just always disable that program on Windows.

Brice



Re: [OMPI devel] [OMPI users] configuring open mpi 10.1.2 with cuda on NVIDIA TK1

2016-01-22 Thread Brice Goglin
Hello
hwloc doesn't have any cuda specific configure variables. We just use
standard variables like LIBS and CPPFLAGS. I guess OMPI could propagate
--with-cuda directories to hwloc by setting LIBS and CPPFLAGS before
running hwloc m4 functions, but I don't think OMPI actually cares about
hwloc reporting CUDA device locality anyway, and OMPI might stop
embedding hwloc in the near future anyway.
Brice



Le 22/01/2016 23:34, Sylvain Jeaugey a écrit :
> [Moving To Devel]
>
> I tried to look at the configure to understand why the hwloc part
> failed at getting the CUDA path. I guess the --with-cuda information
> is not propagated to the hwloc part of the configure.
>
> If an m4 expert has an idea of how to do this the The Right Way, that
> would help.
>
> Thanks,
> Sylvain
>
> On 01/22/2016 10:07 AM, Sylvain Jeaugey wrote:
>> It looks like the errors are produced by the hwloc configure ; this
>> one somehow can't find CUDA (I have to check if that's a problem
>> btw). Anyway, later in the configure, the VT configure finds cuda
>> correctly, so it seems specific to the hwloc configure.
>>
>> On 01/22/2016 10:01 AM, Kuhl, Spencer J wrote:
>>>
>>> Hi Sylvain,
>>>
>>>
>>> The configure does not stop, 'make all install' completes.  After
>>> remaking and recompiling then ignoring the configure errors, and
>>> confirming both a functional cuda install and functional openmpi
>>> install.  I went to the /usr/local/cuda/samples directory and ran
>>> 'make' and succesfully ran 'simpleMPI' provided by NVIDIA.  The
>>> output suggested that everything works perfectly fine between
>>> openMPI and cuda on my Jetson TK1 install.  Because of this, I think
>>> it is as you suspected; it was just ./configure output noise.  
>>>
>>>
>>> What a frustrating exercise.  Thanks for the suggestion.  I think I
>>> can say 'case closed'
>>>
>>>
>>> Spencer
>>>
>>>
>>>
>>>
>>> 
>>> *From:* users  on behalf of Sylvain
>>> Jeaugey 
>>> *Sent:* Friday, January 22, 2016 11:34 AM
>>> *To:* us...@open-mpi.org
>>> *Subject:* Re: [OMPI users] configuring open mpi 10.1.2 with cuda on
>>> NVIDIA TK1
>>>  
>>> Hi Spencer,
>>>
>>> Could you be more specific about what fails ? Did the configure stop
>>> at some point ? Or is it a compile error during the build ?
>>>
>>> I'm not sure the errors you are seeing in config.log are actually
>>> the real problem (I'm seeing the same error traces on a perfectly
>>> working machine). Not pretty, but maybe just noise.
>>>
>>> Thanks,
>>> Sylvain
>>>
>>> On 01/22/2016 06:48 AM, Kuhl, Spencer J wrote:

 Thanks for the suggestion Ryan, I will remove the symlinks and
 start try again.  I checked config.log, and it appears that the
 configure finds cuda support, (result: yes), but once configure
 checks for cuda.h usability, conftest.c reports that a fatal error
 occurred, 'cuda.h no such file or directory.'   


 I have copied here some grep'ed output of config.log


 $ ./configure --prefix=/usr/local --with-cuda=/usr/local/cuda-6.5
 --enable-mpi-java
 configure:9829: checking if --with-cuda is set
 configure:9883: result: found (/usr/local/cuda-6.5/include/cuda.h)
 | #include 
 configure:10055: checking if have cuda support
 configure:10058: result: yes (-I/usr/local/cuda-6.5)
 configure:66435: result:  '--prefix=/usr/local'
 '--with-cuda=/usr/local/cuda-6.5' '--enable-mpi-java'
 configure:74182: checking cuda.h usability
 conftest.c:643:18: fatal error: cuda.h: No such file or directory
  #include 
 | #include 
 configure:74182: checking cuda.h presence
 conftest.c:610:18: fatal error: cuda.h: No such file or directory
  #include 
 | #include 
 configure:74182: checking for cuda.h
 configure:74265: checking cuda_runtime_api.h usability
 conftest.c:643:30: fatal error: cuda_runtime_api.h: No such file or
 directory
  #include 
 | #include 
 configure:74265: checking cuda_runtime_api.h presence
 conftest.c:610:30: fatal error: cuda_runtime_api.h: No such file or
 directory
  #include 
 | #include 
 configure:74265: checking for cuda_runtime_api.h
 configure:97946: running /bin/bash './configure' --disable-dns
 --disable-http --disable-rpc --disable-openssl
 --enable-thread-support --disable-evport  '--prefix=/usr/local'
 '--with-cuda=/usr/local/cuda-6.5' '--enable-mpi-java'
 --cache-file=/dev/null --srcdir=. --disable-option-checking
 configure:187066: result: verbs_usnic, ugni, sm, verbs, cuda
 configure:193532: checking for MCA component common:cuda compile mode
 configure:193585: checking if MCA component common:cuda can compile



 
 *From:* users  on behalf of
 Novosielski, Ryan 

Re: [hwloc-devel] Static analysis

2016-01-14 Thread Brice Goglin
Hello

I looked at the full list of warnings. It doesn't look like this tool
will be very useful to hwloc. Many warnings are false-positive caused by
the lack of precise control flow analysis (dead code, leaks). Also the
report isn't precise enough to explain which control-flow would actually
cause a leak. Coverity Scan does a much better job at detecting and
reporting these and I already try to fix all warnings it reports.

Also the tool lacks some understanding of how some APIs work. For
instance it knows posix_memalign() allocates things but doesn't know
that it doesn't allocate when it returns non-0. Or it complains about
pthread_mutex() return values without looking at how we initialized the
mutex.

Finally some warnings seem Microsoft-specific :)
Function 'memcpy' is deprecated. Replace with more secure equivalent
like 'memcpy_s', add missing logic, or re-architect.
Function 'sprintf' is deprecated. Replace with more secure equivalent
like 'sprintf_s', add missing logic, or re-architect.

Coverity lets you mark some warnings as false-positive so that next runs
don't report them. If you can hide all the above, we could look at the
remaining ones. But right now there are so many warnings that it's hard
to focus on the real bugs :/

Brice



Le 12/01/2016 14:26, Odzioba, Lukasz a écrit :
> Hi,
> I use klocwork, which doesn't mean it is better it just reports different 
> subset of potential errors.
>
> Ignoring malloc errors is your design decision, I don't mind it. 
> From debugging perspective it makes it easier to track it down since you have 
> null ptr dereference somewhere near malloc .
> Malloc might start failing as well as just fail once in process live (i.e. 
> some other process requested free memory for a short period of time), if an 
> app is able to survive it's nice if not then well we have to live with that.
>
> Thanks,
> Lukas
>
>
> -Original Message-
> From: hwloc-devel [mailto:hwloc-devel-boun...@open-mpi.org] On Behalf Of 
> Brice Goglin
> Sent: Tuesday, January 12, 2016 12:57 PM
> To: hwloc-de...@open-mpi.org
> Subject: Re: [hwloc-devel] Static analysis
>
> Hello
>
> We're running coverity every night and I try to address most of what it
> reports (except the netloc/ directory git master which still needs a lot
> of work). What tool do you use?
>
> It's true we don't check malloc() return values in many cases (hopefully
> only the small allocations), mostly because we're lazy (and also because
> many other things would go wrong when malloc starts failing :/)
>
> Brice
>
>
>
> Le 12/01/2016 12:23, Odzioba, Lukasz a écrit :
>> Hi,
>> Static analysis tool we use has found quite a lot of potential issues in 
>> hwloc.
>> Most of them are type of "NULL ptr dereference" i.e. when pointer is not 
>> checked for null after allocation, but there are some more interesting cases 
>> as well.
>> My team distributes hwloc as a part of software package and we could just 
>> ignore those, but I wanted to let you know in case you are interested in 
>> fixing some or all of them.
>>
>> Please let me know If you would like to get a full list, so I'll prepare it.
>>
>> Thanks,
>> Lukas
>>
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/hwloc-devel/2016/01/4698.php
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2016/01/4699.php
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2016/01/4700.php



[hwloc-devel] new filter/ignore API

2015-10-29 Thread Brice Goglin
Hello

I just pushed the "type-filter" branch. It's supposed to replace the old
"ignore" API as well as ICACHES and IO topology flags in v2.0.

Main points:
* you may now enable, disable and query the "ignoring" (now called
filter) for any object type
* instruction caches and I/O don't need special topology flags anymore,
you may just set their filtering to KEEP_ALL
* there are 4 kinds of filter. KEEP_ALL and KEEP_NONE (obvious).
KEEP_STRUCTURE (similar to what we have in v1.x). KEEP_IMPORTANT (for
I/Os without too many useless objects).
* defaults are unchanged, both in the C API and in command-line tools

If you want everything (like lstopo), you just
hwloc_topology_set_all_types_filter(topology,
HWLOC_TYPE_FILTER_KEEP_ALL). There are variants for just changing the
filter for all I/O types, all caches, or all instruction caches. Examples:
https://github.com/open-mpi/hwloc/blob/type-filter/utils/hwloc/hwloc-diff.c#L72
https://github.com/open-mpi/hwloc/blob/type-filter/doc/examples/sharedcaches.c#L35
https://github.com/open-mpi/hwloc/blob/type-filter/tests/hwloc/cuda.c#L37

Full API:
https://github.com/open-mpi/hwloc/blob/type-filter/include/hwloc.h#L1919

Brice



Re: [hwloc-devel] Check the return value of getenv() in, hwloc_plugin_check_namespace

2015-10-09 Thread Brice Goglin
This broken code is only used when a plugin fails to find core symbols.
Only happens in case of namespace issues (for instance when libhwloc is
loaded by another layer of plugin, something we advice not to do).

Thanks, I'll apply this.

Brice



Le 09/10/2015 23:04, Guy Streeter a écrit :
> I am not able explain why this doesn't fail everywhere. If
> HWLOC_PLUGINS_VERBOSE is not set, atoi() gets called with a NULL pointer, and
> the behavior in that case is undocumented.
>
> --Guy
>
>
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2015/10/4627.php



Re: [OMPI devel] PMIX vs Solaris

2015-09-28 Thread Brice Goglin
Sorry, I didn't see this report before the pull request.

I applied Gilles' "simple but arguable" fix to master and stable
branches up to v1.9. It could be too imperfect if somebody ever changes
to permissions of /devices/pci* but I guess that's not going to happen
in practice. Finding the right device path and checking permissions
inside hwloc looks more arguable to me.
Thanks!

I am adding a filter to my email client to avoid missing hwloc-related
things among OMPI mails.

Brice




Le 28/09/2015 06:23, Gilles Gouaillardet a écrit :
> Paul and Brice,
>
> the error message is displayed by libpciaccess when hwloc invokes
> pci_system_init
>
> on Solaris :
> crw---   1 root sys  182, 253 Sep 28 10:55
> /devices/pci@0,0:reg
>
> from libpciaccess
>
>snprintf(nexus_path, sizeof(nexus_path), "/devices%s", nexus_name);
> if ((fd = open(nexus_path, O_RDWR | O_CLOEXEC)) >= 0) {
> [...]
> } else {
> (void) fprintf(stderr, "Error opening %s: %s\n",
>nexus_path, strerror(errno));
> [...]  
> }
>
> i noted some TODO comments in the code to handle this.
> since this piece of code is deep inside libpciaccess, i guess a fix is
> not trivial.
> unless libpciaccess is modified (for example, do not fprintf if a
> given environment variable is set),
> hwloc should "emulate" pieces of libpciaccess to get the devices path,
> check the permissions and
> invoke pci_system_init only if everything is ok.
>
>
> an other simpler (but arguable ...) option, is not to probe the PCI
> bus on Solaris unless root
> i made PR #136 https://github.com/open-mpi/hwloc/pull/136 to implement
> this
>
> Cheers,
>
> Gilles
>
> On 9/26/2015 9:24 AM, Paul Hargrove wrote:
>> FYI:
>>
>> Things look fine today with last night's master tarball.
>>
>> I hope Brice has a way to eliminate the hwloc warning, since I am
>> sure I am not the only one with scripts that will notice "Error" in
>> the output.
>>
>> -Paul
>>
>> On Wed, Sep 23, 2015 at 6:08 PM, Ralph Castain > > wrote:
>>
>> Aha! Thanks - just what the doctor ordered!
>>
>>
>>> On Sep 23, 2015, at 5:45 PM, Gilles Gouaillardet
>>> > wrote:
>>>
>>> Ralph,
>>>
>>> the root cause is
>>> getsockopt(..., SOL_SOCKET, SO_RCVTIMEO,...)
>>> fails with errno ENOPROTOOPT on solaris 11.2
>>>
>>> the attached patch is a proof of concept and works for me :
>>> /* if ENOPROTOOPT, do not try to set and restore SO_RCVTIMEO */
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On 9/21/2015 2:16 PM, Paul Hargrove wrote:
 Ralph,

 Just as you say:
 The first 64s pause was before the hwloc error message appeared.
 The second was after the second server_setup_fork appears, and
 before whatever line came after that.

 I don't know if stdio buffering my be "distorting" the
 placement of the pause relative to the lines of output.
 However, prior to your patch the entire failed mpirun was
 around 1s.

 No allocation.
 No resource manager.
 Just a single workstation.

 -Paul

 On Sun, Sep 20, 2015 at 9:32 PM, Ralph Castain
 > wrote:

 ?? Just so this old fossilized brain gets this right: you
 are saying there was a 64s pause before the hwloc error
 appeared, and then another 64s pause after the second
 server_setup_fork message appeared?

 If that’s true, then I’m chasing the wrong problem - it
 sounds like something is messed up in the mpirun startup.
 Did you have more than one node in the allocation by
 chance? I’m wondering if we are getting held up by
 something in the daemon launch/callback area.



> On Sep 20, 2015, at 4:08 PM, Paul Hargrove
>  wrote:
>
> Ralph,
>
> Still failing with that patch, but with the addition of a
> fairly long pause (64s) before the first error message
> appears, and again after the second "server setup_fork"
> (64s again)
>
> New output is attached.
>
> -Paul
>
> On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain
>  wrote:
>
> Argh - found a typo in the output line. Could you
> please try the attached patch and do it again? This
> might fix it, but if not it will provide me with some
> idea of the returned error.
>
> Thanks
> Ralph
>
>
>> On Sep 20, 2015, at 12:40 PM, Paul Hargrove
>>  wrote:
>>
>> Yes, it is definitely at 10.
>>   

Re: [OMPI devel] HWLOC issue

2015-09-10 Thread Brice Goglin
Try this patch (it applies to hwloc v1.9-v1.11, it should be OK against
OMPI's tree).
Your bridge 22:00.0 says it contains the master bus 00. It causes a
cycle in hwloc's insert algorithm, caught be the assertion. The patch
just removes this invalid bridge entirely.

Brice



Le 10/09/2015 21:23, George Bosilca a écrit :
> It used to work. Now I don't know exactly when I last updated the
> trunk version on the cluster, but not more than 10 days ago.
>
> lstopo complains with the same assert. Interestingly enough, the same
> binary succeed on the other nodes of the same cluster ...
>
>   George.
>
>
> On Thu, Sep 10, 2015 at 3:20 PM, Brice Goglin <brice.gog...@inria.fr
> <mailto:brice.gog...@inria.fr>> wrote:
>
> Did it work on the same machine before? Or did OMPI enable hwloc's
> PCI discovery recently?
>
> Does lstopo complain the same?
>
> Brice
>
>
>
> Le 10/09/2015 21:10, George Bosilca a écrit :
>> With the current trunk version I keep getting an assert deep down
>> in orted.
>>
>> orted:
>> 
>> ../../../../../../../ompi/opal/mca/hwloc/hwloc1110/hwloc/src/pci-common.c:177:
>> hwloc_pci_try_insert_siblings_below_new_bridge: Assertion `comp
>> != HWLOC_PCI_BUSID_SUPERSET' failed.
>>
>> The stack looks like this:
>>
>> [dancer18:21100] *** Process received signal ***
>> [dancer18:21100] Signal: Aborted (6)
>> [dancer18:21100] Signal code:  (-6)
>> [dancer18:21100] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fc22ce61710]
>> [dancer18:21100] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fc22caf0625]
>> [dancer18:21100] [ 2] /lib64/libc.so.6(abort+0x175)[0x7fc22caf1e05]
>> [dancer18:21100] [ 3] /lib64/libc.so.6(+0x2b74e)[0x7fc22cae974e]
>> [dancer18:21100] [ 4]
>> /lib64/libc.so.6(__assert_perror_fail+0x0)[0x7fc22cae9810]
>> [dancer18:21100] [ 5]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xb0a62)[0x7fc22ddc6a62]
>> [dancer18:21100] [ 6]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xb0b60)[0x7fc22ddc6b60]
>> [dancer18:21100] [ 7]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc1110_hwloc_insert_pci_device_list+0x8f)[0x7fc22ddc724c]
>> [dancer18:21100] [ 8]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xbf2d6)[0x7fc22ddd52d6]
>> [dancer18:21100] [ 9]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xd22f7)[0x7fc22dde82f7]
>> [dancer18:21100] [10]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc1110_hwloc_topology_load+0x1a3)[0x7fc22dde8ee1]
>> [dancer18:21100] [11]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc_base_get_topology+0x80)[0x7fc22ddb6ece]
>> [dancer18:21100] [12]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_ess_base_orted_setup+0x127)[0x7fc22e0b3523]
>> [dancer18:21100] [13]
>> 
>> /home/bosilca/opt/trunk/debug/lib/openmpi/mca_ess_env.so(+0xe45)[0x7fc22c6bbe45]
>> [dancer18:21100] [14]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_init+0x2c6)[0x7fc22e06b55a]
>> [dancer18:21100] [15]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_daemon+0x5c1)[0x7fc22e09a895]
>> [dancer18:21100] [16]
>> /home/bosilca/opt/trunk/debug/bin/orted[0x40082a]
>> [dancer18:21100] [17]
>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fc22cadcd5d]
>> [dancer18:21100] [18]
>> /home/bosilca/opt/trunk/debug/bin/orted[0x4006e9]
>>
>> Any ideas?
>>
>>   George.
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/09/17993.php
>
>
> ___
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17994.php
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17995.ph

Re: [OMPI devel] HWLOC issue

2015-09-10 Thread Brice Goglin
I guess it could be some invalid bus information in PCI bridges. Maybe
try to shutdown the node completely and restart it. I've seen other
strange PCI issues disappear like this in the past...

Otherwise, please send the tarball generated by "hwloc-gather-topology
--io foo". Send it only to me, it will likely be big because --io
gathers much more sysfs files for PCI.

Brice




Le 10/09/2015 21:23, George Bosilca a écrit :
> It used to work. Now I don't know exactly when I last updated the
> trunk version on the cluster, but not more than 10 days ago.
>
> lstopo complains with the same assert. Interestingly enough, the same
> binary succeed on the other nodes of the same cluster ...
>
>   George.
>
>
> On Thu, Sep 10, 2015 at 3:20 PM, Brice Goglin <brice.gog...@inria.fr
> <mailto:brice.gog...@inria.fr>> wrote:
>
> Did it work on the same machine before? Or did OMPI enable hwloc's
> PCI discovery recently?
>
> Does lstopo complain the same?
>
> Brice
>
>
>
> Le 10/09/2015 21:10, George Bosilca a écrit :
>> With the current trunk version I keep getting an assert deep down
>> in orted.
>>
>> orted:
>> 
>> ../../../../../../../ompi/opal/mca/hwloc/hwloc1110/hwloc/src/pci-common.c:177:
>> hwloc_pci_try_insert_siblings_below_new_bridge: Assertion `comp
>> != HWLOC_PCI_BUSID_SUPERSET' failed.
>>
>> The stack looks like this:
>>
>> [dancer18:21100] *** Process received signal ***
>> [dancer18:21100] Signal: Aborted (6)
>> [dancer18:21100] Signal code:  (-6)
>> [dancer18:21100] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fc22ce61710]
>> [dancer18:21100] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fc22caf0625]
>> [dancer18:21100] [ 2] /lib64/libc.so.6(abort+0x175)[0x7fc22caf1e05]
>> [dancer18:21100] [ 3] /lib64/libc.so.6(+0x2b74e)[0x7fc22cae974e]
>> [dancer18:21100] [ 4]
>> /lib64/libc.so.6(__assert_perror_fail+0x0)[0x7fc22cae9810]
>> [dancer18:21100] [ 5]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xb0a62)[0x7fc22ddc6a62]
>> [dancer18:21100] [ 6]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xb0b60)[0x7fc22ddc6b60]
>> [dancer18:21100] [ 7]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc1110_hwloc_insert_pci_device_list+0x8f)[0x7fc22ddc724c]
>> [dancer18:21100] [ 8]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xbf2d6)[0x7fc22ddd52d6]
>> [dancer18:21100] [ 9]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xd22f7)[0x7fc22dde82f7]
>> [dancer18:21100] [10]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc1110_hwloc_topology_load+0x1a3)[0x7fc22dde8ee1]
>> [dancer18:21100] [11]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc_base_get_topology+0x80)[0x7fc22ddb6ece]
>> [dancer18:21100] [12]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_ess_base_orted_setup+0x127)[0x7fc22e0b3523]
>> [dancer18:21100] [13]
>> 
>> /home/bosilca/opt/trunk/debug/lib/openmpi/mca_ess_env.so(+0xe45)[0x7fc22c6bbe45]
>> [dancer18:21100] [14]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_init+0x2c6)[0x7fc22e06b55a]
>> [dancer18:21100] [15]
>> 
>> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_daemon+0x5c1)[0x7fc22e09a895]
>> [dancer18:21100] [16]
>> /home/bosilca/opt/trunk/debug/bin/orted[0x40082a]
>> [dancer18:21100] [17]
>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fc22cadcd5d]
>> [dancer18:21100] [18]
>> /home/bosilca/opt/trunk/debug/bin/orted[0x4006e9]
>>
>> Any ideas?
>>
>>   George.
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/09/17993.php
>
>
> ___
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/17994.php
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17995.php



Re: [OMPI devel] HWLOC issue

2015-09-10 Thread Brice Goglin
Did it work on the same machine before? Or did OMPI enable hwloc's PCI
discovery recently?

Does lstopo complain the same?

Brice


Le 10/09/2015 21:10, George Bosilca a écrit :
> With the current trunk version I keep getting an assert deep down in
> orted.
>
> orted:
> ../../../../../../../ompi/opal/mca/hwloc/hwloc1110/hwloc/src/pci-common.c:177:
> hwloc_pci_try_insert_siblings_below_new_bridge: Assertion `comp !=
> HWLOC_PCI_BUSID_SUPERSET' failed.
>
> The stack looks like this:
>
> [dancer18:21100] *** Process received signal ***
> [dancer18:21100] Signal: Aborted (6)
> [dancer18:21100] Signal code:  (-6)
> [dancer18:21100] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fc22ce61710]
> [dancer18:21100] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x7fc22caf0625]
> [dancer18:21100] [ 2] /lib64/libc.so.6(abort+0x175)[0x7fc22caf1e05]
> [dancer18:21100] [ 3] /lib64/libc.so.6(+0x2b74e)[0x7fc22cae974e]
> [dancer18:21100] [ 4]
> /lib64/libc.so.6(__assert_perror_fail+0x0)[0x7fc22cae9810]
> [dancer18:21100] [ 5]
> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xb0a62)[0x7fc22ddc6a62]
> [dancer18:21100] [ 6]
> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xb0b60)[0x7fc22ddc6b60]
> [dancer18:21100] [ 7]
> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc1110_hwloc_insert_pci_device_list+0x8f)[0x7fc22ddc724c]
> [dancer18:21100] [ 8]
> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xbf2d6)[0x7fc22ddd52d6]
> [dancer18:21100] [ 9]
> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(+0xd22f7)[0x7fc22dde82f7]
> [dancer18:21100] [10]
> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc1110_hwloc_topology_load+0x1a3)[0x7fc22dde8ee1]
> [dancer18:21100] [11]
> /home/bosilca/opt/trunk/debug/lib/libopen-pal.so.0(opal_hwloc_base_get_topology+0x80)[0x7fc22ddb6ece]
> [dancer18:21100] [12]
> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_ess_base_orted_setup+0x127)[0x7fc22e0b3523]
> [dancer18:21100] [13]
> /home/bosilca/opt/trunk/debug/lib/openmpi/mca_ess_env.so(+0xe45)[0x7fc22c6bbe45]
> [dancer18:21100] [14]
> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_init+0x2c6)[0x7fc22e06b55a]
> [dancer18:21100] [15]
> /home/bosilca/opt/trunk/debug/lib/libopen-rte.so.0(orte_daemon+0x5c1)[0x7fc22e09a895]
> [dancer18:21100] [16] /home/bosilca/opt/trunk/debug/bin/orted[0x40082a]
> [dancer18:21100] [17]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7fc22cadcd5d]
> [dancer18:21100] [18] /home/bosilca/opt/trunk/debug/bin/orted[0x4006e9]
>
> Any ideas?
>
>   George.
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17993.php



Re: [OMPI devel] RFC: Remove --without-hwloc configure option

2015-09-04 Thread Brice Goglin
Le 04/09/2015 00:36, Gilles Gouaillardet a écrit :
> Ralph,
>
> just to be clear, your proposal is to abort if openmpi is configured
> with --without-hwloc, right ?
> ( the --with-hwloc option is not removed because we want to keep the
> option of using an external hwloc library )
>
> if I understand correctly, Paul's point is that if openmpi is ported
> to a new architecture for which hwloc has not been ported yet
> (embedded hwloc or external hwloc), then the very first step is to
> port hwloc before ompi can be built.
>
> did I get it right Paul ?
>
> Brice, what would happen in such a case ?
> embedded hwloc cannot be built ?
> hwloc returns little or no information ?

If it's a new operating system and it supports at least things like
sysconf, you will get a Machine object with one PUs per logical processor.

If it's a new platform running Linux, they are supposed to tell Linux at
least package/core/thread information. That's what we have for ARM for
instance.

Missing topology detection can be worked around easily (with XML and
synthetic description, what we did for BlueGene/Q before adding manual
support for that specific processor). Binding support can't.
And once you get binding, you get x86-topology even if the operating
system isn't supported (using cpuid).

> for example, on Fujitsu FX10 node (single socket, 16 cores), hwloc
> reports 16 sockets with one core each and no cache. though this is not
> correct, that can be seen as equivalent to the real config by ompi, so
> this is not really an issue for ompi.

Can you help fixing this?

The issue is indeed with supercomputers with uncommon architectures like
this one.

Brice


>
> Cheers,
>
> Gilles
>
> On Friday, September 4, 2015, Ralph Castain  > wrote:
>
> No - hwloc is embedded in OMPI anyway.
>
>> On Sep 3, 2015, at 11:09 AM, Paul Hargrove > > wrote:
>>
>>
>> On Thu, Sep 3, 2015 at 8:03 AM, Ralph Castain > > wrote:
>>
>> Does anyone know of a reason why we shouldn’t do this?
>>
>>
>>
>> Would doing this mean that a port to a new system would require
>> that one first perform a full hwloc port?
>>
>> -Paul
>>
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> 
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/09/17942.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/17952.php



Re: [OMPI devel] Dual rail IB card problem

2015-09-01 Thread Brice Goglin
Le 01/09/2015 15:59, marcin.krotkiewski a écrit :
> Dear Rolf and Brice,
>
> Thank you very much for your help. I have now moved the 'dubious' IB
> card from Slot 1 to Slot 5. It is now reported by hwloc as bound to a
> separate NUMA node. In this case OpenMPI works as could be expected:
>
>  - NUMA nodes that 'own' a card decide to use only that card (the
> other card is too far away)
>  - NUMA nodes that have no card decide to use both cards and both
> ports of each card, because they are equally far away
>
> I do not have practical experience with this, so I am not sure whether
> this is the best performing behavior, but I guess it does look reasonable.
>
> Another question is, how should OpenMPI deal with the hw configuration
> that originally caused problems. Now it is assumed that the distance
> to the 'dubious' card is 0 (close to everybody) and the other,
> 'correctly' inserted card is never used. Maybe it would be better to
> assing the maximum distance/latency instead of 0? It seems that in
> this scenario the system (at least my hw configuration) would
> effectively behave exactly as described above: 2 NUMA nodes use 1
> card, the other 2 use both cards. And in either case it would at least
> not eliminate the other card(s) from being used.

My feeling is: don't bother. I think all Intel processors now have their
PCI controllers inside sockets, AMD PCI controllers are connected to a
single NUMA node, same for Power8. So you're not supposed to get such
'dubious' cards anymore. If the BIOS is buggy, we have hwloc environment
variables to workaround it.

Now, how do you detect the issue ? I wonder if hwloc should autofix
them, or at least issue a warning. We already do that in some cases for
E5v3. We could just warn whenever "sandy-bridge or later" reports
PCI-locality larger than a single NUMA nodes. However, the warning will
quite common given how buggy the BIOS are.

Brice



Re: [OMPI devel] Dual rail IB card problem

2015-09-01 Thread Brice Goglin
Hello

It's a float because we normalize to 1 on the diagonal (some AMD
machines have values like 10 on the diagonal and 16 or 22 otherwise, so
you ge 1.0, 1.6 or 2.2 after normalization), and also because some users
wanted to specify their own distance matrix.

I'd like to cleanup the distance API in hwloc 2.0. Current ideas are:
1) Removing normalization+float ? Should be possible.
2) Only supporting distance matrices that cover the entire machine ?
Likely fine too.
3) Remove the ability for the users to specify distances manually ? It's
useful for adding locality based on benchmarks when the BIOS/kernel
doesn't report enough. Need to talk with users.
4) Only support NUMA distances. Depends on (3).
Comments are welcome.

Brice



Le 01/09/2015 01:50, Gilles Gouaillardet a écrit :
> Brice,
>
> as a side note, what is the rationale for defining the distance as a
> floating point number ?
>
> i remember i had to fix a bug in ompi a while ago
> /* e.g. replace if (d1 == d2) with if((d1-d2) < epsilon) */
>
> Cheers,
>
> Gilles
>
> On 9/1/2015 5:28 AM, Brice Goglin wrote:
>> The locality is mlx4_0 as reported by lstopo is "near the entire
>> machine" (while mlx4_1 is reported near NUMA node #3). I would vote
>> for buggy PCI-NUMA affinity being reported by the BIOS. But I am not
>> very familiar with 4x E5-4600 machines so please make sure this PCI
>> slot is really attached to a single NUMA node (some older 4-socket
>> machines have some I/O hub attached to 2 sockets).
>>
>> Given the lspci output, mlx4_0 is likely on the PCI bus attached to
>> NUMA node #0, so you should be able to work-around the issue by
>> setting HWLOC_PCI__00_LOCALCPUS=0xfff in the environment.
>>
>> There are 8 hostbridges in this machine, 2 attached to each
>> processor, there are likely similar issues for others.
>>
>> Brice
>>
>>
>>
>> Le 31/08/2015 22:06, Rolf vandeVaart a écrit :
>>>
>>> There was a problem reported on the User's list about Open MPI
>>> always picking one Mellanox card when they were two in the machine.
>>>
>>>
>>> http://www.open-mpi.org/community/lists/users/2015/08/27507.php
>>>
>>>
>>> We dug a little deeper and I think this has to do with how hwloc is
>>> figuring out where one of the cards is located.  This verbose output
>>> (with some extra printfs) shows that it cannot figure out which NUMA
>>> node mlx4_0 is closest too. It can only determine it is located on
>>> HWLOC_OBJ_SYSTEM and therefore Open MPI assumes a distance of 0.0. 
>>> Because of this (smaller is better) Open MPI library always picks
>>> mlx4_0 for all sockets.  I am trying to figure out if this is a
>>> hwloc or Open MPI bug. Any thoughts on this?
>>>
>>>
>>> [node1.local:05821] Checking distance for device=mlx4_1
>>> [node1.local:05821] hwloc_distances->nbobjs=4
>>> [node1.local:05821] hwloc_distances->latency[0]=1.00
>>> [node1.local:05821] hwloc_distances->latency[1]=2.10
>>> [node1.local:05821] hwloc_distances->latency[2]=2.10
>>> [node1.local:05821] hwloc_distances->latency[3]=2.10
>>> [node1.local:05821] hwloc_distances->latency[4]=2.10
>>> [node1.local:05821] hwloc_distances->latency[5]=1.00
>>> [node1.local:05821] hwloc_distances->latency[6]=2.10
>>> [node1.local:05821] hwloc_distances->latency[7]=2.10
>>> [node1.local:05821] ibv_obj->type = 4
>>> [node1.local:05821] ibv_obj->logical_index=1
>>> [node1.local:05821] my_obj->logical_index=0
>>> [node1.local:05821] Proc is bound: distance=2.10
>>>
>>> [node1.local:05821] Checking distance for device=mlx4_0
>>> [node1.local:05821] hwloc_distances->nbobjs=4
>>> [node1.local:05821] hwloc_distances->latency[0]=1.00
>>> [node1.local:05821] hwloc_distances->latency[1]=2.10
>>> [node1.local:05821] hwloc_distances->latency[2]=2.10
>>> [node1.local:05821] hwloc_distances->latency[3]=2.10
>>> [node1.local:05821] hwloc_distances->latency[4]=2.10
>>> [node1.local:05821] hwloc_distances->latency[5]=1.00
>>> [node1.local:05821] hwloc_distances->latency[6]=2.10
>>> [node1.local:05821] hwloc_distances->latency[7]=2.10
>>> [node1.local:05821] ibv_obj->type = 1
>>> <-HWLOC_OBJ_MACHINE
>>> [node1.local:05821] ibv_obj->type set to NULL
>>> [node1.local:05821] Proc is bound: distance=0.00
>>>
>>> [node1.local:05821] [rank=0] openib: skipping d

Re: [hwloc-devel] Add support for PCIe drives

2015-08-31 Thread Brice Goglin
I applied a slightly different patch to v1.11 (nothing is needed in
master since the discovery logic is different and more generic).
thanks
Brice



Le 28/08/2015 21:53, Tannenbaum, Barry M a écrit :
>
> PCIe drives (like the Intel DC P3500/P3600/P3700) do not have a
> controller – they appear directly on the PCIe bus.
>
>
> support-pcie-disk.patch
>
>
> diff --git a/src/topology-linux.c b/src/topology-linux.c
> --- a/src/topology-linux.c
> +++ b/src/topology-linux.c
> @@ -4656,6 +4656,11 @@
>/* restore parent path */
>pathlen -= devicedlen;
>path[pathlen] = '\0';
> +} else if (strcmp(devicedirent->d_name, "block") == 0) {
> +  /* found a block device - lookup block class for real */
> +  res += hwloc_linux_class_readdir(backend, pcidev, path,
> +   HWLOC_OBJ_OSDEV_BLOCK, "block",
> +   hwloc_linux_block_class_fillinfos);
>  }
>}
>closedir(devicedir);
>
>
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2015/08/4581.php



Re: [OMPI devel] Dual rail IB card problem

2015-08-31 Thread Brice Goglin
The locality is mlx4_0 as reported by lstopo is "near the entire
machine" (while mlx4_1 is reported near NUMA node #3). I would vote for
buggy PCI-NUMA affinity being reported by the BIOS. But I am not very
familiar with 4x E5-4600 machines so please make sure this PCI slot is
really attached to a single NUMA node (some older 4-socket machines have
some I/O hub attached to 2 sockets).

Given the lspci output, mlx4_0 is likely on the PCI bus attached to NUMA
node #0, so you should be able to work-around the issue by setting
HWLOC_PCI__00_LOCALCPUS=0xfff in the environment.

There are 8 hostbridges in this machine, 2 attached to each processor,
there are likely similar issues for others.

Brice



Le 31/08/2015 22:06, Rolf vandeVaart a écrit :
>
> There was a problem reported on the User's list about Open MPI always
> picking one Mellanox card when they were two in the machine.
>
>
> http://www.open-mpi.org/community/lists/users/2015/08/27507.php
>
>
> We dug a little deeper and I think this has to do with how hwloc is
> figuring out where one of the cards is located.  This verbose output
> (with some extra printfs) shows that it cannot figure out which NUMA
> node mlx4_0 is closest too. It can only determine it is located on
> HWLOC_OBJ_SYSTEM and therefore Open MPI assumes a distance of 0.0. 
> Because of this (smaller is better) Open MPI library always picks
> mlx4_0 for all sockets.  I am trying to figure out if this is a hwloc
> or Open MPI bug. Any thoughts on this?
>
>
> [node1.local:05821] Checking distance for device=mlx4_1
> [node1.local:05821] hwloc_distances->nbobjs=4
> [node1.local:05821] hwloc_distances->latency[0]=1.00
> [node1.local:05821] hwloc_distances->latency[1]=2.10
> [node1.local:05821] hwloc_distances->latency[2]=2.10
> [node1.local:05821] hwloc_distances->latency[3]=2.10
> [node1.local:05821] hwloc_distances->latency[4]=2.10
> [node1.local:05821] hwloc_distances->latency[5]=1.00
> [node1.local:05821] hwloc_distances->latency[6]=2.10
> [node1.local:05821] hwloc_distances->latency[7]=2.10
> [node1.local:05821] ibv_obj->type = 4
> [node1.local:05821] ibv_obj->logical_index=1
> [node1.local:05821] my_obj->logical_index=0
> [node1.local:05821] Proc is bound: distance=2.10
>
> [node1.local:05821] Checking distance for device=mlx4_0
> [node1.local:05821] hwloc_distances->nbobjs=4
> [node1.local:05821] hwloc_distances->latency[0]=1.00
> [node1.local:05821] hwloc_distances->latency[1]=2.10
> [node1.local:05821] hwloc_distances->latency[2]=2.10
> [node1.local:05821] hwloc_distances->latency[3]=2.10
> [node1.local:05821] hwloc_distances->latency[4]=2.10
> [node1.local:05821] hwloc_distances->latency[5]=1.00
> [node1.local:05821] hwloc_distances->latency[6]=2.10
> [node1.local:05821] hwloc_distances->latency[7]=2.10
> [node1.local:05821] ibv_obj->type = 1
> <-HWLOC_OBJ_MACHINE
> [node1.local:05821] ibv_obj->type set to NULL
> [node1.local:05821] Proc is bound: distance=0.00
>
> [node1.local:05821] [rank=0] openib: skipping device mlx4_1; it is too
> far away
> [node1.local:05821] [rank=0] openib: using port mlx4_0:1
> [node1.local:05821] [rank=0] openib: using port mlx4_0:2
>
>
> Machine (1024GB)
>   NUMANode L#0 (P#0 256GB) + Socket L#0 + L3 L#0 (30MB)
> L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU
> L#0 (P#0)
> L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU
> L#1 (P#1)
> L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU
> L#2 (P#2)
> L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU
> L#3 (P#3)
> L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU
> L#4 (P#4)
> L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU
> L#5 (P#5)
> L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU
> L#6 (P#6)
> L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU
> L#7 (P#7)
> L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU
> L#8 (P#8)
> L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU
> L#9 (P#9)
> L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 +
> PU L#10 (P#10)
> L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 +
> PU L#11 (P#11)
>   NUMANode L#1 (P#1 256GB)
> Socket L#1 + L3 L#1 (30MB)
>   L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
> + PU L#12 (P#12)
>   L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
> + PU L#13 (P#13)
>   L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
> + PU L#14 (P#14)
>   L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
> + PU L#15 (P#15)
>   L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
> + PU L#16 (P#16)
>   L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
> + PU L#17 (P#17)
>   L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + 

Re: [hwloc-devel] Add support for PCIe drives

2015-08-28 Thread Brice Goglin
Le 28/08/2015 21:53, Tannenbaum, Barry M a écrit :
>
> PCIe drives (like the Intel DC P3500/P3600/P3700) do not have a
> controller – they appear directly on the PCIe bus.
>
>

Ah nice, I needed access to a machine with such a disk before adding
this support.

Unfortunately, the I/O device discovery code was totally reworked in
master. Can you try git master to see if it works? Or better, run
"hwloc-gather-topology foo" (from master) and send the resulting
foo.tar.bz2 ? (send it to me in private, it will be large).

thanks
Brice



Re: [OMPI devel] 1.10.0rc6 - slightly different mx problem

2015-08-25 Thread Brice Goglin
Le 25/08/2015 05:59, Christopher Samuel a écrit :
>
> INRIA does have Open-MX (Myrinet Express over Generic Ethernet
> Hardware), last release December 2014.  No idea if it's still developed
> or used..
>
> http://open-mx.gforge.inria.fr/
>
> Brice?
>
> Open-MPI is listed as working with it there. ;-)
>

It's not developed anymore. New releases just fix support for newer
kernels as long as the fix is easy.
There are still a couple users but I guess OMPI 1.8 is enough for them.

Brice



Re: [hwloc-devel] hwloc on Windows?

2015-08-17 Thread Brice Goglin
Nobody from Microsoft as far as I know.
Several Intel people are looking at hwloc and/or contributing, I'll send
the list in a private email.

Brice



Le 17/08/2015 19:38, Tannenbaum, Barry M a écrit :
>
> Thanks.
>
>  
>
> Are there any other Intel or Microsoft developers on this project?
>
>  
>
> -Barry
>
>  
>
> *From:*hwloc-devel [mailto:hwloc-devel-boun...@open-mpi.org] *On
> Behalf Of *Brice Goglin
> *Sent:* Monday, August 17, 2015 1:36 PM
> *To:* Hardware locality development list
> *Subject:* Re: [hwloc-devel] hwloc on Windows?
>
>  
>
> The master branch in git is under heavy development for the future
> v2.0. You should likely avoid it for now.
> v1.11 tarball or v1.11 git branch is fine (and would likely require
> the same changes for Windows).
>
> Brice
>
>
>
> Le 17/08/2015 19:17, Tannenbaum, Barry M a écrit :
>
> I pulled 1.11.0 from the download site
> (http://www.open-mpi.org/software/hwloc/v1.11/). Would you prefer
> me to work from the git repository?
>
>  
>
>     -    Barry
>
>  
>
> *From:*hwloc-devel [mailto:hwloc-devel-boun...@open-mpi.org] *On
> Behalf Of *Brice Goglin
> *Sent:* Monday, August 17, 2015 12:24 PM
> *To:* Hardware locality development list
> *Subject:* Re: [hwloc-devel] hwloc on Windows?
>
>  
>
> autoconf/configure just put it with other in the main
> autogen/config.h file:
> include/private/autogen/config.h:#define HWLOC_VERSION "2.0.0a1-git"
> include/private/autogen/config.h:#define PACKAGE_VERSION "2.0.0a1-git"
> include/private/autogen/config.h:#define VERSION "2.0.0a1-git"
>
> Brice
>
>
>
> Le 17/08/2015 17:39, Tannenbaum, Barry M a écrit :
>
> If you can give me a clue where HWLOC_VERSION is **supposed**
> to be defined, I’ll run through and at least get rid of the
> annoying warnings that Visual Studio is spewing. Things like
>     size conversion warnings and signed/unsigned mismatches.
>
>  
>
> -Barry
>
>  
>
> *From:*hwloc-devel [mailto:hwloc-devel-boun...@open-mpi.org]
> *On Behalf Of *Brice Goglin
> *Sent:* Saturday, August 15, 2015 2:46 AM
> *To:* hwloc-de...@open-mpi.org <mailto:hwloc-de...@open-mpi.org>
> *Subject:* Re: [hwloc-devel] hwloc on Windows?
>
>  
>
> Le 14/08/2015 23:44, Tannenbaum, Barry M a écrit :
>
> I’m trying to build/use hwloc on Windows.
>
>  
>
> The first question is does hwloc do anything to explore
> the storage devices on a Windows system?
>
>
> Hello
>
> Not yet. If I remember correctly, the main issue for I/O
> devices on Windows is that PCI locality is available since
> Windows 8 (or something "recent" I don't have). We have had
> https://github.com/open-mpi/hwloc/issues/108 open for a while
> but couldn't really look at it.
> If you have patches, we'll be happy to integrate them,
> assuming we don't need very recent windows releases :/
>
>
>
>
> The second question is how do you build hwloc on Windows?
> I’m building with Cygwin and Visual Studio 2013. I managed
> to coerce the configuration script to run, but when I
> tried to issue the “make” command, it bombed out, starting
> with a message from cl that it didn’t know the “-g” option.
>
>
> I don't know how to solve this BUT I recently found that hwloc
> 1.11 was being added to the cygwin distribution.
>   https://cygwin.com/ml/cygwin/2015-06/msg00418.html
> Could this help?
>
>
>
>
>  
>
> Trying to use the Visual Studio project provided, I got a
> pile of warnings, and then errors looking for definitions
> of HWLOC_VERSION.
>
>
> Unfornately, we have no way to automatize the testing of these
> files, and they get outdated quickly. They were written for
> 1.9. Since 1.10, we use HWLOC_VERSION instead of VERSION
> everywhere in the code, so I guess the project files should be
> updated to define HWLOC_VERSION as well.
>
>
>
>
>  
>
> Obviously I’m doing this wrong. Can someone suggest how to
> build hwloc on Windows?
>
>  
>
>
> Looks like you're not doing anything wrong. You're just
&g

Re: [hwloc-devel] hwloc on Windows?

2015-08-17 Thread Brice Goglin
The master branch in git is under heavy development for the future v2.0.
You should likely avoid it for now.
v1.11 tarball or v1.11 git branch is fine (and would likely require the
same changes for Windows).

Brice



Le 17/08/2015 19:17, Tannenbaum, Barry M a écrit :
>
> I pulled 1.11.0 from the download site
> (http://www.open-mpi.org/software/hwloc/v1.11/). Would you prefer me
> to work from the git repository?
>
>  
>
> -Barry
>
>  
>
> *From:*hwloc-devel [mailto:hwloc-devel-boun...@open-mpi.org] *On
> Behalf Of *Brice Goglin
> *Sent:* Monday, August 17, 2015 12:24 PM
> *To:* Hardware locality development list
> *Subject:* Re: [hwloc-devel] hwloc on Windows?
>
>  
>
> autoconf/configure just put it with other in the main autogen/config.h
> file:
> include/private/autogen/config.h:#define HWLOC_VERSION "2.0.0a1-git"
> include/private/autogen/config.h:#define PACKAGE_VERSION "2.0.0a1-git"
> include/private/autogen/config.h:#define VERSION "2.0.0a1-git"
>
> Brice
>
>
>
> Le 17/08/2015 17:39, Tannenbaum, Barry M a écrit :
>
> If you can give me a clue where HWLOC_VERSION is **supposed** to
> be defined, I’ll run through and at least get rid of the annoying
> warnings that Visual Studio is spewing. Things like size
> conversion warnings and signed/unsigned mismatches.
>
>  
>
> -Barry
>
>  
>
> *From:*hwloc-devel [mailto:hwloc-devel-boun...@open-mpi.org] *On
> Behalf Of *Brice Goglin
> *Sent:* Saturday, August 15, 2015 2:46 AM
> *To:* hwloc-de...@open-mpi.org <mailto:hwloc-de...@open-mpi.org>
> *Subject:* Re: [hwloc-devel] hwloc on Windows?
>
>  
>
> Le 14/08/2015 23:44, Tannenbaum, Barry M a écrit :
>
> I’m trying to build/use hwloc on Windows.
>
>  
>
> The first question is does hwloc do anything to explore the
> storage devices on a Windows system?
>
>
> Hello
>
> Not yet. If I remember correctly, the main issue for I/O devices
> on Windows is that PCI locality is available since Windows 8 (or
> something "recent" I don't have). We have had
> https://github.com/open-mpi/hwloc/issues/108 open for a while but
> couldn't really look at it.
> If you have patches, we'll be happy to integrate them, assuming we
> don't need very recent windows releases :/
>
>
>
> The second question is how do you build hwloc on Windows? I’m
> building with Cygwin and Visual Studio 2013. I managed to
> coerce the configuration script to run, but when I tried to
> issue the “make” command, it bombed out, starting with a
> message from cl that it didn’t know the “-g” option.
>
>
> I don't know how to solve this BUT I recently found that hwloc
> 1.11 was being added to the cygwin distribution.
>   https://cygwin.com/ml/cygwin/2015-06/msg00418.html
> Could this help?
>
>
>
>  
>
> Trying to use the Visual Studio project provided, I got a pile
> of warnings, and then errors looking for definitions of
> HWLOC_VERSION.
>
>
> Unfornately, we have no way to automatize the testing of these
> files, and they get outdated quickly. They were written for 1.9.
> Since 1.10, we use HWLOC_VERSION instead of VERSION everywhere in
> the code, so I guess the project files should be updated to define
> HWLOC_VERSION as well.
>
>
>
>  
>
> Obviously I’m doing this wrong. Can someone suggest how to
> build hwloc on Windows?
>
>  
>
>
> Looks like you're not doing anything wrong. You're just unlucky,
> being one of the very few people that tries to build recent
> releases :/
>
> Brice
>
>
>
>
> ___
>
> hwloc-devel mailing list
>
> hwloc-de...@open-mpi.org <mailto:hwloc-de...@open-mpi.org>
>
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2015/08/4551.php
>
>  
>
>
>
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2015/08/4553.php



Re: [hwloc-devel] hwloc on Windows?

2015-08-15 Thread Brice Goglin
Le 14/08/2015 23:44, Tannenbaum, Barry M a écrit :
>
> I’m trying to build/use hwloc on Windows.
>
>  
>
> The first question is does hwloc do anything to explore the storage
> devices on a Windows system?
>

Hello

Not yet. If I remember correctly, the main issue for I/O devices on
Windows is that PCI locality is available since Windows 8 (or something
"recent" I don't have). We have had
https://github.com/open-mpi/hwloc/issues/108 open for a while but
couldn't really look at it.
If you have patches, we'll be happy to integrate them, assuming we don't
need very recent windows releases :/

> The second question is how do you build hwloc on Windows? I’m building
> with Cygwin and Visual Studio 2013. I managed to coerce the
> configuration script to run, but when I tried to issue the “make”
> command, it bombed out, starting with a message from cl that it didn’t
> know the “-g” option.
>

I don't know how to solve this BUT I recently found that hwloc 1.11 was
being added to the cygwin distribution.
  https://cygwin.com/ml/cygwin/2015-06/msg00418.html
Could this help?

>  
>
> Trying to use the Visual Studio project provided, I got a pile of
> warnings, and then errors looking for definitions of HWLOC_VERSION.
>

Unfornately, we have no way to automatize the testing of these files,
and they get outdated quickly. They were written for 1.9. Since 1.10, we
use HWLOC_VERSION instead of VERSION everywhere in the code, so I guess
the project files should be updated to define HWLOC_VERSION as well.

>  
>
> Obviously I’m doing this wrong. Can someone suggest how to build hwloc
> on Windows?
>
>

Looks like you're not doing anything wrong. You're just unlucky, being
one of the very few people that tries to build recent releases :/

Brice



Re: [OMPI devel] 1.8.8rc1 testing report

2015-07-31 Thread Brice Goglin
It was renamed from cpuid.h to cpuid-x86.h at some point. Can't check from here 
but the actual code should be the same in all these branches.
Brice


Le 31 juillet 2015 22:19:47 UTC+02:00, Ralph Castain  a 
écrit :
>Yo Paul
>
>1.8.8 and 1.10 do not have hwloc-1.11 in them - they remain on
>hwloc-1.9. The referenced commit doesn’t apply to that version of hwloc
>because the affected file doesn’t exist there.
>
>
>> On Jul 31, 2015, at 9:07 AM, Paul Hargrove 
>wrote:
>> 
>> My testing has completed all but the last few QEMU-emulated ARM and
>MIPS platforms.
>> However I do have complete (successful) results from 1 MIPS and 2 ARM
>platforms at this point.
>> 
>> The only issue I encountered is one we learned of with 1.10.0rc2:
>"pgcc -m32" has issues with some inline asm in hwloc-1.11.0
>> Since hwloc's v1.11 branch has been updated to resolve that issue, I
>suggest cherry-picking the commit
>(https://github.com/open-mpi/hwloc/commit/46deaebf
>) that addresses
>this particular issue.
>> 
>> -Paul
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>http://www.open-mpi.org/community/lists/devel/2015/07/17724.php
>
>
>
>
>
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post:
>http://www.open-mpi.org/community/lists/devel/2015/07/17725.php


Re: [hwloc-devel] hwloc-1.11 failure with pgi compiler

2015-07-28 Thread Brice Goglin
Le 28/07/2015 16:23, Samuel Thibault a écrit :
> Brice Goglin, le Tue 28 Jul 2015 16:13:49 +0200, a écrit :
>> and your commit is slightly different: (s/xchg/mov/ and removed last line).
> xchg is spurious here, mov is enough.  I didn't remove the last line, I
> just kept the original source, which uses +a instead of =a and a.
>
>> FWIW, in master we don't have multiple inlining anymore (there's a
>> wrapper function calling this inline asm).
> You mean the cpuid_or_from_dump function?
>

Yes.
Brice



Re: [hwloc-devel] hwloc-1.11 failure with pgi compiler

2015-07-28 Thread Brice Goglin
Le 28/07/2015 15:55, Samuel Thibault a écrit :
> Hello,
>
> Paul Hargrove, le Mon 20 Jul 2015 23:12:10 -0700, a écrit :
>> I believe the following inline x86 asm is correct and more robust than the
>> existing code that pgi appears to reject:
> Indeed, in the 32bit case, we don't need to shuffle between 32 and 64bit
> values, so it's simpler to just use a register. It's surprising that
> letting the compiler decide the register fails more than just specifying
> SD, but since wide testing shows that, then let's go with it.

My main concern about all this is that we talked about changing "m" into
"SD" but Paul's patch did much more than that:

--- a/include/private/cpuid-x86.h
+++ b/include/private/cpuid-x86.h
@@ -72,14 +72,12 @@ static __hwloc_inline void hwloc_x86_cpuid(unsigned *eax, 
unsigned *ebx, unsigne
   : "+a" (*eax), "=m" (*ebx), "="(sav_rbx),
 "+c" (*ecx), "=" (*edx));
 #elif defined(HWLOC_X86_32_ARCH)
-  unsigned long sav_ebx;
   __asm__(
-  "mov %%ebx,%2\n\t"
+  "xchg %%ebx,%1\n\t"
   "cpuid\n\t"
-  "xchg %2,%%ebx\n\t"
-  "movl %k2,%1\n\t"
-  : "+a" (*eax), "=m" (*ebx), "="(sav_ebx),
-"+c" (*ecx), "=" (*edx));
+  "xchg %%ebx,%1\n\t"
+  : "=a" (*eax), "=SD" (*ebx), "=c" (*ecx), "=d" (*edx)
+  : "0" (*eax), "2" (*ecx));
 #else
 #error unknown architecture
 #endif


and your commit is slightly different: (s/xchg/mov/ and removed last line).

--- a/include/private/cpuid-x86.h
+++ b/include/private/cpuid-x86.h
@@ -72,14 +72,11 @@ static __hwloc_inline void hwloc_x86_cpuid(unsigned *eax, 
unsigned *ebx, unsigne
   : "+a" (*eax), "=m" (*ebx), "="(sav_rbx),
 "+c" (*ecx), "=" (*edx));
 #elif defined(HWLOC_X86_32_ARCH)
-  unsigned long sav_ebx;
   __asm__(
-  "mov %%ebx,%2\n\t"
+  "mov %%ebx,%1\n\t"
   "cpuid\n\t"
-  "xchg %2,%%ebx\n\t"
-  "movl %k2,%1\n\t"
-  : "+a" (*eax), "=m" (*ebx), "="(sav_ebx),
-"+c" (*ecx), "=" (*edx));
+  "xchg %%ebx,%1\n\t"
+  : "+a" (*eax), "=SD" (*ebx), "+c" (*ecx), "=d" (*edx));
 #else
 #error unknown architecture
 #endif

Without much explanation in git log, the history of all these fragile
asm changes becomes quite hard to read for normal people :/

My regression testing is happy so far, but it would be nice if Paul
could check again.

> I'm however afraid that this code has again posed problem, even if we
> do test its compilation in configure.ac.  I'm wondering: instead of
> insisting on inlining this function, we should perhaps just put it in a
> separate .c file, which we try to compile from configure.ac exactly the
> same way as it will be for libhwloc.so?
>

FWIW, in master we don't have multiple inlining anymore (there's a
wrapper function calling this inline asm).

Brice



Re: [hwloc-devel] "make check" of 1.11 broken on x86 RedHat 8

2015-07-26 Thread Brice Goglin
Maybe try this. It should disable the entire BGQ backend
cross-build-testing when Linux doesn't have enough pthread/cpuset support.

Brice



Le 21/07/2015 22:02, Paul Hargrove a écrit :
> I was, at Brice's request, trying out the hwloc-1.11.0 release on all
> sorts of x86 systems, with and without a patch for the inline asm for
> the cpuid instruction.
>
> I came across the following UNRELATED error during "make check" on a
> (very old) Red Hat 8 system (that would be something like "Fedora
> negative-3"):
>
> make[3]: Entering directory
> `/home/pcp1/phargrov/OMPI/hwloc-1.11.0-linux-x86-RH8/BLD/tests/ports'
>   CC   libhwloc_port_aix_la-topology-aix.lo
>   CCLD libhwloc-port-aix.la 
>   CC   libhwloc_port_bgq_la-topology-bgq.lo
> topology-bgq.c: In function `hwloc_bgq_get_thread_cpubind':
> topology-bgq.c:115: `cpu_set_t' undeclared (first use in this function)
> topology-bgq.c:115: (Each undeclared identifier is reported only once
> topology-bgq.c:115: for each function it appears in.)
> topology-bgq.c:115: parse error before "bg_set"
> topology-bgq.c:122: `bg_set' undeclared (first use in this function)
> topology-bgq.c: In function `hwloc_bgq_set_thread_cpubind':
> topology-bgq.c:151: `cpu_set_t' undeclared (first use in this function)
> topology-bgq.c:151: parse error before "bg_set"
> topology-bgq.c:168: `bg_set' undeclared (first use in this function)
> make[3]: *** [libhwloc_port_bgq_la-topology-bgq.lo] Error 1
>
> The following output from configure might be relevant:
>
> checking for sched_setaffinity... yes
> checking for sys/cpuset.h... no
> checking for cpuset_setaffinity... no
> checking for library containing pthread_getthrds_np... no
> checking for cpuset_setid... no
>
>
>
> -Paul
>
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2015/07/4508.php

diff --git a/src/topology-bgq.c b/src/topology-bgq.c
index 3998f31..4dfc2bf 100644
--- a/src/topology-bgq.c
+++ b/src/topology-bgq.c
@@ -15,6 +15,8 @@
 #include 
 #include 

+#ifndef HWLOC_DISABLE_BGQ_PORT_TEST
+
 static int
 hwloc_look_bgq(struct hwloc_backend *backend)
 {
@@ -244,3 +246,5 @@ const struct hwloc_component hwloc_bgq_component = {
   0,
   _bgq_disc_component
 };
+
+#endif /* !HWLOC_DISABLE_BGQ_PORT_TEST */
diff --git a/tests/ports/include/bgq/spi/include/kernel/location.h b/tests/ports/include/bgq/spi/include/kernel/location.h
index 4b67abb..883bb51 100644
--- a/tests/ports/include/bgq/spi/include/kernel/location.h
+++ b/tests/ports/include/bgq/spi/include/kernel/location.h
@@ -11,4 +11,10 @@
 uint32_t Kernel_ProcessorID( void );
 uint32_t Kernel_MyTcoord( void );

+/* don't try to cross-build BGQ port on old Linux platforms */
+#if (!HAVE_DECL_PTHREAD_GETAFFINITY_NP) || (!HAVE_DECL_PTHREAD_SETAFFINITY_NP) || (!defined HWLOC_HAVE_CPU_SET)
+#warning Disabling BGQ port cross-build on old Linux platform
+#define HWLOC_DISABLE_BGQ_PORT_TEST
+#endif
+
 #endif /* HWLOC_PORT_BGQ_KERNEL_LOCATION_H */


Re: [hwloc-devel] whelk warning

2015-07-26 Thread Brice Goglin
Applied a slightly better one, thanks.
https://github.com/open-mpi/hwloc/commit/6376438e558f4a1541f4bb1cef18c0e86f222821

Brice



Le 21/07/2015 19:07, Balaji, Pavan a écrit :
> Folks,
>
> We see a warning about assignment from a const string to a regular string 
> (CFLAGS="-Wall -Werror").  Please see the fix we are maintaining for this:
>
> http://git.mpich.org/mpich.git/commitdiff/5ce7102445fe0f6fbcf3fac0e49b092bf3069778
>
>
> Could you consider including this or an alternative (e.g., direct typecast of 
> the const string to "char *") in the next hwloc release?
>
> Thanks,
>
>   -- Pavan
>
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2015/07/4507.php



Re: [hwloc-devel] hwloc-1.11 failure with pgi compiler

2015-07-21 Thread Brice Goglin
Thanks.
Could you test this new asm on all your systems/compilers? I don't want break 
that fragile code again.
Brice

Le 21 juillet 2015 08:17:06 UTC+02:00, Paul Hargrove  a 
écrit :
>Oops - send the wrong asm code.
>While "=S" is correct for the second constraint, I meant to send a
>version
>that had "=r" because it allows the compiler more choices.
>
>-Paul
>
>On Mon, Jul 20, 2015 at 11:12 PM, Paul Hargrove 
>wrote:
>
>> PGI-14.10 for 32-bit targets fails in the same manner as 13.7, 13.9
>and
>> 13.10.
>>
>> I believe the following inline x86 asm is correct and more robust
>than the
>> existing code that pgi appears to reject:
>>
>> #elif defined(HWLOC_X86_32_ARCH)
>>   __asm__(
>>   "xchg %%ebx,%1\n\t"
>>   "cpuid\n\t"
>>   "xchg %%ebx,%1\n\t"
>>   : "=a" (*eax), "=S" (*ebx), "=c" (*ecx), "=d" (*edx)
>>   : "0" (*eax), "2" (*ecx));
>> #else
>>
>> -Paul
>>
>> On Mon, Jul 20, 2015 at 9:50 PM, Paul Hargrove 
>wrote:
>>
>>> Pavan,
>>>
>>> I can confirm that I see the same with PGI-13.10.
>>>
>>> I have a couple systems with 14.x installed but neither with 32-bit
>>> support.
>>> I am downloading 32-bit support now (which I am assuming will work
>with
>>> the existing license) and will report back.
>>>
>>> -Paul
>>>
>>> On Mon, Jul 20, 2015 at 9:00 PM, Balaji, Pavan 
>wrote:
>>>
 Hello,

 The hwloc-1.11 build seems to fail with the pgi compiler on 32-bit
 platforms.  I see the following error:

 8<
   CC   topology-x86.lo
 PGC-F--Internal compiler error. unable to allocate a register
  8 (topology-x86.c: 87)
 PGC/x86 Linux 13.9-0: compilation aborted
 8<

 I only tried pgi-13.7 and 13.9 (I don't have access to later
>compiler
 versions).  It looks like the compiler doesn't like the assembly
>code in
 include/private/cpuid-x86.h for 32-bit platforms.



 Thanks,

   -- Pavan

 ___
 hwloc-devel mailing list
 hwloc-de...@open-mpi.org
 Subscription:
>http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
 Link to this post:

>http://www.open-mpi.org/community/lists/hwloc-devel/2015/07/4501.php

>>>
>>>
>>>
>>> --
>>> Paul H. Hargrove  phhargr...@lbl.gov
>>> Computer Languages & Systems Software (CLaSS) Group
>>> Computer Science Department   Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>>
>>
>>
>>
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>
>
>
>-- 
>Paul H. Hargrove  phhargr...@lbl.gov
>Computer Languages & Systems Software (CLaSS) Group
>Computer Science Department   Tel: +1-510-495-2352
>Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>
>
>
>___
>hwloc-devel mailing list
>hwloc-de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>Link to this post:
>http://www.open-mpi.org/community/lists/hwloc-devel/2015/07/4505.php


Re: [hwloc-devel] Possible buffer overflow in topology-linux.c

2015-07-17 Thread Brice Goglin
Thanks, I'll fix this. I'll try strlcpy() in case it's widely available
enough. Otherwise I'll just add the ending \0 manually.

Brice



Le 17/07/2015 12:56, Odzioba, Lukasz a écrit :
> Hi,
> Static analysis detected inappropriate use of strcpy function[1]  in 
> topology-linux.c.
> There are more places like this, but here data comes from dev configuration 
> file and I think we should fix it in the first place.
>
> Below is the patch which fixes those which concern me.
> Unfortunately strncpy does not guarantee that string will be NULL terminated 
> which may cause other problems.
> I am leaving it up to you whether you want to address that or no.
>
> Thanks,
> Lukas
>
> [1]: http://cwe.mitre.org/data/definitions/676.html
>
> diff --git a/hwloc/topology-linux.c b/hwloc/topology-linux.c
> index 82423ff..0512bac 100644
> --- a/hwloc/topology-linux.c
> +++ b/hwloc/topology-linux.c
> @@ -4347,15 +4347,15 @@ hwloc_linux_block_class_fillinfos(struct 
> hwloc_backend *backend,
>  if (tmp)
>*tmp = '\0';
>  if (!strncmp(line, "E:ID_VENDOR=", strlen("E:ID_VENDOR="))) {
> -  strcpy(vendor, line+strlen("E:ID_VENDOR="));
> +  strncpy(vendor, line+strlen("E:ID_VENDOR="), sizeof(vendor));
>  } else if (!strncmp(line, "E:ID_MODEL=", strlen("E:ID_MODEL="))) {
> -  strcpy(model, line+strlen("E:ID_MODEL="));
> +  strncpy(model, line+strlen("E:ID_MODEL="), sizeof(model));
>  } else if (!strncmp(line, "E:ID_REVISION=", strlen("E:ID_REVISION="))) {
> -  strcpy(revision, line+strlen("E:ID_REVISION="));
> +  strncpy(revision, line+strlen("E:ID_REVISION="), sizeof(revision));
>  } else if (!strncmp(line, "E:ID_SERIAL_SHORT=", 
> strlen("E:ID_SERIAL_SHORT="))) {
> -  strcpy(serial, line+strlen("E:ID_SERIAL_SHORT="));
> +  strncpy(serial, line+strlen("E:ID_SERIAL_SHORT="), sizeof(serial));
>  } else if (!strncmp(line, "E:ID_TYPE=", strlen("E:ID_TYPE="))) {
> -  strcpy(blocktype, line+strlen("E:ID_TYPE="));
> +  strncpy(blocktype, line+strlen("E:ID_TYPE="), sizeof(blocktype));
>  }
>}
>fclose(fd);
> @@ -4493,7 +4493,7 @@ hwloc_linux_lookup_block_class(struct hwloc_backend 
> *backend,
>int dummy;
>int res = 0;
>
> -  strcpy(path, pcidevpath);
> +  strncpy(path, pcidevpath, sizeof(path));
>pathlen = strlen(path);
> 
>
> Intel Technology Poland sp. z o.o.
> ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII 
> Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 
> 957-07-52-316 | Kapital zakladowy 200.000 PLN.
>
> Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i 
> moze zawierac informacje poufne. W razie przypadkowego otrzymania tej 
> wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; 
> jakiekolwiek
> przegladanie lub rozpowszechnianie jest zabronione.
> This e-mail and any attachments may contain confidential material for the 
> sole use of the intended recipient(s). If you are not the intended recipient, 
> please contact the sender and delete all copies; any review or distribution by
> others is strictly prohibited.
>
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2015/07/4494.php



Re: [hwloc-devel] v1.11.0

2015-06-13 Thread Brice Goglin
We're basically supposed to use HWLOC_VERSION everywhere.
But that requirements was added while the line below was developed in
a branch at the same time. That's why it didn't get fixes.
I'll review the entire tree in case there's another one missing and
fix master and v1.11, thanks.

Brice




Le 13/06/2015 19:01, Ralph Castain a écrit :
> Hi folks
> 
> I’ve been working on updating the OMPI hwloc code to the 1.11 version. I
> reported via Jeff about the config issue, so I updated to the latest
> nightly tarball of 1.11 to pickup that change. I’m now able to
> configure, but hit one last required change to make it build:
> 
> *diff --git a/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c
> b/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c*
> *index 8d129d0..01be274 100644*
> *--- a/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c*
> *+++ b/opal/mca/hwloc/hwloc1110/hwloc/src/topology.c*
> @@ -2599,7 +2599,7 @@next_noncpubackend:
>&& strcmp(topology->backends->component->name, "xml")) {
>  char *value;
>  /* add a hwlocVersion */
> -hwloc_obj_add_info(topology->levels[0][0], "hwlocVersion", VERSION);
> +hwloc_obj_add_info(topology->levels[0][0], "hwlocVersion",
> HWLOC_VERSION);
>  /* add a ProcessName */
>  value = hwloc_progname(topology);
>  if (value) {
> 
> 
> I’m not sure if this is a prefixing issue when embedded, or a more
> general problem. Any thoughts?
> Ralph
> 
> 
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2015/06/4468.php
> 



[hwloc-devel] v1.11 coming soon

2015-05-21 Thread Brice Goglin
Hello

I will likely release v1.11rc1 next week. The list of changes is
appended below.

For the record, the master branch is preparing hwloc v2.0 (which will
break the API), but it's still far from ready to release.
The v1.11 branch was added between v1.10 and v2.0 to release most
backward-compatible changes earlier.

FYI, changes in master are listed at
https://github.com/open-mpi/hwloc/blob/master/NEWS (but more will be
coming in the next months).

Brice




Version 1.11.0
--
* API
  + Socket objects are renamed into Package to align with the terminology
used by processor vendors. The old HWLOC_OBJ_SOCKET type and "Socket"
name are still supported for backward compatibility.
  + HWLOC_OBJ_NODE is replaced with HWLOC_OBJ_NUMANODE for clarification.
HWLOC_OBJ_NODE is still supported for backward compatibility.
"Node" and "NUMANode" strings are supported as in earlier releases.
* Detection improvements
  + Add support for Intel Knights Landing Xeon Phi.
Thanks to Grzegorz Andrejczuk and Lukasz Anaczkowski.
  + Add Vendor, Model, Revision, SerialNumber, Type and LinuxDeviceID
info attributes to Block OS devices on Linux. Thanks to Vineet Pedaballe
for the help.
- Add --disable-libudev to avoid dependency on the libudev library.
  + Add "MemoryDevice" Misc objects with information about DIMMs, on Linux
when privileged and when I/O is enabled.
Thanks to Vineet Pedaballe for the help.
  + Add a PCISlot attribute to PCI devices on Linux when supported to
identify the physical PCI slot where the board is plugged.
  + Add CPUStepping info attribute on x86 processors,
thanks to Thomas Röhl for the suggestion.
  + Ignore the device-tree on non-Power architectures to avoid buggy
detection on ARM. Thanks to Orion Poplawski for reporting the issue.
  + Work-around buggy Xeon E5v3 BIOS reporting invalid PCI-NUMA affinity
for the PCI links on the second processor.
  + Add support for CUDA compute capability 5.x, thanks Benjamin Worpitz.
  + Many fixes to the x86 backend
- Add L1i and fix L2/L3 type on old AMD processors without topoext support.
- Fix Intel CPU family and model numbers when basic family isn't 6 or 15.
- Fix package IDs on recent AMD processors.
- Fix misc issues due to incomplete APIC IDs on x2APIC processors.
- Avoid buggy discovery on old SGI Altix UVs with non-unique APIC IDs.
  + Gather total machine memory on NetBSD.
* Tools
  + lstopo
- Collapse identical PCI devices unless --no-collapse is given.
  This avoids gigantic outputs when a PCI device contains dozens of
  identical virtual functions.
- The ASCII art output is now called "ascii", for instance in
  "lstopo -.ascii".
  The former "txt" extension is retained for backward compatibility.
- Automatically scales graphical box width to the inner text in Cairo,
  ASCII and Windows outputs.
- Add --rect to lstopo to force rectangular layout even for NUMA nodes.
- Objects may have a Type info attribute to specific a better type name
  and display it in lstopo.
  + hwloc-annotate
- May now operate on all types of objects, including I/O.
- May now insert Misc objects in the topology.
- Do not drop instruction caches and I/O devices from the output anymore.
  + Fix lstopo path in hwloc-gather-topology after install.
* Misc
  + Fix PCI Bridge-specific depth attribute.
  + Fix hwloc_bitmap_intersect() for two infinite bitmaps.
  + Improve the performance of object insertion by cpuset for large
topologies.
  + Prefix verbose XML import errors with the source name.
  + Improve pkg-config checks and error messages.
  + Fix excluding after a component with an argument in the HWLOC_COMPONENTS
environment variable.
  + Fix the recommended way in documentation and examples to allocate memory
on some node, it should use HWLOC_MEMBIND_BIND.
Thanks to Nicolas Bouzat for reporting the issue.
  + Add a "Miscellaneous objects" section in the documentation.
  + Add a FAQ entry "What happens to my topology if I disable symmetric
multithreading, hyper-threading, etc. ?" to the documentation.



Re: [hwloc-devel] Using Hwloc in LLVM OpenMP

2015-05-18 Thread Brice Goglin
Le 18/05/2015 20:49, Peyton, Jonathan L a écrit :
>
> Hello Everyone,
>
>  
>
> We have been developing The LLVM OpenMP runtime library project and
> were hoping to incorporate the hwloc library as the primary affinity
> mechanism.  In order for this to happen though,
>
> a CMake build system would have to be created as it is now the primary
> build system of both LLVM and the LLVM OpenMP runtime library.  It
> offers better native Windows support (no config/compile cl hackery),
> just as much
>
> configuration capability as the autotools at a fraction of the
> effort.  It is also easier to maintain by more developers because the
> CMake language is easier to learn and has superior documentation.
>
>
> So a couple of questions:
>
> 1) Is anyone currently working on a CMake build system for hwloc?
>
> 2) Would someone inside hwloc development be interested in building a
> CMake build system?
>
> 3) If we were to implement a quality CMake build system, would it be
> accepted?
>
>  
>
> Plus, any other comments or questions are absolutely welcome.
>
>  
>
> -- Johnny
>
>  
>

Hello

I have spent a bit of time on CMakifing hwloc in the past, mostly for
windows support, but I didn't have much knowledge about CMake, so it
didn't go far. Somebody offered Windows vcxproj files later, so we
integrated those and I forgot about CMake. The main issue is about
periodic testing. I basically can't do it manually often enough (nightly
testing is done using Mingw only). Our vcxproj are already outdated for
this reason.

So
(1) not currently as far I as know
(2) yes
(3) it won't replace autotools since we have autotools-projects
embedding hwloc. if we can have both autotools and cmake without too
much trouble, I guess it's ok

Brice



Re: [hwloc-devel] [PATCH] utils/hwloc/Makefile.am: fix install-man race condition

2015-05-12 Thread Brice Goglin
Le 12/05/2015 17:01, Peter Korsgaard a écrit :
>  > Peter Korsgaard, le Tue 12 May 2015 16:09:55 +0200, a écrit :
>  >> Make install contains a race condition in utils/hwloc, as both
>  >> install-exec-hook (through intall-exec) and install-data trigger
>  >> install-man:
>
>  > I'm surprised: isn't make supposed to handle this kind of dependency
>  > concurrency?
>
> Within the same make instance, yes - but install-exec fires up a new
> make instance to handle install-exec-hook:
>
> install-exec-am: install-binPROGRAMS install-binSCRIPTS
> @$(NORMAL_INSTALL)
> $(MAKE) $(AM_MAKEFLAGS) install-exec-hook
>
> And this is in the Makefile.in code generated by automake, so I don't
> think there's any way around that.
>

For utils/lstopo, we really need the hook to run after install-man. I
think we can use install-data-hook instead, it generates the following code:

install-data: install-data-am

install-data-am: install-dist_APPLICATIONSDATA install-man
@$(NORMAL_INSTALL)
$(MAKE) $(AM_MAKEFLAGS) install-data-hook

Looks like it's officially documented that install-data depends on
install-man, so things should just work like this? (patch below)

Brice


diff --git a/utils/hwloc/Makefile.am b/utils/hwloc/Makefile.am
index b7f78d3..f20e58c 100644
--- a/utils/hwloc/Makefile.am
+++ b/utils/hwloc/Makefile.am
@@ -106,7 +106,7 @@ endif HWLOC_HAVE_LINUX
  -e 's/#HWLOC_DATE#/@HWLOC_RELEASE_DATE@/g' \
  > $@ < $<

-install-exec-hook: install-man
+install-exec-hook:
$(SED) -e 's/HWLOC_top_builddir\/utils\/hwloc/bindir/' -e 
's/HWLOC_top_builddir\/utils\/lstopo/bindir/' -e '/HWLOC_top_builddir/d' 
$(DESTDIR)$(bindir)/hwloc-compress-dir > 
$(DESTDIR)$(bindir)/hwloc-compress-dir.tmp && mv -f 
$(DESTDIR)$(bindir)/hwloc-compress-dir.tmp 
$(DESTDIR)$(bindir)/hwloc-compress-dir
chmod +x $(DESTDIR)$(bindir)/hwloc-compress-dir
 if HWLOC_HAVE_LINUX
diff --git a/utils/lstopo/Makefile.am b/utils/lstopo/Makefile.am
index fd9f5e0..984e263 100644
--- a/utils/lstopo/Makefile.am
+++ b/utils/lstopo/Makefile.am
@@ -74,7 +74,7 @@ endif
  -e 's/#HWLOC_DATE#/@HWLOC_RELEASE_DATE@/g' \
  > $@ < $<

-install-exec-hook: install-man
+install-exec-hook:
rm -f $(DESTDIR)$(bindir)/hwloc-ls$(EXEEXT)
cd $(DESTDIR)$(bindir) && $(LN_S) lstopo-no-graphics$(EXEEXT) 
hwloc-ls$(EXEEXT)
 if !HWLOC_HAVE_WINDOWS
@@ -83,6 +83,8 @@ if !HWLOC_HAVE_CAIRO
cd $(DESTDIR)$(bindir) && $(LN_S) lstopo-no-graphics$(EXEEXT) 
lstopo$(EXEEXT) || true
 endif
 endif
+
+install-data-hook:
rm -f $(DESTDIR)$(man1dir)/hwloc-ls.1
cd $(DESTDIR)$(man1dir) && $(LN_S) lstopo-no-graphics.1 hwloc-ls.1
rm -f $(DESTDIR)$(man1dir)/lstopo.1



Re: [hwloc-devel] [PATCH] utils/hwloc/Makefile.am: fix install-man race condition

2015-05-12 Thread Brice Goglin
Thanks.
Unfortunately we likely have the same problem under utils/lstopo where
the exec-hook needs that dependency :/
(everything was under utils/ in the past, and the dependency was
duplicated when splitting into utils/lstopo and utils/hwloc)

Brice



Le 12/05/2015 16:09, Peter Korsgaard a écrit :
> Make install contains a race condition in utils/hwloc, as both
> install-exec-hook (through intall-exec) and install-data trigger
> install-man:
>
> http://autobuild.buildroot.net/results/414/41403f8ce4751a27dd1bb9c43f5a97895dea3980/build-end.log
>
> The install-exec-hook target doesn't do anything with the manual pages, so
> fix the race condition by dropping the dependency.
>
> Signed-off-by: Peter Korsgaard 
> ---
>  utils/hwloc/Makefile.am | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/utils/hwloc/Makefile.am b/utils/hwloc/Makefile.am
> index b7f78d3..f20e58c 100644
> --- a/utils/hwloc/Makefile.am
> +++ b/utils/hwloc/Makefile.am
> @@ -106,7 +106,7 @@ endif HWLOC_HAVE_LINUX
> -e 's/#HWLOC_DATE#/@HWLOC_RELEASE_DATE@/g' \
> > $@ < $<
>  
> -install-exec-hook: install-man
> +install-exec-hook:
>   $(SED) -e 's/HWLOC_top_builddir\/utils\/hwloc/bindir/' -e 
> 's/HWLOC_top_builddir\/utils\/lstopo/bindir/' -e '/HWLOC_top_builddir/d' 
> $(DESTDIR)$(bindir)/hwloc-compress-dir > 
> $(DESTDIR)$(bindir)/hwloc-compress-dir.tmp && mv -f 
> $(DESTDIR)$(bindir)/hwloc-compress-dir.tmp 
> $(DESTDIR)$(bindir)/hwloc-compress-dir
>   chmod +x $(DESTDIR)$(bindir)/hwloc-compress-dir
>  if HWLOC_HAVE_LINUX



Re: [hwloc-devel] Lhwloc1 duplicate symbol issue

2015-03-25 Thread Brice Goglin
Le 25/03/2015 21:00, Jeff Squyres (jsquyres) a écrit :
> This has come up in multiple scenarios recently: when compiling OMPI (which 
> contains hwloc 1.9.1), you get a linker error complaining about a duplicate 
> symbol "Lhwloc1".
>
> Peter (CC'ed) was looking into this, but it came up again today with Nathan 
> (also CC'ed).  He did some experiments with hwloc itself (outside of OMPI) 
> before Peter was able to, and determined the following:
>
> - gcc 5.0 on OS X Yosemitie, compiling with -m32
> - hwloc-1.9.1 tag in git: compile fails with Lhwloc1 dup symbol
> - hwloc-1.10 tag in git: works fine
> - master tag in git: works fine
>
> My question is: have you see this Lhwloc1 dup symbol issue before?
>
> I ask because on OMPI master, we can just upgrade to hwloc 1.10.  But in OMPI 
> v1.8.x, it's less attractive to upgrade -- it would be cool if there was a 
> simple fix that we could backport/patch the hwloc 1.9.1 in OMPI 1.8.x with 
> the fix.
>

Looks like I missed something in the OMPI discussion:
When you say symbol, do you mean asm label?
include/private/cpuid-x86.h: Assembler messages:
include/private/cpuid-x86.h:40: Error: symbol `Lhwloc1' is already defined
Like in the mail included at the end of
http://www.open-mpi.org/community/lists/hwloc-users/2014/11/1119.php

This is fixed by
https://github.com/open-mpi/hwloc/commit/790aa2e1e62be6b4f37622959de9ce3766ebc57e
(applied to all stable branches 4 month ago)

This is actually one of the reason why OMPI upgraded to hwloc v1.9. But
I thought they were going to upgrade to hwloc v1.9 git HEAD, while they
only went to v1.9.1, which does not contain this fix.

There's a stable release/branch issue here. hwloc updates stable
*branches* up to what OMPI uses (hwloc v1.8), but usually we only
publish stable *releases* on the last stable branch (v1.10). We need to
clarify if OMPI wants official hwloc releases only, or if applying
(possibly many) hwloc patches is OK.

Brice



Re: [OMPI devel] [OMPI users] Configuration error with external hwloc

2015-03-24 Thread Brice Goglin
Le 24/03/2015 20:47, Jeff Squyres (jsquyres) a écrit :
> I talked to Peter off-list.
>
> We got a successful build going for him.
>
> Seems like we've identified a few issues here, though:
>
> 1. ./configure with gcc 4.7.2 on Debian (I didn't catch the precise version 
> of Debian) results in a Lhwloc1 duplicate symbol in OMPI's embedded hwloc.  
> This feels very much like a compiler error -- we got a successful builds when 
> we forced the use of -O2 instead of the default -O3.  Peter and I will 
> investigate further.

4.7.2 is gcc in Debian wheezy (current stable).
Looks like the upcoming jessie will have 4.9.2

Brice



[hwloc-devel] hwloc v1.11 and/or v1.10.2 and v2.0

2015-03-18 Thread Brice Goglin
Hello

I just pushed a new v1.11 branch. The idea is that master (future v2.0)
won't be ready for release before several months, but it contains many
changes that do not need to wait for all ABI breaks and big changes to
be ready in v2.0. So I've backported these to v1.11. The summary is
https://github.com/open-mpi/hwloc/blob/v1.11/NEWS

I'd like to put a couple more new features in v1.11 before releasing it.
Depending on the timing, I'll do a v1.10.2 first or not.

Brice

PS: Jeff, no need to add nightly tarballs yet, I don't even know which
autotools versions we'll use for v1.11.


Re: [hwloc-devel] [mpich-devel] Build failure in OS X, libxml required?

2015-03-07 Thread Brice Goglin
Le 07/03/2015 12:13, Lisandro Dalcin a écrit :
> OK, I finally figured out what's going on.
>
> I recently upgraded to Mac OS X Yosemite, and I did not install the
> Xcode command line tools. My /usr/include directory was missing,
> however I had /usr/bin/clang (not sure wether it is there because of
> the default Xcode install, or it was a leftover of my previous OS X
> system). I also had pkg-config installed trough Homebrew, and
> "pkg-config --cflags libxml-2.0" prints "-I/usr/include/libxml2",
> however remember I did not have "/usr/include". Do this pkg-config
> install is assuming /usr/include do exist.
>
> The thing is that MPICH is able to build fine (and in fact, many other
> software packages) as long as you pass --disable-libxml2 to workaround
> the hwloc issue. But after spending some time looking at hwloc build
> system, I realized the dependence on pkg-config to look for libxml2,
> and then there is no easy workaround.
>
> Finally, I've installed the Xcode command line tools (with
> "xcode-select --install") and MPICH + embed hwloc built just fine. Now
> I had the cmd line tools in my system, but it makes sense anyway,
> after all I use it everyday for development and the lack of it will
> likely cause me headaches in the near future with some other software.
>
> Some final comment for hwloc folks:
>
> * Relying in pkg-config for building is totally fine, but please note
> this tool is non-standard in OS X.

It's also not always available on Linux either. I got annoyed several
times in the past when scrolling up many pages of configure before I
found a small 'cannot check without pkg-config'. So upstream hwloc will
issue a warning when pkg-config is missing in next releases at the end
of configure (hwloc embedders will need a small change to do the same).

> * Looking in config/hwloc_pkg.m4, I noticed you somehow do check for
> broken stuff by attempting to link with package libraries. Well, if
> would be nice to extend this macro to also check you are able to
> compile by #include'ing package headers. IMHO, such sanity checks save
> time in the long run for both developer and users. Sorry I do not
> offer a patch, I have very limited knowledge of autotools and M4.

That would be a reasonable idea. pkg-config doesn't tell us which
headers come with a package, so we may have to modify hwloc_pkg.m4
(which is already slightly modified from the official pkg.m4) to add one
header argument to the m4 macro. I'll see what I can do.

Thanks for looking into this !

Brice



Re: [hwloc-devel] [mpich-devel] Build failure in OS X, libxml required?

2015-03-06 Thread Brice Goglin
Hello,

Sorry but we don't add configure checks without knowing what's causing
the problem. So far it only looks like a broken libxml install.

We cannot check for all broken installs in the world. Otherwise one day
somebody will remove printf from his libc and request a new configure
check for printf() as well. And another check in case he modified
printf() to return different values. And another check in case printf()
was renamed to pruntf(). Endless.

Please try to understand what caused this broken/partial libxml install
(do you have logs of the install?).

By the way, let's make this new hwloc feature official: hwloc can now
detect broken libxml installs by failing to build :)

Brice





Le 06/03/2015 21:14, Balaji, Pavan a écrit :
> Hi,
>
> This is a problem with hwloc, which I had reported in the past.  I believe 
> this is not fixed in hwloc yet.
>
> The suggestion given at that point was to remove and install libxml again, 
> which I did, and things started working correctly again.  But, I agree, hwloc 
> should detect this in configure and abort if there's an issue.
>
> I've cc'ed hwloc-devel.
>
>   -- Pavan
>
>> On Mar 6, 2015, at 2:10 AM, Lisandro Dalcin  wrote:
>>
>> This is with the 3.1.4 tarball. Not sure if this is actually a problem
>> in MPICH or in my system.
>>
>> I got the following build failure. Shouldn't configure catch that beforehand?
>>
>> Making all in tools/topo/hwloc/hwloc
>> Making all in src
>>  CC   topology.lo
>>  CC   traversal.lo
>>  CC   distances.lo
>>  CC   components.lo
>>  CC   bind.lo
>>  CC   bitmap.lo
>>  CC   pci-common.lo
>>  CC   diff.lo
>>  CC   misc.lo
>>  CC   base64.lo
>>  CC   topology-noos.lo
>>  CC   topology-synthetic.lo
>>  CC   topology-custom.lo
>>  CC   topology-xml.lo
>>  CC   topology-xml-nolibxml.lo
>>  CC   topology-xml-libxml.lo
>>  CC   topology-darwin.lo
>>  CC   topology-x86.lo
>> topology-xml-libxml.c:17:10: fatal error: 'libxml/parser.h' file not found
>> #include 
>> ^
>> 1 error generated.
>> make[4]: *** [topology-xml-libxml.lo] Error 1
>> make[4]: *** Waiting for unfinished jobs
>> make[3]: *** [all-recursive] Error 1
>> make[2]: *** [all-recursive] Error 1
>> make[1]: *** [all-recursive] Error 1
>> make: *** [all] Error 2
>>
>>
>> -- 
>> Lisandro Dalcin
>> 
>> Research Scientist
>> Computer, Electrical and Mathematical Sciences & Engineering (CEMSE)
>> Numerical Porous Media Center (NumPor)
>> King Abdullah University of Science and Technology (KAUST)
>> http://numpor.kaust.edu.sa/
>>
>> 4700 King Abdullah University of Science and Technology
>> al-Khawarizmi Bldg (Bldg 1), Office # 4332
>> Thuwal 23955-6900, Kingdom of Saudi Arabia
>> http://www.kaust.edu.sa
>>
>> Office Phone: +966 12 808-0459
>> ___
>> To manage subscription options or unsubscribe:
>> https://lists.mpich.org/mailman/listinfo/devel
> --
> Pavan Balaji  ✉️
> http://www.mcs.anl.gov/~balaji
>
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-devel/2015/03/4400.php



Re: [hwloc-devel] Coverity

2015-02-14 Thread Brice Goglin
Le 14/02/2015 14:44, Jeff Squyres (jsquyres) a écrit :
> I added a bunch of components into the hwloc coverity setup, as well as a 
> dummy model file (so that coverity doesn't complain that hwloc is not fully 
> configured).
>
> Brice: do you have a nightly coverity submission script running somewhere?
>

No, I just run it manually from time to time.

Brice



  1   2   3   4   5   6   7   8   >