Re: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, basesmuma, p2p

2022-11-07 Thread Ben Menadue via devel
Hi,

We see this on our cluster as well — we traced it to because Python loads 
shared library extensions using RTLD_LOCAL.

The Python module (mpi4py?) has a dependency on libmpi.so, which in turn has a 
dependency on libhcoll.so. So the Python module is being loaded with 
RTLD_LOCAL, anything that it pulls in with it also ends up being loaded like 
that. Later, hcoll tries loading its own plugin .so files, but since 
libhcoll.so was loaded with RTLD_LOCAL that plugin library can’t resolve any 
symbols there.

It might be fixable by having the hcoll plugins linked against libhcoll.so, but 
since it’s just a pre-built bundle from Mellanox it’s not something I can test 
easily.

Otherwise, the solution we use is to just LD_PRELOAD=libmpi.so when launching 
Python so that it gets loaded into the global namespace like would happen with 
a “normal” compiled program.

Cheers,
Ben



> On 8 Nov 2022, at 1:48 am, Tomislav Janjusic via devel 
>  wrote:
> 
> Ugh - runtime command is literally in the e-mail.
>  
> Sorry about that.
>  
>  
> --
> Tomislav Janjusic
> Staff Eng., Mellanox, HPC SW
> +1 (512) 598-0386
> NVIDIA 
>  
> From: Tomislav Janjusic 
> Sent: Monday, November 7, 2022 8:48 AM
> To: 'Open MPI Developers' ; Open MPI Users 
> 
> Cc: mrlong 
> Subject: RE: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available 
> but requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, 
> basesmuma, p2p
>  
> What is the runtime command?
> It’s coming from HCOLL. If HCOLL is not needed feel free to disable it -mca 
> coll ^hcoll
>  
> Tomislav Janjusic
> Staff Eng., Mellanox, HPC SW
> +1 (512) 598-0386
> NVIDIA 
>  
> From: devel  > On Behalf Of mrlong via devel
> Sent: Monday, November 7, 2022 2:33 AM
> To: devel@lists.open-mpi.org ; Open MPI 
> Users mailto:us...@lists.open-mpi.org>>
> Cc: mrlong mailto:mrlong...@gmail.com>>
> Subject: [OMPI devel] [LOG_CAT_ML] component basesmuma is not available but 
> requested in hierarchy: basesmuma, basesmuma, ucx_p2p:basesmsocket, 
> basesmuma, p2p
>  
> External email: Use caution opening links or attachments
>  
> The execution of openmpi 5.0.0rc9 results in the following:
> 
> (py3.9) [user@machine01 share]$  mpirun -n 2 python test.py
> [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
> basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
> [LOG_CAT_ML] ml_discover_hierarchy exited with error
> [LOG_CAT_ML] component basesmuma is not available but requested in hierarchy: 
> basesmuma,basesmuma,ucx_p2p:basesmsocket,basesmuma,p2p
> [LOG_CAT_ML] ml_discover_hierarchy exited with error
> 
> Why is this message printed?
> 



Re: [OMPI devel] v5.0 equivalent of --map-by numa

2021-11-11 Thread Ben Menadue via devel
Hi Brice,Great, thanks for that!I think I need to brush up on my searching skills, not sure why I didn't find that issue. Sorry for the noise.Cheers,BenOn 11 Nov. 2021 19:05, Brice Goglin  wrote:Hello Ben



It will be back, at least for the majority of platforms (those without 

heterogeneous memory).



See https://github.com/open-mpi/ompi/issues/8170 and 

https://github.com/openpmix/prrte/pull/1141



Brice







Le 11/11/2021 à 05:33, Ben Menadue via devel a écrit :

> Hi,

>

> Quick question: what's the equivalent of "--map-by numa" for the new

> PRRTE-based runtime for v5.0? I can see "package" and "l3cache" in the

> help, which are close, but don't quite match "numa" for our system.

>

> In more detail...

>

> We have dual-socket CLX- and SKL-based nodes with sub-NUMA clustering

> enabled. This shows up in the OS as two packages, each with 1 L3 cache

> domain and 2 NUMA domains. Even worse, each compute node effectively

> has its own unique mapping of the cores of each socket between the NUMA

> domains.

>

> A common way of running for our users is with 1 MPI process per NUMA

> domain and then some form of threading within the cores associated with

> that domain. This effectively gives each MPI process its own memory

> controller and DIMMs.

>

> Using "--map-by numa" worked really well for this, since it took care

> of the unique core numbering of each node. The only way I can think of

> to set up something equivalent without that would be manually

> enumerating the nodes in each job and building a rank file.

>

> I've include an example topology below.

>

> Or do you think this is better as a GitHub issue?

>

> Thanks,

> Ben

>

> [bjm900@gadi-cpu-clx-0143 build]$ lstopo

> Machine (189GB total)

>    Package L#0 + L3 L#0 (36MB)

>  Group0 L#0

>    NUMANode L#0 (P#0 47GB)

>    L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU

> L#0 (P#0)

>    L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU

> L#1 (P#1)

>    L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU

> L#2 (P#2)

>    L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU

> L#3 (P#3)

>    L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU

> L#4 (P#7)

>    L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU

> L#5 (P#8)

>    L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU

> L#6 (P#9)

>    L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU

> L#7 (P#13)

>    L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU

> L#8 (P#14)

>    L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU

> L#9 (P#15)

>    L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10

> + PU L#10 (P#19)

>    L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11

> + PU L#11 (P#20)

>    HostBridge

>  PCI 00:11.5 (SATA)

>  PCI 00:17.0 (SATA)

>    Block(Disk) "sda"

>  PCIBridge

>    PCIBridge

>  PCI 02:00.0 (VGA)

>    HostBridge

>  PCIBridge

>    PCIBridge

>  PCIBridge

>    PCI 08:00.2 (Ethernet)

>  Net "eno1"

>  Group0 L#1

>    NUMANode L#1 (P#1 47GB)

>    L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12

> + PU L#12 (P#4)

>    L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13

> + PU L#13 (P#5)

>    L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14

> + PU L#14 (P#6)

>    L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15

> + PU L#15 (P#10)

>    L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16

> + PU L#16 (P#11)

>    L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17

> + PU L#17 (P#12)

>    L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18

> + PU L#18 (P#16)

>    L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19

> + PU L#19 (P#17)

>    L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20

> + PU L#20 (P#18)

>    L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21

> + PU L#21 (P#21)

>    L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22

> + PU L#22 (P#22)

>    L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23

> + PU L#23 (P#23)

>    HostBridge

>  PCIBridge

>    PCI 58:00.0 (InfiniBand)

>  Net "ib0"

> 

[OMPI devel] v5.0 equivalent of --map-by numa

2021-11-10 Thread Ben Menadue via devel
Hi,

Quick question: what's the equivalent of "--map-by numa" for the new
PRRTE-based runtime for v5.0? I can see "package" and "l3cache" in the
help, which are close, but don't quite match "numa" for our system.

In more detail...

We have dual-socket CLX- and SKL-based nodes with sub-NUMA clustering
enabled. This shows up in the OS as two packages, each with 1 L3 cache
domain and 2 NUMA domains. Even worse, each compute node effectively
has its own unique mapping of the cores of each socket between the NUMA
domains.

A common way of running for our users is with 1 MPI process per NUMA
domain and then some form of threading within the cores associated with
that domain. This effectively gives each MPI process its own memory
controller and DIMMs.

Using "--map-by numa" worked really well for this, since it took care
of the unique core numbering of each node. The only way I can think of
to set up something equivalent without that would be manually
enumerating the nodes in each job and building a rank file.

I've include an example topology below.

Or do you think this is better as a GitHub issue?

Thanks,
Ben

[bjm900@gadi-cpu-clx-0143 build]$ lstopo
Machine (189GB total)
  Package L#0 + L3 L#0 (36MB)
Group0 L#0
  NUMANode L#0 (P#0 47GB)
  L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU
L#0 (P#0)
  L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU
L#1 (P#1)
  L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU
L#2 (P#2)
  L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU
L#3 (P#3)
  L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU
L#4 (P#7)
  L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU
L#5 (P#8)
  L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU
L#6 (P#9)
  L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU
L#7 (P#13)
  L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU
L#8 (P#14)
  L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU
L#9 (P#15)
  L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
+ PU L#10 (P#19)
  L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
+ PU L#11 (P#20)
  HostBridge
PCI 00:11.5 (SATA)
PCI 00:17.0 (SATA)
  Block(Disk) "sda"
PCIBridge
  PCIBridge
PCI 02:00.0 (VGA)
  HostBridge
PCIBridge
  PCIBridge
PCIBridge
  PCI 08:00.2 (Ethernet)
Net "eno1"
Group0 L#1
  NUMANode L#1 (P#1 47GB)
  L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
+ PU L#12 (P#4)
  L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
+ PU L#13 (P#5)
  L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
+ PU L#14 (P#6)
  L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
+ PU L#15 (P#10)
  L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
+ PU L#16 (P#11)
  L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
+ PU L#17 (P#12)
  L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
+ PU L#18 (P#16)
  L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
+ PU L#19 (P#17)
  L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
+ PU L#20 (P#18)
  L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
+ PU L#21 (P#21)
  L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
+ PU L#22 (P#22)
  L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
+ PU L#23 (P#23)
  HostBridge
PCIBridge
  PCI 58:00.0 (InfiniBand)
Net "ib0"
OpenFabrics "mlx5_0"
  Package L#1 + L3 L#1 (36MB)
Group0 L#2
  NUMANode L#2 (P#2 47GB)
  L2 L#24 (1024KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
+ PU L#24 (P#24)
  L2 L#25 (1024KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
+ PU L#25 (P#25)
  L2 L#26 (1024KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
+ PU L#26 (P#26)
  L2 L#27 (1024KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
+ PU L#27 (P#27)
  L2 L#28 (1024KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
+ PU L#28 (P#30)
  L2 L#29 (1024KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
+ PU L#29 (P#31)
  L2 L#30 (1024KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
+ PU L#30 (P#35)
  L2 L#31 (1024KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
+ PU L#31 (P#36)
  L2 L#32 (1024KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
+ PU L#32 (P#37)
  L2 L#33 (1024KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
+ PU L#33 (P#42)
  L2 L#34 (1024KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
+ PU L#34 (P#43)
  L2 L#35 (1024KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
+ PU L#35 (P#44)
Group0 L#3
  NUMANode L#3 (P#3 47GB)
  L2 L#36 (1024KB) + L1d 

[OMPI devel] GitHub v4.0.2 tag is broken

2020-04-01 Thread Ben Menadue via devel
Hi,

The v4.0.2 tag in GitHub is broken at the moment -- trying to go to it
just takes you to the v4.0.2 _branch_, which looks to be a separate,
much more recent fork from master:

https://github.com/open-mpi/ompi/tree/v4.0.2

Cheers,
Ben




Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
Hi,I haven’t heard back from the user yet, but I just put this example together which works on 1, 2, and 3 ranks but fails for 4. Unfortunately it needs a fair amount of memory, about 14.3GB per process, so I was running it with -map-by ppr:1:node.It doesn’t fail with the segfault as the user’s code does, but it does SIGABRT:16:12 bjm900@r4320 MPI_TESTS > mpirun -mca pml ob1 -mca coll ^fca,hcoll -map-by ppr:1:node -np 4 ./a.out[r4450:11544] ../../../../../opal/datatype/opal_datatype_pack.h:53	Pointer 0x2bb7ceedb010 size 131040 is outside [0x2b9ec63cb010,0x2bad1458b010] for	base ptr 0x2b9ec63cb010 count 1 and data [r4450:11544] Datatype 0x145fe90[] size 3072000 align 4 id 0 length 7 used 6true_lb 0 true_ub 6144000 (true_extent 6144000) lb 0 ub 6144000 (extent 6144000)nbElems -909934592 loops 4 flags 104 (committed )-c-GD--[---][---]   contain OPAL_FLOAT4:* --C[---][---]    OPAL_LOOP_S 192 times the next 2 elements extent 8000--C---P-D--[---][---]    OPAL_FLOAT4 count 2000 disp 0xaba95 (4608000) blen 0 extent 4 (size 8000)--C[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 4608000 size of data 8000--C[---][---]    OPAL_LOOP_S 192 times the next 2 elements extent 8000--C---P-D--[---][---]    OPAL_FLOAT4 count 2000 disp 0x0 (0) blen 0 extent 4 (size 8000)--C[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 0 size of data 8000---G---[---][---]    OPAL_LOOP_E prev 6 elements first elem displacement 4608000 size of data 655228928Optimized description -cC---P-DB-[---][---]     OPAL_UINT1 count -1819869184 disp 0xaba95 (4608000) blen 1 extent 1 (size 1536000)-cC---P-DB-[---][---]     OPAL_UINT1 count -1819869184 disp 0x0 (0) blen 1 extent 1 (size 1536000)---G---[---][---]    OPAL_LOOP_E prev 2 elements first elem displacement 4608000 [r4450:11544] *** Process received signal ***[r4450:11544] Signal: Aborted (6)[r4450:11544] Signal code:  (-6)Cheers,Ben

allgatherv_failure.c
Description: Binary data
On 2 Nov 2018, at 12:09 pm, Ben Menadue <ben.mena...@nci.org.au> wrote:HI Gilles,On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet <gil...@rist.or.jp> wrote:I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to CUDA environments ?No, this is just on normal CPU-only nodes. But memcpy always goes through opal_cuda_memcpy when CUDA support is enabled, even if there’s no GPUs in use (or indeed, even installed).The coll/tuned default collective module is known not to work when tasks use matching but different signatures.For example, one task sends one vector of N elements, and the other task receives N elements.This is the call that triggers it:	ierror = MPI_Allgatherv(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, S[0], recvcounts, displs, mpitype_vec_nobs, node_comm);(and changing the source datatype to MPI_BYTE to avoid the NULL handle doesn’t help).A workaround worth trying is tompirun --mca coll basic ...Thanks — using --mca coll basic,libnbc fixes it (basic on its own fails because it can’t work out what to use for Iallgather).Last but not least, could you please post a minimal example (and the number of MPI tasks used) that can evidence the issue ?I’m just waiting for the user to get back to me with the okay to share the code. Otherwise, I’ll see what I can put together myself. It works on 42 cores (at 14 per node = 3 nodes) but fails for 43 cores (so 1 rank on the 4th node). The communicator includes 1 rank per node, so it’s going from a three-rank communicator to a four-rank communicator — perhaps the tuned algorithm changes at that point?Cheers,Ben___devel mailing listdevel@lists.open-mpi.orghttps://lists.open-mpi.org/mailman/listinfo/devel___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
HI Gilles,

> On 2 Nov 2018, at 11:03 am, Gilles Gouaillardet  wrote:
> I noted the stack traces refers opal_cuda_memcpy(). Is this issue specific to 
> CUDA environments ?

No, this is just on normal CPU-only nodes. But memcpy always goes through 
opal_cuda_memcpy when CUDA support is enabled, even if there’s no GPUs in use 
(or indeed, even installed).

> The coll/tuned default collective module is known not to work when tasks use 
> matching but different signatures.
> For example, one task sends one vector of N elements, and the other task 
> receives N elements.


This is the call that triggers it:

ierror = MPI_Allgatherv(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, S[0], 
recvcounts, displs, mpitype_vec_nobs, node_comm);

(and changing the source datatype to MPI_BYTE to avoid the NULL handle doesn’t 
help).

> A workaround worth trying is to
> mpirun --mca coll basic ...


Thanks — using --mca coll basic,libnbc fixes it (basic on its own fails because 
it can’t work out what to use for Iallgather).

> Last but not least, could you please post a minimal example (and the number 
> of MPI tasks used) that can evidence the issue ?


I’m just waiting for the user to get back to me with the okay to share the 
code. Otherwise, I’ll see what I can put together myself. It works on 42 cores 
(at 14 per node = 3 nodes) but fails for 43 cores (so 1 rank on the 4th node). 
The communicator includes 1 rank per node, so it’s going from a three-rank 
communicator to a four-rank communicator — perhaps the tuned algorithm changes 
at that point?

Cheers,
Ben

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] 3.1.2: Datatype errors and segfault in MPI_Allgatherv

2018-11-01 Thread Ben Menadue
Hi,

One of our users is reporting an issue using MPI_Allgatherv with a large 
derived datatype — it segfaults inside OpenMPI. Using a debug build of OpenMPI 
3.1.2 produces a ton of messages like this before the segfault:

[r3816:50921] ../../../../../opal/datatype/opal_datatype_pack.h:53
Pointer 0x2acd0121b010 size 131040 is outside 
[0x2ac5ed268010,0x2ac980ad8010] for
base ptr 0x2ac5ed268010 count 1 and data 
[r3816:50921] Datatype 0x42998b0[] size 592000 align 4 id 0 length 7 used 6
true_lb 0 true_ub 1536000 (true_extent 1536000) lb 0 ub 1536000 
(extent 1536000)
nbElems 148000 loops 4 flags 104 (committed )-c-GD--[---][---]
   contain OPAL_FLOAT4:* 
--C[---][---]OPAL_LOOP_S 4 times the next 2 elements extent 8000
--C---P-D--[---][---]OPAL_FLOAT4 count 2000 disp 0x380743000 
(1504000) blen 0 extent 4 (size 8000)
--C[---][---]OPAL_LOOP_E prev 2 elements first elem displacement 
1504000 size of data 8000
--C[---][---]OPAL_LOOP_S 70 times the next 2 elements extent 
8000
--C---P-D--[---][---]OPAL_FLOAT4 count 2000 disp 0x0 (0) blen 0 extent 
4 (size 8000)
--C[---][---]OPAL_LOOP_E prev 2 elements first elem displacement 0 
size of data 8000
---G---[---][---]OPAL_LOOP_E prev 6 elements first elem displacement 
1504000 size of data 1625032704
Optimized description 
-cC---P-DB-[---][---] OPAL_UINT1 count 32000 disp 0x380743000 
(1504000) blen 1 extent 1 (size 32000)
-cC---P-DB-[---][---] OPAL_UINT1 count 1305032704 disp 0x0 (0) blen 1 
extent 1 (size 56)
---G---[---][---]OPAL_LOOP_E prev 2 elements first elem displacement 
1504000 size of d

Here is the backtrace:

 backtrace 
 0 0x0008987b memcpy()  ???:0
 1 0x000639b6 opal_cuda_memcpy()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_datatype_cuda.c:99
 2 0x0005cd7a pack_predefined_data()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_datatype_pack.h:56
 3 0x0005e845 opal_generic_simple_pack()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_datatype_pack.c:319
 4 0x0004ce6e opal_convertor_pack()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/datatype/../../../../../opal/datatype/opal_convertor.c:272
 5 0xe3b6 mca_btl_openib_prepare_src()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib.c:1609
 6 0x00023c75 mca_bml_base_prepare_src()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/bml/bml.h:341
 7 0x00027d2a mca_pml_ob1_send_request_schedule_once()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.c:995
 8 0x0002473c mca_pml_ob1_send_request_schedule_exclusive()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.h:313
 9 0x0002479d mca_pml_ob1_send_request_schedule()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.h:337
10 0x000256fe mca_pml_ob1_frag_completion()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/ompi/mca/pml/ob1/../../../../../../../ompi/mca/pml/ob1/pml_ob1_sendreq.c:321
11 0x0001baaf handle_wc()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3565
12 0x0001c20c poll_device()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3719
13 0x0001c6c0 progress_one_device()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3829
14 0x0001c763 btl_openib_component_progress()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/mca/btl/openib/../../../../../../../opal/mca/btl/openib/btl_openib_component.c:3853
15 0x0002ff90 opal_progress()  
/short/z00/bjm900/build/openmpi-mofed4.2/openmpi-3.1.2/build/gcc/debug-1/opal/../../../../opal/runtime/opal_progress.c:228
16 0x0001114c ompi_request_wait_completion()  

Re: [OMPI devel] Removing the oob/ud component

2018-06-19 Thread Ben Menadue
Hi Jeff,

What’s the replacement that it should use instead? I’m pretty sure oob/ud is 
being picked by default on our IB cluster. Or is oob/tcp good enough?

Cheers,
Ben

> On 20 Jun 2018, at 5:20 am, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> We talked about this on the webex today, but for those of you who weren't 
> there: we're talking about removing the oob/ud component:
> 
>https://github.com/open-mpi/ompi/pull/5300
> 
> We couldn't find anyone who still cares about this component (e.g., LANL, 
> Mellanox, ...etc.), and no one is maintaining it.  If no one says anything 
> within the next 2 weeks, we'll remove oob/ud before the branch for v4.0.0.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] [OMPI users] 3.x - hang in MPI_Comm_disconnect

2018-05-21 Thread Ben Menadue
Hi Ralph,

Thanks for that. That would also explain why it works with OMPI 1.10.7. In 
which case, I’ll just suggest they continue using 1.10.7 for now.

I just went back over the doMPI R code, and it looks like it’s using 
MPI_Comm_spawn to create it’s “cluster” of MPI worker processes but then using 
MPI_Comm_disconnect when closing the cluster. I think the idea is that they can 
then create and destroy clusters several times within the same R script. But of 
course, that won’t work here when you can’t disconnect processes.

Cheers,
Ben



> On 22 May 2018, at 1:09 pm, r...@open-mpi.org wrote:
> 
> Comm_connect and Comm_disconnect are both broken in OMPI v2.0 and above, 
> including OMPI master - the precise reasons differ across the various 
> releases. From what I can tell, the problem is in the OMPI side (as opposed 
> to PMIx). I’ll try to file a few issues (since the problem is different in 
> the various releases) in the next few days that points to the problems.
> 
> Comm_spawn is okay, FWIW
> 
> Ralph
> 
> 
>> On May 21, 2018, at 8:00 PM, Ben Menadue <ben.mena...@nci.org.au 
>> <mailto:ben.mena...@nci.org.au>> wrote:
>> 
>> Hi,
>> 
>> Moving this over to the devel list... I’m not sure if it's is a problem with 
>> PMIx or with OMPI’s integration with that. It looks like wait_cbfunc 
>> callback enqueued as part of the PMIX_PTL_SEND_RECV at 
>> pmix_client_connect.c:329 is never called, and so the main thread is never 
>> woken from the PMIX_WAIT_THREAD at pmix_client_connect.c:232. (This is for 
>> PMIx v2.1.1.) But I haven’t worked out why that callback is not being called 
>> yet… looking at the output, I think that it’s expecting a message back from 
>> the PMIx server that it’s never getting.
>> 
>> [raijin7:05505] pmix: disconnect called
>> [raijin7:05505] [../../../../../src/mca/ptl/tcp/ptl_tcp.c:431] post send to 
>> server
>> [raijin7:05505] posting recv on tag 119
>> [raijin7:05505] QUEIENG MSG TO SERVER OF SIZE 645
>> [raijin7:05505] 1746468865:0 ptl:base:send_handler SENDING TO PEER 
>> 1746468864:0 tag 119 with NON-NULL msg
>> [raijin7:05505] ptl:base:send_handler SENDING MSG
>> [raijin7:05493] 1746468864:0 ptl:base:recv:handler called with peer 
>> 1746468865:0
>> [raijin7:05493] ptl:base:recv:handler allocate new recv msg
>> [raijin7:05493] ptl:base:recv:handler read hdr on socket 27
>> [raijin7:05493] RECVD MSG FOR TAG 119 SIZE 645
>> [raijin7:05493] ptl:base:recv:handler allocate data region of size 645
>> [raijin7:05505] ptl:base:send_handler MSG SENT
>> [raijin7:05493] 1746468864:0 RECVD COMPLETE MESSAGE FROM SERVER OF 645 BYTES 
>> FOR TAG 119 ON PEER SOCKET 27
>> [raijin7:05493] [../../../../src/mca/ptl/base/ptl_base_sendrecv.c:507] post 
>> msg
>> [raijin7:05493] 1746468864:0 message received 645 bytes for tag 119 on 
>> socket 27
>> [raijin7:05493] checking msg on tag 119 for tag 0
>> [raijin7:05493] checking msg on tag 119 for tag 4294967295
>> [raijin7:05505] pmix: disconnect completed
>> [raijin7:05493] 1746468864:0 EXECUTE CALLBACK for tag 119
>> [raijin7:05493] SWITCHYARD for 1746468865:0:27
>> [raijin7:05493] recvd pmix cmd 11 from 1746468865:0
>> [raijin7:05493] recvd CONNECT from peer 1746468865:0
>> [raijin7:05493] get_tracker called with 32 procs
>> [raijin7:05493] 1746468864:0 CALLBACK COMPLETE
>> 
>> Here, 5493 is the mpirun and 5505 is one of the spawned processes. All of 
>> the MPI processes (i.e. the original one along with the dynamically launched 
>> ones) look to be waiting on the same pthread_cond_wait in the backtrace 
>> below, while the mpirun is just in the standard event loops 
>> (event_base_loop, oob_tcp_listener, opal_progress_threads, 
>> ptl_base_listener, and pmix_progress_threads).
>> 
>> That said, I’m not sure why get_tracker is reporting 32 procs — there’s only 
>> 16 running here (i.e. 1 original + 15 spawned).
>> 
>> Or should I post this over in the PMIx list instead?
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On 17 May 2018, at 9:59 am, Ben Menadue <ben.mena...@nci.org.au 
>>> <mailto:ben.mena...@nci.org.au>> wrote:
>>> 
>>> Hi,
>>> 
>>> I’m trying to debug a user’s program that uses dynamic process management 
>>> through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of 
>>> the processes is in
>>> 
>>> #0  0x7ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from 
>>> /lib64/libpthread.so.0
>>> #1  0x7ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs=>> optimized out>, info=, 

Re: [OMPI devel] [OMPI users] 3.x - hang in MPI_Comm_disconnect

2018-05-21 Thread Ben Menadue
Hi,

Moving this over to the devel list... I’m not sure if it's is a problem with 
PMIx or with OMPI’s integration with that. It looks like wait_cbfunc callback 
enqueued as part of the PMIX_PTL_SEND_RECV at pmix_client_connect.c:329 is 
never called, and so the main thread is never woken from the PMIX_WAIT_THREAD 
at pmix_client_connect.c:232. (This is for PMIx v2.1.1.) But I haven’t worked 
out why that callback is not being called yet… looking at the output, I think 
that it’s expecting a message back from the PMIx server that it’s never getting.

[raijin7:05505] pmix: disconnect called
[raijin7:05505] [../../../../../src/mca/ptl/tcp/ptl_tcp.c:431] post send to 
server
[raijin7:05505] posting recv on tag 119
[raijin7:05505] QUEIENG MSG TO SERVER OF SIZE 645
[raijin7:05505] 1746468865:0 ptl:base:send_handler SENDING TO PEER 1746468864:0 
tag 119 with NON-NULL msg
[raijin7:05505] ptl:base:send_handler SENDING MSG
[raijin7:05493] 1746468864:0 ptl:base:recv:handler called with peer 1746468865:0
[raijin7:05493] ptl:base:recv:handler allocate new recv msg
[raijin7:05493] ptl:base:recv:handler read hdr on socket 27
[raijin7:05493] RECVD MSG FOR TAG 119 SIZE 645
[raijin7:05493] ptl:base:recv:handler allocate data region of size 645
[raijin7:05505] ptl:base:send_handler MSG SENT
[raijin7:05493] 1746468864:0 RECVD COMPLETE MESSAGE FROM SERVER OF 645 BYTES 
FOR TAG 119 ON PEER SOCKET 27
[raijin7:05493] [../../../../src/mca/ptl/base/ptl_base_sendrecv.c:507] post msg
[raijin7:05493] 1746468864:0 message received 645 bytes for tag 119 on socket 27
[raijin7:05493] checking msg on tag 119 for tag 0
[raijin7:05493] checking msg on tag 119 for tag 4294967295
[raijin7:05505] pmix: disconnect completed
[raijin7:05493] 1746468864:0 EXECUTE CALLBACK for tag 119
[raijin7:05493] SWITCHYARD for 1746468865:0:27
[raijin7:05493] recvd pmix cmd 11 from 1746468865:0
[raijin7:05493] recvd CONNECT from peer 1746468865:0
[raijin7:05493] get_tracker called with 32 procs
[raijin7:05493] 1746468864:0 CALLBACK COMPLETE

Here, 5493 is the mpirun and 5505 is one of the spawned processes. All of the 
MPI processes (i.e. the original one along with the dynamically launched ones) 
look to be waiting on the same pthread_cond_wait in the backtrace below, while 
the mpirun is just in the standard event loops (event_base_loop, 
oob_tcp_listener, opal_progress_threads, ptl_base_listener, and 
pmix_progress_threads).

That said, I’m not sure why get_tracker is reporting 32 procs — there’s only 16 
running here (i.e. 1 original + 15 spawned).

Or should I post this over in the PMIx list instead?

Cheers,
Ben


> On 17 May 2018, at 9:59 am, Ben Menadue <ben.mena...@nci.org.au> wrote:
> 
> Hi,
> 
> I’m trying to debug a user’s program that uses dynamic process management 
> through Rmpi + doMPI. We’re seeing a hang in MPI_Comm_disconnect. Each of the 
> processes is in
> 
> #0  0x7ff72513168c in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /lib64/libpthread.so.0
> #1  0x7ff7130760d3 in PMIx_Disconnect (procs=0x5b0d600, nprocs= optimized out>, info=, ninfo=0) at 
> ../../src/client/pmix_client_connect.c:232
> #2  0x7ff7132fa670 in ext2x_disconnect (procs=0x7fff48ed6700) at 
> ext2x_client.c:1432
> #3  0x7ff71a3b7ce4 in ompi_dpm_disconnect (comm=0x5af6910) at 
> ../../../../../ompi/dpm/dpm.c:596
> #4  0x7ff71a402ff8 in PMPI_Comm_disconnect (comm=0x5a3c4f8) at 
> pcomm_disconnect.c:67
> #5  0x7ff71a7466b9 in mpi_comm_disconnect () from 
> /home/900/bjm900/R/x86_64-pc-linux-gnu-library/3.4/Rmpi/libs/Rmpi.so
> 
> This is using 3.1.0 against and external install of PMIx 2.1.1. But I see 
> exactly the same issue with 3.0.1 using its internal PMIx. It looks similar 
> to issue #4542, but the corresponding patch in PR#4549 doesn’t seem to help 
> (it just hangs in PMIx_fence instead of PMIx_disconnect).
> 
> Attached is the offending R script, it hangs in the “closeCluster” call. Has 
> anyone seen this issue? I’m not sure what approach to take to debug it, but I 
> have builds of the MPI libraries with --enable-debug available if needed.
> 
> Cheers,
> Ben
> 
> 
> ___
> users mailing list
> us...@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

[OMPI devel] Map by socket broken in 3.0.0?

2017-10-02 Thread Ben Menadue
Hi,

I having trouble using map by socket on remote nodes.

Running on the same node as mpirun works fine (except for that spurious 
debugging line):

$ mpirun -H localhost:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true
[raijin7:22248] SETTING BINDING TO CORE
 Data for JOB [11140,1] offset 0 Total slots allocated 16

    JOB MAP   

 Data for node: raijin7 Num slots: 16   Max slots: 0Num procs: 4
Process OMPI jobid: [11140,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]], socket 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], socket 
0[core 3[hwt 0]]:[B/B/B/B/./././.][./././././././.]
Process OMPI jobid: [11140,1] App: 0 Process rank: 1 Bound: socket 
0[core 4[hwt 0]], socket 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], socket 
0[core 7[hwt 0]]:[././././B/B/B/B][./././././././.]
Process OMPI jobid: [11140,1] App: 0 Process rank: 2 Bound: socket 
1[core 8[hwt 0]], socket 1[core 9[hwt 0]], socket 1[core 10[hwt 0]], socket 
1[core 11[hwt 0]]:[./././././././.][B/B/B/B/./././.]
Process OMPI jobid: [11140,1] App: 0 Process rank: 3 Bound: socket 
1[core 12[hwt 0]], socket 1[core 13[hwt 0]], socket 1[core 14[hwt 0]], socket 
1[core 15[hwt 0]]:[./././././././.][././././B/B/B/B]

 =
But the same on a remote node fails in a rather odd fashion:

$ mpirun -H r1:16 -map-by ppr:2:socket:PE=4 -display-map /bin/true
[raijin7:22291] SETTING BINDING TO CORE
[r1:10565] SETTING BINDING TO CORE
 Data for JOB [10879,1] offset 0 Total slots allocated 32

    JOB MAP   

 Data for node: r1  Num slots: 16   Max slots: 0Num procs: 4
Process OMPI jobid: [10879,1] App: 0 Process rank: 0 Bound: N/A
Process OMPI jobid: [10879,1] App: 0 Process rank: 1 Bound: N/A
Process OMPI jobid: [10879,1] App: 0 Process rank: 2 Bound: N/A
Process OMPI jobid: [10879,1] App: 0 Process rank: 3 Bound: N/A

 =
--
The request to bind processes could not be completed due to
an internal error - the locale of the following process was
not set by the mapper code:

  Process:  [[10879,1],2]

Please contact the OMPI developers for assistance. Meantime,
you will still be able to run your application without binding
by specifying "--bind-to none" on your command line.
--
--
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[10879,0],0] on node raijin7
  Remote daemon: [[10879,0],1] on node r1

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--

On the other hand, mapping by node works fine...

> mpirun -H r1:16 -map-by ppr:4:node:PE=4 -display-map /bin/true
[raijin7:22668] SETTING BINDING TO CORE
[r1:10777] SETTING BINDING TO CORE
 Data for JOB [9696,1] offset 0 Total slots allocated 32

    JOB MAP   

 Data for node: r1  Num slots: 16   Max slots: 0Num procs: 4
Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: N/A
Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: N/A
Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: N/A
Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: N/A

 =
 Data for JOB [9696,1] offset 0 Total slots allocated 32

    JOB MAP   

 Data for node: r1  Num slots: 16   Max slots: 0Num procs: 4
Process OMPI jobid: [9696,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], 
socket 0[core 3[hwt 0-1]]:[BB/BB/BB/BB/../../../..][../../../../../../../..]
Process OMPI jobid: [9696,1] App: 0 Process rank: 1 Bound: socket 
0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], 
socket 0[core 7[hwt 0-1]]:[../../../../BB/BB/BB/BB][../../../../../../../..]
Process OMPI jobid: [9696,1] App: 0 Process rank: 2 Bound: socket 
1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], 
socket 1[core 11[hwt 0-1]]:[../../../../../../../..][BB/BB/BB/BB/../../../..]
Process OMPI jobid: [9696,1] App: 0 Process rank: 3 Bound: socket 
1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], 
socket 1[core 15[hwt 0-1]]:[../../../../../../../..][../../../../BB/BB/BB/BB]

 =


[OMPI devel] 3.0.0 - extraneous "DONE" when mapping by core

2017-09-18 Thread Ben Menadue
Hi,

I’m seeing an extraneous “DONE” message being printed with OpenMPI 3.0.0 when 
mapping by core:

[bjm900@raijin7 pt2pt]$ mpirun -np 2 ./osu_bw > /dev/null
[bjm900@raijin7 pt2pt]$ mpirun -map-by core -np 2 ./osu_bw > /dev/null
[raijin7:14376] DONE

This patch gets rid of the offending line — but I’m not sure if you want to 
keep it and just make it only print for debug builds?

--- a/orte/mca/rmaps/base/rmaps_base_ranking.c.old
+++ b/orte/mca/rmaps/base/rmaps_base_ranking.c
@@ -561,7 +561,6 @@ int orte_rmaps_base_compute_vpids(orte_job_t *jdata)
 }
 ORTE_ERROR_LOG(rc);
 }
-opal_output(0, "DONE");
 return rc;
 }

Cheers,
Ben

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Binding with --oversubscribe in 2.0.0

2016-08-25 Thread Ben Menadue
Hi Ralph,

Thanks for that, I'll test it out tomorrow. If it doesn't make it in 2.0.1, 
that's fine - I'll just apply the patch myself.

Cheers,
Ben


-Original Message-
From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
r...@open-mpi.org
Sent: Thursday, 25 August 2016 2:12 PM
To: OpenMPI Devel <devel@lists.open-mpi.org>
Subject: Re: [OMPI devel] Binding with --oversubscribe in 2.0.0

Okay, I found the issue and fixed it:

https://github.com/open-mpi/ompi-release/pull/1340

We are very close to v2.0.1 release, so it may not get into that one. Still, 
you are welcome to pull down the patch and locally apply it if it would help.

Ralph

> On Aug 24, 2016, at 5:29 PM, r...@open-mpi.org wrote:
> 
> Hmmm...bet I know why. Let me poke a bit.
> 
>> On Aug 24, 2016, at 5:18 PM, Ben Menadue <ben.mena...@nci.org.au> wrote:
>> 
>> Actually, adding :oversubscribe to the --map-by option still disables 
>> binding, even with :overload on the --bind-to option. While the :overload 
>> option allows binding more than one process per CPU, it only has an effect 
>> if binding actually happens - i.e. without :oversubscribe.
>> 
>> So, on one of our login nodes (2x8-core),
>> 
>> mpirun --np 32 --bind-to core:overload --report-bindings true
>> 
>> works and does what you would expect (0 and 16 on core 0, 1 and 17 on core 
>> 1, ...), while inside a PBS job on a compute node (same hardware) it fails 
>> with "not enough slots available in the system". Adding --map-by 
>> core:oversubscribe makes this to work, but then doesn't have binding.
>> 
>> Cheers,
>> Ben
>> 
>> -Original Message-
>> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
>> Ben Menadue
>> Sent: Thursday, 25 August 2016 9:36 AM
>> To: 'Open MPI Developers' <devel@lists.open-mpi.org>
>> Subject: Re: [OMPI devel] Binding with --oversubscribe in 2.0.0
>> 
>> Hi Ralph,
>> 
>> Thanks for that... that option's not on the man page for mpirun, but I can 
>> see it in the --help message (as "overload-allowed", which also works).
>> 
>> Cheers,
>> Ben
>> 
>> 
>> -Original Message-
>> From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
>> r...@open-mpi.org
>> Sent: Thursday, 25 August 2016 2:03 AM
>> To: OpenMPI Devel <devel@lists.open-mpi.org>
>> Subject: Re: [OMPI devel] Binding with --oversubscribe in 2.0.0
>> 
>> Actually, I stand corrected! Someone must have previously requested it, 
>> because support already exists.
>> 
>> What you need to do is simply specify the desired binding. If you don’t 
>> specify one, then we will disable it by default when oversubscribed. This 
>> was done to protect performance for those who don’t have such kind 
>> scenarios, and don’t realize we are otherwise binding by default.
>> 
>> So in your case, you’d want something like:
>> 
>> mpirun --map-by core:oversubscribe --bind-to core:overload
>> 
>> HTH
>> Ralph
>> 
>>> On Aug 24, 2016, at 7:33 AM, r...@open-mpi.org wrote:
>>> 
>>> Well, that’s a new one! I imagine we could modify the logic to allow a 
>>> combination of oversubscribe and overload flags. Won’t get out until 2.1, 
>>> though you could pull the patch in advance if it is holding you up.
>>> 
>>> 
>>>> On Aug 23, 2016, at 11:46 PM, Ben Menadue <ben.mena...@nci.org.au> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> One of our users has noticed that binding is disabled in 2.0.0 when 
>>>> --oversubscribe is passed, which is hurting their performance, 
>>>> likely through migrations between sockets. It looks to be because 
>>>> of 294793c (PR#1228).
>>>> 
>>>> They need to use --oversubscribe as for some reason the developers 
>>>> decided to run two processes for each MPI task for some reason (a 
>>>> compute process and an I/O worker process, I think). Since the 
>>>> second process in the pair is mostly idle, there's (almost) no harm 
>>>> in launching two processes per core - and it's better than leaving 
>>>> half the cores idle most of the time. In previous versions they 
>>>> were binding each pair to a core and letting the hyper-threads 
>>>> argue over which of the two processes to run, since this gave the best 
>>>> performance.
>>>> 
>>>> I tried creating a rankfile and binding each process to its own 
>>>> hardware thread, but it refuses to

Re: [OMPI devel] Binding with --oversubscribe in 2.0.0

2016-08-24 Thread Ben Menadue
Hi Ralph,

Thanks for that... that option's not on the man page for mpirun, but I can see 
it in the --help message (as "overload-allowed", which also works).

Cheers,
Ben


-Original Message-
From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
r...@open-mpi.org
Sent: Thursday, 25 August 2016 2:03 AM
To: OpenMPI Devel <devel@lists.open-mpi.org>
Subject: Re: [OMPI devel] Binding with --oversubscribe in 2.0.0

Actually, I stand corrected! Someone must have previously requested it, because 
support already exists.

What you need to do is simply specify the desired binding. If you don’t specify 
one, then we will disable it by default when oversubscribed. This was done to 
protect performance for those who don’t have such kind scenarios, and don’t 
realize we are otherwise binding by default.

So in your case, you’d want something like:

mpirun --map-by core:oversubscribe --bind-to core:overload

HTH
Ralph

> On Aug 24, 2016, at 7:33 AM, r...@open-mpi.org wrote:
> 
> Well, that’s a new one! I imagine we could modify the logic to allow a 
> combination of oversubscribe and overload flags. Won’t get out until 2.1, 
> though you could pull the patch in advance if it is holding you up.
> 
> 
>> On Aug 23, 2016, at 11:46 PM, Ben Menadue <ben.mena...@nci.org.au> wrote:
>> 
>> Hi,
>> 
>> One of our users has noticed that binding is disabled in 2.0.0 when 
>> --oversubscribe is passed, which is hurting their performance, likely 
>> through migrations between sockets. It looks to be because of 294793c 
>> (PR#1228).
>> 
>> They need to use --oversubscribe as for some reason the developers 
>> decided to run two processes for each MPI task for some reason (a 
>> compute process and an I/O worker process, I think). Since the second 
>> process in the pair is mostly idle, there's (almost) no harm in 
>> launching two processes per core - and it's better than leaving half 
>> the cores idle most of the time. In previous versions they were 
>> binding each pair to a core and letting the hyper-threads argue over 
>> which of the two processes to run, since this gave the best performance.
>> 
>> I tried creating a rankfile and binding each process to its own 
>> hardware thread, but it refuses to launch more processes than the 
>> number of cores (even if all these processes are on the first socket 
>> because of the binding) unless --oversubscribe is passed, and thus 
>> disabling the binding. Is there a way of bypassing the 
>> disable-binding-if-oversubscribing check introduced by that commit? Or can 
>> anyone think of a better way of running this program?
>> 
>> Alternatively, they could leave it with no binding at the mpirun 
>> level and do the binding in a wrapper.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

[OMPI devel] Binding with --oversubscribe in 2.0.0

2016-08-24 Thread Ben Menadue
Hi,

One of our users has noticed that binding is disabled in 2.0.0 when
--oversubscribe is passed, which is hurting their performance, likely
through migrations between sockets. It looks to be because of 294793c
(PR#1228).

They need to use --oversubscribe as for some reason the developers decided
to run two processes for each MPI task for some reason (a compute process
and an I/O worker process, I think). Since the second process in the pair is
mostly idle, there's (almost) no harm in launching two processes per core -
and it's better than leaving half the cores idle most of the time. In
previous versions they were binding each pair to a core and letting the
hyper-threads argue over which of the two processes to run, since this gave
the best performance.

I tried creating a rankfile and binding each process to its own hardware
thread, but it refuses to launch more processes than the number of cores
(even if all these processes are on the first socket because of the binding)
unless --oversubscribe is passed, and thus disabling the binding. Is there a
way of bypassing the disable-binding-if-oversubscribing check introduced by
that commit? Or can anyone think of a better way of running this program?

Alternatively, they could leave it with no binding at the mpirun level and
do the binding in a wrapper.

Thanks,
Ben



___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran

2016-03-03 Thread Ben Menadue
Hi Dave,

 

Our wrappers are custom written by us for our machines, rather than provided by 
OpenMPI. I’m not sure how the developers feel, but I’m guessing that’s probably 
how it should be since it’s so highly dependent on the system environment. 
Plus, the wrappers are at the ifort / gfortran level, rather than mpifort – 
this way it also works for all the other libraries that provide 
compiler-specific bindings (i.e. most packages).

 

Essentially, they check each path in the list of include paths for a GNU or 
Intel subdirectory and add that as needed – and similarly for the library 
directories. Of course, this breaks build systems (one relatively common one in 
particular) that expect the files to be in particular locations in the tree 
rather than just assuming the compiler knows what it’s doing.

 

We thought about doing “module load openmpi/intel/1.10.2” and so on (and 
similarly for the other packages – we have about 200-300 in our /apps tree 
now), but decided against it due to the extra complication for our users – 
especially since many just do “module load openmpi” and don’t care about the 
version (yuck). I think there were other reasons as well, but that was before 
my time.

 

Cheers,

Ben

 

 

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Dave Turner
Sent: Friday, 4 March 2016 3:28 PM
To: Ben Menadue <ben.mena...@nci.org.au>
Cc: Open MPI Developers <de...@open-mpi.org>
Subject: Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran

 

All,

 

 Ben's suggestion seems reasonable to me.  How about having the 

mpifort script choose the correct mpif-sizeof.h header file based on

the OMPI_FC compiler given at compile time?

 

   Dave

 

On Thu, Mar 3, 2016 at 9:48 PM, Ben Menadue <ben.mena...@nci.org.au 
<mailto:ben.mena...@nci.org.au> > wrote:

Hi Dave,

 

The issue is the way MPI_Sizeof is handled; it's implemented as a series of 
interfaces that map the MPI_Sizeof call to the right function in the library. I 
suspect this is needed because that function doesn't take a datatype argument 
and instead infers this from the argument types - in Fortran, this is only 
possible if you use an interface to call a different function for each data 
type.

 

Since "real, dimension(:)" is different from "real, dimension(:, :)" from the 
perspective of the interface, you need a separate entry for each possible array 
rank. In Fortran 2008 the maximum rank was increased to 15. This is supported 
in Intel Fortran and so the mpif-sizeof.h from such a build needs to have 
interface blocks for up to rank-15 arrays. However, the version of gfortran 
that you're using doesn’t, and hence it complains when it sees the rank > 7 
interfaces in mpif-sizeof.h.

 

To make it compatible between Fortran 2008 compliant and non-compliant 
compilers, you would need to implement MPI_Sizeof in a totally different 
fashion - if that’s even possible.

 

FWIW: We maintain two copies of mpif-sizeof.h in separate subdirectories – one 
for gfortran and one for ifort. Then, there are wrappers around each of the 
compilers that adds the appropriate subdirectory to the include path. This 
makes it transparent to our users, and allows us to present a single build tree 
that works for both compilers.

 

Cheers,

Ben

 

 

 

From: devel [mailto: <mailto:devel-boun...@open-mpi.org> 
devel-boun...@open-mpi.org] On Behalf Of Dave Turner
Sent: Friday, 4 March 2016 2:23 PM
To: Larry Baker < <mailto:ba...@usgs.gov> ba...@usgs.gov>
Cc: Open MPI Developers < <mailto:de...@open-mpi.org> de...@open-mpi.org>
Subject: Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran

 

All,

 

 I think that a GNU build of OpenMPI will allow compiling with both

gfortan and ifort, so I think OMPI_FC is useful.  I'd like to see it fully

supported if possible, so if the higher-dimensions in mpif-sizeof.h are

not vital and there is another way of accomplishing the same thing I think

it would be useful to address.

 If not, I would at least like to see some warnings in the documentation

of the OMPI_FC section that would list the cases like this where it will fail.

 

 Dave

 

On Thu, Mar 3, 2016 at 9:07 PM, Larry Baker <ba...@usgs.gov 
<mailto:ba...@usgs.gov> > wrote:

Dave,

 

Both Gilles and Chris raise important points.  You really cannot expect to mix 
modules from two different Fortran compilers in a single executable.  There is 
no requirement placed on a compiler by the Fortran standard for what object 
language it should use, how the information in modules is made available across 
compilation units, or the procedure calling conventions.  This makes me wonder, 
as you do, what the point is of the OMPI_CC and OMPI_FC environment variables?  
I do think that Intel has tried to make their objects interoperable with GCC 
objects.  That is a link-ti

Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran

2016-03-03 Thread Ben Menadue
Hi Dave,

 

The issue is the way MPI_Sizeof is handled; it's implemented as a series of 
interfaces that map the MPI_Sizeof call to the right function in the library. I 
suspect this is needed because that function doesn't take a datatype argument 
and instead infers this from the argument types - in Fortran, this is only 
possible if you use an interface to call a different function for each data 
type.

 

Since "real, dimension(:)" is different from "real, dimension(:, :)" from the 
perspective of the interface, you need a separate entry for each possible array 
rank. In Fortran 2008 the maximum rank was increased to 15. This is supported 
in Intel Fortran and so the mpif-sizeof.h from such a build needs to have 
interface blocks for up to rank-15 arrays. However, the version of gfortran 
that you're using doesn’t, and hence it complains when it sees the rank > 7 
interfaces in mpif-sizeof.h.

 

To make it compatible between Fortran 2008 compliant and non-compliant 
compilers, you would need to implement MPI_Sizeof in a totally different 
fashion - if that’s even possible.

 

FWIW: We maintain two copies of mpif-sizeof.h in separate subdirectories – one 
for gfortran and one for ifort. Then, there are wrappers around each of the 
compilers that adds the appropriate subdirectory to the include path. This 
makes it transparent to our users, and allows us to present a single build tree 
that works for both compilers.

 

Cheers,

Ben

 

 

 

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Dave Turner
Sent: Friday, 4 March 2016 2:23 PM
To: Larry Baker 
Cc: Open MPI Developers 
Subject: Re: [OMPI devel] mpif.h on Intel build when run with OMPI_FC=gfortran

 

All,

 

 I think that a GNU build of OpenMPI will allow compiling with both

gfortan and ifort, so I think OMPI_FC is useful.  I'd like to see it fully

supported if possible, so if the higher-dimensions in mpif-sizeof.h are

not vital and there is another way of accomplishing the same thing I think

it would be useful to address.

 If not, I would at least like to see some warnings in the documentation

of the OMPI_FC section that would list the cases like this where it will fail.

 

 Dave

 

On Thu, Mar 3, 2016 at 9:07 PM, Larry Baker  > wrote:

Dave,

 

Both Gilles and Chris raise important points.  You really cannot expect to mix 
modules from two different Fortran compilers in a single executable.  There is 
no requirement placed on a compiler by the Fortran standard for what object 
language it should use, how the information in modules is made available across 
compilation units, or the procedure calling conventions.  This makes me wonder, 
as you do, what the point is of the OMPI_CC and OMPI_FC environment variables?  
I do think that Intel has tried to make their objects interoperable with GCC 
objects.  That is a link-time issue.  You are encountering compile-time issues. 
 Gilles says whatever mpif-sizeof.h was intended to define, it cannot be done 
in gfortran.  Even if mpif-sizeof.h generated for an Intel compiler was 
standard-conforming (so the errors you encountered are not show stoppers), I'm 
not sure you would be able to get past the incompatibility between the internal 
formats used by each compiler to store module definitions and declarations for 
later USE by another compilation unit.  I think your expectations cannot be 
fulfilled because of the compilers, not because of OpenMPI.

 

Larry Baker
US Geological Survey
  650-329-5608
  ba...@usgs.gov



 

On 3 Mar 2016, at 6:39 PM, Dave Turner wrote:

 

Gilles,

 

I don't see the point of having the OMPI_CC and OMPI_FC environment

variables at all if you're saying that we shouldn't expect them to work.  I 

actually do think they work fine if you do a GNU build and use them to

specify the Intel compilers.  I also think it works fine when you do an

Intel build and compile with gcc.  So to me it just looks like that one

include file is the problem.

 

  Dave

 

On Thu, Mar 3, 2016 at 8:02 PM, Gilles Gouaillardet  > wrote:

Dave,

you should not expect anything when mixing Fortran compilers
(and to be on the safe side, you might not expect much when mixing C/C++ 
compilers too,
for example, if you built ompi with intel and use gcc for your app, gcc might 
complain about unresolved symbols from the intel runtime)

if you compile OpenMPI with gfortran 4.8.5, the automatically generated 
mpif-sizeof.h contains

! Sad panda.
!
! This compiler does not support the Right Stuff to enable MPI_SIZEOF.
! Specifically: we need support for the INTERFACE keyword,
! ISO_FORTRAN_ENV, and the STORAGE_SIZE() intrinsic on all types.
! Apparently, this compiler does not support both of those things, so
! this file will be (effecitvely) blank (i.e., we didn't bother
! generating the 

[OMPI devel] XRC Support

2015-07-08 Thread Ben Menadue
Hi,

I just finished building 1.8.6 and master on our cluster and noticed that
for both, XRC support wasn't being detected because it didn't detect the
IBV_SRQT_XRC declaration:

checking whether IBV_SRQT_XRC is declared... (cached) no
...
checking if ConnectX XRC support is enabled... no
checking if ConnectIB XRC support is enabled... no

Both of these builds had --enable-openib-connectx-xrc. Having a look in the
config.log, I found this:

configure:191690: checking whether IBV_SRQT_XRC is declared
configure:191690: gcc -std=gnu99 -c -O3 -DNDEBUG -finline-functions
-fno-strict-aliasing -pthread
-I/short/z00/bjm900/build/openmpi/openmpi-1.8.6/opal/mca/hwloc/hwloc191/hwlo
c/include
-I/short/z00/bjm900/build/openmpi/openmpi-1.8.6/build/gnu/opal/mca/hwloc/hwl
oc191/hwloc/include
-I/short/z00/bjm900/build/openmpi/openmpi-1.8.6/opal/mca/event/libevent2021/
libevent
-I/short/z00/bjm900/build/openmpi/openmpi-1.8.6/opal/mca/event/libevent2021/
libevent/include
-I/short/z00/bjm900/build/openmpi/openmpi-1.8.6/build/gnu/opal/mca/event/lib
event2021/libevent/include  conftest.c >&5
conftest.c: In function 'main':
conftest.c:718: error: 'IBV_SRQT_XRC' undeclared (first use in this
function)
conftest.c:718: error: (Each undeclared identifier is reported only once
conftest.c:718: error: for each function it appears in.)
configure:191690: $? = 1

If you have a look at the test program, the failure is because it forgets to
include the infiniband/verbs.h header, and sure enough the configure script
bears this out:

ac_fn_c_check_decl "$LINENO" "IBV_SRQT_XRC"
"ac_cv_have_decl_IBV_SRQT_XRC" "$ac_includes_default"

Changing "$ac_includes_default" to "#include " and
reconfiguring allows it to detect this declaration and then enable support
for XRC:

checking whether IBV_SRQT_XRC is declared... (cached) yes
...
checking if ConnectX XRC support is enabled... yes
checking if ConnectIB XRC support is enabled... yes

Cheers,
Ben