Nathan, thanks for taking care of this! I looked at the PR and wonder why we don't move the whole session directory to /dev/shm on Linux instead of introducing a new mca parameter?

Joseph

On 05/24/2018 04:28 PM, Nathan Hjelm wrote:
PR is up

https://github.com/open-mpi/ompi/pull/5193


-Nathan

On May 24, 2018, at 7:09 AM, Nathan Hjelm <hje...@me.com> wrote:

Ok, thanks for testing that. I will open a PR for master changing the default 
backing location to /dev/shm on linux. Will be PR’d to v3.0.x and v3.1.x.

-Nathan

On May 24, 2018, at 6:46 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:

Thank you all for your input!

Nathan: thanks for that hint, this seems to be the culprit: With your patch, I 
do not observe a difference in the performance between the two memory 
allocations. I remembered that Open MPI allows to change the shmem allocator on 
the command line. Using vanilla Open MPI 3.1.0 and increasing the priority of 
the POSIX shmem implementation using `--mca shmem_posix_priority 100` leads to 
good performance, too. The reason could be that on the Bull machine /tmp is 
mounted on a disk partition (SSD, iirc). Maybe there is actual I/O involved 
that hurts performance if the shm backing file is located on a disk (even 
though the file is unlinked before the memory is accessed)?

Regarding the other hints: I tried using MPI_Win_allocate_shared with the 
noncontig hint. Using POSIX shmem, I do not observe a difference in performance 
to the other two options. If using the disk-backed shmem file, performance 
fluctuations are similar to MPI_Win_allocate.

On this machine /proc/sys/kernel/numa_balancing is not available, so I assume 
that this is not the cause in this case. It's good to know for the future that 
this might become an issue on other systems.

Cheers
Joseph

On 05/23/2018 02:26 PM, Nathan Hjelm wrote:
Odd. I wonder if it is something affected by your session directory. It might 
be worth moving the segment to /dev/shm. I don’t expect it will have an impact 
but you could try the following patch:
diff --git a/ompi/mca/osc/sm/osc_sm_component.c 
b/ompi/mca/osc/sm/osc_sm_component.c
index f7211cd93c..bfc26b39f2 100644
--- a/ompi/mca/osc/sm/osc_sm_component.c
+++ b/ompi/mca/osc/sm/osc_sm_component.c
@@ -262,8 +262,8 @@ component_select(struct ompi_win_t *win, void **base, 
size_t size, int disp_unit
         posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64);
         if (0 == ompi_comm_rank (module->comm)) {
             char *data_file;
-            if (asprintf(&data_file, "%s"OPAL_PATH_SEP"shared_window_%d.%s",
-                         ompi_process_info.proc_session_dir,
+            if (asprintf(&data_file, "/dev/shm/%d.shared_window_%d.%s",
+                         ompi_process_info.my_name.jobid,
                          ompi_comm_get_cid(module->comm),
                          ompi_process_info.nodename) < 0) {
                 return OMPI_ERR_OUT_OF_RESOURCE;
On May 23, 2018, at 6:11 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:

I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 7.1.0 
on the Bull Cluster. I only ran on a single node but haven't tested what 
happens if more than one node is involved.

Joseph

On 05/23/2018 02:04 PM, Nathan Hjelm wrote:
What Open MPI version are you using? Does this happen when you run on a single 
node or multiple nodes?
-Nathan
On May 23, 2018, at 4:45 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:

All,

We are observing some strange/interesting performance issues in accessing 
memory that has been allocated through MPI_Win_allocate. I am attaching our 
test case, which allocates memory for 100M integer values on each process both 
through malloc and MPI_Win_allocate and writes to the local ranges sequentially.

On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing 
the memory allocated through MPI is significantly slower than accessing the 
malloc'ed memory if multiple processes run on a single node, increasing the 
effect with increasing number of processes per node. As an example, running 24 
processes per node with the example attached we see the operations on the 
malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.

After some experiments, I think there are two factors involved:

1) Initialization: it appears that the first iteration is significantly slower 
than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single 
socket). Excluding the first iteration from the timing or memsetting the range 
leads to comparable performance. I assume that this is due to page faults that 
stem from first accessing the mmap'ed memory that backs the shared memory used 
in the window. The effect of presetting the  malloc'ed memory seems smaller 
(0.4s vs 0.6s).

2) NUMA effects: Given proper initialization, running on two sockets still 
leads to fluctuating performance degradation under the MPI window memory, which 
ranges up to 20x (in extreme cases). The performance of accessing the malloc'ed 
memory is rather stable. The difference seems to get smaller (but does not 
disappear) with increasing number of repetitions. I am not sure what causes 
these effects as each process should first-touch their local memory.

Are these known issues? Does anyone have any thoughts on my analysis?

It is problematic for us that replacing local memory allocation with MPI memory 
allocation leads to performance degradation as we rely on this mechanism in our 
distributed data structures. While we can ensure proper initialization of the 
memory to mitigate 1) for performance measurements, I don't see a way to 
control the NUMA effects. If there is one I'd be happy about any hints :)

I should note that we also tested MPICH-based implementations, which showed 
similar effects (as they also mmap their window memory). Not surprisingly, 
using MPI_Alloc_mem and attaching that memory to a dynamic window does not 
cause these effects while using shared memory windows does. I ran my 
experiments using Open MPI 3.1.0 with the following command lines:

- 12 cores / 1 socket:
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
- 24 cores / 2 sockets:
mpirun -n 24 --bind-to socket

and verified the binding using  --report-bindings.

Any help or comment would be much appreciated.

Cheers
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
<mpiwin_vs_malloc.c>_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to