from:"Joseph Schuchart"

[OMPI users] Regression: multiple memory regions in dynamic windows

2016-08-25 Thread Joseph Schuchart


All,

It seems there is a regression in the handling of dynamic windows 
between Open MPI 1.10.3 and 2.0.0. I am attaching a test case that works 
fine with Open MPI 1.8.3 and fail with version 2.0.0 with the following 
output:


===
[0] MPI_Get 0 -> 3200 on first memory region
[cl3fr1:7342] *** An error occurred in MPI_Get
[cl3fr1:7342] *** reported by process [908197889,0]
[cl3fr1:7342] *** on win rdma window 3
[cl3fr1:7342] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[cl3fr1:7342] *** MPI_ERRORS_ARE_FATAL (processes in this win will now 
abort,

[cl3fr1:7342] ***and potentially your MPI job)
===

Expected output is:
===
[0] MPI_Get 0 -> 100 on first memory region:
[0] Done.
[0] MPI_Get 0 -> 100 on second memory region:
[0] Done.
===

The code allocates a dynamic window and attaches two memory regions to 
it before accessing both memory regions using MPI_Get. With Open MPI 
2.0.0, only access to the both memory regions fails. Access to the first 
memory region only succeeds if the second memory region is not attached. 
With Open MPI 1.10.3, all MPI operations succeed.


Please let me know if you need any additional information or think that 
my code example is not standard compliant.


Best regards
Joseph


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

/*
 * mpi_dynamic_win.cc
 *
 *  Created on: Aug 24, 2016
 *  Author: joseph
 */

#include 
#include 
#include 

static int allocate_shared(size_t bufsize, MPI_Win win, MPI_Aint *disp_set) {
  int ret;
  char *sub_mem;
  MPI_Aint disp;
  MPI_Info win_info;
  MPI_Info_create(&win_info);
  MPI_Info_set(win_info, "alloc_shared_noncontig", "true");

  sub_mem = malloc(bufsize * sizeof(char));

  /* Attach the allocated shared memory to the dynamic window */
  ret = MPI_Win_attach(win, sub_mem, bufsize);

  if (ret != MPI_SUCCESS) {
printf("MPI_Win_attach failed!\n");
return -1;
  }

  /* Get the local address */
  ret = MPI_Get_address(sub_mem, &disp);

  if (ret != MPI_SUCCESS) {
printf("MPI_Get_address failed!\n");
return -1;
  }

  /* Publish addresses */
  ret = MPI_Allgather(&disp, 1, MPI_AINT, disp_set, 1, MPI_AINT, MPI_COMM_WORLD);

  if (ret != MPI_SUCCESS) {
printf("MPI_Allgather failed!\n");
return -1;
  }

  MPI_Info_free(&win_info);

  return 0;
}

int main(int argc, char **argv)
{
  MPI_Win win;
  const size_t nelems = 10*10;
  const size_t bufsize = nelems * sizeof(double);
  MPI_Aint   *disp_set, *disp_set2;
  int rank, size;

  double buf[nelems];

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  disp_set  = (MPI_Aint*) malloc(size * sizeof(MPI_Aint));
  disp_set2 = (MPI_Aint*) malloc(size * sizeof(MPI_Aint));

  int ret = MPI_Win_create_dynamic(MPI_INFO_NULL, MPI_COMM_WORLD, &win);
  if (ret != MPI_SUCCESS) {
printf("MPI_Win_create_dynamic failed!\n");
exit(1);
  }

  
  MPI_Win_lock_all (0, win);

  /* Allocate two shared windows */
  allocate_shared(bufsize, win, disp_set);  
  allocate_shared(bufsize, win, disp_set2);  

  /* Initiate a get */
  {
int elem;
int neighbor = (rank + 1) / size;
if (rank == 0) printf("[%i] MPI_Get 0 -> %zu on first memory region: \n", rank, nelems);
for (elem = 0; elem < nelems -1; elem++) {
  MPI_Aint off = elem * sizeof(double);
  //MPI_Aint disp = MPI_Aint_add(disp_set[neighbor], off);
  MPI_Aint disp = disp_set[neighbor] + off;
  MPI_Get(&buf[elem], sizeof(double), MPI_BYTE, neighbor, disp, sizeof(double), MPI_BYTE, win);
}
MPI_Win_flush(neighbor, win);
if (rank == 0) printf("[%i] Done.\n", rank);
  }


  MPI_Barrier(MPI_COMM_WORLD);

  {
int elem;
int neighbor = (rank + 1) / size;
if (rank == 0) printf("[%i] MPI_Get 0 -> %zu on second memory region: \n", rank, nelems);
for (elem = 0; elem < nelems; elem++) {
  MPI_Aint off = elem * sizeof(double);
  //MPI_Aint disp = MPI_Aint_add(disp_set2[neighbor], off);
  MPI_Aint disp = disp_set[neighbor] + off;
  MPI_Get(&buf[elem], sizeof(double), MPI_BYTE, neighbor, disp, sizeof(double), MPI_BYTE, win);
}
MPI_Win_flush(neighbor, win);
if (rank == 0) printf("[%i] Done.\n", rank);
  }
  MPI_Barrier(MPI_COMM_WORLD);


  MPI_Win_unlock_all (win);

  free(disp_set);
  free(disp_set2);

  MPI_Finalize();
  return 0;
}

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Regression: multiple memory regions in dynamic windows

2016-08-25 Thread Joseph Schuchart


Gilles,

Thanks for your fast reply. I did some last minute changes to the 
example code and didn't fully check the consistency of the output. Also, 
thanks for pointing out the mistake in computing the neighbor rank. I am 
attaching a fixed version.


Best
Joseph

On 08/25/2016 03:11 PM, Gilles Gouaillardet wrote:

Joseph,

at first glance, there is a memory corruption (!)
the first printf should be 0 -> 100, instead of 0 -> 3200

this is very odd because nelems is const, and the compiler might not 
even allocate this variable.


I also noted some counter intuitive stuff in your test program
(which still looks valid to me)

neighbor = (rank +1) / size;
should it be
neighbor = (rank + 1) % size;
instead ?

the first loop is
for (elem=0; elem < nelems-1; elem++) ...
it could be
for (elem=0; elem < nelems; elem++) ...

the second loop uses disp_set, and I guess you meant to use disp_set2

I will try to reproduce this crash.
which compiler (vendor and version) are you using ?
which compiler options do you pass to mpicc ?


Cheers,

Gilles

On Thursday, August 25, 2016, Joseph Schuchart <mailto:schuch...@hlrs.de>> wrote:


All,

It seems there is a regression in the handling of dynamic windows
between Open MPI 1.10.3 and 2.0.0. I am attaching a test case that
works fine with Open MPI 1.8.3 and fail with version 2.0.0 with
the following output:

===
[0] MPI_Get 0 -> 3200 on first memory region
[cl3fr1:7342] *** An error occurred in MPI_Get
[cl3fr1:7342] *** reported by process [908197889,0]
[cl3fr1:7342] *** on win rdma window 3
[cl3fr1:7342] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[cl3fr1:7342] *** MPI_ERRORS_ARE_FATAL (processes in this win will
now abort,
[cl3fr1:7342] ***and potentially your MPI job)
===

Expected output is:
===
[0] MPI_Get 0 -> 100 on first memory region:
[0] Done.
[0] MPI_Get 0 -> 100 on second memory region:
[0] Done.
===

The code allocates a dynamic window and attaches two memory
regions to it before accessing both memory regions using MPI_Get.
With Open MPI 2.0.0, only access to the both memory regions fails.
Access to the first memory region only succeeds if the second
memory region is not attached. With Open MPI 1.10.3, all MPI
operations succeed.

Please let me know if you need any additional information or think
that my code example is not standard compliant.

Best regards
    Joseph


-- 
Dipl.-Inf. Joseph Schuchart

High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

/*
 * mpi_dynamic_win.cc
 *
 *  Created on: Aug 24, 2016
 *  Author: joseph
 */

#include 
#include 
#include 

static int allocate_shared(size_t bufsize, MPI_Win win, MPI_Aint *disp_set) {
  int ret;
  char *sub_mem;
  MPI_Aint disp;

  sub_mem = malloc(bufsize * sizeof(char));

  /* Attach the allocated shared memory to the dynamic window */
  ret = MPI_Win_attach(win, sub_mem, bufsize);

  if (ret != MPI_SUCCESS) {
printf("MPI_Win_attach failed!\n");
return -1;
  }

  /* Get the local address */
  ret = MPI_Get_address(sub_mem, &disp);

  if (ret != MPI_SUCCESS) {
printf("MPI_Get_address failed!\n");
return -1;
  }

  /* Publish addresses */
  ret = MPI_Allgather(&disp, 1, MPI_AINT, disp_set, 1, MPI_AINT, MPI_COMM_WORLD);

  if (ret != MPI_SUCCESS) {
printf("MPI_Allgather failed!\n");
return -1;
  }

  return 0;
}

int main(int argc, char **argv)
{
  MPI_Win win;
  const size_t nelems = 10*10;
  const size_t bufsize = nelems * sizeof(double);
  MPI_Aint   *disp_set, *disp_set2;
  int rank, size;

  double buf[nelems];

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  disp_set  = (MPI_Aint*) malloc(size * sizeof(MPI_Aint));
  disp_set2 = (MPI_Aint*) malloc(size * sizeof(MPI_Aint));

  int ret = MPI_Win_create_dynamic(MPI_INFO_NULL, MPI_COMM_WORLD, &win);
  if (ret != MPI_SUCCESS) {
printf("MPI_Win_create_dynamic failed!\n");
exit(1);
  }

  
  MPI_Win_lock_all (0, win);

  /* Allocate two shared windows */
  allocate_shared(bufsize, win, disp_set);  
  allocate_shared(bufsize, win, disp_set2);  

  /* Initiate a get */
  {
int elem;
int neighbor = (rank + 1) % size;
if (rank == 0) printf("[%i] MPI_Get 0 -> %zu on first memory region: \n", rank, nelems);
for (elem = 0; elem < nel

Re: [OMPI users] Regression: multiple memory regions in dynamic windows

2016-08-26 Thread Joseph Schuchart


Nathan, all,

Thanks for the quick fix. I can confirm that the behavior with multiple 
windows is now as expected and as seen in 1.10.3.


Best
Joseph


On 08/25/2016 10:51 PM, Nathan Hjelm wrote:
Fixed on master. The fix will be in 2.0.2 but you can apply it to 
2.0.0 or 2.0.1:


https://github.com/open-mpi/ompi/commit/e53de7ecbe9f034ab92c832330089cf7065181dc.patch

-Nathan

On Aug 25, 2016, at 07:31 AM, Joseph Schuchart  wrote:


Gilles,

Thanks for your fast reply. I did some last minute changes to the 
example code and didn't fully check the consistency of the output. 
Also, thanks for pointing out the mistake in computing the neighbor 
rank. I am attaching a fixed version.


Best
Joseph

On 08/25/2016 03:11 PM, Gilles Gouaillardet wrote:

Joseph,

at first glance, there is a memory corruption (!)
the first printf should be 0 -> 100, instead of 0 -> 3200

this is very odd because nelems is const, and the compiler might not 
even allocate this variable.


I also noted some counter intuitive stuff in your test program
(which still looks valid to me)

neighbor = (rank +1) / size;
should it be
neighbor = (rank + 1) % size;
instead ?

the first loop is
for (elem=0; elem < nelems-1; elem++) ...
it could be
for (elem=0; elem < nelems; elem++) ...

the second loop uses disp_set, and I guess you meant to use disp_set2

I will try to reproduce this crash.
which compiler (vendor and version) are you using ?
which compiler options do you pass to mpicc ?


Cheers,

Gilles

On Thursday, August 25, 2016, Joseph Schuchart <mailto:schuch...@hlrs.de>> wrote:


All,

It seems there is a regression in the handling of dynamic
windows between Open MPI 1.10.3 and 2.0.0. I am attaching a test
case that works fine with Open MPI 1.8.3 and fail with version
2.0.0 with the following output:

===
[0] MPI_Get 0 -> 3200 on first memory region
[cl3fr1:7342] *** An error occurred in MPI_Get
[cl3fr1:7342] *** reported by process [908197889,0]
[cl3fr1:7342] *** on win rdma window 3
[cl3fr1:7342] *** MPI_ERR_RMA_RANGE: invalid RMA address range
[cl3fr1:7342] *** MPI_ERRORS_ARE_FATAL (processes in this win
will now abort,
[cl3fr1:7342] ***and potentially your MPI job)
===

Expected output is:
===
[0] MPI_Get 0 -> 100 on first memory region:
[0] Done.
[0] MPI_Get 0 -> 100 on second memory region:
[0] Done.
===

The code allocates a dynamic window and attaches two memory
regions to it before accessing both memory regions using
MPI_Get. With Open MPI 2.0.0, only access to the both memory
regions fails. Access to the first memory region only succeeds
if the second memory region is not attached. With Open MPI
1.10.3, all MPI operations succeed.

Please let me know if you need any additional information or
think that my code example is not standard compliant.

Best regards
    Joseph


-- 
Dipl.-Inf. Joseph Schuchart

High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail:schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Valgrind errors related to MPI_Win_allocate_shared

2016-11-14 Thread Joseph Schuchart


All,

I am investigating an MPI application using Valgrind and see a load of 
memory leaks reported in MPI-related code. Please find the full log 
attached. Some observations/questions:


1) According to the information available at 
https://www.open-mpi.org/faq/?category=debugging#valgrind_clean the 
suppression file should help get a clean run of an MPI application 
despite several buffers not being free'd by MPI_Finalize. Is this 
assumption still valid? If so, maybe the suppression file needs an 
update as I still see reports on leaked memory allocated in MPI_Init?


2) There seem to be several invalid reads and writes in the 
opal_shmem_segment_* functions. Are they significant or can we regard 
them as false positives?


3) The code example attached allocates memory using 
MPI_Win_allocate_shared and frees it using MPI_Win_free. However, 
Valgrind reports some memory to be leaking, e.g.:


==4020== 16 bytes in 1 blocks are definitely lost in loss record 21 of 234
==4020==at 0x4C2DB8F: malloc (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)

==4020==by 0xCFDCD47: component_select (osc_sm_component.c:277)
==4020==by 0x4F39FC3: ompi_osc_base_select (osc_base_init.c:73)
==4020==by 0x4E945DC: ompi_win_allocate_shared (win.c:272)
==4020==by 0x4EF6576: PMPI_Win_allocate_shared 
(pwin_allocate_shared.c:80)

==4020==by 0x400E96: main (mpi_dynamic_win_free.c:48)

Can someone please confirm that we the way the shared window memory is 
free'd is actually correct? I noticed that the amount of memory that is 
reported to be leaking scales with the number of windows that are 
allocated and free'd. In our case this happens in a set of unit tests 
that all allocate their own shared memory windows and thus the amount of 
leaked memory piles up quite a bit.


I build the code using GCC 5.4.0 using OpenMPI 2.0.1 and ran it on a 
single node. How to reproduce:


$ mpicc -Wall -ggdb mpi_dynamic_win_free.c -o mpi_dynamic_win_free

$ mpirun -n 2 valgrind --leak-check=full 
--suppressions=$HOME/opt/openmpi-2.0.1/share/openmpi/openmpi-valgrind.supp 
./mpi_dynamic_win_free


Best regards,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

==5740== Memcheck, a memory error detector
==5740== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==5740== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==5740== Command: ./mpi_dynamic_win_free
==5740== 
==5741== Memcheck, a memory error detector
==5741== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==5741== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==5741== Command: ./mpi_dynamic_win_free
==5741== 
==5741== Invalid read of size 4
==5741==at 0x70D6166: segment_attach (shmem_mmap_module.c:494)
==5741==by 0x5B5BD3E: opal_shmem_segment_attach (shmem_base_wrappers.c:61)
==5741==by 0xCFDCD14: component_select (osc_sm_component.c:272)
==5741==by 0x4F39FC3: ompi_osc_base_select (osc_base_init.c:73)
==5741==by 0x4E945DC: ompi_win_allocate_shared (win.c:272)
==5741==by 0x4EF6576: PMPI_Win_allocate_shared (pwin_allocate_shared.c:80)
==5741==by 0x400CCE: create_dynamic (mpi_dynamic_win_free.c:32)
==5741==by 0x400DF3: main (mpi_dynamic_win_free.c:81)
==5741==  Address 0x6c659e8 is 264 bytes inside a block of size 4,608 alloc'd
==5741==at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5741==by 0xCFDC798: component_select (osc_sm_component.c:174)
==5741==by 0x4F39FC3: ompi_osc_base_select (osc_base_init.c:73)
==5741==by 0x4E945DC: ompi_win_allocate_shared (win.c:272)
==5741==by 0x4EF6576: PMPI_Win_allocate_shared (pwin_allocate_shared.c:80)
==5741==by 0x400CCE: create_dynamic (mpi_dynamic_win_free.c:32)
==5741==by 0x400DF3: main (mpi_dynamic_win_free.c:81)
==5741== 
==5741== Syscall param open(filename) points to unaddressable byte(s)
==5741==at 0x51C9CCD: ??? (syscall-template.S:84)
==5741==by 0x70D618A: segment_attach (shmem_mmap_module.c:495)
==5741==by 0x5B5BD3E: opal_shmem_segment_attach (shmem_base_wrappers.c:61)
==5741==by 0xCFDCD14: component_select (osc_sm_component.c:272)
==5741==by 0x4F39FC3: ompi_osc_base_select (osc_base_init.c:73)
==5741==by 0x4E945DC: ompi_win_allocate_shared (win.c:272)
==5741==by 0x4EF6576: PMPI_Win_allocate_shared (pwin_allocate_shared.c:80)
==5741==by 0x400CCE: create_dynamic (mpi_dynamic_win_free.c:32)
==5741==by 0x400DF3: main (mpi_dynamic_win_free.c:81)
==5741==  Address 0x6c65a08 is 296 bytes inside a block of size 4,608 alloc'd
==5741==at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5741==by 0xCFDC798: component_select (osc_sm_component.c:174)
==5741==by 0x4F39FC3: ompi_osc_base_select (osc_base_

Re: [OMPI users] Valgrind errors related to MPI_Win_allocate_shared

2016-11-15 Thread Joseph Schuchart


Hi Luke,

Thanks for your reply. From my understanding, the wrappers mainly help 
catch errors on the MPI API level. The errors I reported are well below 
the API layer (please correct me if I'm wrong here) However, I re-ran 
the code with the wrapper loaded via LD_PRELOAD and without the 
suppresion file and the warnings issued by Valgrind for the shmem 
segment handling code and leaking memory from MPI_Win_allocate_shared 
are basically the same. Nevertheless, I am attaching the full log of 
that run as well.


Cheers
Joseph


On 11/14/2016 05:07 PM, D'Alessandro, Luke K wrote:

Hi Joesph,

I don’t have a solution to your issue, but I’ve found that the valgrind mpi 
wrapper is necessary to eliminate many of the false positives that the 
suppressions file can’t.

http://valgrind.org/docs/manual/mc-manual.html#mc-manual.mpiwrap.gettingstarted

You should LD_PRELOAD the libmpiwrap from your installation. If it’s not there 
then you can rebuild valgrind with CC=mpicc to have it built.

Hope this helps move you towards a solution.

Luke


On Nov 14, 2016, at 5:49 AM, Joseph Schuchart  wrote:

All,

I am investigating an MPI application using Valgrind and see a load of memory 
leaks reported in MPI-related code. Please find the full log attached. Some 
observations/questions:

1) According to the information available at 
https://www.open-mpi.org/faq/?category=debugging#valgrind_clean the suppression 
file should help get a clean run of an MPI application despite several buffers 
not being free'd by MPI_Finalize. Is this assumption still valid? If so, maybe 
the suppression file needs an update as I still see reports on leaked memory 
allocated in MPI_Init?

2) There seem to be several invalid reads and writes in the 
opal_shmem_segment_* functions. Are they significant or can we regard them as 
false positives?

3) The code example attached allocates memory using MPI_Win_allocate_shared and 
frees it using MPI_Win_free. However, Valgrind reports some memory to be 
leaking, e.g.:

==4020== 16 bytes in 1 blocks are definitely lost in loss record 21 of 234
==4020==at 0x4C2DB8F: malloc (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4020==by 0xCFDCD47: component_select (osc_sm_component.c:277)
==4020==by 0x4F39FC3: ompi_osc_base_select (osc_base_init.c:73)
==4020==by 0x4E945DC: ompi_win_allocate_shared (win.c:272)
==4020==by 0x4EF6576: PMPI_Win_allocate_shared (pwin_allocate_shared.c:80)
==4020==by 0x400E96: main (mpi_dynamic_win_free.c:48)

Can someone please confirm that we the way the shared window memory is free'd 
is actually correct? I noticed that the amount of memory that is reported to be 
leaking scales with the number of windows that are allocated and free'd. In our 
case this happens in a set of unit tests that all allocate their own shared 
memory windows and thus the amount of leaked memory piles up quite a bit.

I build the code using GCC 5.4.0 using OpenMPI 2.0.1 and ran it on a single 
node. How to reproduce:

$ mpicc -Wall -ggdb mpi_dynamic_win_free.c -o mpi_dynamic_win_free

$ mpirun -n 2 valgrind --leak-check=full 
--suppressions=$HOME/opt/openmpi-2.0.1/share/openmpi/openmpi-valgrind.supp 
./mpi_dynamic_win_free

Best regards,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

==27025== Memcheck, a memory error detector
==27025== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==27025== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==27025== Command: ./mpi_dynamic_win_free
==27025== 
==27026== Memcheck, a memory error detector
==27026== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==27026== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==27026== Command: ./mpi_dynamic_win_free
==27026== 
valgrind MPI wrappers 27026: Active for pid 27026
valgrind MPI wrappers 27026: Try MPIWRAP_DEBUG=help for possible options
valgrind MPI wrappers 27025: Active for pid 27025
valgrind MPI wrappers 27025: Try MPIWRAP_DEBUG=help for possible options
==27026== Invalid read of size 4
==27026==at 0x731B166: segment_attach (shmem_mmap_module.c:494)
==27026==by 0x5DA0D3E: opal_shmem_segment_attach (shmem_base_wrappers.c:61)
==27026==by 0xD221D14: component_select (osc_sm_component.c:272)
==27026==by 0x517EFC3: ompi_osc_base_

Re: [OMPI users] Valgrind errors related to MPI_Win_allocate_shared

2016-11-21 Thread Joseph Schuchart


Gilles,

Thanks a lot for the fix. I just tested on master and can confirm that 
both the leak and the invalid read-writes are gone. However, the 
suppression file still does not filter the memory that is allocated in 
MPI_Init and not properly free'd at the end. Any chance this can be 
filtered using the suppression file?


Cheers,
Joseph

On 11/15/2016 04:39 PM, Gilles Gouaillardet wrote:

Joseph,

thanks for the report, this is a real memory leak.
i fixed it in master, and the fix is now being reviewed.
meanwhile, you can manually apply the patch available at
https://github.com/open-mpi/ompi/pull/2418.patch

Cheers,

Gilles


On Tue, Nov 15, 2016 at 1:52 AM, Joseph Schuchart  wrote:

Hi Luke,

Thanks for your reply. From my understanding, the wrappers mainly help catch
errors on the MPI API level. The errors I reported are well below the API
layer (please correct me if I'm wrong here) However, I re-ran the code with
the wrapper loaded via LD_PRELOAD and without the suppresion file and the
warnings issued by Valgrind for the shmem segment handling code and leaking
memory from MPI_Win_allocate_shared are basically the same. Nevertheless, I
am attaching the full log of that run as well.

Cheers
Joseph



On 11/14/2016 05:07 PM, D'Alessandro, Luke K wrote:

Hi Joesph,

I don’t have a solution to your issue, but I’ve found that the valgrind
mpi wrapper is necessary to eliminate many of the false positives that the
suppressions file can’t.


http://valgrind.org/docs/manual/mc-manual.html#mc-manual.mpiwrap.gettingstarted

You should LD_PRELOAD the libmpiwrap from your installation. If it’s not
there then you can rebuild valgrind with CC=mpicc to have it built.

Hope this helps move you towards a solution.

Luke


On Nov 14, 2016, at 5:49 AM, Joseph Schuchart  wrote:

All,

I am investigating an MPI application using Valgrind and see a load of
memory leaks reported in MPI-related code. Please find the full log
attached. Some observations/questions:

1) According to the information available at
https://www.open-mpi.org/faq/?category=debugging#valgrind_clean the
suppression file should help get a clean run of an MPI application despite
several buffers not being free'd by MPI_Finalize. Is this assumption still
valid? If so, maybe the suppression file needs an update as I still see
reports on leaked memory allocated in MPI_Init?

2) There seem to be several invalid reads and writes in the
opal_shmem_segment_* functions. Are they significant or can we regard them
as false positives?

3) The code example attached allocates memory using
MPI_Win_allocate_shared and frees it using MPI_Win_free. However, Valgrind
reports some memory to be leaking, e.g.:

==4020== 16 bytes in 1 blocks are definitely lost in loss record 21 of
234
==4020==at 0x4C2DB8F: malloc (in
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==4020==by 0xCFDCD47: component_select (osc_sm_component.c:277)
==4020==by 0x4F39FC3: ompi_osc_base_select (osc_base_init.c:73)
==4020==by 0x4E945DC: ompi_win_allocate_shared (win.c:272)
==4020==by 0x4EF6576: PMPI_Win_allocate_shared
(pwin_allocate_shared.c:80)
==4020==by 0x400E96: main (mpi_dynamic_win_free.c:48)

Can someone please confirm that we the way the shared window memory is
free'd is actually correct? I noticed that the amount of memory that is
reported to be leaking scales with the number of windows that are allocated
and free'd. In our case this happens in a set of unit tests that all
allocate their own shared memory windows and thus the amount of leaked
memory piles up quite a bit.

I build the code using GCC 5.4.0 using OpenMPI 2.0.1 and ran it on a
single node. How to reproduce:

$ mpicc -Wall -ggdb mpi_dynamic_win_free.c -o mpi_dynamic_win_free

$ mpirun -n 2 valgrind --leak-check=full
--suppressions=$HOME/opt/openmpi-2.0.1/share/openmpi/openmpi-valgrind.supp
./mpi_dynamic_win_free

Best regards,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center S

[OMPI users] MPI_Win_allocate: Memory alignment

2017-02-14 Thread Joseph Schuchart


Hi,

We have been experiencing strange crashes in our application that mostly 
works on memory allocated through MPI_Win_allocate and 
MPI_Win_allocate_shared. We eventually realized that the application 
crashes if it is compiled with -O3 or -Ofast and run with an odd number 
of processors on our x86_64 machines.


After some debugging we found that the minimum alignment of the memory 
returned by MPI_Win_allocate is 4 Bytes, which is fine for 32b data 
types but causes problems with 64b data types (such as size_t) and 
automatic loop vectorization (tested with GCC 5.3.0). Here the compiler 
assumes a natural alignment, which should be at least 8 Byte on x86_64 
and is guaranteed by malloc and new.


Interestingly, the alignment of the returned memory depends on the 
number of processes running. I am attaching a small reproducer that 
prints the alignments of memory returned by MPI_Win_alloc, 
MPI_Win_alloc_shared, and MPI_Alloc_mem (the latter seems to be fine).


Example for 2 processes (correct alignment):

[MPI_Alloc_mem] Alignment of baseptr=0x260ac60: 32
[MPI_Win_allocate] Alignment of baseptr=0x7f94d7aa30a8: 40
[MPI_Win_allocate_shared] Alignment of baseptr=0x7f94d7aa30a8: 40

Example for 3 processes (alignment 4 Bytes even with 8 Byte displacement 
unit):


[MPI_Alloc_mem] Alignment of baseptr=0x115e970: 48
[MPI_Win_allocate] Alignment of baseptr=0x7f685f50f0c4: 4
[MPI_Win_allocate_shared] Alignment of baseptr=0x7fec618bc0c4: 4

Is this a known issue? I expect users to rely on basic alignment 
guarantees made by malloc/new to be true for any function providing 
malloc-like behavior, even more so as a hint on the alignment 
requirements is passed to MPI_Win_alloc in the form of the disp_unit 
argument.


I was able to reproduce this issue in both OpenMPI 1.10.5 and 2.0.2. I 
also tested with MPICH, which provides correct alignment.


Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

#include 
#include 
#include 

static void
test_allocmem()
{
char *baseptr;
MPI_Info win_info;
MPI_Info_create(&win_info);
MPI_Info_set(win_info, "alloc_shared_noncontig", "true");


MPI_Alloc_mem(sizeof(uint64_t), win_info, &baseptr);

printf("[MPI_Alloc_mem] Alignment of baseptr=%p: %li\n", baseptr, ((uint64_t)baseptr) % 64);
MPI_Info_free(&win_info);

MPI_Free_mem(baseptr);
}

static void
test_allocate()
{
char *baseptr;
MPI_Win win;

MPI_Info win_info;
MPI_Info_create(&win_info);
MPI_Info_set(win_info, "alloc_shared_noncontig", "true");


MPI_Win_allocate(
sizeof(uint64_t), 
sizeof(uint64_t), 
win_info, 
MPI_COMM_WORLD, 
&baseptr, 
&win);

printf("[MPI_Win_allocate] Alignment of baseptr=%p: %li\n", baseptr, ((uint64_t)baseptr) % 64);

MPI_Win_free(&win);

MPI_Info_free(&win_info);
}

static void 
test_allocate_shared()
{
char *baseptr;
MPI_Win win;

MPI_Comm sharedmem_comm;
MPI_Group sharedmem_group, group_all;
MPI_Comm_split_type(
MPI_COMM_WORLD,
MPI_COMM_TYPE_SHARED,
1,
MPI_INFO_NULL,
&sharedmem_comm);

MPI_Info win_info;
MPI_Info_create(&win_info);
MPI_Info_set(win_info, "alloc_shared_noncontig", "true");

MPI_Win_allocate_shared(
sizeof(uint64_t), 
sizeof(uint64_t), 
win_info, 
sharedmem_comm, 
&baseptr, 
&win);

printf("[MPI_Win_allocate_shared] Alignment of baseptr=%p: %li\n", baseptr, ((uint64_t)baseptr) % 64);

MPI_Win_free(&win);


MPI_Info_free(&win_info);


}

int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);

test_allocmem();
MPI_Barrier(MPI_COMM_WORLD);
test_allocate();
MPI_Barrier(MPI_COMM_WORLD);
test_allocate_shared();

MPI_Finalize();

return 0;
}
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Win_allocate: Memory alignment

2017-02-15 Thread Joseph Schuchart


Gilles,

Thanks for the quick reply and the immediate fix. I can confirm that 
allocations from both MPI_Win_allocate_shared and MPI_Win_allocate are 
now consistently aligned at 8-byte boundaries and the application runs 
fine now.


For the records, allocations from malloc and MPI_Mem_alloc are 
consistently aligned on 16 bytes on my machine. I have not investigated 
whether the difference in alignment has any impact on performance. 
Unfortunately, MPI in general does not seem to offer means for 
controlling the alignment as posix_memalign does so we would have to 
ensure larger alignments ourselves if this was the case.


Best regards
Joseph

On 02/15/2017 05:45 AM, Gilles Gouaillardet wrote:


Joseph,


thanks for the report and the test program.


the memory allocated by MPI_Win_allocate_shared() is indeed aligned on 
(4*communicator_size).


i could not reproduce such a thing with MPI_Win_allocate(), but will 
investigate it.



i fixed MPI_Win_allocate_shared() in 
https://github.com/open-mpi/ompi/pull/2978,


meanwhile, you can manually download and apply the patch at 
https://github.com/open-mpi/ompi/pull/2978.patch



Cheers,


Gilles


On 2/14/2017 11:01 PM, Joseph Schuchart wrote:

Hi,

We have been experiencing strange crashes in our application that 
mostly works on memory allocated through MPI_Win_allocate and 
MPI_Win_allocate_shared. We eventually realized that the application 
crashes if it is compiled with -O3 or -Ofast and run with an odd 
number of processors on our x86_64 machines.


After some debugging we found that the minimum alignment of the 
memory returned by MPI_Win_allocate is 4 Bytes, which is fine for 32b 
data types but causes problems with 64b data types (such as size_t) 
and automatic loop vectorization (tested with GCC 5.3.0). Here the 
compiler assumes a natural alignment, which should be at least 8 Byte 
on x86_64 and is guaranteed by malloc and new.


Interestingly, the alignment of the returned memory depends on the 
number of processes running. I am attaching a small reproducer that 
prints the alignments of memory returned by MPI_Win_alloc, 
MPI_Win_alloc_shared, and MPI_Alloc_mem (the latter seems to be fine).


Example for 2 processes (correct alignment):

[MPI_Alloc_mem] Alignment of baseptr=0x260ac60: 32
[MPI_Win_allocate] Alignment of baseptr=0x7f94d7aa30a8: 40
[MPI_Win_allocate_shared] Alignment of baseptr=0x7f94d7aa30a8: 40

Example for 3 processes (alignment 4 Bytes even with 8 Byte 
displacement unit):


[MPI_Alloc_mem] Alignment of baseptr=0x115e970: 48
[MPI_Win_allocate] Alignment of baseptr=0x7f685f50f0c4: 4
[MPI_Win_allocate_shared] Alignment of baseptr=0x7fec618bc0c4: 4

Is this a known issue? I expect users to rely on basic alignment 
guarantees made by malloc/new to be true for any function providing 
malloc-like behavior, even more so as a hint on the alignment 
requirements is passed to MPI_Win_alloc in the form of the disp_unit 
argument.


I was able to reproduce this issue in both OpenMPI 1.10.5 and 2.0.2. 
I also tested with MPICH, which provides correct alignment.


Cheers,
Joseph



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] MPI_THREAD_MULTIPLE: Fatal error on MPI_Win_create

2017-02-18 Thread Joseph Schuchart


All,

I am seeing a fatal error with OpenMPI 2.0.2 if requesting support for 
MPI_THREAD_MULTIPLE and afterwards creating a window using 
MPI_Win_create. I am attaching a small reproducer. The output I get is 
the following:


```
MPI_THREAD_MULTIPLE supported: yes
MPI_THREAD_MULTIPLE supported: yes
MPI_THREAD_MULTIPLE supported: yes
MPI_THREAD_MULTIPLE supported: yes
--
The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this 
release.

Workarounds are to run on a single node, or to use a system with an RDMA
capable network such as Infiniband.
--
[beryl:10705] *** An error occurred in MPI_Win_create
[beryl:10705] *** reported by process [2149974017,2]
[beryl:10705] *** on communicator MPI_COMM_WORLD
[beryl:10705] *** MPI_ERR_WIN: invalid window
[beryl:10705] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,

[beryl:10705] ***and potentially your MPI job)
[beryl:10698] 3 more processes have sent help message help-osc-pt2pt.txt 
/ mpi-thread-multiple-not-supported
[beryl:10698] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
all help / error messages
[beryl:10698] 3 more processes have sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal

```

I am running on a single node (my laptop). Both OpenMPI and the 
application were compiled using GCC 5.3.0. Naturally, there is no 
support for Infiniband available. Should I signal OpenMPI that I am 
indeed running on a single node? If so, how can I do that? Can't this be 
detected by OpenMPI automatically? The test succeeds if I only request 
MPI_THREAD_SINGLE.


OpenMPI 2.0.2 has been configured using only 
--enable-mpi-thread-multiple and --prefix configure parameters. I am 
attaching the output of ompi_info.


Please let me know if you need any additional information.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

#include 
#include 
#include 
#include 


int main(int argc, char **argv)
{
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
printf("MPI_THREAD_MULTIPLE supported: %s\n", (provided == MPI_THREAD_MULTIPLE) ? "yes" : "no" );

MPI_Win win;
char *base = malloc(sizeof(uint64_t));

MPI_Win_create(base, sizeof(uint64_t), sizeof(uint64_t), MPI_INFO_NULL, MPI_COMM_WORLD, &win);

MPI_Win_free(&win);
free(base);

MPI_Finalize();

return 0;
}

 Package: Open MPI joseph@beryl Distribution
Open MPI: 2.0.2
  Open MPI repo revision: v2.0.1-348-ge291d0e
   Open MPI release date: Jan 31, 2017
Open RTE: 2.0.2
  Open RTE repo revision: v2.0.1-348-ge291d0e
   Open RTE release date: Jan 31, 2017
OPAL: 2.0.2
  OPAL repo revision: v2.0.1-348-ge291d0e
   OPAL release date: Jan 31, 2017
 MPI API: 3.1.0
Ident string: 2.0.2
  Prefix: /home/joseph/opt/openmpi-2.0.2
 Configured architecture: x86_64-unknown-linux-gnu
  Configure host: beryl
   Configured by: joseph
   Configured on: Wed Feb  1 11:03:54 CET 2017
  Configure host: beryl
Built by: joseph
Built on: Wed Feb  1 11:09:15 CET 2017
  Built host: beryl
  C bindings: yes
C++ bindings: no
 Fort mpif.h: no
Fort use mpi: no
   Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: no
 Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
   Java bindings: no
  Wrapper compiler rpath: runpath
  C compiler: gcc
 C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
  C compiler version: 5.4.1
C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
   Fort compiler: none
   Fort compiler abs: none
 Fort ignore TKR: no
   Fort 08 assumed shape: no
  Fort optional args: no
  Fort INTERFACE: no
Fort ISO_FORTRAN_ENV: no
   Fort STORAGE_SIZE: no
  Fort BIND(C) (all): no
  Fort ISO_C_BINDING: no
 Fort SUBROUTINE BIND(C): no
   Fort TYPE,BIND(C): no
 Fort T,BIND(C,name="a"): no
Fort PRIVATE: no
  Fort PROTECTED: no
   Fort ABSTRACT: no
   Fort ASYNCHRONOUS: no
  Fort PROCEDURE: no
 Fort USE...ONLY: no
   Fort C_FUNLOC: no
 Fort f08 using wrappers: no
 Fort MPI_SIZEOF: no
 C profiling: yes
   C++ profiling: no
   Fort mpif.h profiling: no
  Fort use mpi profiling: no
   Fort use mpi_f08 prof: no
  C++ exceptions: no
  Thread support: posix (MPI_THRE

Re: [OMPI users] MPI_THREAD_MULTIPLE: Fatal error on MPI_Win_create

2017-02-19 Thread Joseph Schuchart


Hi Howard,

Thanks for your quick reply and your suggestions. I exported both 
variables as you suggested but neither has any impact. The error message 
stays the same with both env variables set. Is there any other way to 
get more information from OpenMPI?


Sorry for not mentioning my OS. I'm running on a Linux Mint 18.1 with 
stock kernel 4.8.0-36.


Joseph

On 02/18/2017 05:18 PM, Howard Pritchard wrote:

Hi Joseph

What OS are you using when running the test?

Could you try running with

export OMPI_mca_osc=^pt2pt
and
export OMPI_mca_osc_base_verbose=10

This error message was put in to this OMPI release because this part 
of the code has known problems when used multi threaded.




Joseph Schuchart mailto:schuch...@hlrs.de>> 
schrieb am Sa. 18. Feb. 2017 um 04:02:


All,

I am seeing a fatal error with OpenMPI 2.0.2 if requesting support for
MPI_THREAD_MULTIPLE and afterwards creating a window using
MPI_Win_create. I am attaching a small reproducer. The output I get is
the following:

```
MPI_THREAD_MULTIPLE supported: yes
MPI_THREAD_MULTIPLE supported: yes
MPI_THREAD_MULTIPLE supported: yes
MPI_THREAD_MULTIPLE supported: yes
--
The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this
release.
Workarounds are to run on a single node, or to use a system with
an RDMA
capable network such as Infiniband.
--
[beryl:10705] *** An error occurred in MPI_Win_create
[beryl:10705] *** reported by process [2149974017,2]
[beryl:10705] *** on communicator MPI_COMM_WORLD
[beryl:10705] *** MPI_ERR_WIN: invalid window
[beryl:10705] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[beryl:10705] ***and potentially your MPI job)
[beryl:10698] 3 more processes have sent help message
help-osc-pt2pt.txt
/ mpi-thread-multiple-not-supported
[beryl:10698] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages
[beryl:10698] 3 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
```

I am running on a single node (my laptop). Both OpenMPI and the
application were compiled using GCC 5.3.0. Naturally, there is no
support for Infiniband available. Should I signal OpenMPI that I am
indeed running on a single node? If so, how can I do that? Can't
this be
detected by OpenMPI automatically? The test succeeds if I only request
MPI_THREAD_SINGLE.

OpenMPI 2.0.2 has been configured using only
--enable-mpi-thread-multiple and --prefix configure parameters. I am
attaching the output of ompi_info.

Please let me know if you need any additional information.

Cheers,
Joseph

    --
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>

___
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OMPI users] MPI_THREAD_MULTIPLE: Fatal error on MPI_Win_create

2017-02-19 Thread Joseph Schuchart


Gilles,

Sure, this time I see more output (seems there was a typo in the env 
variable earlier):


```
$ echo $OMPI_MCA_osc
^pt2pt
$ echo $OMPI_MCA_osc_base_verbose
10
$ mpirun -n 2 ./a.out
[beryl:12905] mca: base: components_register: registering framework osc 
components
[beryl:12904] mca: base: components_register: registering framework osc 
components

[beryl:12904] mca: base: components_register: found loaded component rdma
[beryl:12904] mca: base: components_register: component rdma register 
function successful

[beryl:12904] mca: base: components_register: found loaded component sm
[beryl:12904] mca: base: components_register: component sm has no 
register or open function

[beryl:12904] mca: base: components_open: opening osc components
[beryl:12904] mca: base: components_open: found loaded component rdma
[beryl:12904] mca: base: components_open: found loaded component sm
[beryl:12904] mca: base: components_open: component sm open function 
successful

[beryl:12905] mca: base: components_register: found loaded component rdma
[beryl:12905] mca: base: components_register: component rdma register 
function successful

[beryl:12905] mca: base: components_register: found loaded component sm
[beryl:12905] mca: base: components_register: component sm has no 
register or open function

[beryl:12905] mca: base: components_open: opening osc components
[beryl:12905] mca: base: components_open: found loaded component rdma
[beryl:12905] mca: base: components_open: found loaded component sm
[beryl:12905] mca: base: components_open: component sm open function 
successful

[beryl:12904] *** An error occurred in MPI_Win_create
[beryl:12904] *** reported by process [2609840129,0]
[beryl:12904] *** on communicator MPI_COMM_WORLD
[beryl:12904] *** MPI_ERR_WIN: invalid window
[beryl:12904] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,

[beryl:12904] ***and potentially your MPI job)
```

HTH. If not, I'd be happy to provide you with anything else that might help.

Best
Joseph

On 02/19/2017 01:13 PM, Gilles Gouaillardet wrote:

Joseph,

Would you mind trying again with
export OMPI_MCA_osc=^pt2pt
export OMPI_MCA_osc_base_verbose=10

If it still does not work, then please post the output

Cheers,

Gilles

Joseph Schuchart  wrote:

Hi Howard,

Thanks for your quick reply and your suggestions. I exported both 
variables as you suggested but neither has any impact. The error 
message stays the same with both env variables set. Is there any other 
way to get more information from OpenMPI?


Sorry for not mentioning my OS. I'm running on a Linux Mint 18.1 with 
stock kernel 4.8.0-36.


Joseph

On 02/18/2017 05:18 PM, Howard Pritchard wrote:

Hi Joseph

What OS are you using when running the test?

Could you try running with

export OMPI_mca_osc=^pt2pt
and
export OMPI_mca_osc_base_verbose=10

This error message was put in to this OMPI release because this part 
of the code has known problems when used multi threaded.




Joseph Schuchart mailto:schuch...@hlrs.de>> 
schrieb am Sa. 18. Feb. 2017 um 04:02:


All,

I am seeing a fatal error with OpenMPI 2.0.2 if requesting
support for
MPI_THREAD_MULTIPLE and afterwards creating a window using
MPI_Win_create. I am attaching a small reproducer. The output I
get is
the following:

```
MPI_THREAD_MULTIPLE supported: yes
MPI_THREAD_MULTIPLE supported: yes
MPI_THREAD_MULTIPLE supported: yes
MPI_THREAD_MULTIPLE supported: yes
--
The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this
release.
Workarounds are to run on a single node, or to use a system with
an RDMA
capable network such as Infiniband.
--
[beryl:10705] *** An error occurred in MPI_Win_create
[beryl:10705] *** reported by process [2149974017,2]
[beryl:10705] *** on communicator MPI_COMM_WORLD
[beryl:10705] *** MPI_ERR_WIN: invalid window
[beryl:10705] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator
will now abort,
[beryl:10705] ***and potentially your MPI job)
[beryl:10698] 3 more processes have sent help message
help-osc-pt2pt.txt
/ mpi-thread-multiple-not-supported
[beryl:10698] Set MCA parameter "orte_base_help_aggregate" to 0
to see
all help / error messages
[beryl:10698] 3 more processes have sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
```

I am running on a single node (my laptop). Both OpenMPI and the
application were compiled using GCC 5.3.0. Naturally, there is no
support for Infiniband available. Should I signal OpenMPI that I am
indeed running on a single node? If so, how can I do that? Can't
this be
detected by OpenMPI automatically? The test succeeds if I only
request
MPI_THREAD_SING

[OMPI users] MPI_THREAD_MULTIPLE: Fatal error in MPI_Win_flush

2017-02-19 Thread Joseph Schuchart


All,

We are trying to combine MPI_Put and MPI_Win_flush on locked (using 
MPI_Win_lock_all) dynamic windows to mimic a blocking put. The 
application is (potentially) multi-threaded and we are thus relying on 
MPI_THREAD_MULTIPLE support to be available.


When I try to use this combination (MPI_Put + MPI_Win_flush) in our 
application, I am seeing threads occasionally hang in MPI_Win_flush, 
probably waiting for some progress to happen. However, when I try to 
create a small reproducer (attached, the original application has 
multiple layers of abstraction), I am seeing fatal errors in 
MPI_Win_flush if using more than one thread:


```
[beryl:18037] *** An error occurred in MPI_Win_flush
[beryl:18037] *** reported by process [4020043777,2]
[beryl:18037] *** on win pt2pt window 3
[beryl:18037] *** MPI_ERR_RMA_SYNC: error executing rma sync
[beryl:18037] *** MPI_ERRORS_ARE_FATAL (processes in this win will now 
abort,

[beryl:18037] ***and potentially your MPI job)
```

I could only trigger this on dynamic windows with multiple concurrent 
threads running.


So: Is this a valid MPI program (except for the missing clean-up at the 
end ;))? It seems to run fine with MPICH but maybe they are more 
tolerant to some programming errors...


If it is a valid MPI program, I assume there is some race condition in 
MPI_Win_flush that leads to the fatal error (or the hang that I observe 
otherwise)?


I tested this with OpenMPI 1.10.5 on single node Linux Mint 18.1 system 
with stock kernel 4.8.0-36 (aka my laptop). OpenMPI and the test were 
both compiled using GCC 5.3.0. I could not run it using OpenMPI 2.0.2 
due to the fatal error in MPI_Win_create (which also applies to 
MPI_Win_create_dynamic, see my other thread, not sure if they are related).


Please let me know if this is a valid use case and whether I can provide 
you with additional information if required.


Many thanks in advance!

Cheers
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

#include 
#include 
#include 
#include 
#include 
#include 

static void
allocate_dynamic(size_t elsize, size_t count, MPI_Win *win, MPI_Aint *disp_set, char **b)
{
char *base;
MPI_Aint disp;
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (MPI_Win_create_dynamic(MPI_INFO_NULL, MPI_COMM_WORLD, win) != MPI_SUCCESS) {
printf("Failed to create dynamic window!\n");
exit(1);
}

if (MPI_Alloc_mem(elsize*count, MPI_INFO_NULL, &base) != MPI_SUCCESS) {
printf("Failed to allocate memory!\n");
exit(1);
}


if (MPI_Win_attach(*win, base, elsize*count) != MPI_SUCCESS) {
printf("Failed to attach memory to dynamic window!\n");
exit(1);
}


MPI_Get_address(base, &disp);
printf("Offset at process %i: %p (%lu)\n", rank, base, disp);
MPI_Allgather(&disp, 1, MPI_AINT, disp_set, 1, MPI_AINT, MPI_COMM_WORLD);

MPI_Win_lock_all(0, *win);

*b = base;
}

static void
put_blocking(uint64_t value, int target, MPI_Aint offset, MPI_Win win)
{
MPI_Put(&value, 1, MPI_UNSIGNED_LONG, target, offset, 1, MPI_UNSIGNED_LONG, win);
MPI_Win_flush(target, win);
}

int main(int argc, char **argv)
{
int provided;
int rank, size;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
printf("MPI_THREAD_MULTIPLE supported: %s\n", (provided == MPI_THREAD_MULTIPLE) ? "yes" : "no" );

MPI_Win win;
char *base;
// every thread writes so many values to our neighbor
// the offset is controlled by the thread ID
int elem_per_thread = 10;
int num_threads = omp_get_num_threads();

MPI_Aint *disp_set = calloc(size, sizeof(MPI_Aint));

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

allocate_dynamic(sizeof(uint64_t), elem_per_thread*num_threads, &win, disp_set, &base);

int target = (rank + 1) % size;
#pragma omp parallel 
{
int thread_num = omp_get_thread_num();
size_t offset  = disp_set[target] + (elem_per_thread * thread_num)*sizeof(uint64_t);
for (int i = 0; i < elem_per_thread; i++) {
printf("[%i:%i] win[%zu => %zu + (%i * %i)*%zu] <= %i\n", rank, thread_num, 
offset + (sizeof(uint64_t) * i), disp_set[target], elem_per_thread, thread_num, 
sizeof(uint64_t), thread_num);
put_blocking(thread_num, target, offset + (sizeof(uint64_t) * i), win);
}
}

MPI_Barrier(MPI_COMM_WORLD);

// check for the result locally (not very sophisticated)
#pragma omp parallel
{
int thread_num = omp_get_thread_num();
assert(((uint64_t*)base)[(elem_per_thread * thread_num)] == thread_num);
}
MPI_Win_unlock_all(win);

//MPI_Win_free(&win);

free(disp_s

Re: [OMPI users] MPI_THREAD_MULTIPLE: Fatal error in MPI_Win_flush

2017-02-20 Thread Joseph Schuchart


Nathan,

Thanks for your clarification. Just so that I understand where my 
misunderstanding of this matter comes from: can you please point me to 
the place in the standard that prohibits thread-concurrent window 
synchronization using MPI_Win_flush[_all]? I can neither seem to find 
such a passage in 11.5.4 (Flush and Sync), nor in 12.4 (MPI and 
Threads). The latter explicitly excludes waiting on the same request 
object (which it does not) and collective operations on the same 
communicator (which MPI_Win_flush is not) but it fails to mention 
one-sided non-collective sync operations. Any hint would be much 
appreciated.


We will look at MPI_Rput and MPI_Rget. However, having a single put 
paired with a flush is just the simplest case. We also want to support 
multiple asynchronous operations that are eventually synchronized on a 
per-thread basis where keeping the request handles might not be feasible.


Thanks,
Joseph

On 02/20/2017 02:30 AM, Nathan Hjelm wrote:

You can not perform synchronization at the same time as communication on the 
same target. This means if one thread is in MPI_Put/MPI_Get/MPI_Accumulate 
(target) you can’t have another thread in MPI_Win_flush (target) or 
MPI_Win_flush_all(). If your program is doing that it is not a valid MPI 
program. If you want to ensure a particular put operation is complete try 
MPI_Rput instead.

-Nathan


On Feb 19, 2017, at 2:34 PM, Joseph Schuchart  wrote:

All,

We are trying to combine MPI_Put and MPI_Win_flush on locked (using 
MPI_Win_lock_all) dynamic windows to mimic a blocking put. The application is 
(potentially) multi-threaded and we are thus relying on MPI_THREAD_MULTIPLE 
support to be available.

When I try to use this combination (MPI_Put + MPI_Win_flush) in our 
application, I am seeing threads occasionally hang in MPI_Win_flush, probably 
waiting for some progress to happen. However, when I try to create a small 
reproducer (attached, the original application has multiple layers of 
abstraction), I am seeing fatal errors in MPI_Win_flush if using more than one 
thread:

```
[beryl:18037] *** An error occurred in MPI_Win_flush
[beryl:18037] *** reported by process [4020043777,2]
[beryl:18037] *** on win pt2pt window 3
[beryl:18037] *** MPI_ERR_RMA_SYNC: error executing rma sync
[beryl:18037] *** MPI_ERRORS_ARE_FATAL (processes in this win will now abort,
[beryl:18037] ***and potentially your MPI job)
```

I could only trigger this on dynamic windows with multiple concurrent threads 
running.

So: Is this a valid MPI program (except for the missing clean-up at the end 
;))? It seems to run fine with MPICH but maybe they are more tolerant to some 
programming errors...

If it is a valid MPI program, I assume there is some race condition in 
MPI_Win_flush that leads to the fatal error (or the hang that I observe 
otherwise)?

I tested this with OpenMPI 1.10.5 on single node Linux Mint 18.1 system with 
stock kernel 4.8.0-36 (aka my laptop). OpenMPI and the test were both compiled 
using GCC 5.3.0. I could not run it using OpenMPI 2.0.2 due to the fatal error 
in MPI_Win_create (which also applies to MPI_Win_create_dynamic, see my other 
thread, not sure if they are related).

Please let me know if this is a valid use case and whether I can provide you 
with additional information if required.

Many thanks in advance!

Cheers
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Shared Windows and MPI_Accumulate

2017-03-01 Thread Joseph Schuchart


Hi all,

We are seeing issues in one of our applications, in which processes in a 
shared communicator allocate a shared MPI window and execute 
MPI_Accumulate simultaneously on it to iteratively update each process' 
values. The test boils down to the sample code attached. Sample output 
is as follows:


```
$ mpirun -n 4 ./mpi_shared_accumulate
[1] baseptr[0]: 1010 (expected 1010)
[1] baseptr[1]: 1011 (expected 1011)
[1] baseptr[2]: 1012 (expected 1012)
[1] baseptr[3]: 1013 (expected 1013)
[1] baseptr[4]: 1014 (expected 1014)
[2] baseptr[0]: 1005 (expected 1010) [!!!]
[2] baseptr[1]: 1006 (expected 1011) [!!!]
[2] baseptr[2]: 1007 (expected 1012) [!!!]
[2] baseptr[3]: 1008 (expected 1013) [!!!]
[2] baseptr[4]: 1009 (expected 1014) [!!!]
[3] baseptr[0]: 1010 (expected 1010)
[0] baseptr[0]: 1010 (expected 1010)
[0] baseptr[1]: 1011 (expected 1011)
[0] baseptr[2]: 1012 (expected 1012)
[0] baseptr[3]: 1013 (expected 1013)
[0] baseptr[4]: 1014 (expected 1014)
[3] baseptr[1]: 1011 (expected 1011)
[3] baseptr[2]: 1012 (expected 1012)
[3] baseptr[3]: 1013 (expected 1013)
[3] baseptr[4]: 1014 (expected 1014)
```

Each process should hold the same values but sometimes (not on all 
executions) random processes diverge (marked through [!!!]).


I made the following observations:

1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with 
MPICH 3.2.
2) The issue occurs only if the window is allocated through 
MPI_Win_allocate_shared, using MPI_Win_allocate works fine.
3) The code assumes that MPI_Accumulate atomically updates individual 
elements (please correct me if that is not covered by the MPI standard).


Both OpenMPI and the example code were compiled using GCC 5.4.1 and run 
on a Linux system (single node). OpenMPI was configure with 
--enable-mpi-thread-multiple and --with-threads but the application is 
not multi-threaded. Please let me know if you need any other information.


Cheers
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

#include 
#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);

int *baseptr;
MPI_Win win;
int local_elem_count = 5;
int comm_size;
int comm_rank;

int *local_elem_buf = malloc(sizeof(int) * local_elem_count);

MPI_Comm sharedmem_comm;
MPI_Group sharedmem_group, group_all;
MPI_Comm_split_type(
MPI_COMM_WORLD,
MPI_COMM_TYPE_SHARED,
1,
MPI_INFO_NULL,
&sharedmem_comm);

MPI_Comm_size(sharedmem_comm, &comm_size);
MPI_Comm_rank(sharedmem_comm, &comm_rank);

for (int i = 0; i < local_elem_count; ++i) {
local_elem_buf[i] = comm_rank + 1;
}

MPI_Info win_info;
MPI_Info_create(&win_info);
MPI_Info_set(win_info, "alloc_shared_noncontig", "true");

//NOTE: Using MPI_Win_allocate here works as expected
MPI_Win_allocate_shared(
local_elem_count*sizeof(int),
sizeof(int),
win_info, 
sharedmem_comm, 
&baseptr, 
&win);
MPI_Info_free(&win_info);

MPI_Win_lock_all(0, win);

int loffs = 0;
for (int i = 0; i < local_elem_count; ++i) {
  baseptr[i] = 1000 + loffs++;
}

// accumulate into each process' local memory
for (int i = 0; i < comm_size; i++) {
MPI_Accumulate(
local_elem_buf,
local_elem_count,
MPI_INT,
i, 0,
local_elem_count,
MPI_INT,
MPI_SUM,
win);
MPI_Win_flush(i, win);
}

// wait for completion of all processes
MPI_Barrier(sharedmem_comm);
// print local elements
for (int i = 0; i < local_elem_count; ++i) {
  int expected = (1000 + i) +
 ((comm_size * (comm_size + 1)) / 2);
  printf("[%i] baseptr[%i]: %i (expected %i)%s\n",
comm_rank, i, baseptr[i], expected,
(baseptr[i] != expected) ? " [!!!]" : "");
}

MPI_Win_unlock_all(win);

MPI_Win_free(&win);


MPI_Finalize();

return 0;
}
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_THREAD_MULTIPLE: Fatal error in MPI_Win_flush

2017-03-06 Thread Joseph Schuchart

Ping :) I would really appreciate any input on my question below. I 
crawled through the standard but cannot seem to find the wording that 
prohibits thread-concurrent access and synchronization.


Using MPI_Rget works in our case but MPI_Rput only guarantees local 
completion, not remote completion. Specifically, a thread-parallel 
application would have to go into some serial region just to issue an 
MPI_Win_flush before a thread can read a value previously written to the 
same target. Re-reading remote values in the same process/thread might 
not be efficient but is a valid use-case for us.


Best regards,
Joseph


On 02/20/2017 09:23 AM, Joseph Schuchart wrote:

Nathan,

Thanks for your clarification. Just so that I understand where my 
misunderstanding of this matter comes from: can you please point me to 
the place in the standard that prohibits thread-concurrent window 
synchronization using MPI_Win_flush[_all]? I can neither seem to find 
such a passage in 11.5.4 (Flush and Sync), nor in 12.4 (MPI and 
Threads). The latter explicitly excludes waiting on the same request 
object (which it does not) and collective operations on the same 
communicator (which MPI_Win_flush is not) but it fails to mention 
one-sided non-collective sync operations. Any hint would be much 
appreciated.


We will look at MPI_Rput and MPI_Rget. However, having a single put 
paired with a flush is just the simplest case. We also want to support 
multiple asynchronous operations that are eventually synchronized on a 
per-thread basis where keeping the request handles might not be feasible.


Thanks,
Joseph

On 02/20/2017 02:30 AM, Nathan Hjelm wrote:
You can not perform synchronization at the same time as communication 
on the same target. This means if one thread is in 
MPI_Put/MPI_Get/MPI_Accumulate (target) you can’t have another thread 
in MPI_Win_flush (target) or MPI_Win_flush_all(). If your program is 
doing that it is not a valid MPI program. If you want to ensure a 
particular put operation is complete try MPI_Rput instead.


-Nathan

On Feb 19, 2017, at 2:34 PM, Joseph Schuchart  
wrote:


All,

We are trying to combine MPI_Put and MPI_Win_flush on locked (using 
MPI_Win_lock_all) dynamic windows to mimic a blocking put. The 
application is (potentially) multi-threaded and we are thus relying 
on MPI_THREAD_MULTIPLE support to be available.


When I try to use this combination (MPI_Put + MPI_Win_flush) in our 
application, I am seeing threads occasionally hang in MPI_Win_flush, 
probably waiting for some progress to happen. However, when I try to 
create a small reproducer (attached, the original application has 
multiple layers of abstraction), I am seeing fatal errors in 
MPI_Win_flush if using more than one thread:


```
[beryl:18037] *** An error occurred in MPI_Win_flush
[beryl:18037] *** reported by process [4020043777,2]
[beryl:18037] *** on win pt2pt window 3
[beryl:18037] *** MPI_ERR_RMA_SYNC: error executing rma sync
[beryl:18037] *** MPI_ERRORS_ARE_FATAL (processes in this win will 
now abort,

[beryl:18037] ***and potentially your MPI job)
```

I could only trigger this on dynamic windows with multiple 
concurrent threads running.


So: Is this a valid MPI program (except for the missing clean-up at 
the end ;))? It seems to run fine with MPICH but maybe they are more 
tolerant to some programming errors...


If it is a valid MPI program, I assume there is some race condition 
in MPI_Win_flush that leads to the fatal error (or the hang that I 
observe otherwise)?


I tested this with OpenMPI 1.10.5 on single node Linux Mint 18.1 
system with stock kernel 4.8.0-36 (aka my laptop). OpenMPI and the 
test were both compiled using GCC 5.3.0. I could not run it using 
OpenMPI 2.0.2 due to the fatal error in MPI_Win_create (which also 
applies to MPI_Win_create_dynamic, see my other thread, not sure if 
they are related).


Please let me know if this is a valid use case and whether I can 
provide you with additional information if required.


Many thanks in advance!

Cheers
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_THREAD_MULTIPLE: Fatal error in MPI_Win_flush

2017-03-08 Thread Joseph Schuchart


Jeff, Nathan,

Thank you for the positive feedback. I took a chance to look at current 
master and tried (based on my humble understanding of the OpenMPI 
internals) to remove the error check in ompi_osc_pt2pt_flush. Upon 
testing with the example code I sent initially, I saw a Segfault that 
stemmed from infinite recursion in ompi_osc_pt2pt_frag_alloc. I fixed it 
locally (see attached patch) and now I am seeing a Segfault somewhere 
below opal_progress():


```
Thread 5 "a.out" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdcf08700 (LWP 23912)]
0x7fffe77b48ad in mca_pml_ob1_recv_frag_callback_match () from 
/home/joseph/opt/openmpi-master/lib/openmpi/mca_pml_ob1.so

(gdb) bt
#0  0x7fffe77b48ad in mca_pml_ob1_recv_frag_callback_match () from 
/home/joseph/opt/openmpi-master/lib/openmpi/mca_pml_ob1.so
#1  0x7fffec985d0d in mca_btl_vader_component_progress () from 
/home/joseph/opt/openmpi-master/lib/openmpi/mca_btl_vader.so
#2  0x76d4220c in opal_progress () from 
/home/joseph/opt/openmpi-master/lib/libopen-pal.so.0
#3  0x7fffe6754c55 in ompi_osc_pt2pt_flush_lock () from 
/home/joseph/opt/openmpi-master/lib/openmpi/mca_osc_pt2pt.so
#4  0x7fffe67574df in ompi_osc_pt2pt_flush () from 
/home/joseph/opt/openmpi-master/lib/openmpi/mca_osc_pt2pt.so
#5  0x77b608bc in PMPI_Win_flush () from 
/home/joseph/opt/openmpi-master/lib/libmpi.so.0

#6  0x00401149 in put_blocking ()
#7  0x0040140f in main._omp_fn ()
#8  0x778bfe46 in gomp_thread_start (xdata=) at 
../../../src/libgomp/team.c:119
#9  0x776936ba in start_thread (arg=0x7fffdcf08700) at 
pthread_create.c:333
#10 0x773c982d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:109

```
I guess there is a race condition somewhere. Unfortunately, I cannot see 
symbols in the standard build (which is probably to be expected). I 
recompiled master with --enable-debug and now I am facing a Segfault in 
mpirun:


```
Thread 1 "mpirun" received signal SIGSEGV, Segmentation fault.
0x7fffd1e2590e in external_close () from 
/home/joseph/opt/openmpi-master/lib/openmpi/mca_pmix_pmix3x.so

(gdb) bt
#0  0x7fffd1e2590e in external_close () from 
/home/joseph/opt/openmpi-master/lib/openmpi/mca_pmix_pmix3x.so
#1  0x777f23f5 in mca_base_component_close 
(component=0x7fffd20afde0 , output_id=-1) at 
../../../../opal/mca/base/mca_base_components_close.c:53
#2  0x777f24b5 in mca_base_components_close (output_id=-1, 
components=0x77abe250 , 
skip=0x7fffd23c9dc0 )

at ../../../../opal/mca/base/mca_base_components_close.c:85
#3  0x777f2867 in mca_base_select (type_name=0x7789652c 
"pmix", output_id=-1, components_available=0x77abe250 
, best_module=0x7fffd6a0, 
best_component=0x7fffd698,
priority_out=0x0) at 
../../../../opal/mca/base/mca_base_components_select.c:141
#4  0x7786d083 in opal_pmix_base_select () at 
../../../../opal/mca/pmix/base/pmix_base_select.c:35
#5  0x75339ca8 in rte_init () at 
../../../../../orte/mca/ess/hnp/ess_hnp_module.c:640
#6  0x77ae0b62 in orte_init (pargc=0x7fffd86c, 
pargv=0x7fffd860, flags=4) at ../../orte/runtime/orte_init.c:243
#7  0x77b2290b in orte_submit_init (argc=6, argv=0x7fffde58, 
opts=0x0) at ../../orte/orted/orted_submit.c:535
#8  0x004012d7 in orterun (argc=6, argv=0x7fffde58) at 
../../../../orte/tools/orterun/orterun.c:133
#9  0x00400fd6 in main (argc=6, argv=0x7fffde58) at 
../../../../orte/tools/orterun/main.c:13

```

I'm giving up here for tonight. Please let me know if I can help with 
anything else.



Joseph


On 03/07/2017 05:43 PM, Jeff Hammond wrote:
Nathan and I discussed at the MPI Forum last week. I argued that your 
usage is not erroneous, although certain pathological cases (likely 
concocted) can lead to nasty behavior.  He indicated that he would 
remove the error check, but it may require further discussion/debate 
with others.


You can remove the error check from the source and recompile if you 
are in a hurry, or you can use an MPICH-derivative (I have not 
checked, but I doubt MPICH errors on this code.).


Jeff

On Mon, Mar 6, 2017 at 8:30 AM, Joseph Schuchart <mailto:schuch...@hlrs.de>> wrote:


Ping :) I would really appreciate any input on my question below.
I crawled through the standard but cannot seem to find the wording
that prohibits thread-concurrent access and synchronization.

Using MPI_Rget works in our case but MPI_Rput only guarantees
local completion, not remote completion. Specifically, a
thread-parallel application would have to go into some serial
region just to issue an MPI_Win_flush before a thread can read a
value previously written to the same target. Re-reading remote
values in the same process/thread might not be efficient but is a
valid use-case for us.

Best regards,

Re: [OMPI users] Shared Windows and MPI_Accumulate

2017-03-09 Thread Joseph Schuchart

Well, that is embarrassing! Thank you so much for figuring this out and 
providing a detailed answer (also thanks to everyone else who tried to 
reproduce it). I guess I assumed some synchronization in lock_all even 
though I know that it is not collective. With an additional barrier 
between initialization and accumulate in our original application things 
work smoothly.


Best
Joseph


On 03/09/2017 03:10 PM, Steffen Christgau wrote:

Hi Joseph,

in your code, you are updating the local buffer, which is also exposed
via the window, right after the lock_all call, but the stores
(baseptr[i] = 1000 + loffs++, let's call those the buffer
initialization) are may overwrite the outcome of other concurrent
operations, i.e. the accumulate calls in your case.

Another process that has already advanced to the accumulate loop may
change data in the local window, but your local process has not
completed the initialization. Thus you loose the outcome of accumulates
by initialization in case of process skew.

I provoked process skew by adding a

if (comm_rank == 0) {
sleep(1);
}

before the initialization loop, which enables me to reproduce the wrong
results using GCC 6.3 and OpenMPI 2.0.2 and executing the program with
two MPI processes.

The lock_all call after the buffer initialization gives you no
collective synchronization in the windows' communicator (as hinted on p.
446 in the 3.1 standard). That is, other processes have already
performed their accumulate phase while the local one is still (or not
yet) in the initialization and overwrites the data (see above).

You might consider an EXCLUSIVE lock around your initialization, but
this wont solve the issue, because any other process may do its
accumulate phase after the window creation but before you enter the
buffer initialization loop.

As far as I understand your MWE code, the initialization should complete
before the accumulate loop starts to get the correct results. I suppose
a missing MPI_Barrier before the accumulate loop. Since you are using
the unified model, you can omit the proposed exclusive lock (see above)
as well.

Hope this helps.

Regards, Steffen

On 03/01/2017 04:03 PM, Joseph Schuchart wrote:

Hi all,

We are seeing issues in one of our applications, in which processes in a
shared communicator allocate a shared MPI window and execute
MPI_Accumulate simultaneously on it to iteratively update each process'
values. The test boils down to the sample code attached. Sample output
is as follows:

```
$ mpirun -n 4 ./mpi_shared_accumulate
[1] baseptr[0]: 1010 (expected 1010)
[1] baseptr[1]: 1011 (expected 1011)
[1] baseptr[2]: 1012 (expected 1012)
[1] baseptr[3]: 1013 (expected 1013)
[1] baseptr[4]: 1014 (expected 1014)
[2] baseptr[0]: 1005 (expected 1010) [!!!]
[2] baseptr[1]: 1006 (expected 1011) [!!!]
[2] baseptr[2]: 1007 (expected 1012) [!!!]
[2] baseptr[3]: 1008 (expected 1013) [!!!]
[2] baseptr[4]: 1009 (expected 1014) [!!!]
[3] baseptr[0]: 1010 (expected 1010)
[0] baseptr[0]: 1010 (expected 1010)
[0] baseptr[1]: 1011 (expected 1011)
[0] baseptr[2]: 1012 (expected 1012)
[0] baseptr[3]: 1013 (expected 1013)
[0] baseptr[4]: 1014 (expected 1014)
[3] baseptr[1]: 1011 (expected 1011)
[3] baseptr[2]: 1012 (expected 1012)
[3] baseptr[3]: 1013 (expected 1013)
[3] baseptr[4]: 1014 (expected 1014)
```

Each process should hold the same values but sometimes (not on all
executions) random processes diverge (marked through [!!!]).

I made the following observations:

1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with
MPICH 3.2.
2) The issue occurs only if the window is allocated through
MPI_Win_allocate_shared, using MPI_Win_allocate works fine.
3) The code assumes that MPI_Accumulate atomically updates individual
elements (please correct me if that is not covered by the MPI standard).

Both OpenMPI and the example code were compiled using GCC 5.4.1 and run
on a Linux system (single node). OpenMPI was configure with
--enable-mpi-thread-multiple and --with-threads but the application is
not multi-threaded. Please let me know if you need any other information.

Cheers
Joseph



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] [Open MPI Announce] Open MPI v2.1.1 released

2017-05-10 Thread Joseph Schuchart


Dear OpenMPI developers,

Thanks for the great work and the many fixes in this release. Without 
meaning to nitpick here, but:


>
> - Fix memory allocated by MPI_WIN_ALLOCATE_SHARED to
>
>   be 64 byte aligned.

The alignment has been fixed to 64 *bit* or 8 byte, just in case someone 
is relying on it or stumbling across that. Verified using 2.1.1 just now.


Cheers
Joseph

P.S.: The reply-to field in the announcement email states 
us...@open-mpi.org where it should be users@lists.open-mpi.org :)


--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_Win_allocate: Memory alignment

2017-05-17 Thread Joseph Schuchart


Gilles, all,

I can confirm that the fix has landed in OpenMPI 2.1.1. Unfortunately, 
1.10.7 still provides 4-Byte-aligned memory. Will this be fixed in the 
1.x branch at some point? We are selectively enabling the use of shared 
memory windows when using OpenMPI so it would be interesting to know 
whether it's sufficient to check for the 2.x branch :)


Best
Joseph

On 02/15/2017 05:45 AM, Gilles Gouaillardet wrote:

Joseph,


thanks for the report and the test program.


the memory allocated by MPI_Win_allocate_shared() is indeed aligned on
(4*communicator_size).

i could not reproduce such a thing with MPI_Win_allocate(), but will
investigate it.


i fixed MPI_Win_allocate_shared() in
https://github.com/open-mpi/ompi/pull/2978,

meanwhile, you can manually download and apply the patch at
https://github.com/open-mpi/ompi/pull/2978.patch


Cheers,


Gilles


On 2/14/2017 11:01 PM, Joseph Schuchart wrote:

Hi,

We have been experiencing strange crashes in our application that
mostly works on memory allocated through MPI_Win_allocate and
MPI_Win_allocate_shared. We eventually realized that the application
crashes if it is compiled with -O3 or -Ofast and run with an odd
number of processors on our x86_64 machines.

After some debugging we found that the minimum alignment of the memory
returned by MPI_Win_allocate is 4 Bytes, which is fine for 32b data
types but causes problems with 64b data types (such as size_t) and
automatic loop vectorization (tested with GCC 5.3.0). Here the
compiler assumes a natural alignment, which should be at least 8 Byte
on x86_64 and is guaranteed by malloc and new.

Interestingly, the alignment of the returned memory depends on the
number of processes running. I am attaching a small reproducer that
prints the alignments of memory returned by MPI_Win_alloc,
MPI_Win_alloc_shared, and MPI_Alloc_mem (the latter seems to be fine).

Example for 2 processes (correct alignment):

[MPI_Alloc_mem] Alignment of baseptr=0x260ac60: 32
[MPI_Win_allocate] Alignment of baseptr=0x7f94d7aa30a8: 40
[MPI_Win_allocate_shared] Alignment of baseptr=0x7f94d7aa30a8: 40

Example for 3 processes (alignment 4 Bytes even with 8 Byte
displacement unit):

[MPI_Alloc_mem] Alignment of baseptr=0x115e970: 48
[MPI_Win_allocate] Alignment of baseptr=0x7f685f50f0c4: 4
[MPI_Win_allocate_shared] Alignment of baseptr=0x7fec618bc0c4: 4

Is this a known issue? I expect users to rely on basic alignment
guarantees made by malloc/new to be true for any function providing
malloc-like behavior, even more so as a hint on the alignment
requirements is passed to MPI_Win_alloc in the form of the disp_unit
argument.

I was able to reproduce this issue in both OpenMPI 1.10.5 and 2.0.2. I
also tested with MPICH, which provides correct alignment.

Cheers,
Joseph



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users




--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Remote progress in MPI_Win_flush_local

2017-06-23 Thread Joseph Schuchart


All,

We employ the following pattern to send signals between processes:

```
int com_rank, root = 0;
// allocate MPI window
MPI_Win win = allocate_win();
// do some computation
...
// Process 0 waits for a signal
if (com_rank == root) {
  do {
MPI_Fetch_and_op(NULL, &res,
  MPI_INT, com_rank, 0, MPI_NO_OP, win);
MPI_Win_flush_local(com_rank, win);
  } while (res == 0);
} else {
  MPI_Accumulate(
&val, &res, 1, MPI_INT, root, 0, MPI_SUM, win);
  MPI_Win_flush(root, win);
}
[...]
```

We use MPI_Fetch_and_op to atomically query the local memory location 
for a signal and MPI_Accumulate to send the signal (I have omitted the 
reset and other details for simplicity).


If running on a single node (my laptop), this code snippet reproducibly 
hangs, with the root process indefinitely repeating the do-while-loop 
and all other processes being stuck in MPI_Win_flush.


An interesting observation here is that if I replace the 
MPI_Win_flush_local with MPI_Win_flush the application does not hang. 
However, my understanding is that a local flush should be sufficient for 
MPI_Fetch_and_op with MPI_NO_OP as remote completion is not required.


I do not observe this hang with MPICH 3.2 and I am aware that the 
progress semantics of MPI are rather vague. However, I'm curious whether 
this difference is intended and whether or not repeatedly calling into 
MPI communication functions (that do not block) should provide progress 
for incoming RMA operations?


Any input is much appreciated.

Cheers
Joseph
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Issues with Large Window Allocations

2017-08-24 Thread Joseph Schuchart


All,

I have been experimenting with large window allocations recently and 
have made some interesting observations that I would like to share.


The system under test:
  - Linux cluster equipped with IB,
  - Open MPI 2.1.1,
  - 128GB main memory per node
  - 6GB /tmp filesystem per node

My observations:
1) Running with 1 process on a single node, I can allocate and write to 
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and 
MPI_Win_allocate_shared.


2) If running with 1 process per node on 2 nodes single large 
allocations succeed but with the repeating allocate/free cycle in the 
attached code I see the application being reproducibly being killed by 
the OOM at 25GB allocation with MPI_Win_allocate_shared. When I try to 
run it under Valgrind I get an error from MPI_Win_allocate at ~50GB that 
I cannot make sense of:


```
MPI_Alloc_mem:  53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,

[n131302:11989] ***and potentially your MPI job)
```

3) If running with 2 processes on a node, I get the following error from 
both MPI_Win_allocate and MPI_Win_allocate_shared:

```
--
It appears as if there is not enough space for 
/tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702 
(the shared-memory backing

file). It is likely that your MPI job will now either abort or experience
performance degradation.

  Local host:  n131702
  Space Requested: 6710890760 B
  Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp. MPI_Allocate works 
as expected, i.e., I can allocate ~50GB per process. I understand that I 
can set $TMP to a bigger filesystem (such as lustre) but then I am 
greeted with a warning on each allocation and performance seems to drop. 
Is there a way to fall back to the allocation strategy used in case 2)?


4) It is also worth noting the time it takes to allocate the memory: 
while the allocations are in the sub-millisecond range for both 
MPI_Allocate and MPI_Win_allocate_shared, it takes >24s to allocate 
100GB using MPI_Win_allocate and the time increasing linearly with the 
allocation size.


Are these issues known? Maybe there is documentation describing 
work-arounds? (esp. for 3) and 4))


I am attaching a small benchmark. Please make sure to adjust the 
MEM_PER_NODE macro to suit your system before you run it :) I'm happy to 
provide additional details if needed.


Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
#include 
#include 
#include 
#include 

#define MEM_PER_NODE (100*1024*1024*1024UL)
#define NUM_STEPS 10

//#define USE_MEM(__mem, __size) memset(__mem, 0, __size)
//#define USE_MEM(__mem, __size)   parallel_memset(__mem, 0, __size)
#define USE_MEM(__mem, __size)  

static int comm_rank;


static void parallel_memset(char *baseptr, int val, size_t size) 
{
#pragma omp parallel for 
  for (size_t i = 0; i < size; ++i) {
baseptr[i] = val;
  }
}


static void test_alloc_mem(size_t size)
{
  char *baseptr;
  if (comm_rank == 0) {
printf("MPI_Alloc_mem: %12zu B ", size);
  }
  double start = MPI_Wtime();
  MPI_Alloc_mem(size, MPI_INFO_NULL, &baseptr);
  double end   = MPI_Wtime();
  if (comm_rank == 0) {
printf("(%fs)\n", end - start);
  }
  USE_MEM(baseptr, size);

  MPI_Free_mem(baseptr);
}

static void test_win_allocate(size_t size)
{
  char *baseptr;
  MPI_Win win;
  if (comm_rank == 0) {
printf("MPI_Win_allocate: %12zu B ", size);
  }
  double start = MPI_Wtime();
  MPI_Win_allocate(size, 1, MPI_INFO_NULL, MPI_COMM_WORLD, &baseptr, &win);
  double end   = MPI_Wtime();
  if (comm_rank == 0) {
printf("(%fs)\n", end - start);
  }

  USE_MEM(baseptr, size);
  MPI_Win_free(&win);
}

static void test_win_allocate_shared(size_t size, MPI_Comm sharedmem_comm)
{
  char *baseptr;
  MPI_Win win;
  
  if (comm_rank == 0) {
printf("MPI_Win_allocate_shared %12zu B ", size);
  }
  double start = MPI_Wtime();
  MPI_Win_allocate_shared(size, 1, MPI_INFO_NULL, sharedmem_comm, &baseptr, &win);
  double end   = MPI_Wtime();
  if (comm_rank == 0) {
printf("(%fs)\n", end - start);
  }

  USE_MEM(baseptr, size);
  MPI_Win_free(&win);
}



int main(int argc, char **argv) 
{
  MPI_Init(&argc, &argv);

  int procs_per_node;
  MPI_Comm sharedmem_comm;
  MPI_Comm_split_type(
MPI_COMM_WORLD,
MPI_COMM_TYPE_SHARED,
1,
MPI_INFO_NULL,
&sharedmem_comm);
  MPI_Comm_size(sharedmem_comm, &procs_

Re: [OMPI users] Issues with Large Window Allocations

2017-08-24 Thread Joseph Schuchart


Gilles,

Thanks for your swift response. On this system, /dev/shm only has 256M 
available so that is no option unfortunately. I tried disabling both 
vader and sm btl via `--mca btl ^vader,sm` but Open MPI still seems to 
allocate the shmem backing file under /tmp. From my point of view, 
missing the performance benefits of file backed shared memory as long as 
large allocations work but I don't know the implementation details and 
whether that is possible. It seems that the mmap does not happen if 
there is only one process per node.


Cheers,
Joseph

On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:

Joseph,

the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file and then mmap'ing
it.
how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a workaround, i suggest
you use this as the shared-memory backing directory

/* i am afk and do not remember the syntax, ompi_info --all | grep
backing is likely to help */

Cheers,

Gilles

On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart  wrote:

All,

I have been experimenting with large window allocations recently and have
made some interesting observations that I would like to share.

The system under test:
   - Linux cluster equipped with IB,
   - Open MPI 2.1.1,
   - 128GB main memory per node
   - 6GB /tmp filesystem per node

My observations:
1) Running with 1 process on a single node, I can allocate and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.

2) If running with 1 process per node on 2 nodes single large allocations
succeed but with the repeating allocate/free cycle in the attached code I
see the application being reproducibly being killed by the OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that I cannot make sense of:

```
MPI_Alloc_mem:  53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
will now abort,
[n131302:11989] ***and potentially your MPI job)
```

3) If running with 2 processes on a node, I get the following error from
both MPI_Win_allocate and MPI_Win_allocate_shared:
```
--
It appears as if there is not enough space for
/tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702 (the
shared-memory backing
file). It is likely that your MPI job will now either abort or experience
performance degradation.

   Local host:  n131702
   Space Requested: 6710890760 B
   Space Available: 6433673216 B
```
This seems to be related to the size limit of /tmp. MPI_Allocate works as
expected, i.e., I can allocate ~50GB per process. I understand that I can
set $TMP to a bigger filesystem (such as lustre) but then I am greeted with
a warning on each allocation and performance seems to drop. Is there a way
to fall back to the allocation strategy used in case 2)?

4) It is also worth noting the time it takes to allocate the memory: while
the allocations are in the sub-millisecond range for both MPI_Allocate and
MPI_Win_allocate_shared, it takes >24s to allocate 100GB using
MPI_Win_allocate and the time increasing linearly with the allocation size.

Are these issues known? Maybe there is documentation describing
work-arounds? (esp. for 3) and 4))

I am attaching a small benchmark. Please make sure to adjust the
MEM_PER_NODE macro to suit your system before you run it :) I'm happy to
provide additional details if needed.

Best
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users




--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Issues with Large Window Allocations

2017-08-29 Thread Joseph Schuchart


Jeff, all,

Thanks for the clarification. My measurements show that global memory 
allocations do not require the backing file if there is only one process 
per node, for arbitrary number of processes. So I was wondering if it 
was possible to use the same allocation process even with multiple 
processes per node if there is not enough space available in /tmp. 
However, I am not sure whether the IB devices can be used to perform 
intra-node RMA. At least that would retain the functionality on this 
kind of system (which arguably might be a rare case).


On a different note, I found during the weekend that Valgrind only 
supports allocations up to 60GB, so my second point reported below may 
be invalid. Number 4 seems still seems curious to me, though.


Best
Joseph

On 08/25/2017 09:17 PM, Jeff Hammond wrote:
There's no reason to do anything special for shared memory with a 
single-process job because MPI_Win_allocate_shared(MPI_COMM_SELF) ~= 
MPI_Alloc_mem().  However, it would help debugging if MPI implementers 
at least had an option to take the code path that allocates shared 
memory even when np=1.


Jeff

On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart <mailto:schuch...@hlrs.de>> wrote:


Gilles,

Thanks for your swift response. On this system, /dev/shm only has
256M available so that is no option unfortunately. I tried disabling
both vader and sm btl via `--mca btl ^vader,sm` but Open MPI still
seems to allocate the shmem backing file under /tmp. From my point
of view, missing the performance benefits of file backed shared
memory as long as large allocations work but I don't know the
implementation details and whether that is possible. It seems that
the mmap does not happen if there is only one process per node.

Cheers,
Joseph


On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:

Joseph,

the error message suggests that allocating memory with
MPI_Win_allocate[_shared] is done by creating a file and then
mmap'ing
it.
how much space do you have in /dev/shm ? (this is a tmpfs e.g. a RAM
file system)
there is likely quite some space here, so as a workaround, i suggest
you use this as the shared-memory backing directory

/* i am afk and do not remember the syntax, ompi_info --all | grep
backing is likely to help */

Cheers,

Gilles

On Thu, Aug 24, 2017 at 10:31 PM, Joseph Schuchart
mailto:schuch...@hlrs.de>> wrote:

All,

I have been experimenting with large window allocations
recently and have
made some interesting observations that I would like to share.

The system under test:
- Linux cluster equipped with IB,
- Open MPI 2.1.1,
- 128GB main memory per node
- 6GB /tmp filesystem per node

My observations:
1) Running with 1 process on a single node, I can allocate
and write to
memory up to ~110 GB through MPI_Allocate, MPI_Win_allocate, and
MPI_Win_allocate_shared.

2) If running with 1 process per node on 2 nodes single
large allocations
succeed but with the repeating allocate/free cycle in the
attached code I
see the application being reproducibly being killed by the
OOM at 25GB
allocation with MPI_Win_allocate_shared. When I try to run
it under Valgrind
I get an error from MPI_Win_allocate at ~50GB that I cannot
make sense of:

```
MPI_Alloc_mem:  53687091200 B
[n131302:11989] *** An error occurred in MPI_Alloc_mem
[n131302:11989] *** reported by process [1567293441,1]
[n131302:11989] *** on communicator MPI_COMM_WORLD
[n131302:11989] *** MPI_ERR_NO_MEM: out of memory
[n131302:11989] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator
will now abort,
[n131302:11989] ***and potentially your MPI job)
```

3) If running with 2 processes on a node, I get the
following error from
both MPI_Win_allocate and MPI_Win_allocate_shared:
```

--
It appears as if there is not enough space for

/tmp/openmpi-sessions-31390@n131702_0/23041/1/0/shared_window_4.n131702
(the
shared-memory backing
file). It is likely that your MPI job will now either abort
or experience
performance degradation.

Local host:  n131702
Space Requested: 6710890760 B
Space Available: 6433673216 B
```
This seems to be rel

Re: [OMPI users] Issues with Large Window Allocations

2017-09-04 Thread Joseph Schuchart


Jeff, all,

Unfortunately, I (as a user) have no control over the page size on our 
cluster. My interest in this is more of a general nature because I am 
concerned that our users who use Open MPI underneath our code run into 
this issue on their machine.


I took a look at the code for the various window creation methods and 
now have a better picture of the allocation process in Open MPI. I 
realized that memory in windows allocated through MPI_Win_alloc or 
created through MPI_Win_create is registered with the IB device using 
ibv_reg_mr, which takes significant time for large allocations (I assume 
this is where hugepages would help?). In contrast to this, it seems that 
memory attached through MPI_Win_attach is not registered, which explains 
the lower latency for these allocation I am observing (I seem to 
remember having observed higher communication latencies as well).


Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix 
component that uses shmem_open to create a POSIX shared memory object 
instead of a file on disk, which is then mmap'ed. Unfortunately, if I 
raise the priority of this component above that of the default mmap 
component I end up with a SIGBUS during MPI_Init. No other errors are 
reported by MPI. Should I open a ticket on Github for this?


As an alternative, would it be possible to use anonymous shared memory 
mappings to avoid the backing file for large allocations (maybe above a 
certain threshold) on systems that support MAP_ANONYMOUS and distribute 
the result of the mmap call among the processes on the node?


Thanks,
Joseph

On 08/29/2017 06:12 PM, Jeff Hammond wrote:
I don't know any reason why you shouldn't be able to use IB for 
intra-node transfers.  There are, of course, arguments against doing it 
in general (e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it 
likely behaves less synchronously than shared-memory, since I'm not 
aware of any MPI RMA library that dispatches the intranode RMA 
operations to an asynchronous agent (e.g. communication helper thread).


Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page, which 
doesn't sound unreasonable to me.  You might investigate if/how you can 
use 2M or 1G pages instead.  It's possible Open-MPI already supports 
this, if the underlying system does.  You may need to twiddle your OS 
settings to get hugetlbfs working.


Jeff

On Tue, Aug 29, 2017 at 6:15 AM, Joseph Schuchart <mailto:schuch...@hlrs.de>> wrote:


Jeff, all,

Thanks for the clarification. My measurements show that global
memory allocations do not require the backing file if there is only
one process per node, for arbitrary number of processes. So I was
wondering if it was possible to use the same allocation process even
with multiple processes per node if there is not enough space
available in /tmp. However, I am not sure whether the IB devices can
be used to perform intra-node RMA. At least that would retain the
functionality on this kind of system (which arguably might be a rare
case).

On a different note, I found during the weekend that Valgrind only
supports allocations up to 60GB, so my second point reported below
may be invalid. Number 4 seems still seems curious to me, though.

Best
Joseph

On 08/25/2017 09:17 PM, Jeff Hammond wrote:

There's no reason to do anything special for shared memory with
a single-process job because
MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem(). 
However, it would help debugging if MPI implementers at least

had an option to take the code path that allocates shared memory
even when np=1.

Jeff

    On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
mailto:schuch...@hlrs.de>
<mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:

 Gilles,

 Thanks for your swift response. On this system, /dev/shm
only has
 256M available so that is no option unfortunately. I tried
disabling
 both vader and sm btl via `--mca btl ^vader,sm` but Open
MPI still
 seems to allocate the shmem backing file under /tmp. From
my point
 of view, missing the performance benefits of file backed shared
 memory as long as large allocations work but I don't know the
 implementation details and whether that is possible. It
seems that
 the mmap does not happen if there is only one process per node.

 Cheers,
 Joseph


 On 08/24/2017 03:49 PM, Gilles Gouaillardet wrote:

 Joseph,

 the error message suggests that allocating memory with
 MPI_Win_allocate[_shared] is done by creating a file
and then
 mmap'ing
 it.

Re: [OMPI users] Issues with Large Window Allocations

2017-09-04 Thread Joseph Schuchart


Gilles,

On 09/04/2017 03:22 PM, Gilles Gouaillardet wrote:

Joseph,

please open a github issue regarding the SIGBUS error.


Done: https://github.com/open-mpi/ompi/issues/4166



as far as i understand, MAP_ANONYMOUS+MAP_SHARED can only be used
between related processes. (e.g. parent and children)
in the case of Open MPI, MPI tasks are siblings, so this is not an option.



You are right, it doesn't work the way I expected. Should have tested it 
before :)


Best
Joseph


Cheers,

Gilles


On Mon, Sep 4, 2017 at 10:13 PM, Joseph Schuchart  wrote:

Jeff, all,

Unfortunately, I (as a user) have no control over the page size on our
cluster. My interest in this is more of a general nature because I am
concerned that our users who use Open MPI underneath our code run into this
issue on their machine.

I took a look at the code for the various window creation methods and now
have a better picture of the allocation process in Open MPI. I realized that
memory in windows allocated through MPI_Win_alloc or created through
MPI_Win_create is registered with the IB device using ibv_reg_mr, which
takes significant time for large allocations (I assume this is where
hugepages would help?). In contrast to this, it seems that memory attached
through MPI_Win_attach is not registered, which explains the lower latency
for these allocation I am observing (I seem to remember having observed
higher communication latencies as well).

Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix
component that uses shmem_open to create a POSIX shared memory object
instead of a file on disk, which is then mmap'ed. Unfortunately, if I raise
the priority of this component above that of the default mmap component I
end up with a SIGBUS during MPI_Init. No other errors are reported by MPI.
Should I open a ticket on Github for this?

As an alternative, would it be possible to use anonymous shared memory
mappings to avoid the backing file for large allocations (maybe above a
certain threshold) on systems that support MAP_ANONYMOUS and distribute the
result of the mmap call among the processes on the node?

Thanks,
Joseph

On 08/29/2017 06:12 PM, Jeff Hammond wrote:


I don't know any reason why you shouldn't be able to use IB for intra-node
transfers.  There are, of course, arguments against doing it in general
(e.g. IB/PCI bandwidth less than DDR4 bandwidth), but it likely behaves less
synchronously than shared-memory, since I'm not aware of any MPI RMA library
that dispatches the intranode RMA operations to an asynchronous agent (e.g.
communication helper thread).

Regarding 4, faulting 100GB in 24s corresponds to 1us per 4K page, which
doesn't sound unreasonable to me.  You might investigate if/how you can use
2M or 1G pages instead.  It's possible Open-MPI already supports this, if
the underlying system does.  You may need to twiddle your OS settings to get
hugetlbfs working.

Jeff

On Tue, Aug 29, 2017 at 6:15 AM, Joseph Schuchart mailto:schuch...@hlrs.de>> wrote:

 Jeff, all,

 Thanks for the clarification. My measurements show that global
 memory allocations do not require the backing file if there is only
 one process per node, for arbitrary number of processes. So I was
 wondering if it was possible to use the same allocation process even
 with multiple processes per node if there is not enough space
 available in /tmp. However, I am not sure whether the IB devices can
 be used to perform intra-node RMA. At least that would retain the
 functionality on this kind of system (which arguably might be a rare
 case).

 On a different note, I found during the weekend that Valgrind only
 supports allocations up to 60GB, so my second point reported below
 may be invalid. Number 4 seems still seems curious to me, though.

 Best
 Joseph

 On 08/25/2017 09:17 PM, Jeff Hammond wrote:

 There's no reason to do anything special for shared memory with
 a single-process job because
 MPI_Win_allocate_shared(MPI_COMM_SELF) ~= MPI_Alloc_mem().
However, it would help debugging if MPI implementers at least
 had an option to take the code path that allocates shared memory
 even when np=1.

 Jeff

     On Thu, Aug 24, 2017 at 7:41 AM, Joseph Schuchart
 mailto:schuch...@hlrs.de>
 <mailto:schuch...@hlrs.de <mailto:schuch...@hlrs.de>>> wrote:

  Gilles,

  Thanks for your swift response. On this system, /dev/shm
 only has
  256M available so that is no option unfortunately. I tried
 disabling
  both vader and sm btl via `--mca btl ^vader,sm` but Open
 MPI still
  seems to allocate the shmem backing file under /tmp. From
 my point
  of view, missing the performance benefits of file backed
shared
  memory as long as large allocations wo

Re: [OMPI users] Issues with Large Window Allocations

2017-09-08 Thread Joseph Schuchart

We are currently discussing internally how to proceed with this issue on 
our machine. We did a little survey to see the setup of some of the 
machines we have access to, which includes an IBM, a Bull machine, and 
two Cray XC40 machines. To summarize our findings:


1) On the Cray systems, both /tmp and /dev/shm are mounted tmpfs and 
each limited to half of the main memory size per node.
2) On the IBM system, nodes have 64GB and /tmp is limited to 20 GB and 
mounted from a disk partition. /dev/shm, on the other hand, is sized at 
63GB.
3) On the above systems, /proc/sys/kernel/shm* is set up to allow the 
full memory of the node to be used as System V shared memory.
4) On the Bull machine, /tmp is mounted from a disk and fixed to ~100GB 
while /dev/shm is limited to half the node's memory (there are nodes 
with 2TB memory, huge page support is available). System V shmem on the 
other hand is limited to 4GB.


Overall, it seems that there is no globally optimal allocation strategy 
as the best matching source of shared memory is machine dependent.


Open MPI treats System V shared memory as the least favorable option, 
even giving it a lower priority than POSIX shared memory, where 
conflicting names might occur. What's the reason for preferring /tmp and 
POSIX shared memory over System V? It seems to me that the latter is a 
cleaner and safer way (provided that shared memory is not constrained by 
/proc, which could easily be detected) while mmap'ing large files feels 
somewhat hacky. Maybe I am missing an important aspect here though.


The reason I am interested in this issue is that our PGAS library is 
build on top of MPI and allocates pretty much all memory exposed to the 
user through MPI windows. Thus, any limitation from the underlying MPI 
implementation (or system for that matter) limits the amount of usable 
memory for our users.


Given our observations above, I would like to propose a change to the 
shared memory allocator: the priorities would be derived from the 
percentage of main memory each component can cover, i.e.,


Priority = 99*(min(Memory, SpaceAvail) / Memory)

At startup, each shm component would determine the available size (by 
looking at /tmp, /dev/shm, and /proc/sys/kernel/shm*, respectively) and 
set its priority between 0 and 99. A user could force Open MPI to use a 
specific component by manually settings its priority to 100 (which of 
course has to be documented). The priority could factor in other aspects 
as well, such as whether /tmp is actually tmpfs or disk-based if that 
makes a difference in performance.


This proposal of course assumes that shared memory size is the sole 
optimization goal. Maybe there are other aspects to consider? I'd be 
happy to work on a patch but would like to get some feedback before 
getting my hands dirty. IMO, the current situation is less than ideal 
and prone to cause pain to the average user. In my recent experience, 
debugging this has been tedious and the user in general shouldn't have 
to care about how shared memory is allocated (and administrators don't 
always seem to care, see above).


Any feedback is highly appreciated.

Joseph


On 09/04/2017 03:13 PM, Joseph Schuchart wrote:

Jeff, all,

Unfortunately, I (as a user) have no control over the page size on our 
cluster. My interest in this is more of a general nature because I am 
concerned that our users who use Open MPI underneath our code run into 
this issue on their machine.


I took a look at the code for the various window creation methods and 
now have a better picture of the allocation process in Open MPI. I 
realized that memory in windows allocated through MPI_Win_alloc or 
created through MPI_Win_create is registered with the IB device using 
ibv_reg_mr, which takes significant time for large allocations (I assume 
this is where hugepages would help?). In contrast to this, it seems that 
memory attached through MPI_Win_attach is not registered, which explains 
the lower latency for these allocation I am observing (I seem to 
remember having observed higher communication latencies as well).


Regarding the size limitation of /tmp: I found an opal/mca/shmem/posix 
component that uses shmem_open to create a POSIX shared memory object 
instead of a file on disk, which is then mmap'ed. Unfortunately, if I 
raise the priority of this component above that of the default mmap 
component I end up with a SIGBUS during MPI_Init. No other errors are 
reported by MPI. Should I open a ticket on Github for this?


As an alternative, would it be possible to use anonymous shared memory 
mappings to avoid the backing file for large allocations (maybe above a 
certain threshold) on systems that support MAP_ANONYMOUS and distribute 
the result of the mmap call among the processes on the node?


Thanks,
Joseph

On 08/29/2017 06:12 PM, Jeff Hammond wrote:
I don't know any reason why you shouldn't be able to use IB for 
intra-node transf

Re: [OMPI users] Issues with Large Window Allocations

2017-09-09 Thread Joseph Schuchart

Jeff, Gilles,

Thanks for your input. I am aware of the limitations of Sys5 shmem (the 
links you posted do not accurately reflect the description of SHMMAX, 
SHMALL, and SHMMNI found in the standard, though. See 
http://man7.org/linux/man-pages/man2/shmget.2.html).

However, these limitations can be easily checked by looking at the 
entries in /proc. If the limits of Sys5 shmem are smaller than for /tmp 
it would receive a lower priority and thus not be used in my proposal. 
While Sys5 shmem is limited by shmmax, POSIX shmem (shm_open that is) at 
least on Linux is limited by the size of /dev/shm, where ftruncate does 
not complain if the shared memory allocation grows beyond what is 
possible there. A user will learn this the hard way by debugging SIGBUS 
signals upon memory access. This is imo a flaw in the way shm_open is 
implemented on Linux (and/or a flaw of POSIX shm not providing a way to 
check such arbitrary limits). You have to guess the limit by looking at 
the space available in /dev/shm and hoping that the implementation 
didn't change. Apparently, some BSD flavors (such as FreeBSD) have 
implemented POSIX shmem as system calls to not rely on files in some 
tmpfs mounts, which gets rid of such size limitations.

I wasn't aware of the possibility of a memory leak when using Sys5. From 
what I understand, this might happen if the process receives a signal 
between shmget and the immediately following call to shmctl that marks 
the segment deletable. We could go to some lengths and install a signal 
handler in case another thread causes a SIGSEGV or the user tries to 
abort in exactly that moment (except if he uses SIGKILL, that is). I 
agree that all this is not nice but I would argue that it doesn't 
disqualify Sys5 in cases where it's the only way of allocating decent 
amounts of shared memory due to size limitations of the tmpfs mounts. 
What's more, Open MPI is not the only implementation that supports Sys5 
(it's a compile-time option in MPICH and I'm sure there are others using 
it) so automatic job epilogue scripts should clean up Sys5 shmem as well 
(which I'm sure they don't, mostly).

Open MPI currently supports three different allocation strategies for 
shared memory, so it can chose based on what is available (on a 
BlueGene, only POSIX shmem or mmap'ing from /tmp would be considered 
then) to maximize portability. I'm not proposing to make Sys5 the 
default (although I was wondering why it's not preferred by Open MPI, 
which I have a better understanding of now, thanks to your input). I 
just would like to draw something from my recent experience and make 
Open MPI more user friendly by deciding which shmem component to use 
automatically :)

Best
Joseph

On 09/08/2017 06:29 PM, Jeff Hammond wrote:
In my experience, POSIX is much more reliable than Sys5.  Sys5 depends 
on the value of shmmax, which is often set to a small fraction of node 
memory.  I've probably seen the error described on 
http://verahill.blogspot.com/2012/04/solution-to-nwchem-shmmax-too-small.html 
with NWChem a 1000 times because of this.  POSIX, on the other hand, 
isn't limited by SHMMAX (https://community.oracle.com/thread/3828422).

POSIX is newer than Sys5, and while Sys5 is supported by Linux and thus 
almost ubiquitous, it wasn't supported by Blue Gene, so in an HPC 
context, one can argue that POSIX is more portable.

Jeff

On Fri, Sep 8, 2017 at 9:16 AM, Gilles Gouaillardet 
mailto:gilles.gouaillar...@gmail.com>> 
wrote:

Joseph,

Thanks for sharing this !

sysv is imho the worst option because if something goes really
wrong, Open MPI might leave some shared memory segments behind when
a job crashes. From that perspective, leaving a big file in /tmp can
be seen as the lesser evil.
That being said, there might be other reasons that drove this design

Cheers,

Gilles

Joseph Schuchart mailto:schuch...@hlrs.de>> wrote:
 >We are currently discussing internally how to proceed with this
issue on
 >our machine. We did a little survey to see the setup of some of the
 >machines we have access to, which includes an IBM, a Bull machine, and
 >two Cray XC40 machines. To summarize our findings:
 >
 >1) On the Cray systems, both /tmp and /dev/shm are mounted tmpfs and
 >each limited to half of the main memory size per node.
 >2) On the IBM system, nodes have 64GB and /tmp is limited to 20 GB and
 >mounted from a disk partition. /dev/shm, on the other hand, is
sized at
 >63GB.
 >3) On the above systems, /proc/sys/kernel/shm* is set up to allow the
 >full memory of the node to be used as System V shared memory.
 >4) On the Bull machine, /tmp is mounted from a disk and fixed to
~100GB
 >while /dev/shm is limited to half the node's memory (there are nodes
 >with

[OMPI users] Progress issue with dynamic windows

2017-11-01 Thread Joseph Schuchart


All,

I came across what I consider another issue regarding progress in Open 
MPI: consider one process (P1) polling locally on a regular window (W1) 
for a local value to change (using MPI_Win_lock+MPI_Get+MPI_Win_unlock) 
while a second process (P2) tries to read from a memory location in a 
dynamic window (W2) on process P1 (using MPI_Rget+MPI_Wait, other 
combinations affected as well). P2 will later update the memory location 
waited on by P1. However, the read on the dynamic window stalls as the 
(local) read on W1 on P1 does not trigger progress on the dynamic window 
W2, causing the application to deadlock.


It is my understanding that process P1 should guarantee progress on any 
communication it is involved in, irregardless of the window or window 
type, and thus the communication should succeed. Is this assumption 
correct? Or is P1 required to access W2 as well to ensure progress? I 
can trigger progress on W2 on P1 by adding a call to MPI_Iprobe but that 
seems like a hack to me. Also, if both W1 and W2 are regular (allocated) 
windows the communication succeeds.


I am attaching a small reproducer, tested with Open MPI release 3.0.0 on 
a single GNU/Linux node (Linux Mint 18.2, gcc 5.4.1, Linux 
4.10.0-38-generic).


Many thanks in advance!

Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
#include 
#include 
#include 


static void
allocate_dynamic(size_t elsize, size_t count, MPI_Win *win, MPI_Aint *disp_set, char **b)
{
char *base;
MPI_Aint disp;
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (MPI_Win_create_dynamic(MPI_INFO_NULL, MPI_COMM_WORLD, win) != MPI_SUCCESS) {
printf("Failed to create dynamic window!\n");
exit(1);
}

if (MPI_Alloc_mem(elsize*count, MPI_INFO_NULL, &base) != MPI_SUCCESS) {
printf("Failed to allocate memory!\n");
exit(1);
}


if (MPI_Win_attach(*win, base, elsize*count) != MPI_SUCCESS) {
printf("Failed to attach memory to dynamic window!\n");
exit(1);
}


MPI_Get_address(base, &disp);
printf("Offset at process %i: %p (%lu)\n", rank, base, disp);
MPI_Allgather(&disp, 1, MPI_AINT, disp_set, 1, MPI_AINT, MPI_COMM_WORLD);

MPI_Win_lock_all(0, *win);

*b = base;
}

int main(int argc, char **argv) 
{
  int *baseptr1;
  int *baseptr2;
  MPI_Win win1, win2;
  int rank, size;

  MPI_Init(&argc, &argv);

  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  MPI_Win_allocate(sizeof(int), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &baseptr1, &win1);
  MPI_Aint *disp_set = malloc(sizeof(MPI_Aint)*size);
  allocate_dynamic(sizeof(int), 1, &win2, disp_set, &baseptr2);

  *baseptr1 = rank;
  *baseptr2 = rank;
  MPI_Barrier(MPI_COMM_WORLD); 

  if (rank == 0) {
int local_val;
do {
  // trigger progress to avoid stalling read on rank 1
//  int flag;
//  MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &flag, MPI_STATUS_IGNORE);
  MPI_Win_lock(MPI_LOCK_EXCLUSIVE, rank, 0, win1);
  MPI_Get(&local_val, 1, MPI_INT, rank, 0, 1, MPI_INT, win1);
  MPI_Win_flush(rank, win1);
  if (local_val != rank) { printf("Done!\n"); }
  MPI_Win_unlock(rank, win1);
} while (local_val == rank);
  } else if (rank == 1) {
MPI_Request req;
int val;
MPI_Rget(&val, 1, MPI_INT, 0, disp_set[0], 1, MPI_INT, win2, &req);
MPI_Wait(&req, MPI_STATUS_IGNORE);
MPI_Win_lock(MPI_LOCK_EXCLUSIVE, 0, 0, win1);
MPI_Put(&rank, 1, MPI_INT, 0, 0, 1, MPI_INT, win1);
MPI_Win_unlock(0, win1);
  }

  MPI_Win_free(&win1);
  MPI_Win_free(&win2);

  MPI_Finalize();

  return 0;
}

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Progress issue with dynamic windows

2017-11-01 Thread Joseph Schuchart


Nathan,

Thank you for your reply. I opened an issue: 
https://github.com/open-mpi/ompi/issues/4434


Thanks,
Joseph

On 11/02/2017 01:04 PM, Nathan Hjelm wrote:

Hmm, though I thought we also make calls to opal_progress () in your case 
(calling MPI_Win_lock on self). Open a bug on github and I will double-check.


On Nov 1, 2017, at 9:54 PM, Nathan Hjelm  wrote:

This is a known issue when using osc/pt2pt. The only way to get progress is to 
enable (it is not on by default) it at the network level (btl). How this is 
done depends on the underlying transport.

-Nathan


On Nov 1, 2017, at 9:49 PM, Joseph Schuchart  wrote:

All,

I came across what I consider another issue regarding progress in Open MPI: 
consider one process (P1) polling locally on a regular window (W1) for a local 
value to change (using MPI_Win_lock+MPI_Get+MPI_Win_unlock) while a second 
process (P2) tries to read from a memory location in a dynamic window (W2) on 
process P1 (using MPI_Rget+MPI_Wait, other combinations affected as well). P2 
will later update the memory location waited on by P1. However, the read on the 
dynamic window stalls as the (local) read on W1 on P1 does not trigger progress 
on the dynamic window W2, causing the application to deadlock.

It is my understanding that process P1 should guarantee progress on any 
communication it is involved in, irregardless of the window or window type, and 
thus the communication should succeed. Is this assumption correct? Or is P1 
required to access W2 as well to ensure progress? I can trigger progress on W2 
on P1 by adding a call to MPI_Iprobe but that seems like a hack to me. Also, if 
both W1 and W2 are regular (allocated) windows the communication succeeds.

I am attaching a small reproducer, tested with Open MPI release 3.0.0 on a 
single GNU/Linux node (Linux Mint 18.2, gcc 5.4.1, Linux 4.10.0-38-generic).

Many thanks in advance!

Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users




--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Output redirection: missing output from all but one node

2018-02-09 Thread Joseph Schuchart


All,

I am trying to debug my MPI application using good ol' printf and I am 
running into an issue with Open MPI's output redirection (using 
--output-filename).


The system I'm running on is an IB cluster with the home directory 
mounted through NFS.


1) Sometimes I get the following error message and the application hangs:

```
$ mpirun -n 2 -N 1 --output-filename output.log ls
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file 
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_base_setup.c 
at line 314
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file 
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/orted/iof_orted.c 
at line 184
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file 
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_base_setup.c 
at line 237
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file 
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/odls/base/odls_base_default_fns.c 
at line 1147

```

So far I have only seen this error when running straight out of my home 
directory, not when running from a subdirectory.


In case this error does not appear all log files are written correctly.

2) If I call mpirun from within a subdirectory I am only seeing output 
files from processes running on the same node as rank 0. I have not seen 
above error messages in this case.


Example:

```
# two procs, one per node
~/test $ mpirun -n 2 -N 1 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0
# two procs, single node
~/test $ mpirun -n 2 -N 2 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0  rank.1
```

Using Open MPI 2.1.1, I can observe a similar effect:
```
# two procs, one per node
~/test $ mpirun --output-filename output.log -n 2 -N 1 ls
~/test $ ls
output.log.1.0
# two procs, single node
~/test $ mpirun --output-filename output.log -n 2 -N 2 ls
~/test $ ls
output.log.1.0  output.log.1.1
```

Any idea why this happens and/or how to debug this?

In case this helps, the NFS mount flags are:
(rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=,mountvers=3,mountport=,mountproto=udp,local_lock=none,addr=)

I also tested above commands with MPICH, which gives me the expected 
output for all processes on all nodes.


Any help would be much appreciated!

Cheers,
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Output redirection: missing output from all but one node

2018-02-13 Thread Joseph Schuchart


Christoph, all,

Thank you for looking into this. I can confirm that all but the 
processes running on the first node write their output into 
$HOME/ and that using an absolute path is a workaround.


I have created an issue for this on Github:
https://github.com/open-mpi/ompi/issues/4806

Cheers,
Joseph

On 02/09/2018 08:14 PM, Christoph Niethammer wrote:

Hi Joseph,

Thanks for reporting!

Regarding your second point about the missing output files there seems to be a
problem with the current working directory detection on the remote nodes:
while on the first node - on which mpirun is executed - the output folder is
created in the current working directory, the processes on the other nodes
seem to write the files into $HOME/output.log/

As a workaround you can use an absolute directory path:
--output-filename $PWD/output.log

Best
Christoph



On Friday, 9 February 2018 15:52:31 CET Joseph Schuchart wrote:

All,

I am trying to debug my MPI application using good ol' printf and I am
running into an issue with Open MPI's output redirection (using
--output-filename).

The system I'm running on is an IB cluster with the home directory
mounted through NFS.

1) Sometimes I get the following error message and the application hangs:

```
$ mpirun -n 2 -N 1 --output-filename output.log ls
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_
base_setup.c at line 314
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/orted/iof
_orted.c at line 184
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/iof/base/iof_
base_setup.c at line 237
[n121301:25598] [[60171,0],1] ORTE_ERROR_LOG: File open failure in file
/path/to/mpi/openmpi/3.0.0-intel-17.0.4/openmpi-3.0.0/orte/mca/odls/base/odl
s_base_default_fns.c at line 1147
```

So far I have only seen this error when running straight out of my home
directory, not when running from a subdirectory.

In case this error does not appear all log files are written correctly.

2) If I call mpirun from within a subdirectory I am only seeing output
files from processes running on the same node as rank 0. I have not seen
above error messages in this case.

Example:

```
# two procs, one per node
~/test $ mpirun -n 2 -N 1 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0
# two procs, single node
~/test $ mpirun -n 2 -N 2 --output-filename output.log ls
output.log
output.log
~/test $ ls output.log/*
rank.0  rank.1
```

Using Open MPI 2.1.1, I can observe a similar effect:
```
# two procs, one per node
~/test $ mpirun --output-filename output.log -n 2 -N 1 ls
~/test $ ls
output.log.1.0
# two procs, single node
~/test $ mpirun --output-filename output.log -n 2 -N 2 ls
~/test $ ls
output.log.1.0  output.log.1.1
```

Any idea why this happens and/or how to debug this?

In case this helps, the NFS mount flags are:
(rw,nosuid,nodev,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,pro
to=tcp,timeo=600,retrans=2,sec=sys,mountaddr=,mountvers=3,mountport=,mountproto=udp,local_lock=none,addr=)

I also tested above commands with MPICH, which gives me the expected
output for all processes on all nodes.

Any help would be much appreciated!

Cheers,
Joseph



--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Window memory alignment not suitable for long double

2018-03-09 Thread Joseph Schuchart

We have recently added support for long double to our distributed data 
structures and found similar crashes to the ones we have reported in 
[1]. I have not been able to reproduce this with a small example but 
some analysis indicates that the memory that is handed out by 
MPI_Win_allocate is guaranteed to be aligned to 8 Byte while sizeof(long 
double) on my machine is 16 Byte. The clang 5.0 compiler issues vmovaps 
XMM stores that require the target memory to be aligned to 128 bit 
boundaries, potentially causing a segmentation fault on non-aligned 
memory accesses.


AFAICS, since MPI supports long double through MPI_LONG_DOUBLE the 
memory that is allocated by MPI functions should meet the requirements 
of this type. Open MPI should allocate memory in MPI_Allocate, 
MPI_Win_allocate, and MPI_Win_allocate_shared that is aligned to 16 Byte 
to avoid this problem.


Cheers,
Joseph

[1] https://www.mail-archive.com/users@lists.open-mpi.org/msg30621.html

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] User-built OpenMPI 3.0.1 segfaults when storing into an atomic 128-bit variable

2018-05-03 Thread Joseph Schuchart

815== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
--
mpirun noticed that process rank 0 with PID 0 on node kamenice exited on signal 
9 (Killed).
--



___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users




--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI-3 RMA on Cray XC40

2018-05-08 Thread Joseph Schuchart


Nathan,

Thanks for looking into that. My test program is attached.

Best
Joseph

On 05/08/2018 02:56 PM, Nathan Hjelm wrote:

I will take a look today. Can you send me your test program?

-Nathan


On May 8, 2018, at 2:49 AM, Joseph Schuchart  wrote:

All,

I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 
(Haswell-based nodes, Aries interconnect) for multi-threaded MPI RMA. 
Unfortunately, a simple (single-threaded) test case consisting of two processes 
performing an MPI_Rget+MPI_Wait hangs when running on two nodes. It succeeds if 
both processes run on a single node.

For completeness, I am attaching the config.log. The build environment was set 
up to build Open MPI for the login nodes (I wasn't sure how to properly 
cross-compile the libraries):

```
# this seems necessary to avoid a linker error during build
export CRAYPE_LINK_TYPE=dynamic
module swap PrgEnv-cray PrgEnv-intel
module sw craype-haswell craype-sandybridge
module unload craype-hugepages16M
module unload cray-mpich
```

I am using mpirun to launch the test code. Below is the BTL debug log (with tcp 
disabled for clarity, turning it on makes no difference):

```
mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 ./mpi_test_loop
[nid03060:36184] mca: base: components_register: registering framework btl 
components
[nid03060:36184] mca: base: components_register: found loaded component self
[nid03060:36184] mca: base: components_register: component self register 
function successful
[nid03060:36184] mca: base: components_register: found loaded component sm
[nid03061:36208] mca: base: components_register: registering framework btl 
components
[nid03061:36208] mca: base: components_register: found loaded component self
[nid03060:36184] mca: base: components_register: found loaded component ugni
[nid03061:36208] mca: base: components_register: component self register 
function successful
[nid03061:36208] mca: base: components_register: found loaded component sm
[nid03061:36208] mca: base: components_register: found loaded component ugni
[nid03060:36184] mca: base: components_register: component ugni register 
function successful
[nid03060:36184] mca: base: components_register: found loaded component vader
[nid03061:36208] mca: base: components_register: component ugni register 
function successful
[nid03061:36208] mca: base: components_register: found loaded component vader
[nid03060:36184] mca: base: components_register: component vader register 
function successful
[nid03060:36184] mca: base: components_open: opening btl components
[nid03060:36184] mca: base: components_open: found loaded component self
[nid03060:36184] mca: base: components_open: component self open function 
successful
[nid03060:36184] mca: base: components_open: found loaded component ugni
[nid03060:36184] mca: base: components_open: component ugni open function 
successful
[nid03060:36184] mca: base: components_open: found loaded component vader
[nid03060:36184] mca: base: components_open: component vader open function 
successful
[nid03060:36184] select: initializing btl component self
[nid03060:36184] select: init of component self returned success
[nid03060:36184] select: initializing btl component ugni
[nid03061:36208] mca: base: components_register: component vader register 
function successful
[nid03061:36208] mca: base: components_open: opening btl components
[nid03061:36208] mca: base: components_open: found loaded component self
[nid03061:36208] mca: base: components_open: component self open function 
successful
[nid03061:36208] mca: base: components_open: found loaded component ugni
[nid03061:36208] mca: base: components_open: component ugni open function 
successful
[nid03061:36208] mca: base: components_open: found loaded component vader
[nid03061:36208] mca: base: components_open: component vader open function 
successful
[nid03061:36208] select: initializing btl component self
[nid03061:36208] select: init of component self returned success
[nid03061:36208] select: initializing btl component ugni
[nid03061:36208] select: init of component ugni returned success
[nid03061:36208] select: initializing btl component vader
[nid03061:36208] select: init of component vader returned failure
[nid03061:36208] mca: base: close: component vader closed
[nid03061:36208] mca: base: close: unloading component vader
[nid03060:36184] select: init of component ugni returned success
[nid03060:36184] select: initializing btl component vader
[nid03060:36184] select: init of component vader returned failure
[nid03060:36184] mca: base: close: component vader closed
[nid03060:36184] mca: base: close: unloading component vader
[nid03061:36208] mca: bml: Using self btl for send to [[54630,1],1] on node 
nid03061
[nid03060:36184] mca: bml: Using self btl for send to [[54630,1],0] on node 
nid03060
[nid03061:36208] mca: bml: Using ugni btl for send to [[54630,1],0] on node 
(null)
[nid03060:36184] mca: bml: Using ugni btl for send to [[54630,1],1] on

Re: [OMPI users] MPI-3 RMA on Cray XC40

2018-05-09 Thread Joseph Schuchart


Nathan,

Thank you, I can confirm that it works as expected with master on our 
system. I will stick to this version then until 3.1.1 is out.


Joseph

On 05/08/2018 05:34 PM, Nathan Hjelm wrote:


Looks like it doesn't fail with master so at some point I fixed this 
bug. The current plan is to bring all the master changes into v3.1.1. 
This includes a number of bug fixes.


-Nathan

On May 08, 2018, at 08:25 AM, Joseph Schuchart  wrote:


Nathan,

Thanks for looking into that. My test program is attached.

Best
Joseph

On 05/08/2018 02:56 PM, Nathan Hjelm wrote:

I will take a look today. Can you send me your test program?

-Nathan


On May 8, 2018, at 2:49 AM, Joseph Schuchart  wrote:

All,

I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 
(Haswell-based nodes, Aries interconnect) for multi-threaded MPI 
RMA. Unfortunately, a simple (single-threaded) test case consisting 
of two processes performing an MPI_Rget+MPI_Wait hangs when running 
on two nodes. It succeeds if both processes run on a single node.


For completeness, I am attaching the config.log. The build 
environment was set up to build Open MPI for the login nodes (I 
wasn't sure how to properly cross-compile the libraries):


```
# this seems necessary to avoid a linker error during build
export CRAYPE_LINK_TYPE=dynamic
module swap PrgEnv-cray PrgEnv-intel
module sw craype-haswell craype-sandybridge
module unload craype-hugepages16M
module unload cray-mpich
```

I am using mpirun to launch the test code. Below is the BTL debug 
log (with tcp disabled for clarity, turning it on makes no difference):


```
mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 
./mpi_test_loop
[nid03060:36184] mca: base: components_register: registering 
framework btl components
[nid03060:36184] mca: base: components_register: found loaded 
component self
[nid03060:36184] mca: base: components_register: component self 
register function successful
[nid03060:36184] mca: base: components_register: found loaded 
component sm
[nid03061:36208] mca: base: components_register: registering 
framework btl components
[nid03061:36208] mca: base: components_register: found loaded 
component self
[nid03060:36184] mca: base: components_register: found loaded 
component ugni
[nid03061:36208] mca: base: components_register: component self 
register function successful
[nid03061:36208] mca: base: components_register: found loaded 
component sm
[nid03061:36208] mca: base: components_register: found loaded 
component ugni
[nid03060:36184] mca: base: components_register: component ugni 
register function successful
[nid03060:36184] mca: base: components_register: found loaded 
component vader
[nid03061:36208] mca: base: components_register: component ugni 
register function successful
[nid03061:36208] mca: base: components_register: found loaded 
component vader
[nid03060:36184] mca: base: components_register: component vader 
register function successful

[nid03060:36184] mca: base: components_open: opening btl components
[nid03060:36184] mca: base: components_open: found loaded component self
[nid03060:36184] mca: base: components_open: component self open 
function successful

[nid03060:36184] mca: base: components_open: found loaded component ugni
[nid03060:36184] mca: base: components_open: component ugni open 
function successful
[nid03060:36184] mca: base: components_open: found loaded component 
vader
[nid03060:36184] mca: base: components_open: component vader open 
function successful

[nid03060:36184] select: initializing btl component self
[nid03060:36184] select: init of component self returned success
[nid03060:36184] select: initializing btl component ugni
[nid03061:36208] mca: base: components_register: component vader 
register function successful

[nid03061:36208] mca: base: components_open: opening btl components
[nid03061:36208] mca: base: components_open: found loaded component self
[nid03061:36208] mca: base: components_open: component self open 
function successful

[nid03061:36208] mca: base: components_open: found loaded component ugni
[nid03061:36208] mca: base: components_open: component ugni open 
function successful
[nid03061:36208] mca: base: components_open: found loaded component 
vader
[nid03061:36208] mca: base: components_open: component vader open 
function successful

[nid03061:36208] select: initializing btl component self
[nid03061:36208] select: init of component self returned success
[nid03061:36208] select: initializing btl component ugni
[nid03061:36208] select: init of component ugni returned success
[nid03061:36208] select: initializing btl component vader
[nid03061:36208] select: init of component vader returned failure
[nid03061:36208] mca: base: close: component vader closed
[nid03061:36208] mca: base: close: unloading component vader
[nid03060:36184] select: init of component ugni returned success
[nid03060:36184] select: initializing btl component vader
[nid03060:36184] select: init of compo

Re: [OMPI users] MPI-3 RMA on Cray XC40

2018-05-11 Thread Joseph Schuchart


Nathan,

That is good news! Are the improvements that are scheduled for 4.0.0 
already stable enough to be tested? I'd be interested in trying them to 
see whether and how they affect our use-cases.


Also, thanks for pointing me to the RMA-MT benchmark suite, I wasn't 
aware of that project. I looked at the latency benchmarks and found that 
request-based transfer completion (using MPI_Rget+MPI_Wait/MPI_Test) is 
not covered. Would it make sense to add these cases or are there already 
plans to add them? The overhead of transfers with individual 
synchronization from multiple threads are of particular interest for my 
use-case. I'd be happy to contribute to the RMA-BT benchmarks if that 
was useful.


Thanks
Joseph

On 05/10/2018 03:24 AM, Nathan Hjelm wrote:

Thanks for confirming that it works for you as well. I have a PR open on v3.1.x 
that brings osc/rdma up to date with master. I will also be bringing some code 
that greatly improves the multi-threaded RMA performance on Aries systems (at 
least with benchmarks— github.com/hpc/rma-mt). That will not make it into 
v3.1.x but will be in v4.0.0.

-Nathan


On May 9, 2018, at 1:26 AM, Joseph Schuchart  wrote:

Nathan,

Thank you, I can confirm that it works as expected with master on our system. I 
will stick to this version then until 3.1.1 is out.

Joseph

On 05/08/2018 05:34 PM, Nathan Hjelm wrote:

Looks like it doesn't fail with master so at some point I fixed this bug. The 
current plan is to bring all the master changes into v3.1.1. This includes a 
number of bug fixes.
-Nathan
On May 08, 2018, at 08:25 AM, Joseph Schuchart  wrote:

Nathan,

Thanks for looking into that. My test program is attached.

Best
Joseph

On 05/08/2018 02:56 PM, Nathan Hjelm wrote:

I will take a look today. Can you send me your test program?

-Nathan


On May 8, 2018, at 2:49 AM, Joseph Schuchart  wrote:

All,

I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 
(Haswell-based nodes, Aries interconnect) for multi-threaded MPI RMA. 
Unfortunately, a simple (single-threaded) test case consisting of two processes 
performing an MPI_Rget+MPI_Wait hangs when running on two nodes. It succeeds if 
both processes run on a single node.

For completeness, I am attaching the config.log. The build environment was set 
up to build Open MPI for the login nodes (I wasn't sure how to properly 
cross-compile the libraries):

```
# this seems necessary to avoid a linker error during build
export CRAYPE_LINK_TYPE=dynamic
module swap PrgEnv-cray PrgEnv-intel
module sw craype-haswell craype-sandybridge
module unload craype-hugepages16M
module unload cray-mpich
```

I am using mpirun to launch the test code. Below is the BTL debug log (with tcp 
disabled for clarity, turning it on makes no difference):

```
mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 ./mpi_test_loop
[nid03060:36184] mca: base: components_register: registering framework btl 
components
[nid03060:36184] mca: base: components_register: found loaded component self
[nid03060:36184] mca: base: components_register: component self register 
function successful
[nid03060:36184] mca: base: components_register: found loaded component sm
[nid03061:36208] mca: base: components_register: registering framework btl 
components
[nid03061:36208] mca: base: components_register: found loaded component self
[nid03060:36184] mca: base: components_register: found loaded component ugni
[nid03061:36208] mca: base: components_register: component self register 
function successful
[nid03061:36208] mca: base: components_register: found loaded component sm
[nid03061:36208] mca: base: components_register: found loaded component ugni
[nid03060:36184] mca: base: components_register: component ugni register 
function successful
[nid03060:36184] mca: base: components_register: found loaded component vader
[nid03061:36208] mca: base: components_register: component ugni register 
function successful
[nid03061:36208] mca: base: components_register: found loaded component vader
[nid03060:36184] mca: base: components_register: component vader register 
function successful
[nid03060:36184] mca: base: components_open: opening btl components
[nid03060:36184] mca: base: components_open: found loaded component self
[nid03060:36184] mca: base: components_open: component self open function 
successful
[nid03060:36184] mca: base: components_open: found loaded component ugni
[nid03060:36184] mca: base: components_open: component ugni open function 
successful
[nid03060:36184] mca: base: components_open: found loaded component vader
[nid03060:36184] mca: base: components_open: component vader open function 
successful
[nid03060:36184] select: initializing btl component self
[nid03060:36184] select: init of component self returned success
[nid03060:36184] select: initializing btl component ugni
[nid03061:36208] mca: base: components_register: component vader register 
function successful
[nid03061:36208] mca: base:

Re: [OMPI users] MPI-3 RMA on Cray XC40

2018-05-17 Thread Joseph Schuchart

cademic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-6093f2d-intel/lib/libopen-pal.so.0.0.0)
==42751==by 0x4687B6A: ompi_mpi_finalize (in 
/zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-6093f2d-intel/lib/libmpi.so.0.0.0)
==42751==by 0x20001F35: main (in 
/zhome/academic/HLRS/hlrs/hpcjschu/src/test/mpi_test_loop)
==42751==  Address 0xa3aa348 is 16,440 bytes inside a block of size 
16,568 free'd

==42751==at 0x4428CDA: free (vg_replace_malloc.c:530)
==42751==by 0x630FED2: opal_free_list_destruct (in 
/zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-6093f2d-intel/lib/libopen-pal.so.0.0.0)
==42751==by 0x63160C1: opal_rb_tree_destruct (in 
/zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-6093f2d-intel/lib/libopen-pal.so.0.0.0)
==42751==by 0x1076BACE: mca_mpool_hugepage_finalize (in 
/zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-6093f2d-intel/lib/openmpi/mca_mpool_hugepage.so)
==42751==by 0x1076C202: mca_mpool_hugepage_close (in 
/zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-6093f2d-intel/lib/openmpi/mca_mpool_hugepage.so)
==42751==by 0x633CED9: mca_base_component_close (in 
/zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-6093f2d-intel/lib/libopen-pal.so.0.0.0)
==42751==by 0x633CE01: mca_base_components_close (in 
/zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-6093f2d-intel/lib/libopen-pal.so.0.0.0)
==42751==by 0x63C6F31: mca_mpool_base_close (in 
/zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-6093f2d-intel/lib/libopen-pal.so.0.0.0)
==42751==by 0x634AEF7: mca_base_framework_close (in 
/zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-6093f2d-intel/lib/libopen-pal.so.0.0.0)
==42751==by 0x4687B6A: ompi_mpi_finalize (in 
/zhome/academic/HLRS/hlrs/hpcjschu/opt-cray/openmpi-6093f2d-intel/lib/libmpi.so.0.0.0)
==42751==by 0x20001F35: main (in 
/zhome/academic/HLRS/hlrs/hpcjschu/src/test/mpi_test_loop)

```

I'm not sure whether the invalid writes (and reads) during 
initialization and communication are caused by Open MPI or uGNI itself 
and whether they are critical (the addresses seem to be "special"). The 
write-after-free in MPI_Finalize seems suspicious though. I cannot say 
whether that causes the memory corruption I am seeing but I thought I 
report it. I will dig further into this to try to figure out what causes 
the crashes (they are not deterministically reproducible, unfortunately).


Cheers,
Joseph

On 05/10/2018 03:24 AM, Nathan Hjelm wrote:

Thanks for confirming that it works for you as well. I have a PR open on v3.1.x 
that brings osc/rdma up to date with master. I will also be bringing some code 
that greatly improves the multi-threaded RMA performance on Aries systems (at 
least with benchmarks— github.com/hpc/rma-mt). That will not make it into 
v3.1.x but will be in v4.0.0.

-Nathan


On May 9, 2018, at 1:26 AM, Joseph Schuchart  wrote:

Nathan,

Thank you, I can confirm that it works as expected with master on our system. I 
will stick to this version then until 3.1.1 is out.

Joseph

On 05/08/2018 05:34 PM, Nathan Hjelm wrote:

Looks like it doesn't fail with master so at some point I fixed this bug. The 
current plan is to bring all the master changes into v3.1.1. This includes a 
number of bug fixes.
-Nathan
On May 08, 2018, at 08:25 AM, Joseph Schuchart  wrote:

Nathan,

Thanks for looking into that. My test program is attached.

Best
Joseph

On 05/08/2018 02:56 PM, Nathan Hjelm wrote:

I will take a look today. Can you send me your test program?

-Nathan


On May 8, 2018, at 2:49 AM, Joseph Schuchart  wrote:

All,

I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 
(Haswell-based nodes, Aries interconnect) for multi-threaded MPI RMA. 
Unfortunately, a simple (single-threaded) test case consisting of two processes 
performing an MPI_Rget+MPI_Wait hangs when running on two nodes. It succeeds if 
both processes run on a single node.

For completeness, I am attaching the config.log. The build environment was set 
up to build Open MPI for the login nodes (I wasn't sure how to properly 
cross-compile the libraries):

```
# this seems necessary to avoid a linker error during build
export CRAYPE_LINK_TYPE=dynamic
module swap PrgEnv-cray PrgEnv-intel
module sw craype-haswell craype-sandybridge
module unload craype-hugepages16M
module unload cray-mpich
```

I am using mpirun to launch the test code. Below is the BTL debug log (with tcp 
disabled for clarity, turning it on makes no difference):

```
mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 ./mpi_test_loop
[nid03060:36184] mca: base: components_register: registering framework btl 
components
[nid03060:36184] mca: base: components_register: found loaded component self
[nid03060:36184] mca: base: components_register: component self register 
function successful
[nid03060:36184] mca: base: components_register: found loaded component sm
[nid03061:36208] mca: base: components_register: registering framework btl

[OMPI users] MPI Windows: performance of local memory access

2018-05-23 Thread Joseph Schuchart


All,

We are observing some strange/interesting performance issues in 
accessing memory that has been allocated through MPI_Win_allocate. I am 
attaching our test case, which allocates memory for 100M integer values 
on each process both through malloc and MPI_Win_allocate and writes to 
the local ranges sequentially.


On different systems (incl. SuperMUC and a Bull Cluster), we see that 
accessing the memory allocated through MPI is significantly slower than 
accessing the malloc'ed memory if multiple processes run on a single 
node, increasing the effect with increasing number of processes per 
node. As an example, running 24 processes per node with the example 
attached we see the operations on the malloc'ed memory to take ~0.4s 
while the MPI allocated memory takes up to 10s.


After some experiments, I think there are two factors involved:

1) Initialization: it appears that the first iteration is significantly 
slower than any subsequent accesses (1.1s vs 0.4s with 12 processes on a 
single socket). Excluding the first iteration from the timing or 
memsetting the range leads to comparable performance. I assume that this 
is due to page faults that stem from first accessing the mmap'ed memory 
that backs the shared memory used in the window. The effect of 
presetting the  malloc'ed memory seems smaller (0.4s vs 0.6s).


2) NUMA effects: Given proper initialization, running on two sockets 
still leads to fluctuating performance degradation under the MPI window 
memory, which ranges up to 20x (in extreme cases). The performance of 
accessing the malloc'ed memory is rather stable. The difference seems to 
get smaller (but does not disappear) with increasing number of 
repetitions. I am not sure what causes these effects as each process 
should first-touch their local memory.


Are these known issues? Does anyone have any thoughts on my analysis?

It is problematic for us that replacing local memory allocation with MPI 
memory allocation leads to performance degradation as we rely on this 
mechanism in our distributed data structures. While we can ensure proper 
initialization of the memory to mitigate 1) for performance 
measurements, I don't see a way to control the NUMA effects. If there is 
one I'd be happy about any hints :)


I should note that we also tested MPICH-based implementations, which 
showed similar effects (as they also mmap their window memory). Not 
surprisingly, using MPI_Alloc_mem and attaching that memory to a dynamic 
window does not cause these effects while using shared memory windows 
does. I ran my experiments using Open MPI 3.1.0 with the following 
command lines:


- 12 cores / 1 socket:
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
- 24 cores / 2 sockets:
mpirun -n 24 --bind-to socket

and verified the binding using  --report-bindings.

Any help or comment would be much appreciated.

Cheers
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
#include 
#include 
#include 
#include 

#include 

#include 
#include 

#define MYTIMEVAL( tv_ )\
  ((tv_.tv_sec)+(tv_.tv_usec)*1.0e-6)

#define TIMESTAMP( time_ )  \
  { \
static struct timeval tv;   \
gettimeofday( &tv, NULL );  \
time_=MYTIMEVAL(tv);\
  }

//
// do some work and measure how long it takes
//
double do_work(int *beg, size_t nelem, int repeat)
{
  const int LCG_A = 1664525, LCG_C = 1013904223;
  
  int seed = 31337;
  double start, end;
  MPI_Barrier(MPI_COMM_WORLD);
  TIMESTAMP(start);
  for( int j=0; j___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI Windows: performance of local memory access

2018-05-23 Thread Joseph Schuchart

I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 
7.1.0 on the Bull Cluster. I only ran on a single node but haven't 
tested what happens if more than one node is involved.


Joseph

On 05/23/2018 02:04 PM, Nathan Hjelm wrote:

What Open MPI version are you using? Does this happen when you run on a single 
node or multiple nodes?

-Nathan


On May 23, 2018, at 4:45 AM, Joseph Schuchart  wrote:

All,

We are observing some strange/interesting performance issues in accessing 
memory that has been allocated through MPI_Win_allocate. I am attaching our 
test case, which allocates memory for 100M integer values on each process both 
through malloc and MPI_Win_allocate and writes to the local ranges sequentially.

On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing 
the memory allocated through MPI is significantly slower than accessing the 
malloc'ed memory if multiple processes run on a single node, increasing the 
effect with increasing number of processes per node. As an example, running 24 
processes per node with the example attached we see the operations on the 
malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.

After some experiments, I think there are two factors involved:

1) Initialization: it appears that the first iteration is significantly slower 
than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single 
socket). Excluding the first iteration from the timing or memsetting the range 
leads to comparable performance. I assume that this is due to page faults that 
stem from first accessing the mmap'ed memory that backs the shared memory used 
in the window. The effect of presetting the  malloc'ed memory seems smaller 
(0.4s vs 0.6s).

2) NUMA effects: Given proper initialization, running on two sockets still 
leads to fluctuating performance degradation under the MPI window memory, which 
ranges up to 20x (in extreme cases). The performance of accessing the malloc'ed 
memory is rather stable. The difference seems to get smaller (but does not 
disappear) with increasing number of repetitions. I am not sure what causes 
these effects as each process should first-touch their local memory.

Are these known issues? Does anyone have any thoughts on my analysis?

It is problematic for us that replacing local memory allocation with MPI memory 
allocation leads to performance degradation as we rely on this mechanism in our 
distributed data structures. While we can ensure proper initialization of the 
memory to mitigate 1) for performance measurements, I don't see a way to 
control the NUMA effects. If there is one I'd be happy about any hints :)

I should note that we also tested MPICH-based implementations, which showed 
similar effects (as they also mmap their window memory). Not surprisingly, 
using MPI_Alloc_mem and attaching that memory to a dynamic window does not 
cause these effects while using shared memory windows does. I ran my 
experiments using Open MPI 3.1.0 with the following command lines:

- 12 cores / 1 socket:
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
- 24 cores / 2 sockets:
mpirun -n 24 --bind-to socket

and verified the binding using  --report-bindings.

Any help or comment would be much appreciated.

Cheers
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users




___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] MPI Windows: performance of local memory access

2018-05-24 Thread Joseph Schuchart


Thank you all for your input!

Nathan: thanks for that hint, this seems to be the culprit: With your 
patch, I do not observe a difference in the performance between the two 
memory allocations. I remembered that Open MPI allows to change the 
shmem allocator on the command line. Using vanilla Open MPI 3.1.0 and 
increasing the priority of the POSIX shmem implementation using `--mca 
shmem_posix_priority 100` leads to good performance, too. The reason 
could be that on the Bull machine /tmp is mounted on a disk partition 
(SSD, iirc). Maybe there is actual I/O involved that hurts performance 
if the shm backing file is located on a disk (even though the file is 
unlinked before the memory is accessed)?


Regarding the other hints: I tried using MPI_Win_allocate_shared with 
the noncontig hint. Using POSIX shmem, I do not observe a difference in 
performance to the other two options. If using the disk-backed shmem 
file, performance fluctuations are similar to MPI_Win_allocate.


On this machine /proc/sys/kernel/numa_balancing is not available, so I 
assume that this is not the cause in this case. It's good to know for 
the future that this might become an issue on other systems.


Cheers
Joseph

On 05/23/2018 02:26 PM, Nathan Hjelm wrote:

Odd. I wonder if it is something affected by your session directory. It might 
be worth moving the segment to /dev/shm. I don’t expect it will have an impact 
but you could try the following patch:


diff --git a/ompi/mca/osc/sm/osc_sm_component.c 
b/ompi/mca/osc/sm/osc_sm_component.c
index f7211cd93c..bfc26b39f2 100644
--- a/ompi/mca/osc/sm/osc_sm_component.c
+++ b/ompi/mca/osc/sm/osc_sm_component.c
@@ -262,8 +262,8 @@ component_select(struct ompi_win_t *win, void **base, 
size_t size, int disp_unit
  posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64);
  if (0 == ompi_comm_rank (module->comm)) {
  char *data_file;
-if (asprintf(&data_file, "%s"OPAL_PATH_SEP"shared_window_%d.%s",
- ompi_process_info.proc_session_dir,
+if (asprintf(&data_file, "/dev/shm/%d.shared_window_%d.%s",
+ ompi_process_info.my_name.jobid,
   ompi_comm_get_cid(module->comm),
   ompi_process_info.nodename) < 0) {
  return OMPI_ERR_OUT_OF_RESOURCE;



On May 23, 2018, at 6:11 AM, Joseph Schuchart  wrote:

I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 7.1.0 
on the Bull Cluster. I only ran on a single node but haven't tested what 
happens if more than one node is involved.

Joseph

On 05/23/2018 02:04 PM, Nathan Hjelm wrote:

What Open MPI version are you using? Does this happen when you run on a single 
node or multiple nodes?
-Nathan

On May 23, 2018, at 4:45 AM, Joseph Schuchart  wrote:

All,

We are observing some strange/interesting performance issues in accessing 
memory that has been allocated through MPI_Win_allocate. I am attaching our 
test case, which allocates memory for 100M integer values on each process both 
through malloc and MPI_Win_allocate and writes to the local ranges sequentially.

On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing 
the memory allocated through MPI is significantly slower than accessing the 
malloc'ed memory if multiple processes run on a single node, increasing the 
effect with increasing number of processes per node. As an example, running 24 
processes per node with the example attached we see the operations on the 
malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.

After some experiments, I think there are two factors involved:

1) Initialization: it appears that the first iteration is significantly slower 
than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single 
socket). Excluding the first iteration from the timing or memsetting the range 
leads to comparable performance. I assume that this is due to page faults that 
stem from first accessing the mmap'ed memory that backs the shared memory used 
in the window. The effect of presetting the  malloc'ed memory seems smaller 
(0.4s vs 0.6s).

2) NUMA effects: Given proper initialization, running on two sockets still 
leads to fluctuating performance degradation under the MPI window memory, which 
ranges up to 20x (in extreme cases). The performance of accessing the malloc'ed 
memory is rather stable. The difference seems to get smaller (but does not 
disappear) with increasing number of repetitions. I am not sure what causes 
these effects as each process should first-touch their local memory.

Are these known issues? Does anyone have any thoughts on my analysis?

It is problematic for us that replacing local memory allocation with MPI memory 
allocation leads to performance degradation as we rely on this mechanism in our 
distributed data structures. While

Re: [OMPI users] MPI Windows: performance of local memory access

2018-05-24 Thread Joseph Schuchart

Nathan, thanks for taking care of this! I looked at the PR and wonder 
why we don't move the whole session directory to /dev/shm on Linux 
instead of introducing a new mca parameter?


Joseph

On 05/24/2018 04:28 PM, Nathan Hjelm wrote:

PR is up

https://github.com/open-mpi/ompi/pull/5193


-Nathan


On May 24, 2018, at 7:09 AM, Nathan Hjelm  wrote:

Ok, thanks for testing that. I will open a PR for master changing the default 
backing location to /dev/shm on linux. Will be PR’d to v3.0.x and v3.1.x.

-Nathan


On May 24, 2018, at 6:46 AM, Joseph Schuchart  wrote:

Thank you all for your input!

Nathan: thanks for that hint, this seems to be the culprit: With your patch, I 
do not observe a difference in the performance between the two memory 
allocations. I remembered that Open MPI allows to change the shmem allocator on 
the command line. Using vanilla Open MPI 3.1.0 and increasing the priority of 
the POSIX shmem implementation using `--mca shmem_posix_priority 100` leads to 
good performance, too. The reason could be that on the Bull machine /tmp is 
mounted on a disk partition (SSD, iirc). Maybe there is actual I/O involved 
that hurts performance if the shm backing file is located on a disk (even 
though the file is unlinked before the memory is accessed)?

Regarding the other hints: I tried using MPI_Win_allocate_shared with the 
noncontig hint. Using POSIX shmem, I do not observe a difference in performance 
to the other two options. If using the disk-backed shmem file, performance 
fluctuations are similar to MPI_Win_allocate.

On this machine /proc/sys/kernel/numa_balancing is not available, so I assume 
that this is not the cause in this case. It's good to know for the future that 
this might become an issue on other systems.

Cheers
Joseph

On 05/23/2018 02:26 PM, Nathan Hjelm wrote:

Odd. I wonder if it is something affected by your session directory. It might 
be worth moving the segment to /dev/shm. I don’t expect it will have an impact 
but you could try the following patch:
diff --git a/ompi/mca/osc/sm/osc_sm_component.c 
b/ompi/mca/osc/sm/osc_sm_component.c
index f7211cd93c..bfc26b39f2 100644
--- a/ompi/mca/osc/sm/osc_sm_component.c
+++ b/ompi/mca/osc/sm/osc_sm_component.c
@@ -262,8 +262,8 @@ component_select(struct ompi_win_t *win, void **base, 
size_t size, int disp_unit
 posts_size += OPAL_ALIGN_PAD_AMOUNT(posts_size, 64);
 if (0 == ompi_comm_rank (module->comm)) {
 char *data_file;
-if (asprintf(&data_file, "%s"OPAL_PATH_SEP"shared_window_%d.%s",
- ompi_process_info.proc_session_dir,
+if (asprintf(&data_file, "/dev/shm/%d.shared_window_%d.%s",
+ ompi_process_info.my_name.jobid,
  ompi_comm_get_cid(module->comm),
  ompi_process_info.nodename) < 0) {
 return OMPI_ERR_OUT_OF_RESOURCE;

On May 23, 2018, at 6:11 AM, Joseph Schuchart  wrote:

I tested with Open MPI 3.1.0 and Open MPI 3.0.0, both compiled with GCC 7.1.0 
on the Bull Cluster. I only ran on a single node but haven't tested what 
happens if more than one node is involved.

Joseph

On 05/23/2018 02:04 PM, Nathan Hjelm wrote:

What Open MPI version are you using? Does this happen when you run on a single 
node or multiple nodes?
-Nathan

On May 23, 2018, at 4:45 AM, Joseph Schuchart  wrote:

All,

We are observing some strange/interesting performance issues in accessing 
memory that has been allocated through MPI_Win_allocate. I am attaching our 
test case, which allocates memory for 100M integer values on each process both 
through malloc and MPI_Win_allocate and writes to the local ranges sequentially.

On different systems (incl. SuperMUC and a Bull Cluster), we see that accessing 
the memory allocated through MPI is significantly slower than accessing the 
malloc'ed memory if multiple processes run on a single node, increasing the 
effect with increasing number of processes per node. As an example, running 24 
processes per node with the example attached we see the operations on the 
malloc'ed memory to take ~0.4s while the MPI allocated memory takes up to 10s.

After some experiments, I think there are two factors involved:

1) Initialization: it appears that the first iteration is significantly slower 
than any subsequent accesses (1.1s vs 0.4s with 12 processes on a single 
socket). Excluding the first iteration from the timing or memsetting the range 
leads to comparable performance. I assume that this is due to page faults that 
stem from first accessing the mmap'ed memory that backs the shared memory used 
in the window. The effect of presetting the  malloc'ed memory seems smaller 
(0.4s vs 0.6s).

2) NUMA effects: Given proper initialization, running on two sockets still 
leads to fluctuating performance degradation under the MPI window memory, which 
ranges up to 20x (in e

[OMPI users] pt2pt osc required for single-node runs?

2018-09-06 Thread Joseph Schuchart


All,

I installed Open MPI 3.1.2 on my laptop today (up from 3.0.0, which 
worked fine) and ran into the following error when trying to create a 
window:


```
--
The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this 
release.

Workarounds are to run on a single node, or to use a system with an RDMA
capable network such as Infiniband.
--
[beryl:13894] *** An error occurred in MPI_Win_create
[beryl:13894] *** reported by process [2678849537,0]
[beryl:13894] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[beryl:13894] *** MPI_ERR_WIN: invalid window
[beryl:13894] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,

[beryl:13894] ***and potentially your MPI job)
```

I remember seeing this announced in the release notes. I wonder, 
however, why the pt2pt component is required for a run on a single node 
(as suggested by the error message). I tried to disable the pt2pt 
component, which gives a similar error but without the message about the 
pt2pt component:


```
$ mpirun -n 4 --mca osc ^pt2pt ./a.out
[beryl:13738] *** An error occurred in MPI_Win_create
[beryl:13738] *** reported by process [2621964289,0]
[beryl:13738] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[beryl:13738] *** MPI_ERR_WIN: invalid window
[beryl:13738] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,

[beryl:13738] ***and potentially your MPI job)
```

Is this a known issue with v3.1.2? Is there a way to get more 
information about what is going wrong in the second case. Is this the 
right way to disable the pt2pt component?


Cheers,
Joseph
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Joseph Schuchart


All,

I am currently experimenting with MPI atomic operations and wanted to 
share some interesting results I am observing. The numbers below are 
measurements from both an IB-based cluster and our Cray XC40. The 
benchmarks look like the following snippet:


```
  if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
  MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
  MPI_Win_flush(target, win);
}
  }
  MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I 
have tried to confirm that the operations are done in hardware by 
letting rank 0 sleep for a while and ensuring that communication 
progresses). Of particular interest for my use-case is fetch_op but I am 
including other operations here nevertheless:


* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies 
than operations on 32bit

b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and 
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate 
(compare_exchange seems to be somewhat of an outlier).


* Cray XC40,  Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as 
with IB and the latencies for 32bit vs 64bit are roughly the same 
(except for compare_exchange, it seems).


So my question is: is this to be expected? Is the higher latency when 
using a shared lock caused by an internal lock being acquired because 
the hardware operations are not actually atomic?


I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-06 Thread Joseph Schuchart

Thanks a lot for the quick reply, setting 
osc_rdma_acc_single_intrinsic=true does the trick for both shared and 
exclusive locks and brings it down to <2us per operation. I hope that 
the info key will make it into the next version of the standard, I 
certainly have use for it :)


Cheers,
Joseph

On 11/6/18 12:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an extra 
lock step as part of the accumulate (it isn't needed if there is an 
exclusive lock). When setting the above parameter you are telling the 
implementation that you will only be using a single count and we can 
optimize that with the hardware. The RMA working group is working on an 
info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:


All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as
with IB and the latencies for 32bit vs 64bit are roughly the same
(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when
using a shared lock caused by an internal lock being acquired because
the hardware operations are not actually atomic?

I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de <mailto:schuch...@hlrs.de>
___
users mailing list
users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Joseph Schuchart

While using the mca parameter in a real application I noticed a strange 
effect, which took me a while to figure out: It appears that on the 
Aries network the accumulate operations are not atomic anymore. I am 
attaching a test program that shows the problem: all but one processes 
continuously increment a counter while rank 0 is continuously 
subtracting a large value and adding it again, eventually checking for 
the correct number of increments. Without the mca parameter the test at 
the end succeeds as all increments are accounted for:


```
$ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote
result:15000
```

When setting the mca parameter the test fails with garbage in the result:

```
$ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1 
./mpi_fetch_op_local_remote

result:25769849013
mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main: 
Assertion `sum == 1000*(comm_size-1)' failed.

```

All processes perform only MPI_Fetch_and_op in combination with MPI_SUM 
so I assume that the test in combination with the mca flag is correct. I 
cannot reproduce this issue on our IB cluster.


Is that an issue in Open MPI or is there some problem in the test case 
that I am missing?


Thanks in advance,
Joseph


On 11/6/18 1:15 PM, Joseph Schuchart wrote:
Thanks a lot for the quick reply, setting 
osc_rdma_acc_single_intrinsic=true does the trick for both shared and 
exclusive locks and brings it down to <2us per operation. I hope that 
the info key will make it into the next version of the standard, I 
certainly have use for it :)


Cheers,
Joseph

On 11/6/18 12:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an extra 
lock step as part of the accumulate (it isn't needed if there is an 
exclusive lock). When setting the above parameter you are telling the 
implementation that you will only be using a single count and we can 
optimize that with the hardware. The RMA working group is working on 
an info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:


All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange:

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2018-11-08 Thread Joseph Schuchart

Sorry for the delay, I wanted to make sure that I test the same version 
on both Aries and IB: git master bbe5da4. I realized that I had 
previously tested with 3.1.3 on the IB cluster, which ran fine. If I use 
the same version I run into the same problem on both systems (with --mca 
btl_openib_allow_ib true --mca osc_rdma_acc_single_intrinsic true). I 
have not tried using UCX for this.


Joseph

On 11/8/18 1:20 PM, Nathan Hjelm via users wrote:
Quick scan of the program and it looks ok to me. I will dig deeper and 
see if I can determine the underlying cause.


What Open MPI version are you using?

-Nathan

On Nov 08, 2018, at 11:10 AM, Joseph Schuchart  wrote:


While using the mca parameter in a real application I noticed a strange
effect, which took me a while to figure out: It appears that on the
Aries network the accumulate operations are not atomic anymore. I am
attaching a test program that shows the problem: all but one processes
continuously increment a counter while rank 0 is continuously
subtracting a large value and adding it again, eventually checking for
the correct number of increments. Without the mca parameter the test at
the end succeeds as all increments are accounted for:

```
$ mpirun -n 16 -N 1 ./mpi_fetch_op_local_remote
result:15000
```

When setting the mca parameter the test fails with garbage in the result:

```
$ mpirun --mca osc_rdma_acc_single_intrinsic true -n 16 -N 1
./mpi_fetch_op_local_remote
result:25769849013
mpi_fetch_op_local_remote: mpi_fetch_op_local_remote.c:97: main:
Assertion `sum == 1000*(comm_size-1)' failed.
```

All processes perform only MPI_Fetch_and_op in combination with MPI_SUM
so I assume that the test in combination with the mca flag is correct. I
cannot reproduce this issue on our IB cluster.

Is that an issue in Open MPI or is there some problem in the test case
that I am missing?

Thanks in advance,
Joseph


On 11/6/18 1:15 PM, Joseph Schuchart wrote:

Thanks a lot for the quick reply, setting
osc_rdma_acc_single_intrinsic=true does the trick for both shared and
exclusive locks and brings it down to <2us per operation. I hope that
the info key will make it into the next version of the standard, I
certainly have use for it :)

Cheers,
Joseph

On 11/6/18 12:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the
standard it is difficult to make use of network atomics even for
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:

osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an extra
lock step as part of the accumulate (it isn't needed if there is an
exclusive lock). When setting the above parameter you are telling the
implementation that you will only be using a single count and we can
optimize that with the hardware. The RMA working group is working on
an info key that will essentially do the same thing.

Note the above parameter won't help you with IB if you are using UCX
unless you set this (master only right now):

btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of
optimization. I don't know.


-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  
wrote:



All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but 
I am

including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint3

Re: [OMPI users] Cannot catch std::bac_alloc?

2019-04-03 Thread Joseph Schuchart


Zhen,

The "problem" you're running into is memory overcommit [1]. The system 
will happily hand you a pointer to memory upon calling malloc without 
actually allocating the pages (that's the first step in 
std::vector::resize) and then terminate your application as soon as it 
tries to actually allocate them if the system runs out of memory. This 
happens in std::vector::resize too, which sets each entry in the vector 
to it's initial value. There is no way you can catch that. You might 
want to try to disable overcommit in the kernel and see if 
std::vector::resize throws an exception because malloc fails.


HTH,
Joseph

[1] https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

On 4/3/19 3:26 PM, Zhen Wang wrote:

Hi,

I have difficulty catching std::bac_alloc in an MPI environment. The 
code is attached. I'm uisng gcc 6.3 on SUSE Linux Enterprise Server 11 
(x86_64). OpenMPI is built from source. The commands are as follows:


*Build*
g++ -I -L -lmpi memory.cpp

*Run*
 -n 2 a.out

*Output*
0
0
1
1
--
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--
--
mpiexec noticed that process rank 0 with PID 0 on node cdcebus114qa05 
exited on signal 9 (Killed).

--


If I uncomment the line //if (rank == 0), i.e., only rank 0 allocates 
memory, I'm able to catch bad_alloc as I expected. It seems that I am 
misunderstanding something. Could you please help? Thanks a lot.




Best regards,
Zhen

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-08 Thread Joseph Schuchart via users


Nathan,

Over the last couple of weeks I made some more interesting observations 
regarding the latencies of accumulate operations on both Aries and 
InfiniBand systems:


1) There seems to be a significant difference between 64bit and 32bit 
operations: on Aries, the average latency for compare-exchange on 64bit 
values takes about 1.8us while on 32bit values it's at 3.9us, a factor 
of >2x. On the IB cluster, all of fetch-and-op, compare-exchange, and 
accumulate show a similar difference between 32 and 64bit. There are no 
differences between 32bit and 64bit puts and gets on these systems.


2) On both systems, the latency for a single-value atomic load using 
MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM on 
64bit values, roughly matching the latency of 32bit compare-exchange 
operations.


All measurements were done using Open MPI 3.1.2 with 
OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected 
as well?


Thanks,
Joseph


On 11/6/18 6:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an extra 
lock step as part of the accumulate (it isn't needed if there is an 
exclusive lock). When setting the above parameter you are telling the 
implementation that you will only be using a single count and we can 
optimize that with the hardware. The RMA working group is working on an 
info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  wrote:


All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but I am
including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exchange: 1.855840us
accumulate: 5.212632us
get_accumulate: 5.396110us


The difference between exclusive and shared lock is about the same as
with IB and the latencies for 32bit vs 64bit are roughly the same
(except for compare_exchange, it seems).

So my question is: is this to be expected? Is the higher latency when
using a shared lock caused by an internal lock being acquired because
the hardware operations are not actually atomic?

I'd be grateful for any insight on this.

Cheers,
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttga

Re: [OMPI users] Latencies of atomic operations on high-performance networks

2019-05-09 Thread Joseph Schuchart via users


Benson,

I just gave 4.0.1 a shot and the behavior is the same (the reason I'm 
stuck with 3.1.2 is a regression with `osc_rdma_acc_single_intrinsic` on 
4.0 [1]).


The IB cluster has both Mellanox ConnectX-3 (w/ Haswell CPU) and 
ConnectX-4 (w/ Skylake CPU) nodes, the effect is visible on both node types.


Joseph

[1] https://github.com/open-mpi/ompi/issues/6536

On 5/9/19 9:10 AM, Benson Muite via users wrote:

Hi,

Have you tried anything with OpenMPI 4.0.1?

What are the specifications of the Infiniband system you are using?

Benson

On 5/9/19 9:37 AM, Joseph Schuchart via users wrote:

Nathan,

Over the last couple of weeks I made some more interesting 
observations regarding the latencies of accumulate operations on both 
Aries and InfiniBand systems:


1) There seems to be a significant difference between 64bit and 32bit 
operations: on Aries, the average latency for compare-exchange on 
64bit values takes about 1.8us while on 32bit values it's at 3.9us, a 
factor of >2x. On the IB cluster, all of fetch-and-op, 
compare-exchange, and accumulate show a similar difference between 32 
and 64bit. There are no differences between 32bit and 64bit puts and 
gets on these systems.


2) On both systems, the latency for a single-value atomic load using 
MPI_Fetch_and_op + MPI_NO_OP is 2x that of MPI_Fetch_and_op + MPI_SUM 
on 64bit values, roughly matching the latency of 32bit 
compare-exchange operations.


All measurements were done using Open MPI 3.1.2 with 
OMPI_MCA_osc_rdma_acc_single_intrinsic=true. Is that behavior expected 
as well?


Thanks,
Joseph


On 11/6/18 6:13 PM, Nathan Hjelm via users wrote:


All of this is completely expected. Due to the requirements of the 
standard it is difficult to make use of network atomics even for 
MPI_Compare_and_swap (MPI_Accumulate and MPI_Get_accumulate spoil the 
party). If you want MPI_Fetch_and_op to be fast set this MCA parameter:


osc_rdma_acc_single_intrinsic=true

Shared lock is slower than an exclusive lock because there is an 
extra lock step as part of the accumulate (it isn't needed if there 
is an exclusive lock). When setting the above parameter you are 
telling the implementation that you will only be using a single count 
and we can optimize that with the hardware. The RMA working group is 
working on an info key that will essentially do the same thing.


Note the above parameter won't help you with IB if you are using UCX 
unless you set this (master only right now):


btl_uct_transports=dc_mlx5

btl=self,vader,uct

osc=^ucx


Though there may be a way to get osc/ucx to enable the same sort of 
optimization. I don't know.



-Nathan


On Nov 06, 2018, at 09:38 AM, Joseph Schuchart  
wrote:



All,

I am currently experimenting with MPI atomic operations and wanted to
share some interesting results I am observing. The numbers below are
measurements from both an IB-based cluster and our Cray XC40. The
benchmarks look like the following snippet:

```
if (rank == 1) {
uint64_t res, val;
for (size_t i = 0; i < NUM_REPS; ++i) {
MPI_Fetch_and_op(&val, &res, MPI_UINT32_T, 0, 0, MPI_SUM, win);
MPI_Win_flush(target, win);
}
}
MPI_Barrier(MPI_COMM_WORLD);
```

Only rank 1 performs atomic operations, rank 0 waits in a barrier (I
have tried to confirm that the operations are done in hardware by
letting rank 0 sleep for a while and ensuring that communication
progresses). Of particular interest for my use-case is fetch_op but 
I am

including other operations here nevertheless:

* Linux Cluster, IB QDR *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 4.323384us
compare_exchange: 2.035905us
accumulate: 4.326358us
get_accumulate: 4.334831us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.438080us
compare_exchange: 2.398836us
accumulate: 2.435378us
get_accumulate: 2.448347us

Shared lock, MPI_UINT32_T:
fetch_op: 6.819977us
compare_exchange: 4.551417us
accumulate: 6.807766us
get_accumulate: 6.817602us

Shared lock, MPI_UINT64_T:
fetch_op: 4.954860us
compare_exchange: 2.399373us
accumulate: 4.965702us
get_accumulate: 4.977876us

There are two interesting observations:
a) operations on 64bit operands generally seem to have lower latencies
than operations on 32bit
b) Using an exclusive lock leads to lower latencies

Overall, there is a factor of almost 3 between SharedLock+uint32_t and
ExclusiveLock+uint64_t for fetch_and_op, accumulate, and get_accumulate
(compare_exchange seems to be somewhat of an outlier).

* Cray XC40, Aries *
average of 10 iterations

Exclusive lock, MPI_UINT32_T:
fetch_op: 2.011794us
compare_exchange: 1.740825us
accumulate: 1.795500us
get_accumulate: 1.985409us

Exclusive lock, MPI_UINT64_T:
fetch_op: 2.017172us
compare_exchange: 1.846202us
accumulate: 1.812578us
get_accumulate: 2.005541us

Shared lock, MPI_UINT32_T:
fetch_op: 5.380455us
compare_exchange: 5.164458us
accumulate: 5.230184us
get_accumulate: 5.399722us

Shared lock, MPI_UINT64_T:
fetch_op: 5.415230us
compare_exch

Re: [OMPI users] growing memory use from MPI application

2019-06-20 Thread Joseph Schuchart via users


Noam,

Another idea: check for stale files in /dev/shm/ (or a subdirectory that 
looks like it belongs to UCX/OpenMPI) and SysV shared memory using `ipcs 
-m`.


Joseph

On 6/20/19 3:31 PM, Noam Bernstein via users wrote:



On Jun 20, 2019, at 4:44 AM, Charles A Taylor > wrote:


This looks a lot like a problem I had with OpenMPI 3.1.2.  I thought 
the fix was landed in 4.0.0 but you might
want to check the code to be sure there wasn’t a regression in 4.1.x. 
 Most of our codes are still running
3.1.2 so I haven’t built anything beyond 4.0.0 which definitely 
included the fix.


Unfortunately, 4.0.0 behaves the same.

One thing that I’m wondering if anyone familiar with the internals can 
explain is how you get a memory leak that isn’t freed when then program 
ends?  Doesn’t that suggest that it’s something lower level, like maybe 
a kernel issue?


Noam


|
|
|
*U.S. NAVAL*
|
|
_*RESEARCH*_
|
LABORATORY

Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] CPC only supported when the first QP is a PP QP?

2019-08-05 Thread Joseph Schuchart via users

I'm trying to run an MPI RMA application on an IB cluster and find that 
Open MPI is using the pt2pt rdma component instead of openib (or UCX). I 
tried getting some logs from Open MPI (current 3.1.x git):


```
$ mpirun -n 2 --mca btl_base_verbose 100 --mca osc_base_verbose 100 
--mca osc_rdma_verbose 100 ./a.out
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: opening osc components
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: found loaded component sm
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: component sm open function successful
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: found loaded component monitoring
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: found loaded component pt2pt
[taurusi6608.taurus.hrsk.tu-dresden.de:06550] mca: base: 
components_open: found loaded component rdma
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] rdmacm CPC only supported 
when the first QP is a PP QP; skipped
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] openib BTL: rdmacm CPC 
unavailable for use on mlx5_0:1; skipped
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] [rank=0] openib: using 
port mlx5_0:1
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] select: init of component 
openib returned success
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] select: initializing btl 
component tcp
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] btl: tcp: Searching for 
exclude address+prefix: 127.0.0.1 / 8
[taurusi6606.taurus.hrsk.tu-dresden.de:08214] btl: tcp: Found match: 
127.0.0.1 (lo)

```

Is there any information on what makes "rdmacm CPC unavailable for use"? 
I cannot make much sense of "rdmacm CPC only supported when the first QP 
is a PP QP"... Is this a configuration problem of the system? A problem 
with the software stack?


If I try the same using Open MPI 4.0.x it reports:
```
[taurusi6607.taurus.hrsk.tu-dresden.de:21681] Process is not bound: 
distance to device is 0.00

--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   taurusi6606
  Local device: mlx5_0
--
[taurusi6606.taurus.hrsk.tu-dresden.de:09069] select: init of component 
openib returned failure

```

The message about rdmacm does not show up.

The system has mlx5 devices:

```
$ ~/opt/openmpi-v3.1.x/bin/mpirun -n 2 ibv_devices
device node GUID
--  
mlx5_0  08003800013c7507
device node GUID
--  
mlx5_0  08003800013c773b
```

Any help would be much appreciated!

Thanks,
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] mpirun --output-filename behavior

2019-10-31 Thread Joseph Schuchart via users


On 10/30/19 2:06 AM, Jeff Squyres (jsquyres) via users wrote:


Oh, did the prior behavior *only* output to the file and not to 
stdout/stderr?  Huh.


I guess a workaround for that would be:

     mpirun  ... > /dev/null


Just to throw in my $0.02: I recently found that the output to 
stdout/stderr may not be desirable: in an application that writes a lot 
of log data to stderr on all ranks, stdout was significantly slower than 
the files I redirected stdio to (I ended up seeing the application 
complete in the file output while the terminal wasn't even halfway 
through). Redirecting stderr to /dev/null as Jeff suggests does not help 
much because the output first has to be sent to the head node.


Things got even worse when I tried to use the stdout redirection with 
DDT: it barfed at me for doing pipe redirection in the command 
specification! The DDT terminal is just really slow and made the whole 
exercise worthless.


Point to make: it would be nice to have an option to suppress the output 
on stdout and/or stderr when output redirection to file is requested. In 
my case, having stdout still visible on the terminal is desirable but 
having a way to suppress output of stderr to the terminal would be 
immensely helpful.


Joseph



--
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI users] mpirun --output-filename behavior

2019-11-01 Thread Joseph Schuchart via users


Gilles,

Thanks for your suggestions! I just tried both of them, see below:

On 11/1/19 1:15 AM, Gilles Gouaillardet via users wrote:

Joseph,


you can achieve this via an agent (and it works with DDT too)


For example, the nostderr script below redirects each MPI task's stderr 
to /dev/null (so it is not forwarded to mpirun)



$ cat nostderr
#!/bin/sh

exec 2> /dev/null

exec "$@"


and then you can simply


$ mpirun --mca orte_fork_agent /.../nostderr ...



It does not seem to work for me. No matter how I specify the fork agent 
(relative/absolute paths), mpirun can't seem to find it:


$ mpirun --mca orte_fork_agent ./nostderr -n 1 -N 1 hostname
--
The specified fork agent was not found:

  Node:nid06016
  Fork agent:  ./nostderr

The application cannot be launched.
--
$ ll nostderr
-rwx-- 1 hpcjschu s31540 41 Nov  1 13:56 nostderr





FWIW, and even simpler option (that might not work with DDT though) is to

$ mpirun bash -c './a.out 2> /dev/null'



That suppresses the output on stderr, but unfortunately also in the 
resulting output file. Where does Open MPI intercept stdout/stderr to 
tee it to the output file?


Cheers
Joseph



Cheers,


Gilles

On 11/1/2019 7:43 AM, Joseph Schuchart via users wrote:

On 10/30/19 2:06 AM, Jeff Squyres (jsquyres) via users wrote:


Oh, did the prior behavior *only* output to the file and not to 
stdout/stderr?  Huh.


I guess a workaround for that would be:

     mpirun  ... > /dev/null


Just to throw in my $0.02: I recently found that the output to 
stdout/stderr may not be desirable: in an application that writes a 
lot of log data to stderr on all ranks, stdout was significantly 
slower than the files I redirected stdio to (I ended up seeing the 
application complete in the file output while the terminal wasn't even 
halfway through). Redirecting stderr to /dev/null as Jeff suggests 
does not help much because the output first has to be sent to the head 
node.


Things got even worse when I tried to use the stdout redirection with 
DDT: it barfed at me for doing pipe redirection in the command 
specification! The DDT terminal is just really slow and made the whole 
exercise worthless.


Point to make: it would be nice to have an option to suppress the 
output on stdout and/or stderr when output redirection to file is 
requested. In my case, having stdout still visible on the terminal is 
desirable but having a way to suppress output of stderr to the 
terminal would be immensely helpful.


Joseph



--
Jeff Squyres
jsquy...@cisco.com <mailto:jsquy...@cisco.com>

[OMPI users] Question about UCX progress throttling

2020-02-07 Thread Joseph Schuchart via users

Today I came across the two MCA parameters osc_ucx_progress_iterations 
and pml_ucx_progress_iterations in Open MPI. My interpretation of the 
description is that in a loop such as below, progress in UCX is only 
triggered every 100 iterations (assuming opal_progress is only called 
once per MPI_Test call):


```
int flag = 0;
MPI_Request req;

while (!flag) {
  do_something_else();
  MPI_Test(&req, &flag);
}
```

Is that assumption correct? What is the reason behind this throttling? 
In combination with TAMPI, it appears that setting them to 1 yields a 
significant speedup. Is it safe to always set them to 1?


Thanks
Joseph
--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

Re: [OMPI users] RMA in openmpi

2020-04-26 Thread Joseph Schuchart via users


Claire,

> Is it possible to use the one-sided communication without combining 
it with synchronization calls?


What exactly do you mean by "synchronization calls"? MPI_Win_fence is 
indeed synchronizing (basically flush+barrier) but MPI_Win_lock (and the 
passive target synchronization interface at large) is not. It does incur 
some overhead because the lock has to be taken somehow at some point. 
However, it does not require a matching call at the target to complete.


You can lock a window using a (shared or exclusive) lock, initiate RMA 
operations, flush them to wait for their completion, and initiate the 
next set of RMA operations to flush later. None of these calls are 
synchronizing. You will have to perform your own synchronization at some 
point though to make sure processes read consistent data.


HTH!
Joseph


On 4/24/20 5:34 PM, Claire Cashmore via users wrote:

Hello

I was wondering if someone could help me with a question.

When using RMA is there a requirement to use some type of 
synchronization? When using one-sided communication such as MPI_Get the 
code will only run when I combine it with MPI_Win_fence or 
MPI_Win_lock/unlock. I do not want to use MPI_Win_fence as I’m using the 
one-sided communication to allow some communication when processes are 
not synchronised, so this defeats the point. I could use 
MPI_Win_lock/unlock, however someone I’ve spoken to has said that I 
should be able to use RMA without any synchronization calls, if so then 
I would prefer to do this to reduce any overheads using MPI_Win_lock 
every time I use the one-sided communication may produce.


Is it possible to use the one-sided communication without combining it 
with synchronization calls?


(It doesn’t seem to matter what version of openmpi I use).

Thank you

Claire

Re: [OMPI users] RMA in openmpi

2020-04-27 Thread Joseph Schuchart via users

Hi Claire,

You cannot use MPI_Get (or any other RMA communication routine) on a 
window for which no access epoch has been started. MPI_Win_fence starts 
an active target access epoch, MPI_Win_lock[_all] start a passive target 
access epoch. Window locks are synchronizing in the sense that they 
provide a means for mutual exclusion if an exclusive lock is involved (a 
process holding a shared window lock allows for other processes to 
acquire shared locks but prevents them from taking an exclusive lock, 
and vice versa).

One common strategy is to call MPI_Win_lock_all on all processes to let 
all processes acquire a shared lock, which they hold until the end of 
the application run. Communication is then done using a combination of 
MPI_Get/MPI_Put/accumulate functions and flushes. As said earlier, you 
likely will need to take care of synchronization among the processes if 
they also modify data in the window.

Cheers
Joseph

On 4/27/20 12:14 PM, Claire Cashmore wrote:

Hi Joseph

Thank you for your reply. From what I had been reading I thought they were both called 
"synchronization calls" just that one was passive (lock) and one was active 
(fence), sorry if I've got confused!
So I'm asking do need either MPI_Win_fence or MPI_Win_unlock/lock in order to 
use one-sided calls, and is it not possible to use one-sided communication 
without them? So just a stand alone MPI_Get, without the other calls before and 
after? It seems not from what you are saying, but I just wanted to confirm.

Thanks again

Claire

On 27/04/2020, 07:50, "Joseph Schuchart via users"  
wrote:

 Claire,

  > Is it possible to use the one-sided communication without combining
 it with synchronization calls?

 What exactly do you mean by "synchronization calls"? MPI_Win_fence is
 indeed synchronizing (basically flush+barrier) but MPI_Win_lock (and the
 passive target synchronization interface at large) is not. It does incur
 some overhead because the lock has to be taken somehow at some point.
 However, it does not require a matching call at the target to complete.

 You can lock a window using a (shared or exclusive) lock, initiate RMA
 operations, flush them to wait for their completion, and initiate the
 next set of RMA operations to flush later. None of these calls are
 synchronizing. You will have to perform your own synchronization at some
 point though to make sure processes read consistent data.

 HTH!
 Joseph

 On 4/24/20 5:34 PM, Claire Cashmore via users wrote:
 > Hello
 >
 > I was wondering if someone could help me with a question.
 >
 > When using RMA is there a requirement to use some type of
 > synchronization? When using one-sided communication such as MPI_Get the
 > code will only run when I combine it with MPI_Win_fence or
 > MPI_Win_lock/unlock. I do not want to use MPI_Win_fence as I’m using the
 > one-sided communication to allow some communication when processes are
 > not synchronised, so this defeats the point. I could use
 > MPI_Win_lock/unlock, however someone I’ve spoken to has said that I
 > should be able to use RMA without any synchronization calls, if so then
 > I would prefer to do this to reduce any overheads using MPI_Win_lock
 > every time I use the one-sided communication may produce.
 >
 > Is it possible to use the one-sided communication without combining it
 > with synchronization calls?
 >
 > (It doesn’t seem to matter what version of openmpi I use).
 >
 > Thank you
 >
 > Claire
 >

Re: [OMPI users] Coordinating (non-overlapping) local stores with remote puts form using passive RMA synchronization

2020-06-02 Thread Joseph Schuchart via users


Hi Stephen,

Let me try to answer your questions inline (I don't have extensive 
experience with the separate model and from my experience most 
implementations support the unified model, with some exceptions):


On 5/31/20 1:31 AM, Stephen Guzik via users wrote:

Hi,

I'm trying to get a better understanding of coordinating 
(non-overlapping) local stores with remote puts when using passive 
synchronization for RMA.  I understand that the window should be locked 
for a local store, but can it be a shared lock?


Yes. There is no reason why that cannot be a shared lock.

In my example, each 
process retrieves and increments an index (indexBuf and indexWin) from a 
target process and then stores it's rank into an array (dataBuf and 
dataWin) at that index on the target.  If the target is local, a local 
store is attempted:


/* indexWin on indexBuf, dataWin on dataBuf */
std::vector myvals(numProc);
MPI_Win_lock_all(0, indexWin);
MPI_Win_lock_all(0, dataWin);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
   {
     MPI_Fetch_and_op(&one, &myvals[tgtProc], MPI_INT, tgtProc, 0, 
MPI_SUM,indexWin);

     MPI_Win_flush_local(tgtProc, indexWin);
     // Put our rank into the right location of the target
     if (tgtProc == procID)
   {
     dataBuf[myvals[procID]] = procID;
   }
     else
   {
     MPI_Put(&procID, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, 
MPI_INT,dataWin);

   }
   }
MPI_Win_flush_all(dataWin);  /* Force completion and time 
synchronization */

MPI_Barrier(MPI_COMM_WORLD);
/* Proceed with local loads and unlock windows later */

I believe this is valid for a unified memory model but would probably 
fail for a separate model (unless a separate model very cleverly merges 
a private and public window?)  Is this understanding correct?  And if I 
instead use MPI_Put for the local write, then it should be valid for 
both memory models?


Yes, if you use RMA operations even on local memory it is valid for both 
memory models.


The MPI standard on page 455 (S3) states that "a store to process memory 
to a location in a window must not start once a put or accumulate update 
to that target window has started, until the put or accumulate update 
becomes visible in process memory." So there is no clever merging and it 
is up to the user to ensure that there are no puts and stores happening 
at the same time.




Another approach is specific locks.  I don't like this because it seems 
there are excessive synchronizations.  But if I really want to mix local 
stores and remote puts, is this the only way using locks?


/* indexWin on indexBuf, dataWin on dataBuf */
std::vector myvals(numProc);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
   {
     MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, indexWin);
     MPI_Fetch_and_op(&one, &myvals[tgtProc], MPI_INT, tgtProc, 0, 
MPI_SUM,indexWin);

     MPI_Win_unlock(tgtProc, indexWin);
     // Put our rank into the right location of the target
     if (tgtProc == procID)
   {
     MPI_Win_lock(MPI_LOCK_EXCLUSIVE, tgtProc, 0, dataWin);
     dataBuf[myvals[procID]] = procID;
     MPI_Win_unlock(tgtProc, dataWin);  /*(A)*/
   }
     else
   {
     MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, dataWin);
     MPI_Put(&procID, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, 
MPI_INT,dataWin);

     MPI_Win_unlock(tgtProc, dataWin);
   }
   }
/* Proceed with local loads */

I believe this is also valid for both memory models?  An unlock must 
have followed the last access to the local window, before the exclusive 
lock is gained.  That should have synchronized the windows and another 
synchronization should happen at (A).  Is that understanding correct? 


That is correct for both memory models, yes. It is likely to be slower 
because locking and unlocking involves some effort. You are better off 
using put instead.


If you really want to use local stores you can check for the 
MPI_WIN_UNIFIED attribute and fall-back to using puts only for the 
separate model.


> If so, how does one ever get into a situation where MPI_Win_sync must 
be used?


You can think of a synchronization scheme where each process takes a 
shared lock on a window, stores data to a local location, calls 
MPI_Win_sync and signals to other processes that the data is now 
available, e.g., through a barrier or a send. In that case processes 
keep the lock and use some non-RMA synchronization instead.




Final question.  In the first example, let's say there is a lot of 
computation in the loop and I want the MPI_Puts to immediately make 
progress.  Would it be sensible to follow the MPI_Put with a 
MPI_Win_flush_local to get things moving?  Or is it best to avoid any 
unnecessary synchronizations?


That is highly implementation-specific. Some implementations may buffer 
the puts and delay the transfer to the flush, some may initiate it 
immediately, and some may treat a local flush similar to a regular 
flush. I would not make any assumpt

Re: [OMPI users] Vader - Where to Look for Shared Memory Use

2020-07-22 Thread Joseph Schuchart via users


Hi John,

Depending on your platform the default behavior of Open MPI is to mmap a 
shared backing file that is either located in a session directory under 
/dev/shm or under $TMPDIR (I believe under Linux it is /dev/shm). You 
will find a set of files there that are used to back shared memory. They 
should be deleted automatically at the end of a run.


What symptoms are you experiencing and on what platform?

Cheers
Joseph

On 7/22/20 10:15 AM, John Duffy via users wrote:

Hi

I’m trying to investigate an HPL Linpack scaling issue on a single node, 
increasing from 1 to 4 cores.

Regarding single node messages, I think I understand that Open-MPI will select 
the most efficient mechanism, which in this case I think should be vader shared 
memory.

But when I run Linpack, ipcs -m gives…

-- Shared Memory Segments 
keyshmid  owner  perms  bytes  nattch status


And, ipcs -u gives…

-- Messages Status 
allocated queues = 0
used headers = 0
used space = 0 bytes

-- Shared Memory Status 
segments allocated 0
pages allocated 0
pages resident  0
pages swapped   0
Swap performance: 0 attempts 0 successes

-- Semaphore Status 
used arrays = 0
allocated semaphores = 0


Am I looking in the wrong place to see how/if vader is using shared memory? I’m 
wondering if a slower mechanism is being used.

My ompi_info includes...

MCA btl: openib (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.0.3)
MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.0.3)


Best wishes

Re: [OMPI users] MPI test suite

2020-07-24 Thread Joseph Schuchart via users


You may want to look into MTT: https://github.com/open-mpi/mtt

Cheers
Joseph

On 7/23/20 8:28 PM, Zhang, Junchao via users wrote:

Hello,
   Does OMPI have a test suite that can let me validate MPI 
implementations from other vendors?


   Thanks
--Junchao Zhang

Re: [OMPI users] Silent hangs with MPI_Ssend and MPI_Irecv

2020-07-25 Thread Joseph Schuchart via users


Hi Sean,

Thanks for the report! I have a few questions/suggestions:

1) What version of Open MPI are you using?
2) What is your network? It sounds like you are on an IB cluster using 
btl/openib (which is essentially discontinued). Can you try the Open MPI 
4.0.4 release with UCX instead of openib (configure with --without-verbs 
and --with-ucx)?
3) If that does not help, can you boil your code down to a minimum 
working example? That would make it easier for people to try to 
reproduce what happens.


Cheers
Joseph

On 7/24/20 11:34 PM, Lewis,Sean via users wrote:

Hi all,

I am encountering a silent hang involving MPI_Ssend and MPI_Irecv. The 
subroutine in question is called by each processor and is structured 
similar to the pseudo code below. The subroutine is successfully called 
several thousand times before the silent hang behavior manifests and 
never resolves. The hang will occur in nearly (but not exactly) the same 
spot for bit-wise identical tests. During the hang, all MPI ranks will 
be at the Line 18 Barrier except for two. One will be waiting at Line 
17, waiting for its Irecv to complete, and the other at one of the Ssend 
Line 9 or 14. This suggests that a MPI_Irecv never completes and a 
processor is indefinitely blocked in the Ssend unable to complete the 
transfer.


I’ve found similar discussion of this kind of behavior on the OpenMPI 
mailing list: 
https://www.mail-archive.com/users@lists.open-mpi.org/msg19227.html 
ultimately resolving in setting the mca parameter btl_openib_flags to 
304 or 305 (default 310): 
https://www.mail-archive.com/users@lists.open-mpi.org/msg19277.html. I 
have seen some promising behavior by doing the same. As the mailer 
suggests, this implies a problem with the RDMA protocols in infiniband 
for large messages.


I wanted to breathe life back into this conversation as the silent hang 
issue is particularly debilitating and confusing to me. 
Increasing/decreasing the number of processors used does not seem to 
alleviate the issue, using MPI_Send results in the same behavior, 
perhaps a message has exceeded a memory limit? I am running a test now 
that reports the individual message sizes but I previously implemented a 
switch to check for buffer size discrepancies which is not triggered. In 
the meantime, has anyone run into similar issues or have thoughts as to 
remedies for this behavior?


1:  call MPI_BARRIER(…)

2:  do i = 1,nprocs

3:   if(commatrix_recv(i) .gt. 0) then ! Identify which procs to 
receive from via predefined matrix


4: call Mpi_Irecv(…)

5:   endif

6:   enddo

7:   do j = mype+1,nproc

8:   if(commatrix_send(j) .gt. 0) then ! Identify which procs to 
send to via predefined matrix


9:     MPI_Ssend(…)

10: endif

11: enddo

12: do j = 1,mype

13:  if(commatrix_send(j) .gt. 0) then ! Identify which procs to 
send to via predefined matrix


14:    MPI_Ssend(…)

15: endif

16: enddo

17: call MPI_Waitall(…) ! Wait for all Irecv to complete

18: call MPI_Barrier(…)

Cluster information:

30 processors

Managed by slurm

OS: Red Hat v. 7.7

Thank you for help/advice you can provide,

Sean

*Sean C. Lewis*

Doctoral Candidate

Department of Physics

Drexel University

Re: [OMPI users] Is the mpi.3 manpage out of date?

2020-08-31 Thread Joseph Schuchart via users


Andy,

Thanks for pointing this out. We have a merged a fix that corrects that 
stale comment in master :)


Cheers
Joseph

On 8/25/20 8:36 PM, Riebs, Andy via users wrote:
In searching to confirm my belief that recent versions of Open MPI 
support the MPI-3.1 standard, I was a bit surprised to find this in the 
mpi.3 man page from the 4.0.2 release:


“The  outcome,  known  as  the MPI Standard, was first published in 
1993; its most recent version (MPI-2) was published in July 1997. Open 
MPI 1.2 includes all MPI 1.2-compliant and MPI 2-compliant routines.”


(For those who are manpage-averse, see < 
https://www.open-mpi.org/doc/v4.0/man3/MPI.3.php>.)


I’m willing to bet that y’all haven’t been sitting on your hands since 
Open MPI 1.2 was released!


Andy

--

Andy Riebs

andy.ri...@hpe.com

Hewlett Packard Enterprise

High Performance Computing Software Engineering

+1 404 648 9024

Re: [OMPI users] Limiting IP addresses used by OpenMPI

2020-09-01 Thread Joseph Schuchart via users


Charles,

What is the machine configuration you're running on? It seems that there 
are two MCA parameter for the tcp btl: btl_tcp_if_include and 
btl_tcp_if_exclude (see ompi_info for details). There may be other knobs 
I'm not aware of. If you're using UCX then my guess is that UCX has its 
own way to choose the network interface to be used...


Cheers
Joseph

On 9/1/20 9:35 PM, Charles Doland via users wrote:
Yes. It is not unusual to have multiple network interfaces on each host 
of a cluster. Usually there is a preference to use only one network 
interface on each host due to higher speed or throughput, or other 
considerations. It would be useful to be able to explicitly specify the 
interface to use for cases in which the MPI code does not select the 
preferred interface.


Charles Doland
charles.dol...@ansys.com 
(408) 627-6621  [x6621]

*From:* users  on behalf of John 
Hearns via users 

*Sent:* Tuesday, September 1, 2020 12:22 PM
*To:* Open MPI Users 
*Cc:* John Hearns 
*Subject:* Re: [OMPI users] Limiting IP addresses used by OpenMPI

*[External Sender]*

Charles, I recall using the I_MPI_NETMASK to choose which interface for 
MPI to use.

I guess you are asking the same question for OpenMPI?

On Tue, 1 Sep 2020 at 17:03, Charles Doland via users 
mailto:users@lists.open-mpi.org>> wrote:


Is there a way to limit the IP addresses or network interfaces used
for communication by OpenMPI? I am looking for something similar to
the I_MPI_TCP_NETMASK or I_MPI_NETMASK environment variables for
Intel MPI.

The OpenMPI documentation mentions the btl_tcp_if_include
and btl_tcp_if_exclude MCA options. These do not  appear to be
present, at least in OpenMPI v3.1.2. Is there another way to do
this? Or are these options supported in a different version?

Charles Doland
charles.dol...@ansys.com 
(408) 627-6621  [x6621]

Re: [OMPI users] mpirun on Kubuntu 20.4.1 hangs

2020-10-22 Thread Joseph Schuchart via users


Hi Jorge,

Can you try to get a stack trace of mpirun using the following command 
in a separate terminal?


sudo gdb -batch -ex "thread apply all bt" -p $(ps -C mpirun -o pid= | 
head -n 1)


Maybe that will give some insight where mpirun is hanging.

Cheers,
Joseph

On 10/21/20 9:58 PM, Jorge SILVA via users wrote:

Hello Jeff,

The  program is not executed, seems waits for something to connect with 
(why twice ctrl-C ?)


jorge@gcp26:~/MPIRUN$ mpirun -np 1 touch /tmp/foo
^C^C

jorge@gcp26:~/MPIRUN$ ls -l /tmp/foo
ls: impossible d'accéder à '/tmp/foo': Aucun fichier ou dossier de ce type

no file  is created..

In fact, my question was if are there differences in mpirun usage  
between these versions..  The


mpirun -help

gives a different output as expected, but I  tried a lot of options 
without any success.



Le 21/10/2020 à 21:16, Jeff Squyres (jsquyres) a écrit :
There's huge differences between Open MPI v2.1.1 and v4.0.3 (i.e., 
years of development effort); it would be very hard to categorize them 
all; sorry!


What happens if you

    mpirun -np 1 touch /tmp/foo

(Yes, you can run non-MPI apps through mpirun)

Is /tmp/foo created?  (i.e., did the job run, and mpirun is somehow 
not terminating)




On Oct 21, 2020, at 12:22 PM, Jorge SILVA via users 
mailto:users@lists.open-mpi.org>> wrote:


Hello Gus,

 Thank you for your answer..  Unfortunately my problem is much more 
basic. I  didn't try to run the program in both computers , but just 
to run something in one computer. I just installed the new OS an 
openmpi in two different computers, in the standard way, with the 
same result.


For example:

In kubuntu20.4.1 LTS with openmpi 4.0.3-0ubuntu

jorge@gcp26:~/MPIRUN$ cat hello.f90
 print*,"Hello World!"
end
jorge@gcp26:~/MPIRUN$ mpif90 hello.f90 -o hello
jorge@gcp26:~/MPIRUN$ ./hello
 Hello World!
jorge@gcp26:~/MPIRUN$ mpirun -np 1 hello <---here  the program hangs 
with no output

^C^Cjorge@gcp26:~/MPIRUN$

The mpirun task sleeps with no output, and only twice ctrl-C ends the 
execution  :


jorge   5540  0.1  0.0 44768  8472 pts/8    S+   17:54   0:00 
mpirun -np 1 hello


In kubuntu 18.04.5 LTS with openmpi 2.1.1, of course, the same 
program gives


jorge@gcp30:~/MPIRUN$ cat hello.f90
 print*, "Hello World!"
 END
jorge@gcp30:~/MPIRUN$ mpif90 hello.f90 -o hello
jorge@gcp30:~/MPIRUN$ ./hello
 Hello World!
jorge@gcp30:~/MPIRUN$ mpirun -np 1 hello
 Hello World
jorge@gcp30:~/MPIRUN$


Even just typing mpirun hangs without the usual error message.

Are there any changes between the two versions of openmpi that I 
miss?  Some package lacking to mpirun ?


Thank you again for your help

Jorge


Le 21/10/2020 à 00:20, Gus Correa a écrit :

Hi Jorge

You may have an active firewall protecting either computer or both,
and preventing mpirun to start the connection.
Your /etc/hosts file may also not have the computer IP addresses.
You may also want to try the --hostfile option.
Likewise, the --verbose option may also help diagnose the problem.

It would help if you send the mpirun command line, the hostfile (if 
any),

error message if any, etc.


These FAQs may help diagnose and solve the problem:

https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
https://www.open-mpi.org/faq/?category=running#mpirun-hostfile
https://www.open-mpi.org/faq/?category=running

I hope this helps,
Gus Correa

On Tue, Oct 20, 2020 at 4:47 PM Jorge SILVA via users 
mailto:users@lists.open-mpi.org>> wrote:


Hello,

I installed kubuntu20.4.1 with openmpi 4.0.3-0ubuntu in two
different
computers in the standard way. Compiling with mpif90 works, but
mpirun
hangs with no output in both systems. Even mpirun command without
parameters hangs and only twice ctrl-C typing can end the sleeping
program. Only  the command

 mpirun --help

gives the usual output.

Seems that is something related to the terminal output, but the
command
worked well for Kubuntu 18.04. Is there a way to debug or fix this
problem (without re-compiling from sources, etc)? Is it a known
problem?

Thanks,

  Jorge




--
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI users] Issue with MPI_Get_processor_name() in Cygwin

2021-02-09 Thread Joseph Schuchart via users


Martin,

The name argument to MPI_Get_processor_name is a character string of 
length at least MPI_MAX_PROCESSOR_NAME, which in OMPI is 256. You are 
providing a character string of length 200, so OMPI is free to write 
past the end of your string and into some of your stack variables, hence 
you are "losing" the values of rank and size. The issue should be gone 
if you write `char hostName[MPI_MAX_PROCESSOR_NAME];`


Cheers
Joseph

On 2/9/21 9:14 PM, Martín Morales via users wrote:

Hello,

I have what it could be a memory corruption with 
/MPI_Get_processor_name()/ in Cygwin.


I’m using OMPI 4.1.0; I tried also in Linux (same OMPI version) but 
there isn’t an issue there.


Below the example of a trivial spawn operation. It has 2 scripts: 
spawned and spawner.


In the spawned script, if I move the /MPI_Get_processor_name()/ line 
below /MPI_Comm_size()/ I lose the values of /rank/ and /size/.


In fact, I declared some other variables in the /int hostName_len, rank, 
size;/ line and I lost them too.


Regards,

Martín

---

*Spawned:*

/#include "mpi.h"/

/#include /

/#include /

//

/int main(int argc, char ** argv){/

*/    int hostName_len,rank, size;/*

/    MPI_Comm parentcomm;/

/    char hostName[200];/

//

/    MPI_Init( NULL, NULL );/

/    MPI_Comm_get_parent( &parentcomm );/

/*MPI_Get_processor_name(hostName, &hostName_len);*/

/   MPI_Comm_rank(MPI_COMM_WORLD, &rank);/

/    MPI_Comm_size(MPI_COMM_WORLD, &size);/

//

/    if (parentcomm != MPI_COMM_NULL) {/

/  printf("I'm the spawned h: %s  r/s: %i/%i\n", hostName, rank, size);/

/    }/

//

/    MPI_Finalize();/

/    return 0;/

/}/

//

*Spawner:*

#include "mpi.h"

#include 

#include 

#include 

int main(int argc, char ** argv){

     int processesToRun;

     MPI_Comm intercomm;

   if(argc < 2 ){

     printf("Processes number needed!\n");

     return 0;

   }

   processesToRun = atoi(argv[1]);

       MPI_Init( NULL, NULL );

   printf("Spawning from parent:...\n");

   MPI_Comm_spawn( "./spawned", MPI_ARGV_NULL, processesToRun, 
MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE);


 MPI_Finalize();

     return 0;

}

//

Re: [OMPI users] OpenMPI and maker - Multiple messages

2021-02-18 Thread Joseph Schuchart via users

Thomas,

The post you are referencing suggests to run

mpiexec -mca btl ^openib -n 40 maker -help

but you are running

mpiexec -mca btl ^openib -N 5 gcc --version

which will run 5 instances of GCC. The out put you're seeing is totally
to be expected.

I don't think anyone here can help you with getting maker to work
correctly with MPI. I suggest you first check whether make is actually
configured to use MPI. The test using maker (as suggested in the post)
does not necessarily mean that Open MPI isn't working, it might also
happen if maker is not correctly configured.

One way to check whether Open MPI is working correctly on your system is
to use a simple MPI program that prints the world communicator size and
rank. Any MPI hello world program you find online should do.

Cheers
Joseph

On 2/18/21 1:39 PM, Thomas Eylenbosch via users wrote:

Hello

We are trying to run
maker(http://www.yandell-lab.org/software/maker.html) in combination
with OpenMPI

But when we are trying to submit a job with the maker and openmpi,

We see the following error in the log file:

--Next Contig--

Another instance of maker is processing *this*contig!!

SeqID: chrA10

Length: 17398227

According to

http://gmod.827538.n3.nabble.com/Does-maker-support-muti-processing-for-a-single-long-fasta-sequence-using-openMPI-td4061342.html

We have to run the following command mpiexec -mca btl ^openib -n 40
maker -help

“If you get a single help message then everything is fine. If you get
40 help messages, then MPI is not communicating correctly.”

We are using the following command to demonstrate what is going wrong:

mpiexec -mca btl ^openib -N 5 gcc --version

gcc (GCC) 10.2.0