Re: [hpx-users] Segmentation fault with mpi4py

2018-10-24 Thread Thomas Heller
Hi,

>From looking at the stacktrace, it is a segfault coming directly out of
MPI, which is then caught by our signal handlers.
In theory, there shouldn't be any problem with having multiple MPI
libraries running within HPX. The HPX parcelport tries to be a good citizen
and creates its own communicator. The problematic part however, might be
that you either have multiple calls to MPI_Init (HPX itself should handle
that correctly) or that the MPI implementation you are using is not thread
safe. HPX is driving MPI from all of its worker threads. For the purpose of
making MPI non thread implementations not crash, we use a lock to protect
each and every call into MPI (
https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/plugins/parcelport/mpi/mpi_environment.hpp#L42).
If you add a call to that around your pympi4 stuff, it might just work.

The suspension of the runtime should work as well. As soon as all worker
threads are suspended, there won't be any calls to MPI anymore. There still
might be incoming messages from other localities, but that shouldn't be a
problem.

I hope that scheds some light onto that problem.


On Tue, Oct 23, 2018 at 11:37 PM Simberg Mikael  wrote:

> Hi,
>
> hopefully someone else can chime in on the MPI and Python side of things,
> but thought I'd comment shortly on the runtime suspension since I
> implemented it.
>
> The reason for requiring a only a single locality for runtime suspension
> is simply that I never tested it with multiple localities. It may very well
> already work with multiple localities, but I didn't want users to get the
> impression that it's a well-tested feature. So if this is indeed useful for
> you you could try removing the check (you probably already found it, let me
> know if that's not the case) and rebuilding HPX.
>
> I suspect though that runtime suspension won't help you here since it
> doesn't actually disable MPI or anything else. All it does is put the HPX
> worker threads to sleep once all work is completed.
>
> In this case there might be a problem with our MPI parcelport interfering
> with mpi4py. It's not entirely clear to me if you want to use the
> networking features of HPX in addition to MPI. If not you can also build
> HPX with HPX_WITH_NETWORKING=OFF which will... disable networking. This
> branch is also meant to disable some networking related features at runtime
> if you're only using one locality:
> https://github.com/STEllAR-GROUP/hpx/pull/3486.
>
> Kind regards,
> Mikael
> --
> *From:* hpx-users-boun...@stellar.cct.lsu.edu [
> hpx-users-boun...@stellar.cct.lsu.edu] on behalf of Vance, James [
> va...@uni-mainz.de]
> *Sent:* Tuesday, October 23, 2018 4:38 PM
> *To:* hpx-users@stellar.cct.lsu.edu
> *Subject:* [hpx-users] Segmentation fault with mpi4py
>
> Hi everyone,
>
> I am trying to gradually port the molecular dynamics code Espresso++ from
> its current pure-MPI form to one that uses HPX for the critical parts of
> the code. It consists of a C++ and MPI-based shared library that can be
> imported in python using the boost.python library, a collection of python
> modules, and an mpi4py-based library for communication among the python
> processes.
>
> I was able to properly initialize and terminate the HPX runtime
> environment from python using the methods
> in hpx/examples/quickstart/init_globally.cpp
> and phylanx/python/src/init_hpx.cpp. However, when I use mpi4py to perform
> MPI-based communication from within a python script that also runs HPX, I
> encounter a segmentation fault with the following trace:
>
> -
> {stack-trace}: 21 frames:
> 0x2abc616b08f2  : ??? + 0x2abc616b08f2 in
> /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1
> 0x2abc616ad06c  : hpx::termination_handler(int) + 0x15c in
> /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1
> 0x2abc5979b370  : ??? + 0x2abc5979b370 in /lib64/libpthread.so.0
> 0x2abc62755a76  : mca_pml_cm_recv_request_completion + 0xb6 in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
> 0x2abc626f4ac9  : ompi_mtl_psm2_progress + 0x59 in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
> 0x2abc63383eec  : opal_progress + 0x3c in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20
> 0x2abc62630a75  : ompi_request_default_wait + 0x105 in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
> 0x2abc6267be92  : ompi_coll_base_bcast_intra_generic + 0x5b2 in
> /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
> 0x2abc6267c262  : ompi_coll_base_bcast_intra_binom

Re: [hpx-users] Segmentation fault with mpi4py

2018-10-23 Thread Simberg Mikael
Hi,

hopefully someone else can chime in on the MPI and Python side of things, but 
thought I'd comment shortly on the runtime suspension since I implemented it.

The reason for requiring a only a single locality for runtime suspension is 
simply that I never tested it with multiple localities. It may very well 
already work with multiple localities, but I didn't want users to get the 
impression that it's a well-tested feature. So if this is indeed useful for you 
you could try removing the check (you probably already found it, let me know if 
that's not the case) and rebuilding HPX.

I suspect though that runtime suspension won't help you here since it doesn't 
actually disable MPI or anything else. All it does is put the HPX worker 
threads to sleep once all work is completed.

In this case there might be a problem with our MPI parcelport interfering with 
mpi4py. It's not entirely clear to me if you want to use the networking 
features of HPX in addition to MPI. If not you can also build HPX with 
HPX_WITH_NETWORKING=OFF which will... disable networking. This branch is also 
meant to disable some networking related features at runtime if you're only 
using one locality: https://github.com/STEllAR-GROUP/hpx/pull/3486.

Kind regards,
Mikael

From: hpx-users-boun...@stellar.cct.lsu.edu 
[hpx-users-boun...@stellar.cct.lsu.edu] on behalf of Vance, James 
[va...@uni-mainz.de]
Sent: Tuesday, October 23, 2018 4:38 PM
To: hpx-users@stellar.cct.lsu.edu
Subject: [hpx-users] Segmentation fault with mpi4py

Hi everyone,

I am trying to gradually port the molecular dynamics code Espresso++ from its 
current pure-MPI form to one that uses HPX for the critical parts of the code. 
It consists of a C++ and MPI-based shared library that can be imported in 
python using the boost.python library, a collection of python modules, and an 
mpi4py-based library for communication among the python processes.

I was able to properly initialize and terminate the HPX runtime environment 
from python using the methods in hpx/examples/quickstart/init_globally.cpp and 
phylanx/python/src/init_hpx.cpp. However, when I use mpi4py to perform 
MPI-based communication from within a python script that also runs HPX, I 
encounter a segmentation fault with the following trace:

-
{stack-trace}: 21 frames:
0x2abc616b08f2  : ??? + 0x2abc616b08f2 in 
/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1
0x2abc616ad06c  : hpx::termination_handler(int) + 0x15c in 
/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1
0x2abc5979b370  : ??? + 0x2abc5979b370 in /lib64/libpthread.so.0
0x2abc62755a76  : mca_pml_cm_recv_request_completion + 0xb6 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc626f4ac9  : ompi_mtl_psm2_progress + 0x59 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc63383eec  : opal_progress + 0x3c in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20
0x2abc62630a75  : ompi_request_default_wait + 0x105 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc6267be92  : ompi_coll_base_bcast_intra_generic + 0x5b2 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc6267c262  : ompi_coll_base_bcast_intra_binomial + 0xb2 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc6268803b  : ompi_coll_tuned_bcast_intra_dec_fixed + 0xcb in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc62642bc0  : PMPI_Bcast + 0x1a0 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc64cea17f  : ??? + 0x2abc64cea17f in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/python2.7/site-packages/mpi4py/MPI.so
0x2abc59176f9b  : PyEval_EvalFrameEx + 0x923b in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc5917879a  : PyEval_EvalCodeEx + 0x87a in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc59178ba9  : PyEval_EvalCode + 0x19 in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc5919cb4a  : PyRun_FileExFlags + 0x8a in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc5919df25  : PyRun_SimpleFileExFlags + 0xd5 in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc591b44e1  : Py_Main + 0xc61 in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc59bccb35  : __libc_start_main + 0xf5 in /lib64/libc.so.6
0x40071e: ??? + 0x40071e in python
{what}: Segmentation fault
{config

[hpx-users] Segmentation fault with mpi4py

2018-10-23 Thread Vance, James
Hi everyone,

I am trying to gradually port the molecular dynamics code Espresso++ from its 
current pure-MPI form to one that uses HPX for the critical parts of the code. 
It consists of a C++ and MPI-based shared library that can be imported in 
python using the boost.python library, a collection of python modules, and an 
mpi4py-based library for communication among the python processes.

I was able to properly initialize and terminate the HPX runtime environment 
from python using the methods in hpx/examples/quickstart/init_globally.cpp and 
phylanx/python/src/init_hpx.cpp. However, when I use mpi4py to perform 
MPI-based communication from within a python script that also runs HPX, I 
encounter a segmentation fault with the following trace:

-
{stack-trace}: 21 frames:
0x2abc616b08f2  : ??? + 0x2abc616b08f2 in 
/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1
0x2abc616ad06c  : hpx::termination_handler(int) + 0x15c in 
/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1
0x2abc5979b370  : ??? + 0x2abc5979b370 in /lib64/libpthread.so.0
0x2abc62755a76  : mca_pml_cm_recv_request_completion + 0xb6 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc626f4ac9  : ompi_mtl_psm2_progress + 0x59 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc63383eec  : opal_progress + 0x3c in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20
0x2abc62630a75  : ompi_request_default_wait + 0x105 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc6267be92  : ompi_coll_base_bcast_intra_generic + 0x5b2 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc6267c262  : ompi_coll_base_bcast_intra_binomial + 0xb2 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc6268803b  : ompi_coll_tuned_bcast_intra_dec_fixed + 0xcb in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc62642bc0  : PMPI_Bcast + 0x1a0 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc64cea17f  : ??? + 0x2abc64cea17f in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/python2.7/site-packages/mpi4py/MPI.so
0x2abc59176f9b  : PyEval_EvalFrameEx + 0x923b in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc5917879a  : PyEval_EvalCodeEx + 0x87a in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc59178ba9  : PyEval_EvalCode + 0x19 in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc5919cb4a  : PyRun_FileExFlags + 0x8a in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc5919df25  : PyRun_SimpleFileExFlags + 0xd5 in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc591b44e1  : Py_Main + 0xc61 in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc59bccb35  : __libc_start_main + 0xf5 in /lib64/libc.so.6
0x40071e: ??? + 0x40071e in python
{what}: Segmentation fault
{config}:
  HPX_WITH_AGAS_DUMP_REFCNT_ENTRIES=OFF
  HPX_WITH_APEX=OFF
  HPX_WITH_ATTACH_DEBUGGER_ON_TEST_FAILURE=OFF
  HPX_WITH_AUTOMATIC_SERIALIZATION_REGISTRATION=ON
  HPX_WITH_CXX14_RETURN_TYPE_DEDUCTION=TRUE
  HPX_WITH_DEPRECATION_WARNINGS=ON
  HPX_WITH_GOOGLE_PERFTOOLS=OFF
  HPX_WITH_INCLUSIVE_SCAN_COMPATIBILITY=ON
  HPX_WITH_IO_COUNTERS=ON
  HPX_WITH_IO_POOL=ON
  HPX_WITH_ITTNOTIFY=OFF
  HPX_WITH_LOGGING=ON
  HPX_WITH_MORE_THAN_64_THREADS=OFF
  HPX_WITH_NATIVE_TLS=ON
  HPX_WITH_NETWORKING=ON
  HPX_WITH_PAPI=OFF
  HPX_WITH_PARCELPORT_ACTION_COUNTERS=OFF
  HPX_WITH_PARCELPORT_LIBFABRIC=OFF
  HPX_WITH_PARCELPORT_MPI=ON
  HPX_WITH_PARCELPORT_MPI_MULTITHREADED=ON
  HPX_WITH_PARCELPORT_TCP=ON
  HPX_WITH_PARCELPORT_VERBS=OFF
  HPX_WITH_PARCEL_COALESCING=ON
  HPX_WITH_PARCEL_PROFILING=OFF
  HPX_WITH_SCHEDULER_LOCAL_STORAGE=OFF
  HPX_WITH_SPINLOCK_DEADLOCK_DETECTION=OFF
  HPX_WITH_STACKTRACES=ON
  HPX_WITH_SWAP_CONTEXT_EMULATION=OFF
  HPX_WITH_THREAD_BACKTRACE_ON_SUSPENSION=OFF
  HPX_WITH_THREAD_CREATION_AND_CLEANUP_RATES=OFF
  HPX_WITH_THREAD_CUMULATIVE_COUNTS=ON
  HPX_WITH_THREAD_DEBUG_INFO=OFF
  HPX_WITH_THREAD_DESCRIPTION_FULL=OFF
  HPX_WITH_THREAD_GUARD_PAGE=ON
  HPX_WITH_THREAD_IDLE_RATES=ON
  HPX_WITH_THREAD_LOCAL_STORAGE=OFF
  HPX_WITH_THREAD_MANAGER_IDLE_BACKOFF=ON
  HPX_WITH_THREAD_QUEUE_WAITTIME=OFF
  HPX_WITH_THREAD_STACK_MMAP=ON
  HPX_WITH_THREAD_STEALING_COUNTS=ON
  HPX_WITH_THREAD_TARGET_ADDRESS=OFF
  HPX_WITH_TIMER_POOL=ON
  HPX_WITH_TUPLE_RVALUE_SWAP=ON
  HPX_WITH_UNWRAPPED_COMPATIBILITY=ON
  HPX_WITH_VALGRIND=OFF