Hi,

hopefully someone else can chime in on the MPI and Python side of things, but 
thought I'd comment shortly on the runtime suspension since I implemented it.

The reason for requiring a only a single locality for runtime suspension is 
simply that I never tested it with multiple localities. It may very well 
already work with multiple localities, but I didn't want users to get the 
impression that it's a well-tested feature. So if this is indeed useful for you 
you could try removing the check (you probably already found it, let me know if 
that's not the case) and rebuilding HPX.

I suspect though that runtime suspension won't help you here since it doesn't 
actually disable MPI or anything else. All it does is put the HPX worker 
threads to sleep once all work is completed.

In this case there might be a problem with our MPI parcelport interfering with 
mpi4py. It's not entirely clear to me if you want to use the networking 
features of HPX in addition to MPI. If not you can also build HPX with 
HPX_WITH_NETWORKING=OFF which will... disable networking. This branch is also 
meant to disable some networking related features at runtime if you're only 
using one locality: https://github.com/STEllAR-GROUP/hpx/pull/3486.

Kind regards,
Mikael
________________________________
From: hpx-users-boun...@stellar.cct.lsu.edu 
[hpx-users-boun...@stellar.cct.lsu.edu] on behalf of Vance, James 
[va...@uni-mainz.de]
Sent: Tuesday, October 23, 2018 4:38 PM
To: hpx-users@stellar.cct.lsu.edu
Subject: [hpx-users] Segmentation fault with mpi4py

Hi everyone,

I am trying to gradually port the molecular dynamics code Espresso++ from its 
current pure-MPI form to one that uses HPX for the critical parts of the code. 
It consists of a C++ and MPI-based shared library that can be imported in 
python using the boost.python library, a collection of python modules, and an 
mpi4py-based library for communication among the python processes.

I was able to properly initialize and terminate the HPX runtime environment 
from python using the methods in hpx/examples/quickstart/init_globally.cpp and 
phylanx/python/src/init_hpx.cpp. However, when I use mpi4py to perform 
MPI-based communication from within a python script that also runs HPX, I 
encounter a segmentation fault with the following trace:

---------------------------------
{stack-trace}: 21 frames:
0x2abc616b08f2  : ??? + 0x2abc616b08f2 in 
/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1
0x2abc616ad06c  : hpx::termination_handler(int) + 0x15c in 
/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1
0x2abc5979b370  : ??? + 0x2abc5979b370 in /lib64/libpthread.so.0
0x2abc62755a76  : mca_pml_cm_recv_request_completion + 0xb6 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc626f4ac9  : ompi_mtl_psm2_progress + 0x59 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc63383eec  : opal_progress + 0x3c in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20
0x2abc62630a75  : ompi_request_default_wait + 0x105 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc6267be92  : ompi_coll_base_bcast_intra_generic + 0x5b2 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc6267c262  : ompi_coll_base_bcast_intra_binomial + 0xb2 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc6268803b  : ompi_coll_tuned_bcast_intra_dec_fixed + 0xcb in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc62642bc0  : PMPI_Bcast + 0x1a0 in 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20
0x2abc64cea17f  : ??? + 0x2abc64cea17f in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/python2.7/site-packages/mpi4py/MPI.so
0x2abc59176f9b  : PyEval_EvalFrameEx + 0x923b in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc5917879a  : PyEval_EvalCodeEx + 0x87a in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc59178ba9  : PyEval_EvalCode + 0x19 in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc5919cb4a  : PyRun_FileExFlags + 0x8a in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc5919df25  : PyRun_SimpleFileExFlags + 0xd5 in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc591b44e1  : Py_Main + 0xc61 in 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0
0x2abc59bccb35  : __libc_start_main + 0xf5 in /lib64/libc.so.6
0x40071e        : ??? + 0x40071e in python
{what}: Segmentation fault
{config}:
  HPX_WITH_AGAS_DUMP_REFCNT_ENTRIES=OFF
  HPX_WITH_APEX=OFF
  HPX_WITH_ATTACH_DEBUGGER_ON_TEST_FAILURE=OFF
  HPX_WITH_AUTOMATIC_SERIALIZATION_REGISTRATION=ON
  HPX_WITH_CXX14_RETURN_TYPE_DEDUCTION=TRUE
  HPX_WITH_DEPRECATION_WARNINGS=ON
  HPX_WITH_GOOGLE_PERFTOOLS=OFF
  HPX_WITH_INCLUSIVE_SCAN_COMPATIBILITY=ON
  HPX_WITH_IO_COUNTERS=ON
  HPX_WITH_IO_POOL=ON
  HPX_WITH_ITTNOTIFY=OFF
  HPX_WITH_LOGGING=ON
  HPX_WITH_MORE_THAN_64_THREADS=OFF
  HPX_WITH_NATIVE_TLS=ON
  HPX_WITH_NETWORKING=ON
  HPX_WITH_PAPI=OFF
  HPX_WITH_PARCELPORT_ACTION_COUNTERS=OFF
  HPX_WITH_PARCELPORT_LIBFABRIC=OFF
  HPX_WITH_PARCELPORT_MPI=ON
  HPX_WITH_PARCELPORT_MPI_MULTITHREADED=ON
  HPX_WITH_PARCELPORT_TCP=ON
  HPX_WITH_PARCELPORT_VERBS=OFF
  HPX_WITH_PARCEL_COALESCING=ON
  HPX_WITH_PARCEL_PROFILING=OFF
  HPX_WITH_SCHEDULER_LOCAL_STORAGE=OFF
  HPX_WITH_SPINLOCK_DEADLOCK_DETECTION=OFF
  HPX_WITH_STACKTRACES=ON
  HPX_WITH_SWAP_CONTEXT_EMULATION=OFF
  HPX_WITH_THREAD_BACKTRACE_ON_SUSPENSION=OFF
  HPX_WITH_THREAD_CREATION_AND_CLEANUP_RATES=OFF
  HPX_WITH_THREAD_CUMULATIVE_COUNTS=ON
  HPX_WITH_THREAD_DEBUG_INFO=OFF
  HPX_WITH_THREAD_DESCRIPTION_FULL=OFF
  HPX_WITH_THREAD_GUARD_PAGE=ON
  HPX_WITH_THREAD_IDLE_RATES=ON
  HPX_WITH_THREAD_LOCAL_STORAGE=OFF
  HPX_WITH_THREAD_MANAGER_IDLE_BACKOFF=ON
  HPX_WITH_THREAD_QUEUE_WAITTIME=OFF
  HPX_WITH_THREAD_STACK_MMAP=ON
  HPX_WITH_THREAD_STEALING_COUNTS=ON
  HPX_WITH_THREAD_TARGET_ADDRESS=OFF
  HPX_WITH_TIMER_POOL=ON
  HPX_WITH_TUPLE_RVALUE_SWAP=ON
  HPX_WITH_UNWRAPPED_COMPATIBILITY=ON
  HPX_WITH_VALGRIND=OFF
  HPX_WITH_VERIFY_LOCKS=OFF
  HPX_WITH_VERIFY_LOCKS_BACKTRACE=OFF
  HPX_WITH_VERIFY_LOCKS_GLOBALLY=OFF

  HPX_PARCEL_MAX_CONNECTIONS=512
  HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4
  HPX_AGAS_LOCAL_CACHE_SIZE=4096
  HPX_HAVE_MALLOC=JEMALLOC
  HPX_PREFIX 
(configured)=/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install
  
HPX_PREFIX=/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install
{version}: V1.1.0-rc1 (AGAS: V3.0), Git: unknown
{boost}: V1.65.1
{build-type}: release
{date}: Sep 25 2018 11:01:34
{platform}: linux
{compiler}: GNU C++ version 6.3.0
{stdlib}: GNU libstdc++ version 20161221
[login21:18535] *** Process received signal ***
[login21:18535] Signal: Aborted (6)
[login21:18535] Signal code:  (-6)
[login21:18535] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2abc5979b370]
[login21:18535] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2abc59be01d7]
[login21:18535] [ 2] /lib64/libc.so.6(abort+0x148)[0x2abc59be18c8]
[login21:18535] [ 3] 
/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1(_ZN3hpx19termination_handlerEi+0x213)[0x2abc616ad123]
[login21:18535] [ 4] /lib64/libpthread.so.0(+0xf370)[0x2abc5979b370]
[login21:18535] [ 5] 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(mca_pml_cm_recv_request_completion+0xb6)[0x2abc62755a76]
[login21:18535] [ 6] 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_mtl_psm2_progress+0x59)[0x2abc626f4ac9]
[login21:18535] [ 7] 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20(opal_progress+0x3c)[0x2abc63383eec]
[login21:18535] [ 8] 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_request_default_wait+0x105)[0x2abc62630a75]
[login21:18535] [ 9] 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x5b2)[0x2abc6267be92]
[login21:18535] [10] 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+0xb2)[0x2abc6267c262]
[login21:18535] [11] 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_coll_tuned_bcast_intra_dec_fixed+0xcb)[0x2abc6268803b]
[login21:18535] [12] 
/cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(PMPI_Bcast+0x1a0)[0x2abc62642bc0]
[login21:18535] [13] 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/python2.7/site-packages/mpi4py/MPI.so(+0xa517f)[0x2abc64cea17f]
[login21:18535] [14] 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x923b)[0x2abc59176f9b]
[login21:18535] [15] 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x87a)[0x2abc5917879a]
[login21:18535] [16] 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x2abc59178ba9]
[login21:18535] [17] 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyRun_FileExFlags+0x8a)[0x2abc5919cb4a]
[login21:18535] [18] 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xd5)[0x2abc5919df25]
[login21:18535] [19] 
/cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(Py_Main+0xc61)[0x2abc591b44e1]
[login21:18535] [20] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2abc59bccb35]
[login21:18535] [21] python[0x40071e]
[login21:18535] *** End of error message ***
---------------------------------

I think this error is related to 
https://github.com/STEllAR-GROUP/hpx/issues/949 and 
https://github.com/STEllAR-GROUP/hpx/pull/3129  so maybe the suspend and resume 
functions could be used. However, the documentation says this can only be done 
with one locality.

Does anyone know of a way for interprocess communication to still be possible 
within python, separately from the communication layer provided by HPX? Thanks!

Best Regards,

James Vance


_______________________________________________
hpx-users mailing list
hpx-users@stellar.cct.lsu.edu
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Reply via email to