Hi, hopefully someone else can chime in on the MPI and Python side of things, but thought I'd comment shortly on the runtime suspension since I implemented it.
The reason for requiring a only a single locality for runtime suspension is simply that I never tested it with multiple localities. It may very well already work with multiple localities, but I didn't want users to get the impression that it's a well-tested feature. So if this is indeed useful for you you could try removing the check (you probably already found it, let me know if that's not the case) and rebuilding HPX. I suspect though that runtime suspension won't help you here since it doesn't actually disable MPI or anything else. All it does is put the HPX worker threads to sleep once all work is completed. In this case there might be a problem with our MPI parcelport interfering with mpi4py. It's not entirely clear to me if you want to use the networking features of HPX in addition to MPI. If not you can also build HPX with HPX_WITH_NETWORKING=OFF which will... disable networking. This branch is also meant to disable some networking related features at runtime if you're only using one locality: https://github.com/STEllAR-GROUP/hpx/pull/3486. Kind regards, Mikael ________________________________ From: hpx-users-boun...@stellar.cct.lsu.edu [hpx-users-boun...@stellar.cct.lsu.edu] on behalf of Vance, James [va...@uni-mainz.de] Sent: Tuesday, October 23, 2018 4:38 PM To: hpx-users@stellar.cct.lsu.edu Subject: [hpx-users] Segmentation fault with mpi4py Hi everyone, I am trying to gradually port the molecular dynamics code Espresso++ from its current pure-MPI form to one that uses HPX for the critical parts of the code. It consists of a C++ and MPI-based shared library that can be imported in python using the boost.python library, a collection of python modules, and an mpi4py-based library for communication among the python processes. I was able to properly initialize and terminate the HPX runtime environment from python using the methods in hpx/examples/quickstart/init_globally.cpp and phylanx/python/src/init_hpx.cpp. However, when I use mpi4py to perform MPI-based communication from within a python script that also runs HPX, I encounter a segmentation fault with the following trace: --------------------------------- {stack-trace}: 21 frames: 0x2abc616b08f2 : ??? + 0x2abc616b08f2 in /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1 0x2abc616ad06c : hpx::termination_handler(int) + 0x15c in /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1 0x2abc5979b370 : ??? + 0x2abc5979b370 in /lib64/libpthread.so.0 0x2abc62755a76 : mca_pml_cm_recv_request_completion + 0xb6 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc626f4ac9 : ompi_mtl_psm2_progress + 0x59 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc63383eec : opal_progress + 0x3c in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20 0x2abc62630a75 : ompi_request_default_wait + 0x105 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc6267be92 : ompi_coll_base_bcast_intra_generic + 0x5b2 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc6267c262 : ompi_coll_base_bcast_intra_binomial + 0xb2 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc6268803b : ompi_coll_tuned_bcast_intra_dec_fixed + 0xcb in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc62642bc0 : PMPI_Bcast + 0x1a0 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc64cea17f : ??? + 0x2abc64cea17f in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/python2.7/site-packages/mpi4py/MPI.so 0x2abc59176f9b : PyEval_EvalFrameEx + 0x923b in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc5917879a : PyEval_EvalCodeEx + 0x87a in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc59178ba9 : PyEval_EvalCode + 0x19 in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc5919cb4a : PyRun_FileExFlags + 0x8a in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc5919df25 : PyRun_SimpleFileExFlags + 0xd5 in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc591b44e1 : Py_Main + 0xc61 in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc59bccb35 : __libc_start_main + 0xf5 in /lib64/libc.so.6 0x40071e : ??? + 0x40071e in python {what}: Segmentation fault {config}: HPX_WITH_AGAS_DUMP_REFCNT_ENTRIES=OFF HPX_WITH_APEX=OFF HPX_WITH_ATTACH_DEBUGGER_ON_TEST_FAILURE=OFF HPX_WITH_AUTOMATIC_SERIALIZATION_REGISTRATION=ON HPX_WITH_CXX14_RETURN_TYPE_DEDUCTION=TRUE HPX_WITH_DEPRECATION_WARNINGS=ON HPX_WITH_GOOGLE_PERFTOOLS=OFF HPX_WITH_INCLUSIVE_SCAN_COMPATIBILITY=ON HPX_WITH_IO_COUNTERS=ON HPX_WITH_IO_POOL=ON HPX_WITH_ITTNOTIFY=OFF HPX_WITH_LOGGING=ON HPX_WITH_MORE_THAN_64_THREADS=OFF HPX_WITH_NATIVE_TLS=ON HPX_WITH_NETWORKING=ON HPX_WITH_PAPI=OFF HPX_WITH_PARCELPORT_ACTION_COUNTERS=OFF HPX_WITH_PARCELPORT_LIBFABRIC=OFF HPX_WITH_PARCELPORT_MPI=ON HPX_WITH_PARCELPORT_MPI_MULTITHREADED=ON HPX_WITH_PARCELPORT_TCP=ON HPX_WITH_PARCELPORT_VERBS=OFF HPX_WITH_PARCEL_COALESCING=ON HPX_WITH_PARCEL_PROFILING=OFF HPX_WITH_SCHEDULER_LOCAL_STORAGE=OFF HPX_WITH_SPINLOCK_DEADLOCK_DETECTION=OFF HPX_WITH_STACKTRACES=ON HPX_WITH_SWAP_CONTEXT_EMULATION=OFF HPX_WITH_THREAD_BACKTRACE_ON_SUSPENSION=OFF HPX_WITH_THREAD_CREATION_AND_CLEANUP_RATES=OFF HPX_WITH_THREAD_CUMULATIVE_COUNTS=ON HPX_WITH_THREAD_DEBUG_INFO=OFF HPX_WITH_THREAD_DESCRIPTION_FULL=OFF HPX_WITH_THREAD_GUARD_PAGE=ON HPX_WITH_THREAD_IDLE_RATES=ON HPX_WITH_THREAD_LOCAL_STORAGE=OFF HPX_WITH_THREAD_MANAGER_IDLE_BACKOFF=ON HPX_WITH_THREAD_QUEUE_WAITTIME=OFF HPX_WITH_THREAD_STACK_MMAP=ON HPX_WITH_THREAD_STEALING_COUNTS=ON HPX_WITH_THREAD_TARGET_ADDRESS=OFF HPX_WITH_TIMER_POOL=ON HPX_WITH_TUPLE_RVALUE_SWAP=ON HPX_WITH_UNWRAPPED_COMPATIBILITY=ON HPX_WITH_VALGRIND=OFF HPX_WITH_VERIFY_LOCKS=OFF HPX_WITH_VERIFY_LOCKS_BACKTRACE=OFF HPX_WITH_VERIFY_LOCKS_GLOBALLY=OFF HPX_PARCEL_MAX_CONNECTIONS=512 HPX_PARCEL_MAX_CONNECTIONS_PER_LOCALITY=4 HPX_AGAS_LOCAL_CACHE_SIZE=4096 HPX_HAVE_MALLOC=JEMALLOC HPX_PREFIX (configured)=/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install HPX_PREFIX=/lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install {version}: V1.1.0-rc1 (AGAS: V3.0), Git: unknown {boost}: V1.65.1 {build-type}: release {date}: Sep 25 2018 11:01:34 {platform}: linux {compiler}: GNU C++ version 6.3.0 {stdlib}: GNU libstdc++ version 20161221 [login21:18535] *** Process received signal *** [login21:18535] Signal: Aborted (6) [login21:18535] Signal code: (-6) [login21:18535] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2abc5979b370] [login21:18535] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2abc59be01d7] [login21:18535] [ 2] /lib64/libc.so.6(abort+0x148)[0x2abc59be18c8] [login21:18535] [ 3] /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1(_ZN3hpx19termination_handlerEi+0x213)[0x2abc616ad123] [login21:18535] [ 4] /lib64/libpthread.so.0(+0xf370)[0x2abc5979b370] [login21:18535] [ 5] /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(mca_pml_cm_recv_request_completion+0xb6)[0x2abc62755a76] [login21:18535] [ 6] /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_mtl_psm2_progress+0x59)[0x2abc626f4ac9] [login21:18535] [ 7] /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20(opal_progress+0x3c)[0x2abc63383eec] [login21:18535] [ 8] /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_request_default_wait+0x105)[0x2abc62630a75] [login21:18535] [ 9] /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x5b2)[0x2abc6267be92] [login21:18535] [10] /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+0xb2)[0x2abc6267c262] [login21:18535] [11] /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(ompi_coll_tuned_bcast_intra_dec_fixed+0xcb)[0x2abc6268803b] [login21:18535] [12] /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20(PMPI_Bcast+0x1a0)[0x2abc62642bc0] [login21:18535] [13] /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/python2.7/site-packages/mpi4py/MPI.so(+0xa517f)[0x2abc64cea17f] [login21:18535] [14] /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x923b)[0x2abc59176f9b] [login21:18535] [15] /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x87a)[0x2abc5917879a] [login21:18535] [16] /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x19)[0x2abc59178ba9] [login21:18535] [17] /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyRun_FileExFlags+0x8a)[0x2abc5919cb4a] [login21:18535] [18] /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0xd5)[0x2abc5919df25] [login21:18535] [19] /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0(Py_Main+0xc61)[0x2abc591b44e1] [login21:18535] [20] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2abc59bccb35] [login21:18535] [21] python[0x40071e] [login21:18535] *** End of error message *** --------------------------------- I think this error is related to https://github.com/STEllAR-GROUP/hpx/issues/949 and https://github.com/STEllAR-GROUP/hpx/pull/3129 so maybe the suspend and resume functions could be used. However, the documentation says this can only be done with one locality. Does anyone know of a way for interprocess communication to still be possible within python, separately from the communication layer provided by HPX? Thanks! Best Regards, James Vance
_______________________________________________ hpx-users mailing list hpx-users@stellar.cct.lsu.edu https://mail.cct.lsu.edu/mailman/listinfo/hpx-users