Re: [hpx-users] Segmentation fault with mpi4py
Hi, >From looking at the stacktrace, it is a segfault coming directly out of MPI, which is then caught by our signal handlers. In theory, there shouldn't be any problem with having multiple MPI libraries running within HPX. The HPX parcelport tries to be a good citizen and creates its own communicator. The problematic part however, might be that you either have multiple calls to MPI_Init (HPX itself should handle that correctly) or that the MPI implementation you are using is not thread safe. HPX is driving MPI from all of its worker threads. For the purpose of making MPI non thread implementations not crash, we use a lock to protect each and every call into MPI ( https://github.com/STEllAR-GROUP/hpx/blob/master/hpx/plugins/parcelport/mpi/mpi_environment.hpp#L42). If you add a call to that around your pympi4 stuff, it might just work. The suspension of the runtime should work as well. As soon as all worker threads are suspended, there won't be any calls to MPI anymore. There still might be incoming messages from other localities, but that shouldn't be a problem. I hope that scheds some light onto that problem. On Tue, Oct 23, 2018 at 11:37 PM Simberg Mikael wrote: > Hi, > > hopefully someone else can chime in on the MPI and Python side of things, > but thought I'd comment shortly on the runtime suspension since I > implemented it. > > The reason for requiring a only a single locality for runtime suspension > is simply that I never tested it with multiple localities. It may very well > already work with multiple localities, but I didn't want users to get the > impression that it's a well-tested feature. So if this is indeed useful for > you you could try removing the check (you probably already found it, let me > know if that's not the case) and rebuilding HPX. > > I suspect though that runtime suspension won't help you here since it > doesn't actually disable MPI or anything else. All it does is put the HPX > worker threads to sleep once all work is completed. > > In this case there might be a problem with our MPI parcelport interfering > with mpi4py. It's not entirely clear to me if you want to use the > networking features of HPX in addition to MPI. If not you can also build > HPX with HPX_WITH_NETWORKING=OFF which will... disable networking. This > branch is also meant to disable some networking related features at runtime > if you're only using one locality: > https://github.com/STEllAR-GROUP/hpx/pull/3486. > > Kind regards, > Mikael > -- > *From:* hpx-users-boun...@stellar.cct.lsu.edu [ > hpx-users-boun...@stellar.cct.lsu.edu] on behalf of Vance, James [ > va...@uni-mainz.de] > *Sent:* Tuesday, October 23, 2018 4:38 PM > *To:* hpx-users@stellar.cct.lsu.edu > *Subject:* [hpx-users] Segmentation fault with mpi4py > > Hi everyone, > > I am trying to gradually port the molecular dynamics code Espresso++ from > its current pure-MPI form to one that uses HPX for the critical parts of > the code. It consists of a C++ and MPI-based shared library that can be > imported in python using the boost.python library, a collection of python > modules, and an mpi4py-based library for communication among the python > processes. > > I was able to properly initialize and terminate the HPX runtime > environment from python using the methods > in hpx/examples/quickstart/init_globally.cpp > and phylanx/python/src/init_hpx.cpp. However, when I use mpi4py to perform > MPI-based communication from within a python script that also runs HPX, I > encounter a segmentation fault with the following trace: > > - > {stack-trace}: 21 frames: > 0x2abc616b08f2 : ??? + 0x2abc616b08f2 in > /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1 > 0x2abc616ad06c : hpx::termination_handler(int) + 0x15c in > /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1 > 0x2abc5979b370 : ??? + 0x2abc5979b370 in /lib64/libpthread.so.0 > 0x2abc62755a76 : mca_pml_cm_recv_request_completion + 0xb6 in > /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 > 0x2abc626f4ac9 : ompi_mtl_psm2_progress + 0x59 in > /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 > 0x2abc63383eec : opal_progress + 0x3c in > /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20 > 0x2abc62630a75 : ompi_request_default_wait + 0x105 in > /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 > 0x2abc6267be92 : ompi_coll_base_bcast_intra_generic + 0x5b2 in > /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 > 0x2abc6267c262 : ompi_coll_base_bcast_intra_binom
Re: [hpx-users] Segmentation fault with mpi4py
Hi, hopefully someone else can chime in on the MPI and Python side of things, but thought I'd comment shortly on the runtime suspension since I implemented it. The reason for requiring a only a single locality for runtime suspension is simply that I never tested it with multiple localities. It may very well already work with multiple localities, but I didn't want users to get the impression that it's a well-tested feature. So if this is indeed useful for you you could try removing the check (you probably already found it, let me know if that's not the case) and rebuilding HPX. I suspect though that runtime suspension won't help you here since it doesn't actually disable MPI or anything else. All it does is put the HPX worker threads to sleep once all work is completed. In this case there might be a problem with our MPI parcelport interfering with mpi4py. It's not entirely clear to me if you want to use the networking features of HPX in addition to MPI. If not you can also build HPX with HPX_WITH_NETWORKING=OFF which will... disable networking. This branch is also meant to disable some networking related features at runtime if you're only using one locality: https://github.com/STEllAR-GROUP/hpx/pull/3486. Kind regards, Mikael From: hpx-users-boun...@stellar.cct.lsu.edu [hpx-users-boun...@stellar.cct.lsu.edu] on behalf of Vance, James [va...@uni-mainz.de] Sent: Tuesday, October 23, 2018 4:38 PM To: hpx-users@stellar.cct.lsu.edu Subject: [hpx-users] Segmentation fault with mpi4py Hi everyone, I am trying to gradually port the molecular dynamics code Espresso++ from its current pure-MPI form to one that uses HPX for the critical parts of the code. It consists of a C++ and MPI-based shared library that can be imported in python using the boost.python library, a collection of python modules, and an mpi4py-based library for communication among the python processes. I was able to properly initialize and terminate the HPX runtime environment from python using the methods in hpx/examples/quickstart/init_globally.cpp and phylanx/python/src/init_hpx.cpp. However, when I use mpi4py to perform MPI-based communication from within a python script that also runs HPX, I encounter a segmentation fault with the following trace: - {stack-trace}: 21 frames: 0x2abc616b08f2 : ??? + 0x2abc616b08f2 in /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1 0x2abc616ad06c : hpx::termination_handler(int) + 0x15c in /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1 0x2abc5979b370 : ??? + 0x2abc5979b370 in /lib64/libpthread.so.0 0x2abc62755a76 : mca_pml_cm_recv_request_completion + 0xb6 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc626f4ac9 : ompi_mtl_psm2_progress + 0x59 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc63383eec : opal_progress + 0x3c in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20 0x2abc62630a75 : ompi_request_default_wait + 0x105 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc6267be92 : ompi_coll_base_bcast_intra_generic + 0x5b2 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc6267c262 : ompi_coll_base_bcast_intra_binomial + 0xb2 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc6268803b : ompi_coll_tuned_bcast_intra_dec_fixed + 0xcb in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc62642bc0 : PMPI_Bcast + 0x1a0 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc64cea17f : ??? + 0x2abc64cea17f in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/python2.7/site-packages/mpi4py/MPI.so 0x2abc59176f9b : PyEval_EvalFrameEx + 0x923b in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc5917879a : PyEval_EvalCodeEx + 0x87a in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc59178ba9 : PyEval_EvalCode + 0x19 in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc5919cb4a : PyRun_FileExFlags + 0x8a in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc5919df25 : PyRun_SimpleFileExFlags + 0xd5 in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc591b44e1 : Py_Main + 0xc61 in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc59bccb35 : __libc_start_main + 0xf5 in /lib64/libc.so.6 0x40071e: ??? + 0x40071e in python {what}: Segmentation fault {config
[hpx-users] Segmentation fault with mpi4py
Hi everyone, I am trying to gradually port the molecular dynamics code Espresso++ from its current pure-MPI form to one that uses HPX for the critical parts of the code. It consists of a C++ and MPI-based shared library that can be imported in python using the boost.python library, a collection of python modules, and an mpi4py-based library for communication among the python processes. I was able to properly initialize and terminate the HPX runtime environment from python using the methods in hpx/examples/quickstart/init_globally.cpp and phylanx/python/src/init_hpx.cpp. However, when I use mpi4py to perform MPI-based communication from within a python script that also runs HPX, I encounter a segmentation fault with the following trace: - {stack-trace}: 21 frames: 0x2abc616b08f2 : ??? + 0x2abc616b08f2 in /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1 0x2abc616ad06c : hpx::termination_handler(int) + 0x15c in /lustre/miifs01/project/m2_zdvresearch/vance/hpx/builds/gcc-openmpi-bench/install/lib/libhpx.so.1 0x2abc5979b370 : ??? + 0x2abc5979b370 in /lib64/libpthread.so.0 0x2abc62755a76 : mca_pml_cm_recv_request_completion + 0xb6 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc626f4ac9 : ompi_mtl_psm2_progress + 0x59 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc63383eec : opal_progress + 0x3c in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libopen-pal.so.20 0x2abc62630a75 : ompi_request_default_wait + 0x105 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc6267be92 : ompi_coll_base_bcast_intra_generic + 0x5b2 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc6267c262 : ompi_coll_base_bcast_intra_binomial + 0xb2 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc6268803b : ompi_coll_tuned_bcast_intra_dec_fixed + 0xcb in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc62642bc0 : PMPI_Bcast + 0x1a0 in /cluster/easybuild/broadwell/software/mpi/OpenMPI/2.0.2-GCC-6.3.0/lib/libmpi.so.20 0x2abc64cea17f : ??? + 0x2abc64cea17f in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/python2.7/site-packages/mpi4py/MPI.so 0x2abc59176f9b : PyEval_EvalFrameEx + 0x923b in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc5917879a : PyEval_EvalCodeEx + 0x87a in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc59178ba9 : PyEval_EvalCode + 0x19 in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc5919cb4a : PyRun_FileExFlags + 0x8a in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc5919df25 : PyRun_SimpleFileExFlags + 0xd5 in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc591b44e1 : Py_Main + 0xc61 in /cluster/easybuild/broadwell/software/lang/Python/2.7.13-foss-2017a/lib/libpython2.7.so.1.0 0x2abc59bccb35 : __libc_start_main + 0xf5 in /lib64/libc.so.6 0x40071e: ??? + 0x40071e in python {what}: Segmentation fault {config}: HPX_WITH_AGAS_DUMP_REFCNT_ENTRIES=OFF HPX_WITH_APEX=OFF HPX_WITH_ATTACH_DEBUGGER_ON_TEST_FAILURE=OFF HPX_WITH_AUTOMATIC_SERIALIZATION_REGISTRATION=ON HPX_WITH_CXX14_RETURN_TYPE_DEDUCTION=TRUE HPX_WITH_DEPRECATION_WARNINGS=ON HPX_WITH_GOOGLE_PERFTOOLS=OFF HPX_WITH_INCLUSIVE_SCAN_COMPATIBILITY=ON HPX_WITH_IO_COUNTERS=ON HPX_WITH_IO_POOL=ON HPX_WITH_ITTNOTIFY=OFF HPX_WITH_LOGGING=ON HPX_WITH_MORE_THAN_64_THREADS=OFF HPX_WITH_NATIVE_TLS=ON HPX_WITH_NETWORKING=ON HPX_WITH_PAPI=OFF HPX_WITH_PARCELPORT_ACTION_COUNTERS=OFF HPX_WITH_PARCELPORT_LIBFABRIC=OFF HPX_WITH_PARCELPORT_MPI=ON HPX_WITH_PARCELPORT_MPI_MULTITHREADED=ON HPX_WITH_PARCELPORT_TCP=ON HPX_WITH_PARCELPORT_VERBS=OFF HPX_WITH_PARCEL_COALESCING=ON HPX_WITH_PARCEL_PROFILING=OFF HPX_WITH_SCHEDULER_LOCAL_STORAGE=OFF HPX_WITH_SPINLOCK_DEADLOCK_DETECTION=OFF HPX_WITH_STACKTRACES=ON HPX_WITH_SWAP_CONTEXT_EMULATION=OFF HPX_WITH_THREAD_BACKTRACE_ON_SUSPENSION=OFF HPX_WITH_THREAD_CREATION_AND_CLEANUP_RATES=OFF HPX_WITH_THREAD_CUMULATIVE_COUNTS=ON HPX_WITH_THREAD_DEBUG_INFO=OFF HPX_WITH_THREAD_DESCRIPTION_FULL=OFF HPX_WITH_THREAD_GUARD_PAGE=ON HPX_WITH_THREAD_IDLE_RATES=ON HPX_WITH_THREAD_LOCAL_STORAGE=OFF HPX_WITH_THREAD_MANAGER_IDLE_BACKOFF=ON HPX_WITH_THREAD_QUEUE_WAITTIME=OFF HPX_WITH_THREAD_STACK_MMAP=ON HPX_WITH_THREAD_STEALING_COUNTS=ON HPX_WITH_THREAD_TARGET_ADDRESS=OFF HPX_WITH_TIMER_POOL=ON HPX_WITH_TUPLE_RVALUE_SWAP=ON HPX_WITH_UNWRAPPED_COMPATIBILITY=ON HPX_WITH_VALGRIND=OFF