Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Ralph Castain via users writes: > Just a point to consider. OMPI does _not_ want to get in the mode of > modifying imported software packages. That is a blackhole of effort we > simply cannot afford. It's already done that, even in flatten.c. Otherwise updating to the current version would be trivial. I'll eventually make suggestions for some changes in MPICH for standalone builds if I can verify that they don't break things outside of OMPI. Meanwhile we don't have a recent version that will even pass tests recommended here, and we've long been asking about MPI-IO on lustre. We probably should move to some sort of MPICH for MPI-IO on probably the most likely parallel filesystem as well as RMA on the most likely fabric. > The correct thing to do would be to flag Rob Latham on that PR and ask > that he upstream the fix into ROMIO so we can absorb it. We shouldn't > be committing such things directly into OMPI itself. It's already fixed differently in mpich, but the simple patch is useful if there's nothing else broken. I approve of sending fixes to MPICH, but that will only do any good if OMPI's version gets updated from there, which doesn't seem to happen. > It's called "working with the community" as opposed to taking a > point-solution approach :-) The community has already done work to fix this properly. It's a pity that will be wasted. This bit of the community is grateful for the patch, which is reasonable to carry in packaging for now, unlike a whole new romio.
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Just a point to consider. OMPI does _not_ want to get in the mode of modifying imported software packages. That is a blackhole of effort we simply cannot afford. The correct thing to do would be to flag Rob Latham on that PR and ask that he upstream the fix into ROMIO so we can absorb it. We shouldn't be committing such things directly into OMPI itself. It's called "working with the community" as opposed to taking a point-solution approach :-) > On Dec 2, 2020, at 8:46 AM, Mark Dixon via users > wrote: > > Hi Mark, > > Thanks so much for this - yes, applying that pull request against ompi 4.0.5 > allows hdf5 1.10.7's parallel tests to pass on our Lustre filesystem. > > I'll certainly be applying it on our local clusters! > > Best wishes, > > Mark > > On Tue, 1 Dec 2020, Mark Allen via users wrote: > >> At least for the topic of why romio fails with HDF5, I believe this is the >> fix we need (has to do with how romio processes the MPI datatypes in its >> flatten routine). I made a different fix a long time ago in SMPI for that, >> then somewhat more recently it was re-broke it and I had to re-fix it. So >> the below takes a little more aggressive approach, not totally redesigning >> the flatten function, but taking over how the array size counter is handled. >> https://github.com/open-mpi/ompi/pull/3975 >> Mark Allen >>
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Hi Mark, Thanks so much for this - yes, applying that pull request against ompi 4.0.5 allows hdf5 1.10.7's parallel tests to pass on our Lustre filesystem. I'll certainly be applying it on our local clusters! Best wishes, Mark On Tue, 1 Dec 2020, Mark Allen via users wrote: At least for the topic of why romio fails with HDF5, I believe this is the fix we need (has to do with how romio processes the MPI datatypes in its flatten routine). I made a different fix a long time ago in SMPI for that, then somewhat more recently it was re-broke it and I had to re-fix it. So the below takes a little more aggressive approach, not totally redesigning the flatten function, but taking over how the array size counter is handled. https://github.com/open-mpi/ompi/pull/3975 Mark Allen
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Mark Allen via users writes: > At least for the topic of why romio fails with HDF5, I believe this is the > fix we need (has to do with how romio processes the MPI datatypes in its > flatten routine). I made a different fix a long time ago in SMPI for that, > then somewhat more recently it was re-broke it and I had to re-fix it. So > the below takes a little more aggressive approach, not totally redesigning > the flatten function, but taking over how the array size counter is handled. > https://github.com/open-mpi/ompi/pull/3975 > > Mark Allen Thanks. (As it happens, the system we're struggling on is an IBM one.) In the meantime I've hacked in romio from mpich-4.3b1 without really understanding what I'm doing; I think it needs some tidying up on both the mpich and ompi sides. That passed make check in testpar, assuming the complaints from testpflush are the expected ones. (I've not had access to a filesystem with flock to run this previously.) Perhaps it's time to update romio anyway. It may only be relevant to lustre, but I guess that's what most people have.
[OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
At least for the topic of why romio fails with HDF5, I believe this is the fix we need (has to do with how romio processes the MPI datatypes in its flatten routine). I made a different fix a long time ago in SMPI for that, then somewhat more recently it was re-broke it and I had to re-fix it. So the below takes a little more aggressive approach, not totally redesigning the flatten function, but taking over how the array size counter is handled.https://github.com/open-mpi/ompi/pull/3975Mark Allen
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
On Fri, 27 Nov 2020, Dave Love wrote: ... It's less dramatic in the case I ran, but there's clearly something badly wrong which needs profiling. It's probably useful to know how many ranks that's with, and whether it's the default striping. (I assume with default ompio fs parameters.) Hi Dave, It was run the way hdf5's "make check" runs it - that's 6 ranks. I didn't do anything interesting with striping so, unless t_bigio changed it, it'd have a width of 1. ... I can have a look with the current or older romio, unless someone else is going to; we should sort this. If you were willing, that would be brilliant, thanks :) My concern is that openmpi 3.x is near, or at, end of life. 'Twas ever thus, but if it works? Evidently it wouldn't fit the definition of "works" for some users, otherwise there wouldn't have been a version 4! I just didn't want Lustre MPI-IO support to be forgotten about, considering the 4.x series is 2 years old now. All the best, Mark
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
As a check of mpiP, I ran HDF5 testpar/t_bigio under it. This was on one node with four ranks (interactively) on lustre with its default of one 1MB stripe, ompi-4.0.5 + ucx-1.9, hdf5-1.10.7, MCA defaults. I don't know how useful it is, but here's the summary: romio: @--- Aggregate Time (top twenty, descending, milliseconds) --- Call Site TimeApp%MPI% CountCOV File_write_at_all 26 2.58e+04 47.50 50.24 16 0.00 File_read_at_all 14 2.42e+04 44.47 47.03 16 0.00 File_set_view 295150.951.00 16 0.09 File_set_view 33820.700.74 16 0.00 ompio: @--- Aggregate Time (top twenty, descending, milliseconds) --- Call Site TimeApp%MPI% CountCOV File_read_at_all 14 3.32e+06 82.83 82.90 16 0.00 File_write_at_all 26 6.72e+05 16.77 16.78 16 0.02 File_set_view 11 1.14e+040.280.28 16 0.91 File_set_view 293400.010.01 16 0.35 with call sites ID Lev File/AddressLine Parent_FunctMPI_Call 11 0 H5FDmpio.c 1651 H5FD_mpio_write File_set_view 14 0 H5FDmpio.c 1436 H5FD_mpio_read File_read_at_all 26 0 H5FDmpio.c 1636 H5FD_mpio_write File_write_at_all I also looked at the romio hang in testphdf5. In the absence of a parallel debugger, strace and kill show an endless loop of read(...,"",0) under this: [login2:115045] [ 2] .../mca_io_romio321.so(ADIOI_LUSTRE_ReadContig+0xa8)[0x20003d1cab88] [login2:115045] [ 3] .../mca_io_romio321.so(ADIOI_GEN_ReadStrided+0x528)[0x20003d1e4f08] [login2:115045] [ 4] .../mca_io_romio321.so(ADIOI_GEN_ReadStridedColl+0x1084)[0x20003d1e4514] [login2:115045] [ 5] .../mca_io_romio321.so(MPIOI_File_read_all+0x124)[0x20003d1c37c4] [login2:115045] [ 6] .../mca_io_romio321.so(mca_io_romio_dist_MPI_File_read_at_all+0x34)[0x20003d1c41d4] [login2:115045] [ 7] .../mca_io_romio321.so(mca_io_romio321_file_read_at_all+0x3c)[0x20003d1bdabc] [login2:115045] [ 8] .../libmpi.so.40(PMPI_File_read_at_all+0x13c)[0x2078de4c]
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Mark Dixon via users writes: > But remember that IMB-IO doesn't cover everything. I don't know what useful operations it omits, but it was the obvious thing to run, that should show up pathology, with simple things first. It does at least run, which was the first concern. > For example, hdf5's > t_bigio parallel test appears to be a pathological case and OMPIO is 2 > orders of magnitude slower on a Lustre filesystem: > > - OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds > - OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds It's less dramatic in the case I ran, but there's clearly something badly wrong which needs profiling. It's probably useful to know how many ranks that's with, and whether it's the default striping. (I assume with default ompio fs parameters.) > End users seem to have the choice of: > > - use openmpi 4.x and have some things broken (romio) > - use openmpi 4.x and have some things slow (ompio) > - use openmpi 3.x and everything works I can have a look with the current or older romio, unless someone else is going to; we should sort this. > My concern is that openmpi 3.x is near, or at, end of life. 'Twas ever thus, but if it works? [Posted in case it's useful, rather than discussing more locally.]
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Hi Edgar, Thank you so much for your reply. Having run a number of Lustre systems over the years, I fully sympathise with your characterisation of Lustre as being very unforgiving! Best wishes, Mark On Thu, 26 Nov 2020, Gabriel, Edgar wrote: I will have a look at the t_bigio tests on Lustre with ompio. We had from collaborators some reports about the performance problems similar to the one that you mentioned here (which was the reason we were hesitant to make ompio the default on Lustre), but part of the problem is that we were not able to reproduce it reliably on the systems that we had access to, which we makes debugging and fixing the issue very difficult. Lustre is a very unforgiving file system, if you get something wrong with the settings, the performance is not just a bit off, but often orders of magnitude (as in your measurements). Thanks! Edgar -Original Message- From: users On Behalf Of Mark Dixon via users Sent: Thursday, November 26, 2020 9:38 AM To: Dave Love via users Cc: Mark Dixon ; Dave Love Subject: Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? On Wed, 25 Nov 2020, Dave Love via users wrote: The perf test says romio performs a bit better. Also -- from overall time -- it's faster on IMB-IO (which I haven't looked at in detail, and ran with suboptimal striping). I take that back. I can't reproduce a significant difference for total IMB-IO runtime, with both run in parallel on 16 ranks, using either the system default of a single 1MB stripe or using eight stripes. I haven't teased out figures for different operations yet. That must have been done elsewhere, but I've never seen figures. But remember that IMB-IO doesn't cover everything. For example, hdf5's t_bigio parallel test appears to be a pathological case and OMPIO is 2 orders of magnitude slower on a Lustre filesystem: - OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds - OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds End users seem to have the choice of: - use openmpi 4.x and have some things broken (romio) - use openmpi 4.x and have some things slow (ompio) - use openmpi 3.x and everything works My concern is that openmpi 3.x is near, or at, end of life. Mark t_bigio runs on centos 7, gcc 4.8.5, ppc64le, openmpi 4.0.5, hdf5 1.10.7, Lustre 2.12.5: [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real0m21.141s user2m0.318s sys 0m3.289s [login testpar]$ export OMPI_MCA_io=ompio [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real42m34.103s user213m22.925s sys 8m6.742s
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
I will have a look at the t_bigio tests on Lustre with ompio. We had from collaborators some reports about the performance problems similar to the one that you mentioned here (which was the reason we were hesitant to make ompio the default on Lustre), but part of the problem is that we were not able to reproduce it reliably on the systems that we had access to, which we makes debugging and fixing the issue very difficult. Lustre is a very unforgiving file system, if you get something wrong with the settings, the performance is not just a bit off, but often orders of magnitude (as in your measurements). Thanks! Edgar -Original Message- From: users On Behalf Of Mark Dixon via users Sent: Thursday, November 26, 2020 9:38 AM To: Dave Love via users Cc: Mark Dixon ; Dave Love Subject: Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? On Wed, 25 Nov 2020, Dave Love via users wrote: >> The perf test says romio performs a bit better. Also -- from overall >> time -- it's faster on IMB-IO (which I haven't looked at in detail, >> and ran with suboptimal striping). > > I take that back. I can't reproduce a significant difference for > total IMB-IO runtime, with both run in parallel on 16 ranks, using > either the system default of a single 1MB stripe or using eight > stripes. I haven't teased out figures for different operations yet. > That must have been done elsewhere, but I've never seen figures. But remember that IMB-IO doesn't cover everything. For example, hdf5's t_bigio parallel test appears to be a pathological case and OMPIO is 2 orders of magnitude slower on a Lustre filesystem: - OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds - OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds End users seem to have the choice of: - use openmpi 4.x and have some things broken (romio) - use openmpi 4.x and have some things slow (ompio) - use openmpi 3.x and everything works My concern is that openmpi 3.x is near, or at, end of life. Mark t_bigio runs on centos 7, gcc 4.8.5, ppc64le, openmpi 4.0.5, hdf5 1.10.7, Lustre 2.12.5: [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real0m21.141s user2m0.318s sys 0m3.289s [login testpar]$ export OMPI_MCA_io=ompio [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real42m34.103s user213m22.925s sys 8m6.742s
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
On Wed, 25 Nov 2020, Dave Love via users wrote: The perf test says romio performs a bit better. Also -- from overall time -- it's faster on IMB-IO (which I haven't looked at in detail, and ran with suboptimal striping). I take that back. I can't reproduce a significant difference for total IMB-IO runtime, with both run in parallel on 16 ranks, using either the system default of a single 1MB stripe or using eight stripes. I haven't teased out figures for different operations yet. That must have been done elsewhere, but I've never seen figures. But remember that IMB-IO doesn't cover everything. For example, hdf5's t_bigio parallel test appears to be a pathological case and OMPIO is 2 orders of magnitude slower on a Lustre filesystem: - OMPI's default MPI-IO implementation on Lustre (ROMIO): 21 seconds - OMPI's alternative MPI-IO implementation on Lustre (OMPIO): 2554 seconds End users seem to have the choice of: - use openmpi 4.x and have some things broken (romio) - use openmpi 4.x and have some things slow (ompio) - use openmpi 3.x and everything works My concern is that openmpi 3.x is near, or at, end of life. Mark t_bigio runs on centos 7, gcc 4.8.5, ppc64le, openmpi 4.0.5, hdf5 1.10.7, Lustre 2.12.5: [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real0m21.141s user2m0.318s sys 0m3.289s [login testpar]$ export OMPI_MCA_io=ompio [login testpar]$ time mpirun -np 6 ./t_bigio Testing Dataset1 write by ROW Testing Dataset2 write by COL Testing Dataset3 write select ALL proc 0, NONE others Testing Dataset4 write point selection Read Testing Dataset1 by COL Read Testing Dataset2 by ROW Read Testing Dataset3 read select ALL proc 0, NONE others Read Testing Dataset4 with Point selection ***Express test mode on. Several tests are skipped real42m34.103s user213m22.925s sys 8m6.742s
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
I wrote: > The perf test says romio performs a bit better. Also -- from overall > time -- it's faster on IMB-IO (which I haven't looked at in detail, and > ran with suboptimal striping). I take that back. I can't reproduce a significant difference for total IMB-IO runtime, with both run in parallel on 16 ranks, using either the system default of a single 1MB stripe or using eight stripes. I haven't teased out figures for different operations yet. That must have been done elsewhere, but I've never seen figures.
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
HI All, I opened a new issue to track the coll_perf failure in case its not related to the HDF5 problem reported earlier. https://github.com/open-mpi/ompi/issues/8246 Howard Am Mo., 23. Nov. 2020 um 12:14 Uhr schrieb Dave Love via users < users@lists.open-mpi.org>: > Mark Dixon via users writes: > > > Surely I cannot be the only one who cares about using a recent openmpi > > with hdf5 on lustre? > > I generally have similar concerns. I dug out the romio tests, assuming > something more basic is useful. I ran them with ompi 4.0.5+ucx on > Mark's lustre system (similar to a few nodes of Summit, apart from the > filesystem, but with quad-rail IB which doesn't give the bandwidth I > expected). > > The perf test says romio performs a bit better. Also -- from overall > time -- it's faster on IMB-IO (which I haven't looked at in detail, and > ran with suboptimal striping). > > Test: perf > romio321 > Access size per process = 4194304 bytes, ntimes = 5 > Write bandwidth without file sync = 19317.372354 Mbytes/sec > Read bandwidth without prior file sync = 35033.325451 Mbytes/sec > Write bandwidth including file sync = 1081.096713 Mbytes/sec > Read bandwidth after file sync = 47135.349155 Mbytes/sec > ompio > Access size per process = 4194304 bytes, ntimes = 5 > Write bandwidth without file sync = 18442.698536 Mbytes/sec > Read bandwidth without prior file sync = 31958.198676 Mbytes/sec > Write bandwidth including file sync = 1081.058583 Mbytes/sec > Read bandwidth after file sync = 31506.854710 Mbytes/sec > > However, romio coll_perf fails as follows, and ompio runs. Isn't there > mpi-io regression testing? > > [gpu025:89063:0:89063] Caught signal 11 (Segmentation fault: address not > mapped to object at address 0x1fffbc10) > backtrace (tid: 89063) >0 0x0005453c ucs_debug_print_backtrace() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucs/debug/debug.c:656 >1 0x00041b04 ucp_rndv_pack_data() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1335 >2 0x0001c814 uct_self_ep_am_bcopy() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:278 >3 0x0003f7ac uct_ep_am_bcopy() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2561 >4 0x0003f7ac ucp_do_am_bcopy_multi() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.inl:79 >5 0x0003f7ac ucp_rndv_progress_am_bcopy() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1352 >6 0x00041cb8 ucp_request_try_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 >7 0x00041cb8 ucp_request_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 >8 0x00041cb8 ucp_rndv_rtr_handler() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1754 >9 0x0001c984 uct_iface_invoke_am() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635 > 10 0x0001c984 uct_self_iface_sendrecv_am() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149 > 11 0x0001c984 uct_self_ep_am_short() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262 > 12 0x0002ee30 uct_ep_am_short() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549 > 13 0x0002ee30 ucp_do_am_single() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68 > 14 0x00042908 ucp_proto_progress_rndv_rtr() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:172 > 15 0x0003f4c4 ucp_request_try_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 > 16 0x0003f4c4 ucp_request_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 > 17 0x0003f4c4 ucp_rndv_req_send_rtr() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:423 > 18 0x00045214 ucp_rndv_matched() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1262 > 19 0x00046158 ucp_rndv_process_rts() > /tmp/**
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Mark Dixon via users writes: > Surely I cannot be the only one who cares about using a recent openmpi > with hdf5 on lustre? I generally have similar concerns. I dug out the romio tests, assuming something more basic is useful. I ran them with ompi 4.0.5+ucx on Mark's lustre system (similar to a few nodes of Summit, apart from the filesystem, but with quad-rail IB which doesn't give the bandwidth I expected). The perf test says romio performs a bit better. Also -- from overall time -- it's faster on IMB-IO (which I haven't looked at in detail, and ran with suboptimal striping). Test: perf romio321 Access size per process = 4194304 bytes, ntimes = 5 Write bandwidth without file sync = 19317.372354 Mbytes/sec Read bandwidth without prior file sync = 35033.325451 Mbytes/sec Write bandwidth including file sync = 1081.096713 Mbytes/sec Read bandwidth after file sync = 47135.349155 Mbytes/sec ompio Access size per process = 4194304 bytes, ntimes = 5 Write bandwidth without file sync = 18442.698536 Mbytes/sec Read bandwidth without prior file sync = 31958.198676 Mbytes/sec Write bandwidth including file sync = 1081.058583 Mbytes/sec Read bandwidth after file sync = 31506.854710 Mbytes/sec However, romio coll_perf fails as follows, and ompio runs. Isn't there mpi-io regression testing? [gpu025:89063:0:89063] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1fffbc10) backtrace (tid: 89063) 0 0x0005453c ucs_debug_print_backtrace() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucs/debug/debug.c:656 1 0x00041b04 ucp_rndv_pack_data() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1335 2 0x0001c814 uct_self_ep_am_bcopy() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:278 3 0x0003f7ac uct_ep_am_bcopy() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2561 4 0x0003f7ac ucp_do_am_bcopy_multi() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.inl:79 5 0x0003f7ac ucp_rndv_progress_am_bcopy() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1352 6 0x00041cb8 ucp_request_try_send() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 7 0x00041cb8 ucp_request_send() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 8 0x00041cb8 ucp_rndv_rtr_handler() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1754 9 0x0001c984 uct_iface_invoke_am() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635 10 0x0001c984 uct_self_iface_sendrecv_am() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149 11 0x0001c984 uct_self_ep_am_short() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262 12 0x0002ee30 uct_ep_am_short() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549 13 0x0002ee30 ucp_do_am_single() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68 14 0x00042908 ucp_proto_progress_rndv_rtr() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:172 15 0x0003f4c4 ucp_request_try_send() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 16 0x0003f4c4 ucp_request_send() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 17 0x0003f4c4 ucp_rndv_req_send_rtr() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:423 18 0x00045214 ucp_rndv_matched() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1262 19 0x00046158 ucp_rndv_process_rts() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1280 20 0x00046268 ucp_rndv_rts_handler() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1304 21 0x0001c984 uct_iface_invoke_am() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcr
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Hi Edgar, Pity, that would have been nice! But thanks for looking. Checking through the ompi github issues, I now realise I logged exactly the same issue over a year ago (completely forgot - I've moved jobs since then), including a script to reproduce the issue on a Lustre system. Unfortunately there's been no movement: https://github.com/open-mpi/ompi/issues/6871 If it helps anyone, I can confirm that hdf5 parallel tests pass with openmpi 3.1.6, but not in 4.0.5. Surely I cannot be the only one who cares about using a recent openmpi with hdf5 on lustre? Mark On Mon, 16 Nov 2020, Gabriel, Edgar wrote: hm, I think this sounds like a different issue, somebody who is more invested in the ROMIO Open MPI work should probably have a look. Regarding compiling Open MPI with Lustre support for ROMIO, I cannot test it right now for various reasons, but if I recall correctly the trick was to provide the --with-lustre option twice, once inside of the "--with-io-romio-flags=" (along with the option that you provided), and once outside (for ompio). Thanks Edgar
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
hm, I think this sounds like a different issue, somebody who is more invested in the ROMIO Open MPI work should probably have a look. Regarding compiling Open MPI with Lustre support for ROMIO, I cannot test it right now for various reasons, but if I recall correctly the trick was to provide the --with-lustre option twice, once inside of the "--with-io-romio-flags=" (along with the option that you provided), and once outside (for ompio). Thanks Edgar -Original Message- From: Mark Dixon Sent: Monday, November 16, 2020 8:19 AM To: Gabriel, Edgar via users Cc: Gabriel, Edgar Subject: Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? Hi Edgar, Thanks for this - good to know that ompio is an option, despite the reference to potential performance issues. I'm using openmpi 4.0.5 with ucx 1.9.0 and see the hdf5 1.10.7 test "testphdf5" timeout (with the timeout set to an hour) using romio. Is it a known issue there, please? When it times out, the last few lines to be printed are these: Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) The other thing I note is that openmpi doesn't configure romio's lustre driver, even when given "--with-lustre". Regardless, I see the same result whether or not I add "--with-io-romio-flags=--with-file-system=lustre+ufs" Cheers, Mark On Mon, 16 Nov 2020, Gabriel, Edgar via users wrote: > this is in theory still correct, the default MPI I/O library used by > Open MPI on Lustre file systems is ROMIO in all release versions. That > being said, ompio does have support for Lustre as well starting from > the > 2.1 series, so you can use that as well. The main reason that we did > not switch to ompio for Lustre as the default MPI I/O library is a > performance issue that can arise under certain circumstances. > > Which version of Open MPI are you using? There was a bug fix in the > Open MPI to ROMIO integration layer sometime in the 4.0 series that > fixed a datatype problem, which caused some problems in the HDF5 > tests. You might be hitting that problem. > > Thanks > Edgar > > -Original Message- > From: users On Behalf Of Mark Dixon > via users > Sent: Monday, November 16, 2020 4:32 AM > To: users@lists.open-mpi.org > Cc: Mark Dixon > Subject: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? > > Hi all, > > I'm confused about how openmpi supports mpi-io on Lustre these days, > and am hoping that someone can help. > > Back in the openmpi 2.0.0 release notes, it said that OMPIO is the > default MPI-IO implementation on everything apart from Lustre, where > ROMIO is used. Those release notes are pretty old, but it still > appears to be true. > > However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I > tell openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to > print warning messages (UCX_LOG_LEVEL=ERROR). > > Can I just check: are we still supposed to be using ROMIO? > > Thanks, > > Mark >
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Hi Edgar, Thanks for this - good to know that ompio is an option, despite the reference to potential performance issues. I'm using openmpi 4.0.5 with ucx 1.9.0 and see the hdf5 1.10.7 test "testphdf5" timeout (with the timeout set to an hour) using romio. Is it a known issue there, please? When it times out, the last few lines to be printed are these: Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) Testing -- multi-chunk collective chunk io (cchunk3) The other thing I note is that openmpi doesn't configure romio's lustre driver, even when given "--with-lustre". Regardless, I see the same result whether or not I add "--with-io-romio-flags=--with-file-system=lustre+ufs" Cheers, Mark On Mon, 16 Nov 2020, Gabriel, Edgar via users wrote: this is in theory still correct, the default MPI I/O library used by Open MPI on Lustre file systems is ROMIO in all release versions. That being said, ompio does have support for Lustre as well starting from the 2.1 series, so you can use that as well. The main reason that we did not switch to ompio for Lustre as the default MPI I/O library is a performance issue that can arise under certain circumstances. Which version of Open MPI are you using? There was a bug fix in the Open MPI to ROMIO integration layer sometime in the 4.0 series that fixed a datatype problem, which caused some problems in the HDF5 tests. You might be hitting that problem. Thanks Edgar -Original Message- From: users On Behalf Of Mark Dixon via users Sent: Monday, November 16, 2020 4:32 AM To: users@lists.open-mpi.org Cc: Mark Dixon Subject: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? Hi all, I'm confused about how openmpi supports mpi-io on Lustre these days, and am hoping that someone can help. Back in the openmpi 2.0.0 release notes, it said that OMPIO is the default MPI-IO implementation on everything apart from Lustre, where ROMIO is used. Those release notes are pretty old, but it still appears to be true. However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I tell openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to print warning messages (UCX_LOG_LEVEL=ERROR). Can I just check: are we still supposed to be using ROMIO? Thanks, Mark
Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
this is in theory still correct, the default MPI I/O library used by Open MPI on Lustre file systems is ROMIO in all release versions. That being said, ompio does have support for Lustre as well starting from the 2.1 series, so you can use that as well. The main reason that we did not switch to ompio for Lustre as the default MPI I/O library is a performance issue that can arise under certain circumstances. Which version of Open MPI are you using? There was a bug fix in the Open MPI to ROMIO integration layer sometime in the 4.0 series that fixed a datatype problem, which caused some problems in the HDF5 tests. You might be hitting that problem. Thanks Edgar -Original Message- From: users On Behalf Of Mark Dixon via users Sent: Monday, November 16, 2020 4:32 AM To: users@lists.open-mpi.org Cc: Mark Dixon Subject: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO? Hi all, I'm confused about how openmpi supports mpi-io on Lustre these days, and am hoping that someone can help. Back in the openmpi 2.0.0 release notes, it said that OMPIO is the default MPI-IO implementation on everything apart from Lustre, where ROMIO is used. Those release notes are pretty old, but it still appears to be true. However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I tell openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to print warning messages (UCX_LOG_LEVEL=ERROR). Can I just check: are we still supposed to be using ROMIO? Thanks, Mark
[OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?
Hi all, I'm confused about how openmpi supports mpi-io on Lustre these days, and am hoping that someone can help. Back in the openmpi 2.0.0 release notes, it said that OMPIO is the default MPI-IO implementation on everything apart from Lustre, where ROMIO is used. Those release notes are pretty old, but it still appears to be true. However, I cannot get HDF5 1.10.7 to pass its MPI-IO tests unless I tell openmpi to use OMPIO (OMPI_MCA_io=ompio) and tell UCX not to print warning messages (UCX_LOG_LEVEL=ERROR). Can I just check: are we still supposed to be using ROMIO? Thanks, Mark