Re: [OMPI users] MPI I/O question using MPI_File_write_shared
On Fri, 2020-06-05 at 19:52 -0400, Stephen Siegel via users wrote: > Sure, I’ll ask the machine admins to update and let you know how it > goes. > In the meantime, I was just wondering if someone has run this little > program with an up-to-date OpenMPI and if it worked. If so, then I > will know the problem is with our setup. I don't know what version of OpenMPI corresponds to spectrum-mpi- 10.3.1.2-20200121 on the ORNL Summit machine, but this test passes with that implementation. ==rob > Thanks > -Steve > > > > On Jun 5, 2020, at 7:45 PM, Jeff Squyres (jsquyres) < > > jsquy...@cisco.com> wrote: > > > > You cited Open MPI v2.1.1. That's a pretty ancient version of Open > > MPI. > > > > Any chance you can upgrade to Open MPI 4.0.x? > > > > > > > > > On Jun 5, 2020, at 7:24 PM, Stephen Siegel > > > wrote: > > > > > > > > > > > > > On Jun 5, 2020, at 6:55 PM, Jeff Squyres (jsquyres) < > > > > jsquy...@cisco.com> wrote: > > > > > > > > On Jun 5, 2020, at 6:35 PM, Stephen Siegel via users < > > > > users@lists.open-mpi.org> wrote: > > > > > [ilyich:12946] 3 more processes have sent help message help- > > > > > mpi-btl-base.txt / btl:no-nics > > > > > [ilyich:12946] Set MCA parameter "orte_base_help_aggregate" > > > > > to 0 to see all help / error messages > > > > > > > > It looks like your output somehow doesn't include the actual > > > > error message. > > > > > > You’re right, on this first machine I did not include all of the > > > output. It is: > > > > > > siegel@ilyich:~/372/code/mpi/io$ mpiexec -n 4 ./a.out > > > --- > > > --- > > > [[171,1],0]: A high-performance Open MPI point-to-point messaging > > > module > > > was unable to find any relevant network interfaces: > > > > > > Module: OpenFabrics (openib) > > > Host: ilyich > > > > > > Another transport will be used instead, although this may result > > > in > > > lower performance. > > > > > > NOTE: You can disable this warning by setting the MCA parameter > > > btl_base_warn_component_unused to 0. > > > — > > > > > > So, I’ll ask my people to look into how they configured this. > > > > > > However, on the second machine which uses SLURM it consistently > > > hangs on this example, although many other examples using MPI I/O > > > work fine. > > > > > > -Steve > > > > > > > > > > > > > > > > That error message was sent to stderr, so you may not have > > > > captured it if you only did "mpirun ... > foo.txt". The actual > > > > error message template is this: > > > > > > > > - > > > > %s: A high-performance Open MPI point-to-point messaging module > > > > was unable to find any relevant network interfaces: > > > > > > > > Module: %s > > > > Host: %s > > > > > > > > Another transport will be used instead, although this may > > > > result in > > > > lower performance. > > > > > > > > NOTE: You can disable this warning by setting the MCA parameter > > > > btl_base_warn_component_unused to 0. > > > > - > > > > > > > > This is not actually an error -- just a warning. It typically > > > > means that your Open MPI has support for HPC-class networking, > > > > Open MPI saw some evidence of HPC-class networking on the nodes > > > > on which your job ran, but ultimately didn't use any of those > > > > HPC-class networking interfaces for some reason and therefore > > > > fell back to TCP. > > > > > > > > I.e., your program ran correctly, but it may have run slower > > > > than it could have if it were able to use HPC-class networks. > > > > > > > > -- > > > > Jeff Squyres > > > > jsquy...@cisco.com > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > >
Re: [OMPI users] MPI I/O question using MPI_File_write_shared
Sure, I’ll ask the machine admins to update and let you know how it goes. In the meantime, I was just wondering if someone has run this little program with an up-to-date OpenMPI and if it worked. If so, then I will know the problem is with our setup. Thanks -Steve > On Jun 5, 2020, at 7:45 PM, Jeff Squyres (jsquyres) > wrote: > > You cited Open MPI v2.1.1. That's a pretty ancient version of Open MPI. > > Any chance you can upgrade to Open MPI 4.0.x? > > > >> On Jun 5, 2020, at 7:24 PM, Stephen Siegel wrote: >> >> >> >>> On Jun 5, 2020, at 6:55 PM, Jeff Squyres (jsquyres) >>> wrote: >>> >>> On Jun 5, 2020, at 6:35 PM, Stephen Siegel via users >>> wrote: [ilyich:12946] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [ilyich:12946] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages >>> >>> It looks like your output somehow doesn't include the actual error message. >> >> You’re right, on this first machine I did not include all of the output. It >> is: >> >> siegel@ilyich:~/372/code/mpi/io$ mpiexec -n 4 ./a.out >> -- >> [[171,1],0]: A high-performance Open MPI point-to-point messaging module >> was unable to find any relevant network interfaces: >> >> Module: OpenFabrics (openib) >> Host: ilyich >> >> Another transport will be used instead, although this may result in >> lower performance. >> >> NOTE: You can disable this warning by setting the MCA parameter >> btl_base_warn_component_unused to 0. >> — >> >> So, I’ll ask my people to look into how they configured this. >> >> However, on the second machine which uses SLURM it consistently hangs on >> this example, although many other examples using MPI I/O work fine. >> >> -Steve >> >> >> >> >>> That error message was sent to stderr, so you may not have captured it if >>> you only did "mpirun ... > foo.txt". The actual error message template is >>> this: >>> >>> - >>> %s: A high-performance Open MPI point-to-point messaging module >>> was unable to find any relevant network interfaces: >>> >>> Module: %s >>> Host: %s >>> >>> Another transport will be used instead, although this may result in >>> lower performance. >>> >>> NOTE: You can disable this warning by setting the MCA parameter >>> btl_base_warn_component_unused to 0. >>> - >>> >>> This is not actually an error -- just a warning. It typically means that >>> your Open MPI has support for HPC-class networking, Open MPI saw some >>> evidence of HPC-class networking on the nodes on which your job ran, but >>> ultimately didn't use any of those HPC-class networking interfaces for some >>> reason and therefore fell back to TCP. >>> >>> I.e., your program ran correctly, but it may have run slower than it could >>> have if it were able to use HPC-class networks. >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com > > > -- > Jeff Squyres > jsquy...@cisco.com >
Re: [OMPI users] MPI I/O question using MPI_File_write_shared
You cited Open MPI v2.1.1. That's a pretty ancient version of Open MPI. Any chance you can upgrade to Open MPI 4.0.x? > On Jun 5, 2020, at 7:24 PM, Stephen Siegel wrote: > > > >> On Jun 5, 2020, at 6:55 PM, Jeff Squyres (jsquyres) >> wrote: >> >> On Jun 5, 2020, at 6:35 PM, Stephen Siegel via users >> wrote: >>> >>> [ilyich:12946] 3 more processes have sent help message >>> help-mpi-btl-base.txt / btl:no-nics >>> [ilyich:12946] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >>> help / error messages >> >> It looks like your output somehow doesn't include the actual error message. > > You’re right, on this first machine I did not include all of the output. It > is: > > siegel@ilyich:~/372/code/mpi/io$ mpiexec -n 4 ./a.out > -- > [[171,1],0]: A high-performance Open MPI point-to-point messaging module > was unable to find any relevant network interfaces: > > Module: OpenFabrics (openib) > Host: ilyich > > Another transport will be used instead, although this may result in > lower performance. > > NOTE: You can disable this warning by setting the MCA parameter > btl_base_warn_component_unused to 0. > — > > So, I’ll ask my people to look into how they configured this. > > However, on the second machine which uses SLURM it consistently hangs on this > example, although many other examples using MPI I/O work fine. > > -Steve > > > > >> That error message was sent to stderr, so you may not have captured it if >> you only did "mpirun ... > foo.txt". The actual error message template is >> this: >> >> - >> %s: A high-performance Open MPI point-to-point messaging module >> was unable to find any relevant network interfaces: >> >> Module: %s >> Host: %s >> >> Another transport will be used instead, although this may result in >> lower performance. >> >> NOTE: You can disable this warning by setting the MCA parameter >> btl_base_warn_component_unused to 0. >> - >> >> This is not actually an error -- just a warning. It typically means that >> your Open MPI has support for HPC-class networking, Open MPI saw some >> evidence of HPC-class networking on the nodes on which your job ran, but >> ultimately didn't use any of those HPC-class networking interfaces for some >> reason and therefore fell back to TCP. >> >> I.e., your program ran correctly, but it may have run slower than it could >> have if it were able to use HPC-class networks. >> >> -- >> Jeff Squyres >> jsquy...@cisco.com -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] MPI I/O question using MPI_File_write_shared
> On Jun 5, 2020, at 6:55 PM, Jeff Squyres (jsquyres) > wrote: > > On Jun 5, 2020, at 6:35 PM, Stephen Siegel via users > wrote: >> >> [ilyich:12946] 3 more processes have sent help message help-mpi-btl-base.txt >> / btl:no-nics >> [ilyich:12946] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >> help / error messages > > It looks like your output somehow doesn't include the actual error message. You’re right, on this first machine I did not include all of the output. It is: siegel@ilyich:~/372/code/mpi/io$ mpiexec -n 4 ./a.out -- [[171,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: ilyich Another transport will be used instead, although this may result in lower performance. NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0. — So, I’ll ask my people to look into how they configured this. However, on the second machine which uses SLURM it consistently hangs on this example, although many other examples using MPI I/O work fine. -Steve > That error message was sent to stderr, so you may not have captured it if > you only did "mpirun ... > foo.txt". The actual error message template is > this: > > - > %s: A high-performance Open MPI point-to-point messaging module > was unable to find any relevant network interfaces: > > Module: %s > Host: %s > > Another transport will be used instead, although this may result in > lower performance. > > NOTE: You can disable this warning by setting the MCA parameter > btl_base_warn_component_unused to 0. > - > > This is not actually an error -- just a warning. It typically means that > your Open MPI has support for HPC-class networking, Open MPI saw some > evidence of HPC-class networking on the nodes on which your job ran, but > ultimately didn't use any of those HPC-class networking interfaces for some > reason and therefore fell back to TCP. > > I.e., your program ran correctly, but it may have run slower than it could > have if it were able to use HPC-class networks. > > -- > Jeff Squyres > jsquy...@cisco.com >
Re: [OMPI users] MPI I/O question using MPI_File_write_shared
OK, but then on this other machine it hangs. This one is using SLURM, so I’m not exactly sure but I think this tells me the OpenMPI version: siegel@cisc372:~$ mpiexec.openmpi --version mpiexec.openmpi (OpenRTE) 2.1.1 Report bugs to http://www.open-mpi.org/community/help/ siegel@cisc372:~/372/code/mpi/io$ mpicc io_byte_shared.c siegel@cisc372:~/372/code/mpi/io$ srun -n 4 ./a.out srun: job 143344 queued and waiting for resources srun: job 143344 has been allocated resources Proc 0: file has been opened. Proc 0: About to write to file. Proc 1: file has been opened. Proc 2: file has been opened. Proc 3: file has been opened. ^Csrun: interrupt (one more within 1 sec to abort) srun: step:143344.0 tasks 0-3: running ^Csrun: sending Ctrl-C to job 143344.0 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd-beowulf: error: *** STEP 143344.0 ON beowulf CANCELLED AT 2020-06-05T19:03:59 *** > On Jun 5, 2020, at 6:55 PM, Jeff Squyres (jsquyres) > wrote: > > On Jun 5, 2020, at 6:35 PM, Stephen Siegel via users > wrote: >> >> [ilyich:12946] 3 more processes have sent help message help-mpi-btl-base.txt >> / btl:no-nics >> [ilyich:12946] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >> help / error messages > > It looks like your output somehow doesn't include the actual error message. > That error message was sent to stderr, so you may not have captured it if you > only did "mpirun ... > foo.txt". The actual error message template is this: > > - > %s: A high-performance Open MPI point-to-point messaging module > was unable to find any relevant network interfaces: > > Module: %s > Host: %s > > Another transport will be used instead, although this may result in > lower performance. > > NOTE: You can disable this warning by setting the MCA parameter > btl_base_warn_component_unused to 0. > - > > This is not actually an error -- just a warning. It typically means that > your Open MPI has support for HPC-class networking, Open MPI saw some > evidence of HPC-class networking on the nodes on which your job ran, but > ultimately didn't use any of those HPC-class networking interfaces for some > reason and therefore fell back to TCP. > > I.e., your program ran correctly, but it may have run slower than it could > have if it were able to use HPC-class networks. > > -- > Jeff Squyres > jsquy...@cisco.com >
Re: [OMPI users] MPI I/O question using MPI_File_write_shared
On Jun 5, 2020, at 6:35 PM, Stephen Siegel via users wrote: > > [ilyich:12946] 3 more processes have sent help message help-mpi-btl-base.txt > / btl:no-nics > [ilyich:12946] Set MCA parameter "orte_base_help_aggregate" to 0 to see all > help / error messages It looks like your output somehow doesn't include the actual error message. That error message was sent to stderr, so you may not have captured it if you only did "mpirun ... > foo.txt". The actual error message template is this: - %s: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: %s Host: %s Another transport will be used instead, although this may result in lower performance. NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0. - This is not actually an error -- just a warning. It typically means that your Open MPI has support for HPC-class networking, Open MPI saw some evidence of HPC-class networking on the nodes on which your job ran, but ultimately didn't use any of those HPC-class networking interfaces for some reason and therefore fell back to TCP. I.e., your program ran correctly, but it may have run slower than it could have if it were able to use HPC-class networks. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI users] MPI I/O question using MPI_File_write_shared
Your code looks correct, and based on your output I would actually suspect that the I/O part finished correctly, the error message that you see is not an IO error, but from the btl (which is communication related). What version of Open MPI are using, and on what file system? Thanks Edgar -Original Message- From: users On Behalf Of Stephen Siegel via users Sent: Friday, June 5, 2020 5:35 PM To: users@lists.open-mpi.org Cc: Stephen Siegel Subject: [OMPI users] MPI I/O question using MPI_File_write_shared I posted this question on StackOverflow and someone suggested I write to the OpenMPI community. https://stackoverflow.com/questions/62223698/mpi-i-o-why-does-my-program-hang-or-misbehave-when-one-process-writes-using-mpi Below is a little MPI program. It is a simple use of MPI I/O. Process 0 writes an int to the file using MPI_File_write_shared; no other process writes anything. It works correctly using an MPICH installation, but on two different machines using OpenMPI, it either hangs in the middle of the call to MPI_File_write_shared, or it reports an error at the end. Not sure if it is my misunderstanding of the MPI Standard or a bug or configuration problem with my OpenMPI. Thanks in advance if anyone can look at it, Steve #include #include #include int nprocs, rank; int main() { MPI_File fh; int err, count; MPI_Status status; MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_rank(MPI_COMM_WORLD, ); err = MPI_File_open(MPI_COMM_WORLD, "io_byte_shared.tmp", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, ); assert(err==0); err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL); assert(err==0); printf("Proc %d: file has been opened.\n", rank); fflush(stdout); // Proc 0 only writes header using shared file pointer... MPI_Barrier(MPI_COMM_WORLD); if (rank == 0) { int x = ; printf("Proc 0: About to write to file.\n"); fflush(stdout); err = MPI_File_write_shared(fh, , 1, MPI_INT, ); printf("Proc 0: Finished writing.\n"); fflush(stdout); assert(err == 0); } MPI_Barrier(MPI_COMM_WORLD); printf("Proc %d: about to close file.\n", rank); fflush(stdout); err = MPI_File_close(); assert(err==0); MPI_Finalize(); } Example run: $ mpicc io_byte_shared.c $ mpiexec -n 4 ./a.out Proc 0: file has been opened. Proc 0: About to write to file. Proc 0: Finished writing. Proc 1: file has been opened. Proc 2: file has been opened. Proc 3: file has been opened. Proc 0: about to close file. Proc 1: about to close file. Proc 2: about to close file. Proc 3: about to close file. [ilyich:12946] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [ilyich:12946] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[OMPI users] MPI I/O question using MPI_File_write_shared
I posted this question on StackOverflow and someone suggested I write to the OpenMPI community. https://stackoverflow.com/questions/62223698/mpi-i-o-why-does-my-program-hang-or-misbehave-when-one-process-writes-using-mpi Below is a little MPI program. It is a simple use of MPI I/O. Process 0 writes an int to the file using MPI_File_write_shared; no other process writes anything. It works correctly using an MPICH installation, but on two different machines using OpenMPI, it either hangs in the middle of the call to MPI_File_write_shared, or it reports an error at the end. Not sure if it is my misunderstanding of the MPI Standard or a bug or configuration problem with my OpenMPI. Thanks in advance if anyone can look at it, Steve #include #include #include int nprocs, rank; int main() { MPI_File fh; int err, count; MPI_Status status; MPI_Init(NULL, NULL); MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_rank(MPI_COMM_WORLD, ); err = MPI_File_open(MPI_COMM_WORLD, "io_byte_shared.tmp", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, ); assert(err==0); err = MPI_File_set_view(fh, 0, MPI_INT, MPI_INT, "native", MPI_INFO_NULL); assert(err==0); printf("Proc %d: file has been opened.\n", rank); fflush(stdout); // Proc 0 only writes header using shared file pointer... MPI_Barrier(MPI_COMM_WORLD); if (rank == 0) { int x = ; printf("Proc 0: About to write to file.\n"); fflush(stdout); err = MPI_File_write_shared(fh, , 1, MPI_INT, ); printf("Proc 0: Finished writing.\n"); fflush(stdout); assert(err == 0); } MPI_Barrier(MPI_COMM_WORLD); printf("Proc %d: about to close file.\n", rank); fflush(stdout); err = MPI_File_close(); assert(err==0); MPI_Finalize(); } Example run: $ mpicc io_byte_shared.c $ mpiexec -n 4 ./a.out Proc 0: file has been opened. Proc 0: About to write to file. Proc 0: Finished writing. Proc 1: file has been opened. Proc 2: file has been opened. Proc 3: file has been opened. Proc 0: about to close file. Proc 1: about to close file. Proc 2: about to close file. Proc 3: about to close file. [ilyich:12946] 3 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics [ilyich:12946] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages