Re: [OMPI users] File locking in ADIO, OpenMPI 1.6.4
Hi Rob, As you pointed out in April that there are many cases that could arouse ADIOI_Set_lock error. My code writes to a file at a location specified by a shared file pointer (it is a blocking and collective call): MPI_File_write_ordered(contactFile, const_cast (inf.str().c_str()), length, MPI_CHAR, &status); That is why disabling data-sieving does not work for me, even if I tested it with latest openmpi-1.8.2 and gcc-4.9.1. Can I ask a question? Except that Lustre is mounted with "flock" option, is there other workaround to avoid this ADIOI_Set_lock error in MPI-2 parallel IO? Thanks, Beichuan -Original Message- From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Rob Latham Sent: Monday, April 14, 2014 14:24 To: Open MPI Users Subject: Re: [OMPI users] File locking in ADIO, OpenMPI 1.6.4 On 04/08/2014 05:49 PM, Daniel Milroy wrote: > Hello, > > The file system in question is indeed Lustre, and mounting with flock > isn't possible in our environment. I recommended the following > changes to the users' code: Hi. I'm the ROMIO guy, though I do rely on the community to help me keep the lustre driver up to snuff. > MPI_Info_set(info, "collective_buffering", "true"); MPI_Info_set(info, > "romio_lustre_ds_in_coll", "disable"); MPI_Info_set(info, > "romio_ds_read", "disable"); MPI_Info_set(info, "romio_ds_write", > "disable"); > > Which results in the same error as before. Are there any other MPI > options I can set? I'd like to hear more about the workload generating these lock messages, but I can tell you the situations in which ADIOI_SetLock gets called: - everywhere in NFS. If you have a Lustre file system exported to some clients as NFS, you'll get NFS (er, that might not be true unless you pick up a recent patch) - when writing a non-contiguous region in file, unless you disable data sieving, as you did above. - note: you don't need to disable data sieving for reads, though you might want to if the data sieving algorithm is wasting a lot of data. - if atomic mode was set on the file (i.e. you called MPI_File_set_atomicity) - if you use any of the shared file pointer operations - if you use any of the ordered mode collective operations you've turned off data sieving writes, which is what I would have first guessed would trigger this lock message. So I guess you are hitting one of the other cases. ==rob -- Rob Latham Mathematics and Computer Science Division Argonne National Lab, IL USA ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] --prefix, segfaulting
Hi all! Today, I observed a really funky behavior of my stock ``` $ mpiexec --version mpiexec (OpenRTE) 1.6.5 Report bugs to http://www.open-mpi.org/community/help/ ``` on Ubuntu 14.04. When running one of my test codes with ``` $ mpiexec -n 2 ioTest [...] ``` all is fine. If instead I use the full path of mpiexec, I get a different behavior ``` $ /usr/bin/mpiexec -n 2 ioTest [...] (exception thrown) ``` I was puzzled, so skimmed the manpage and found that the `--prefix` option might have something to do with it. I played around and got ``` $ /usr/bin/mpiexec --prefix . -n 2 ioTest [fuji:21003] *** Process received signal *** [fuji:21003] Signal: Segmentation fault (11) [fuji:21003] Signal code: Address not mapped (1) [fuji:21003] Failing at address: 0x100dd [fuji:21003] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7f12e4069340] [fuji:21003] [ 1] /lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x1d13) [0x7f12e3cde8f3] [fuji:21003] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__vasprintf_chk+0xb5) [0x7f12e3d9e035] [fuji:21003] [ 3] /usr/lib/libopen-rte.so.4(opal_show_help_vstring+0x343) [0x7f12e43043e3] [fuji:21003] [ 4] /usr/lib/libopen-rte.so.4(orte_show_help+0xaf) [0x7f12e42a5faf] [fuji:21003] [ 5] /usr/bin/mpiexec() [0x403ab3] [fuji:21003] [ 6] /usr/bin/mpiexec() [0x40347d] [fuji:21003] [ 7] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f12e3cb4ec5] [fuji:21003] [ 8] /usr/bin/mpiexec() [0x403399] [fuji:21003] *** End of error message *** Segmentation fault (core dumped) ``` That's tough! Should I try to reproduce this with a more recent version? Any idea what the reason for the different behavior of `mpiexec` and `/usr/bin/mpiexec` might be? Cheers, Nico
Re: [OMPI users] --prefix, segfaulting
You should check that your path would also hit /usr/bin/mpiexec and not some other version of it On Sep 17, 2014, at 4:01 PM, Nico Schlömer wrote: > Hi all! > > Today, I observed a really funky behavior of my stock > ``` > $ mpiexec --version > mpiexec (OpenRTE) 1.6.5 > > Report bugs to http://www.open-mpi.org/community/help/ > ``` > on Ubuntu 14.04. When running one of my test codes with > ``` > $ mpiexec -n 2 ioTest > [...] > ``` > all is fine. If instead I use the full path of mpiexec, I get a > different behavior > ``` > $ /usr/bin/mpiexec -n 2 ioTest > [...] > (exception thrown) > ``` > I was puzzled, so skimmed the manpage and found that the `--prefix` > option might have something to do with it. I played around and got > ``` > $ /usr/bin/mpiexec --prefix . -n 2 ioTest > [fuji:21003] *** Process received signal *** > [fuji:21003] Signal: Segmentation fault (11) > [fuji:21003] Signal code: Address not mapped (1) > [fuji:21003] Failing at address: 0x100dd > [fuji:21003] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) > [0x7f12e4069340] > [fuji:21003] [ 1] /lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x1d13) > [0x7f12e3cde8f3] > [fuji:21003] [ 2] > /lib/x86_64-linux-gnu/libc.so.6(__vasprintf_chk+0xb5) [0x7f12e3d9e035] > [fuji:21003] [ 3] > /usr/lib/libopen-rte.so.4(opal_show_help_vstring+0x343) > [0x7f12e43043e3] > [fuji:21003] [ 4] /usr/lib/libopen-rte.so.4(orte_show_help+0xaf) > [0x7f12e42a5faf] > [fuji:21003] [ 5] /usr/bin/mpiexec() [0x403ab3] > [fuji:21003] [ 6] /usr/bin/mpiexec() [0x40347d] > [fuji:21003] [ 7] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) > [0x7f12e3cb4ec5] > [fuji:21003] [ 8] /usr/bin/mpiexec() [0x403399] > [fuji:21003] *** End of error message *** > Segmentation fault (core dumped) > ``` > That's tough! > > Should I try to reproduce this with a more recent version? Any idea > what the reason for the different behavior of `mpiexec` and > `/usr/bin/mpiexec` might be? > > Cheers, > Nico > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25338.php
Re: [OMPI users] --prefix, segfaulting
> You should check that your path would also hit /usr/bin/mpiexec and not some > other version of it ``` $ which mpiexec /usr/bin/mpiexec ``` Is this what you mean? –Nico On Thu, Sep 18, 2014 at 1:04 AM, Ralph Castain wrote: > You should check that your path would also hit /usr/bin/mpiexec and not some > other version of it > > On Sep 17, 2014, at 4:01 PM, Nico Schlömer wrote: > >> Hi all! >> >> Today, I observed a really funky behavior of my stock >> ``` >> $ mpiexec --version >> mpiexec (OpenRTE) 1.6.5 >> >> Report bugs to http://www.open-mpi.org/community/help/ >> ``` >> on Ubuntu 14.04. When running one of my test codes with >> ``` >> $ mpiexec -n 2 ioTest >> [...] >> ``` >> all is fine. If instead I use the full path of mpiexec, I get a >> different behavior >> ``` >> $ /usr/bin/mpiexec -n 2 ioTest >> [...] >> (exception thrown) >> ``` >> I was puzzled, so skimmed the manpage and found that the `--prefix` >> option might have something to do with it. I played around and got >> ``` >> $ /usr/bin/mpiexec --prefix . -n 2 ioTest >> [fuji:21003] *** Process received signal *** >> [fuji:21003] Signal: Segmentation fault (11) >> [fuji:21003] Signal code: Address not mapped (1) >> [fuji:21003] Failing at address: 0x100dd >> [fuji:21003] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) >> [0x7f12e4069340] >> [fuji:21003] [ 1] /lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x1d13) >> [0x7f12e3cde8f3] >> [fuji:21003] [ 2] >> /lib/x86_64-linux-gnu/libc.so.6(__vasprintf_chk+0xb5) [0x7f12e3d9e035] >> [fuji:21003] [ 3] >> /usr/lib/libopen-rte.so.4(opal_show_help_vstring+0x343) >> [0x7f12e43043e3] >> [fuji:21003] [ 4] /usr/lib/libopen-rte.so.4(orte_show_help+0xaf) >> [0x7f12e42a5faf] >> [fuji:21003] [ 5] /usr/bin/mpiexec() [0x403ab3] >> [fuji:21003] [ 6] /usr/bin/mpiexec() [0x40347d] >> [fuji:21003] [ 7] >> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) >> [0x7f12e3cb4ec5] >> [fuji:21003] [ 8] /usr/bin/mpiexec() [0x403399] >> [fuji:21003] *** End of error message *** >> Segmentation fault (core dumped) >> ``` >> That's tough! >> >> Should I try to reproduce this with a more recent version? Any idea >> what the reason for the different behavior of `mpiexec` and >> `/usr/bin/mpiexec` might be? >> >> Cheers, >> Nico >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25338.php > > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25339.php
Re: [OMPI users] --prefix, segfaulting
Yeah, just wanted to make sure you were seeing the same mpiexec in both cases. There shouldn't be any issue with providing the complete path, though I can take a look On Sep 17, 2014, at 4:29 PM, Nico Schlömer wrote: >> You should check that your path would also hit /usr/bin/mpiexec and not some >> other version of it > > ``` > $ which mpiexec > /usr/bin/mpiexec > ``` > Is this what you mean? > > –Nico > > On Thu, Sep 18, 2014 at 1:04 AM, Ralph Castain wrote: >> You should check that your path would also hit /usr/bin/mpiexec and not some >> other version of it >> >> On Sep 17, 2014, at 4:01 PM, Nico Schlömer wrote: >> >>> Hi all! >>> >>> Today, I observed a really funky behavior of my stock >>> ``` >>> $ mpiexec --version >>> mpiexec (OpenRTE) 1.6.5 >>> >>> Report bugs to http://www.open-mpi.org/community/help/ >>> ``` >>> on Ubuntu 14.04. When running one of my test codes with >>> ``` >>> $ mpiexec -n 2 ioTest >>> [...] >>> ``` >>> all is fine. If instead I use the full path of mpiexec, I get a >>> different behavior >>> ``` >>> $ /usr/bin/mpiexec -n 2 ioTest >>> [...] >>> (exception thrown) >>> ``` >>> I was puzzled, so skimmed the manpage and found that the `--prefix` >>> option might have something to do with it. I played around and got >>> ``` >>> $ /usr/bin/mpiexec --prefix . -n 2 ioTest >>> [fuji:21003] *** Process received signal *** >>> [fuji:21003] Signal: Segmentation fault (11) >>> [fuji:21003] Signal code: Address not mapped (1) >>> [fuji:21003] Failing at address: 0x100dd >>> [fuji:21003] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) >>> [0x7f12e4069340] >>> [fuji:21003] [ 1] /lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x1d13) >>> [0x7f12e3cde8f3] >>> [fuji:21003] [ 2] >>> /lib/x86_64-linux-gnu/libc.so.6(__vasprintf_chk+0xb5) [0x7f12e3d9e035] >>> [fuji:21003] [ 3] >>> /usr/lib/libopen-rte.so.4(opal_show_help_vstring+0x343) >>> [0x7f12e43043e3] >>> [fuji:21003] [ 4] /usr/lib/libopen-rte.so.4(orte_show_help+0xaf) >>> [0x7f12e42a5faf] >>> [fuji:21003] [ 5] /usr/bin/mpiexec() [0x403ab3] >>> [fuji:21003] [ 6] /usr/bin/mpiexec() [0x40347d] >>> [fuji:21003] [ 7] >>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) >>> [0x7f12e3cb4ec5] >>> [fuji:21003] [ 8] /usr/bin/mpiexec() [0x403399] >>> [fuji:21003] *** End of error message *** >>> Segmentation fault (core dumped) >>> ``` >>> That's tough! >>> >>> Should I try to reproduce this with a more recent version? Any idea >>> what the reason for the different behavior of `mpiexec` and >>> `/usr/bin/mpiexec` might be? >>> >>> Cheers, >>> Nico >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2014/09/25338.php >> >> ___ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/09/25339.php > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/09/25340.php