Re: [OMPI users] File locking in ADIO, OpenMPI 1.6.4

2014-09-17 Thread Beichuan Yan
Hi Rob,

As you pointed out in April that there are many cases that could arouse 
ADIOI_Set_lock error. My code writes to a file at a location specified by a 
shared file pointer (it is a blocking and collective call): 
MPI_File_write_ordered(contactFile, const_cast (inf.str().c_str()), 
length, MPI_CHAR, &status);

That is why disabling data-sieving does not work for me, even if I tested it 
with latest openmpi-1.8.2 and gcc-4.9.1.

Can I ask a question? Except that Lustre is mounted with "flock" option, is 
there other workaround to avoid this ADIOI_Set_lock error in MPI-2 parallel IO?

Thanks,
Beichuan

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Rob Latham
Sent: Monday, April 14, 2014 14:24
To: Open MPI Users
Subject: Re: [OMPI users] File locking in ADIO, OpenMPI 1.6.4



On 04/08/2014 05:49 PM, Daniel Milroy wrote:
> Hello,
>
> The file system in question is indeed Lustre, and mounting with flock 
> isn't possible in our environment.  I recommended the following 
> changes to the users' code:

Hi.  I'm the ROMIO guy, though I do rely on the community to help me keep the 
lustre driver up to snuff.

> MPI_Info_set(info, "collective_buffering", "true"); MPI_Info_set(info, 
> "romio_lustre_ds_in_coll", "disable"); MPI_Info_set(info, 
> "romio_ds_read", "disable"); MPI_Info_set(info, "romio_ds_write", 
> "disable");
>
> Which results in the same error as before.  Are there any other MPI 
> options I can set?

I'd like to hear more about the workload generating these lock messages, but I 
can tell you the situations in which ADIOI_SetLock gets called:
- everywhere in NFS.  If you have a Lustre file system exported to some clients 
as NFS, you'll get NFS (er, that might not be true unless you pick up a recent 
patch)
- when writing a non-contiguous region in file, unless you disable data 
sieving, as you did above.
- note: you don't need to disable data sieving for reads, though you might want 
to if the data sieving algorithm is wasting a lot of data.
- if atomic mode was set on the file (i.e. you called
MPI_File_set_atomicity)
- if you use any of the shared file pointer operations
- if you use any of the ordered mode collective operations

you've turned off data sieving writes, which is what I would have first guessed 
would trigger this lock message.  So I guess you are hitting one of the other 
cases.

==rob

--
Rob Latham
Mathematics and Computer Science Division Argonne National Lab, IL USA 
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


[OMPI users] --prefix, segfaulting

2014-09-17 Thread Nico Schlömer
Hi all!

Today, I observed a really funky behavior of my stock
```
$ mpiexec --version
mpiexec (OpenRTE) 1.6.5

Report bugs to http://www.open-mpi.org/community/help/
```
on Ubuntu 14.04. When running one of my test codes with
```
$ mpiexec -n 2 ioTest
[...]
```
all is fine. If instead I use the full path of mpiexec, I get a
different behavior
```
$ /usr/bin/mpiexec -n 2 ioTest
[...]
(exception thrown)
```
I was puzzled, so skimmed the manpage and found that the `--prefix`
option might have something to do with it. I played around and got
```
$ /usr/bin/mpiexec --prefix . -n 2 ioTest
[fuji:21003] *** Process received signal ***
[fuji:21003] Signal: Segmentation fault (11)
[fuji:21003] Signal code: Address not mapped (1)
[fuji:21003] Failing at address: 0x100dd
[fuji:21003] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)
[0x7f12e4069340]
[fuji:21003] [ 1] /lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x1d13)
[0x7f12e3cde8f3]
[fuji:21003] [ 2]
/lib/x86_64-linux-gnu/libc.so.6(__vasprintf_chk+0xb5) [0x7f12e3d9e035]
[fuji:21003] [ 3]
/usr/lib/libopen-rte.so.4(opal_show_help_vstring+0x343)
[0x7f12e43043e3]
[fuji:21003] [ 4] /usr/lib/libopen-rte.so.4(orte_show_help+0xaf)
[0x7f12e42a5faf]
[fuji:21003] [ 5] /usr/bin/mpiexec() [0x403ab3]
[fuji:21003] [ 6] /usr/bin/mpiexec() [0x40347d]
[fuji:21003] [ 7]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)
[0x7f12e3cb4ec5]
[fuji:21003] [ 8] /usr/bin/mpiexec() [0x403399]
[fuji:21003] *** End of error message ***
Segmentation fault (core dumped)
```
That's tough!

Should I try to reproduce this with a more recent version? Any idea
what the reason for the different behavior of `mpiexec` and
`/usr/bin/mpiexec` might be?

Cheers,
Nico


Re: [OMPI users] --prefix, segfaulting

2014-09-17 Thread Ralph Castain
You should check that your path would also hit /usr/bin/mpiexec and not some 
other version of it

On Sep 17, 2014, at 4:01 PM, Nico Schlömer  wrote:

> Hi all!
> 
> Today, I observed a really funky behavior of my stock
> ```
> $ mpiexec --version
> mpiexec (OpenRTE) 1.6.5
> 
> Report bugs to http://www.open-mpi.org/community/help/
> ```
> on Ubuntu 14.04. When running one of my test codes with
> ```
> $ mpiexec -n 2 ioTest
> [...]
> ```
> all is fine. If instead I use the full path of mpiexec, I get a
> different behavior
> ```
> $ /usr/bin/mpiexec -n 2 ioTest
> [...]
> (exception thrown)
> ```
> I was puzzled, so skimmed the manpage and found that the `--prefix`
> option might have something to do with it. I played around and got
> ```
> $ /usr/bin/mpiexec --prefix . -n 2 ioTest
> [fuji:21003] *** Process received signal ***
> [fuji:21003] Signal: Segmentation fault (11)
> [fuji:21003] Signal code: Address not mapped (1)
> [fuji:21003] Failing at address: 0x100dd
> [fuji:21003] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)
> [0x7f12e4069340]
> [fuji:21003] [ 1] /lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x1d13)
> [0x7f12e3cde8f3]
> [fuji:21003] [ 2]
> /lib/x86_64-linux-gnu/libc.so.6(__vasprintf_chk+0xb5) [0x7f12e3d9e035]
> [fuji:21003] [ 3]
> /usr/lib/libopen-rte.so.4(opal_show_help_vstring+0x343)
> [0x7f12e43043e3]
> [fuji:21003] [ 4] /usr/lib/libopen-rte.so.4(orte_show_help+0xaf)
> [0x7f12e42a5faf]
> [fuji:21003] [ 5] /usr/bin/mpiexec() [0x403ab3]
> [fuji:21003] [ 6] /usr/bin/mpiexec() [0x40347d]
> [fuji:21003] [ 7]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)
> [0x7f12e3cb4ec5]
> [fuji:21003] [ 8] /usr/bin/mpiexec() [0x403399]
> [fuji:21003] *** End of error message ***
> Segmentation fault (core dumped)
> ```
> That's tough!
> 
> Should I try to reproduce this with a more recent version? Any idea
> what the reason for the different behavior of `mpiexec` and
> `/usr/bin/mpiexec` might be?
> 
> Cheers,
> Nico
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25338.php



Re: [OMPI users] --prefix, segfaulting

2014-09-17 Thread Nico Schlömer
> You should check that your path would also hit /usr/bin/mpiexec and not some 
> other version of it

```
$ which mpiexec
/usr/bin/mpiexec
```
Is this what you mean?

–Nico

On Thu, Sep 18, 2014 at 1:04 AM, Ralph Castain  wrote:
> You should check that your path would also hit /usr/bin/mpiexec and not some 
> other version of it
>
> On Sep 17, 2014, at 4:01 PM, Nico Schlömer  wrote:
>
>> Hi all!
>>
>> Today, I observed a really funky behavior of my stock
>> ```
>> $ mpiexec --version
>> mpiexec (OpenRTE) 1.6.5
>>
>> Report bugs to http://www.open-mpi.org/community/help/
>> ```
>> on Ubuntu 14.04. When running one of my test codes with
>> ```
>> $ mpiexec -n 2 ioTest
>> [...]
>> ```
>> all is fine. If instead I use the full path of mpiexec, I get a
>> different behavior
>> ```
>> $ /usr/bin/mpiexec -n 2 ioTest
>> [...]
>> (exception thrown)
>> ```
>> I was puzzled, so skimmed the manpage and found that the `--prefix`
>> option might have something to do with it. I played around and got
>> ```
>> $ /usr/bin/mpiexec --prefix . -n 2 ioTest
>> [fuji:21003] *** Process received signal ***
>> [fuji:21003] Signal: Segmentation fault (11)
>> [fuji:21003] Signal code: Address not mapped (1)
>> [fuji:21003] Failing at address: 0x100dd
>> [fuji:21003] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)
>> [0x7f12e4069340]
>> [fuji:21003] [ 1] /lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x1d13)
>> [0x7f12e3cde8f3]
>> [fuji:21003] [ 2]
>> /lib/x86_64-linux-gnu/libc.so.6(__vasprintf_chk+0xb5) [0x7f12e3d9e035]
>> [fuji:21003] [ 3]
>> /usr/lib/libopen-rte.so.4(opal_show_help_vstring+0x343)
>> [0x7f12e43043e3]
>> [fuji:21003] [ 4] /usr/lib/libopen-rte.so.4(orte_show_help+0xaf)
>> [0x7f12e42a5faf]
>> [fuji:21003] [ 5] /usr/bin/mpiexec() [0x403ab3]
>> [fuji:21003] [ 6] /usr/bin/mpiexec() [0x40347d]
>> [fuji:21003] [ 7]
>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)
>> [0x7f12e3cb4ec5]
>> [fuji:21003] [ 8] /usr/bin/mpiexec() [0x403399]
>> [fuji:21003] *** End of error message ***
>> Segmentation fault (core dumped)
>> ```
>> That's tough!
>>
>> Should I try to reproduce this with a more recent version? Any idea
>> what the reason for the different behavior of `mpiexec` and
>> `/usr/bin/mpiexec` might be?
>>
>> Cheers,
>> Nico
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25338.php
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25339.php


Re: [OMPI users] --prefix, segfaulting

2014-09-17 Thread Ralph Castain
Yeah, just wanted to make sure you were seeing the same mpiexec in both cases. 
There shouldn't be any issue with providing the complete path, though I can 
take a look


On Sep 17, 2014, at 4:29 PM, Nico Schlömer  wrote:

>> You should check that your path would also hit /usr/bin/mpiexec and not some 
>> other version of it
> 
> ```
> $ which mpiexec
> /usr/bin/mpiexec
> ```
> Is this what you mean?
> 
> –Nico
> 
> On Thu, Sep 18, 2014 at 1:04 AM, Ralph Castain  wrote:
>> You should check that your path would also hit /usr/bin/mpiexec and not some 
>> other version of it
>> 
>> On Sep 17, 2014, at 4:01 PM, Nico Schlömer  wrote:
>> 
>>> Hi all!
>>> 
>>> Today, I observed a really funky behavior of my stock
>>> ```
>>> $ mpiexec --version
>>> mpiexec (OpenRTE) 1.6.5
>>> 
>>> Report bugs to http://www.open-mpi.org/community/help/
>>> ```
>>> on Ubuntu 14.04. When running one of my test codes with
>>> ```
>>> $ mpiexec -n 2 ioTest
>>> [...]
>>> ```
>>> all is fine. If instead I use the full path of mpiexec, I get a
>>> different behavior
>>> ```
>>> $ /usr/bin/mpiexec -n 2 ioTest
>>> [...]
>>> (exception thrown)
>>> ```
>>> I was puzzled, so skimmed the manpage and found that the `--prefix`
>>> option might have something to do with it. I played around and got
>>> ```
>>> $ /usr/bin/mpiexec --prefix . -n 2 ioTest
>>> [fuji:21003] *** Process received signal ***
>>> [fuji:21003] Signal: Segmentation fault (11)
>>> [fuji:21003] Signal code: Address not mapped (1)
>>> [fuji:21003] Failing at address: 0x100dd
>>> [fuji:21003] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)
>>> [0x7f12e4069340]
>>> [fuji:21003] [ 1] /lib/x86_64-linux-gnu/libc.so.6(_IO_vfprintf+0x1d13)
>>> [0x7f12e3cde8f3]
>>> [fuji:21003] [ 2]
>>> /lib/x86_64-linux-gnu/libc.so.6(__vasprintf_chk+0xb5) [0x7f12e3d9e035]
>>> [fuji:21003] [ 3]
>>> /usr/lib/libopen-rte.so.4(opal_show_help_vstring+0x343)
>>> [0x7f12e43043e3]
>>> [fuji:21003] [ 4] /usr/lib/libopen-rte.so.4(orte_show_help+0xaf)
>>> [0x7f12e42a5faf]
>>> [fuji:21003] [ 5] /usr/bin/mpiexec() [0x403ab3]
>>> [fuji:21003] [ 6] /usr/bin/mpiexec() [0x40347d]
>>> [fuji:21003] [ 7]
>>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)
>>> [0x7f12e3cb4ec5]
>>> [fuji:21003] [ 8] /usr/bin/mpiexec() [0x403399]
>>> [fuji:21003] *** End of error message ***
>>> Segmentation fault (core dumped)
>>> ```
>>> That's tough!
>>> 
>>> Should I try to reproduce this with a more recent version? Any idea
>>> what the reason for the different behavior of `mpiexec` and
>>> `/usr/bin/mpiexec` might be?
>>> 
>>> Cheers,
>>> Nico
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/users/2014/09/25338.php
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/09/25339.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/09/25340.php