[OMPI users] OpenMPI 1.2.x segfault as regular user

2011-03-04 Thread Youri LACAN-BARTLEY
Hi,

 

This is my first post to this mailing-list so I apologize for maybe
being a little rough on the edges.

I've been digging into OpenMPI for a little while now and have come
across one issue that I just can't explain and I'm sincerely hoping
someone can put me on the right track here.

 

I'm using a fresh install of openmpi-1.2.7 and I systematically get a
segmentation fault at the end of my mpirun calls if I'm logged in as a
regular user.

However, as soon as I switch to the root account, the segfault does not
appear.

The jobs actually run to their term but I just can't find a good reason
for this to be happening and I haven't been able to reproduce the
problem on another machine.

 

Any help or tips would be greatly appreciated.

 

Thanks,

 

Youri LACAN-BARTLEY

 

Here's an example running osu_latency locally (I've "blacklisted" openib
to make sure it's not to blame):

 

[user@server ~]$ mpirun --mca btl ^openib  -np 2
/opt/scripts/osu_latency-openmpi-1.2.7

# OSU MPI Latency Test v3.3

# SizeLatency (us)

0 0.76

1 0.89

2 0.89

4 0.89

8 0.89

160.91

320.91

640.92

128   0.96

256   1.13

512   1.31

1024  1.69

2048  2.51

4096  5.34

8192  9.16

1638417.47

3276831.79

6553651.10

131072   92.41

262144  181.74

524288  512.26

10485761238.21

20971522280.28

41943044616.67

[server:15586] *** Process received signal ***

[server:15586] Signal: Segmentation fault (11)

[server:15586] Signal code: Address not mapped (1)

[server:15586] Failing at address: (nil)

[server:15586] [ 0] /lib64/libpthread.so.0 [0x3cd1e0eb10]

[server:15586] [ 1] /lib64/libc.so.6 [0x3cd166fdc9]

[server:15586] [ 2] /lib64/libc.so.6(__libc_malloc+0x167) [0x3cd1674dd7]

[server:15586] [ 3] /lib64/ld-linux-x86-64.so.2(__tls_get_addr+0xb1)
[0x3cd120fe61]

[server:15586] [ 4] /lib64/libselinux.so.1 [0x3cd320f5cc]

[server:15586] [ 5] /lib64/libselinux.so.1 [0x3cd32045df]

[server:15586] *** End of error message ***

[server:15587] *** Process received signal ***

[server:15587] Signal: Segmentation fault (11)

[server:15587] Signal code: Address not mapped (1)

[server:15587] Failing at address: (nil)

[server:15587] [ 0] /lib64/libpthread.so.0 [0x3cd1e0eb10]

[server:15587] [ 1] /lib64/libc.so.6 [0x3cd166fdc9]

[server:15587] [ 2] /lib64/libc.so.6(__libc_malloc+0x167) [0x3cd1674dd7]

[server:15587] [ 3] /lib64/ld-linux-x86-64.so.2(__tls_get_addr+0xb1)
[0x3cd120fe61]

[server:15587] [ 4] /lib64/libselinux.so.1 [0x3cd320f5cc]

[server:15587] [ 5] /lib64/libselinux.so.1 [0x3cd32045df]

[server:15587] *** End of error message ***

mpirun noticed that job rank 0 with PID 15586 on node server exited on
signal 11 (Segmentation fault).

1 additional process aborted (not shown)

[server:15583] *** Process received signal ***

[server:15583] Signal: Segmentation fault (11)

[server:15583] Signal code: Address not mapped (1)

[server:15583] Failing at address: (nil)

[server:15583] [ 0] /lib64/libpthread.so.0 [0x3cd1e0eb10]

[server:15583] [ 1] /lib64/libc.so.6 [0x3cd166fdc9]

[server:15583] [ 2] /lib64/libc.so.6(__libc_malloc+0x167) [0x3cd1674dd7]

[server:15583] [ 3] /lib64/ld-linux-x86-64.so.2(__tls_get_addr+0xb1)
[0x3cd120fe61]

[server:15583] [ 4] /lib64/libselinux.so.1 [0x3cd320f5cc]

[server:15583] [ 5] /lib64/libselinux.so.1 [0x3cd32045df]

[server:15583] *** End of error message ***

Segmentation fault



Re: [OMPI users] OpenMPI 1.2.x segfault as regular user

2011-03-17 Thread Jeff Squyres
Sorry for the delayed reply.

I'm afraid I haven't done much with SE Linux -- I don't know if there are any 
"gotchas" that would show up there.  SE Linux support is not something we've 
gotten a lot of request for.  I doubt that anyone in the community has done 
much testing in this area.  :-\

I suspect that Open MPI is trying to access something that your user (under SE 
Linux) doesn't have permission to.  

So I'm afraid I don't have much of an answer for you -- sorry!  If you do 
figure it out, though, if a fix is not too intrusive, we can probably 
incorporate it upstream.


On Mar 4, 2011, at 7:31 AM, Youri LACAN-BARTLEY wrote:

> Hi,
>  
> This is my first post to this mailing-list so I apologize for maybe being a 
> little rough on the edges.
> I’ve been digging into OpenMPI for a little while now and have come across 
> one issue that I just can’t explain and I’m sincerely hoping someone can put 
> me on the right track here.
>  
> I’m using a fresh install of openmpi-1.2.7 and I systematically get a 
> segmentation fault at the end of my mpirun calls if I’m logged in as a 
> regular user.
> However, as soon as I switch to the root account, the segfault does not 
> appear.
> The jobs actually run to their term but I just can’t find a good reason for 
> this to be happening and I haven’t been able to reproduce the problem on 
> another machine.
>  
> Any help or tips would be greatly appreciated.
>  
> Thanks,
>  
> Youri LACAN-BARTLEY
>  
> Here’s an example running osu_latency locally (I’ve “blacklisted” openib to 
> make sure it’s not to blame):
>  
> [user@server ~]$ mpirun --mca btl ^openib  -np 2 
> /opt/scripts/osu_latency-openmpi-1.2.7
> # OSU MPI Latency Test v3.3
> # SizeLatency (us)
> 0 0.76
> 1 0.89
> 2 0.89
> 4 0.89
> 8 0.89
> 160.91
> 320.91
> 640.92
> 128   0.96
> 256   1.13
> 512   1.31
> 1024  1.69
> 2048  2.51
> 4096  5.34
> 8192  9.16
> 1638417.47
> 3276831.79
> 6553651.10
> 131072   92.41
> 262144  181.74
> 524288  512.26
> 10485761238.21
> 20971522280.28
> 41943044616.67
> [server:15586] *** Process received signal ***
> [server:15586] Signal: Segmentation fault (11)
> [server:15586] Signal code: Address not mapped (1)
> [server:15586] Failing at address: (nil)
> [server:15586] [ 0] /lib64/libpthread.so.0 [0x3cd1e0eb10]
> [server:15586] [ 1] /lib64/libc.so.6 [0x3cd166fdc9]
> [server:15586] [ 2] /lib64/libc.so.6(__libc_malloc+0x167) [0x3cd1674dd7]
> [server:15586] [ 3] /lib64/ld-linux-x86-64.so.2(__tls_get_addr+0xb1) 
> [0x3cd120fe61]
> [server:15586] [ 4] /lib64/libselinux.so.1 [0x3cd320f5cc]
> [server:15586] [ 5] /lib64/libselinux.so.1 [0x3cd32045df]
> [server:15586] *** End of error message ***
> [server:15587] *** Process received signal ***
> [server:15587] Signal: Segmentation fault (11)
> [server:15587] Signal code: Address not mapped (1)
> [server:15587] Failing at address: (nil)
> [server:15587] [ 0] /lib64/libpthread.so.0 [0x3cd1e0eb10]
> [server:15587] [ 1] /lib64/libc.so.6 [0x3cd166fdc9]
> [server:15587] [ 2] /lib64/libc.so.6(__libc_malloc+0x167) [0x3cd1674dd7]
> [server:15587] [ 3] /lib64/ld-linux-x86-64.so.2(__tls_get_addr+0xb1) 
> [0x3cd120fe61]
> [server:15587] [ 4] /lib64/libselinux.so.1 [0x3cd320f5cc]
> [server:15587] [ 5] /lib64/libselinux.so.1 [0x3cd32045df]
> [server:15587] *** End of error message ***
> mpirun noticed that job rank 0 with PID 15586 on node server exited on signal 
> 11 (Segmentation fault).
> 1 additional process aborted (not shown)
> [server:15583] *** Process received signal ***
> [server:15583] Signal: Segmentation fault (11)
> [server:15583] Signal code: Address not mapped (1)
> [server:15583] Failing at address: (nil)
> [server:15583] [ 0] /lib64/libpthread.so.0 [0x3cd1e0eb10]
> [server:15583] [ 1] /lib64/libc.so.6 [0x3cd166fdc9]
> [server:15583] [ 2] /lib64/libc.so.6(__libc_malloc+0x167) [0x3cd1674dd7]
> [server:15583] [ 3] /lib64/ld-linux-x86-64.so.2(__tls_get_addr+0xb1) 
> [0x3cd120fe61]
> [server:15583] [ 4] /lib64/libselinux.so.1 [0x3cd320f5cc]
> [server:15583] [ 5] /lib64/libselinux.so.1 [0x3cd32045df]
> [server:15583] *** End of error message ***
> Segmentation fault
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] OpenMPI 1.2.x segfault as regular user

2011-03-18 Thread Prentice Bisbal
It's not hard to test whether or not SELinux is the problem. You can
turn SELinux off on the command-line with this command:

setenforce 0

Of course, you need to be root in order to do this.

After turning SELinux off, you can try reproducing the error. If it
still occurs, it's SELinux, if it doesn't the problem is elswhere. When
your done, you can reenable SELinux with

setenforce 1

If you're running your job across multiple nodes, you should disable
SELinux on all of them for testing.

Did you compile/install Open MPI yourself? If so, I suspect that you
have the SELinux context labels on your MPI binaries are incorrect.

If you use the method above to determine that SELinux is the problem,
please post your results here and I may be able to help you set things
right. I have some experience with SELinux problems like this, but I'm
not exactly an expert.

--
Prentice


On 03/17/2011 11:01 AM, Jeff Squyres wrote:
> Sorry for the delayed reply.
> 
> I'm afraid I haven't done much with SE Linux -- I don't know if there are any 
> "gotchas" that would show up there.  SE Linux support is not something we've 
> gotten a lot of request for.  I doubt that anyone in the community has done 
> much testing in this area.  :-\
> 
> I suspect that Open MPI is trying to access something that your user (under 
> SE Linux) doesn't have permission to.  
> 
> So I'm afraid I don't have much of an answer for you -- sorry!  If you do 
> figure it out, though, if a fix is not too intrusive, we can probably 
> incorporate it upstream.
> 
> 
> On Mar 4, 2011, at 7:31 AM, Youri LACAN-BARTLEY wrote:
> 
>> Hi,
>>  
>> This is my first post to this mailing-list so I apologize for maybe being a 
>> little rough on the edges.
>> I’ve been digging into OpenMPI for a little while now and have come across 
>> one issue that I just can’t explain and I’m sincerely hoping someone can put 
>> me on the right track here.
>>  
>> I’m using a fresh install of openmpi-1.2.7 and I systematically get a 
>> segmentation fault at the end of my mpirun calls if I’m logged in as a 
>> regular user.
>> However, as soon as I switch to the root account, the segfault does not 
>> appear.
>> The jobs actually run to their term but I just can’t find a good reason for 
>> this to be happening and I haven’t been able to reproduce the problem on 
>> another machine.
>>  
>> Any help or tips would be greatly appreciated.
>>  
>> Thanks,
>>  
>> Youri LACAN-BARTLEY
>>  
>> Here’s an example running osu_latency locally (I’ve “blacklisted” openib to 
>> make sure it’s not to blame):
>>  
>> [user@server ~]$ mpirun --mca btl ^openib  -np 2 
>> /opt/scripts/osu_latency-openmpi-1.2.7
>> # OSU MPI Latency Test v3.3
>> # SizeLatency (us)
>> 0 0.76
>> 1 0.89
>> 2 0.89
>> 4 0.89
>> 8 0.89
>> 160.91
>> 320.91
>> 640.92
>> 128   0.96
>> 256   1.13
>> 512   1.31
>> 1024  1.69
>> 2048  2.51
>> 4096  5.34
>> 8192  9.16
>> 1638417.47
>> 3276831.79
>> 6553651.10
>> 131072   92.41
>> 262144  181.74
>> 524288  512.26
>> 10485761238.21
>> 20971522280.28
>> 41943044616.67
>> [server:15586] *** Process received signal ***
>> [server:15586] Signal: Segmentation fault (11)
>> [server:15586] Signal code: Address not mapped (1)
>> [server:15586] Failing at address: (nil)
>> [server:15586] [ 0] /lib64/libpthread.so.0 [0x3cd1e0eb10]
>> [server:15586] [ 1] /lib64/libc.so.6 [0x3cd166fdc9]
>> [server:15586] [ 2] /lib64/libc.so.6(__libc_malloc+0x167) [0x3cd1674dd7]
>> [server:15586] [ 3] /lib64/ld-linux-x86-64.so.2(__tls_get_addr+0xb1) 
>> [0x3cd120fe61]
>> [server:15586] [ 4] /lib64/libselinux.so.1 [0x3cd320f5cc]
>> [server:15586] [ 5] /lib64/libselinux.so.1 [0x3cd32045df]
>> [server:15586] *** End of error message ***
>> [server:15587] *** Process received signal ***
>> [server:15587] Signal: Segmentation fault (11)
>> [server:15587] Signal code: Address not mapped (1)
>> [server:15587] Failing at address: (nil)
>> [server:15587] [ 0] /lib64/libpthread.so.0 [0x3cd1e0eb10]
>> [server:15587] [ 1] /lib64/libc.so.6 [0x3cd166fdc9]
>> [server:15587] [ 2] /lib64/libc.so.6(__libc_malloc+0x167) [0x3cd1674dd7]
>> [server:15587] [ 3] /lib64/ld-linux-x86-64.so.2(__tls_get_addr+0xb1) 
>> [0x3cd120fe61]
>> [server:15587] [ 4] /lib64/libselinux.so.1 [0x3cd320f5cc]
>> [server:15587] [ 5] /lib64/libselinux.so.1 [0x3cd32045df]
>> [server:15587] *** End of error message ***
>> mpirun noticed that job rank 0 with PID 15586 on node server exited on 
>> signal 11 (Segmentation fault).
>> 1 additional process aborted (not s

Re: [OMPI users] OpenMPI 1.2.x segfault as regular user

2011-03-20 Thread Kevin . Buckley

> It's not hard to test whether or not SELinux is the problem. You can
> turn SELinux off on the command-line with this command:
>
> setenforce 0
>
> Of course, you need to be root in order to do this.
>
> After turning SELinux off, you can try reproducing the error. If it
> still occurs, it's SELinux, if it doesn't the problem is elswhere. When
> your done, you can reenable SELinux with
>
> setenforce 1
>
> If you're running your job across multiple nodes, you should disable
> SELinux on all of them for testing.

You are not actually disabling SELinux with setenforce 0, just
putting it into "permissive" mode: SELinux is still active.

Running SELinux in its permissive mode, as opposed to disabling it
at boot time, sees SELinux making a log of things that would cause
it to dive in, were it running in "enforcing" mode.

There's then a tool you can run over that log that will suggest
the ACL changes you need to make to fix the issue from an SELinux
perspective.

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI users] OpenMPI 1.2.x segfault as regular user

2011-03-21 Thread Prentice Bisbal
On 03/20/2011 06:22 PM, kevin.buck...@ecs.vuw.ac.nz wrote:
> 
>> It's not hard to test whether or not SELinux is the problem. You can
>> turn SELinux off on the command-line with this command:
>>
>> setenforce 0
>>
>> Of course, you need to be root in order to do this.
>>
>> After turning SELinux off, you can try reproducing the error. If it
>> still occurs, it's SELinux, if it doesn't the problem is elswhere. When
>> your done, you can reenable SELinux with
>>
>> setenforce 1
>>
>> If you're running your job across multiple nodes, you should disable
>> SELinux on all of them for testing.
> 
> You are not actually disabling SELinux with setenforce 0, just
> putting it into "permissive" mode: SELinux is still active.
> 

That's correct. Thanks for catching my inaccurate choice of words.

> Running SELinux in its permissive mode, as opposed to disabling it
> at boot time, sees SELinux making a log of things that would cause
> it to dive in, were it running in "enforcing" mode.

I forgot about that. Checking those logs will make debugging even easier
for the original poster.

> 
> There's then a tool you can run over that log that will suggest
> the ACL changes you need to make to fix the issue from an SELinux
> perspective.
> 

-- 
Prentice


Re: [OMPI users] OpenMPI 1.2.x segfault as regular user

2011-03-23 Thread Youri LACAN-BARTLEY
Hi,

Thanks for your feedback and advice.

SELinux is currently disabled at runtime on all nodes as well as on the head 
node.
So I don't believe this might be the issue here.

I have indeed compiled Open MPI myself and haven't specified anything peculiar 
other than a --prefix and --enable-mpirun-prefix-by-default.
Have I overlooked something?

The problem doesn't occur with Open MPI 1.4.
I've tried running simple jobs directly on the head node to eliminate any 
networking or IB wizardry and mpirun systematically segfaults as a non-root 
user.

Here's one part of a strace call on mpirun that might be of some significance:
mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= -1 ENOMEM (Cannot allocate memory)

For further information you can refer to the strace files attached to this 
email.

Youri LACAN-BARTLEY

-Message d'origine-
De : users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] De la part 
de Prentice Bisbal
Envoyé : lundi 21 mars 2011 14:56
À : Open MPI Users
Objet : Re: [OMPI users] OpenMPI 1.2.x segfault as regular user

On 03/20/2011 06:22 PM, kevin.buck...@ecs.vuw.ac.nz wrote:
> 
>> It's not hard to test whether or not SELinux is the problem. You can
>> turn SELinux off on the command-line with this command:
>>
>> setenforce 0
>>
>> Of course, you need to be root in order to do this.
>>
>> After turning SELinux off, you can try reproducing the error. If it
>> still occurs, it's SELinux, if it doesn't the problem is elswhere. When
>> your done, you can reenable SELinux with
>>
>> setenforce 1
>>
>> If you're running your job across multiple nodes, you should disable
>> SELinux on all of them for testing.
> 
> You are not actually disabling SELinux with setenforce 0, just
> putting it into "permissive" mode: SELinux is still active.
> 

That's correct. Thanks for catching my inaccurate choice of words.

> Running SELinux in its permissive mode, as opposed to disabling it
> at boot time, sees SELinux making a log of things that would cause
> it to dive in, were it running in "enforcing" mode.

I forgot about that. Checking those logs will make debugging even easier
for the original poster.

> 
> There's then a tool you can run over that log that will suggest
> the ACL changes you need to make to fix the issue from an SELinux
> perspective.
> 

-- 
Prentice
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


mpirun-strace.tar.gz
Description: mpirun-strace.tar.gz