Hey!!! I ran into this problem many months ago but its been so elusive that I've haven't nailed it down. First time we saw this was last October. I did some MTT gleaning and could not find anyone but Solaris having this issue under MTT. What's interesting is I gleaned Sun's MTT results and could not find any of these failures as far back as last October. What it looked like to me was that the shared memory segment might not have been initialized with 0's thus allowing one of the processes to start accessing addresses that did not have an appropriate address. However, when I was looking at this I was told the mmap file was created with ftruncate which essentially should 0 fill the memory. So I was at a loss as to why this was happening.

I was able to reproduce this for a little while manually setting up a script that ran and small np=2 program over and over for sometime under 3-4 days. But around November I was unable to reproduce the issue after 4 days of runs and threw up my hands until I was able to find more failures under MTT which for Sun I haven't.

Note that I was able to reproduce this issue with both SPARC and Intel based platforms.

--td

Ralph Castain wrote:
Hey Jeff

I seem to recall seeing the identical problem reported on the user list not long ago...or may have been the devel list. Anyway, it was during btl_sm_add_procs, and the code was segv'ing.

I don't have the archives handy here, but perhaps you might search them and see if there is a common theme here. IIRC, some of Eugene's fixes impacted this problem.

Ralph


On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:

On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:

Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 MTT.  :-
(  I can't reproduce them manually, but they seem to only happen in a
very small fraction of overall MTT runs.  I'm seeing at least 3
classes of errors:

1. btl_sm_add_procs.c:529 which is this:

       if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock !=
NULL) {

j = 3, my_smp_rank = 1.  mca_btl_sm_component.fifo[j][my_smp_rank]
appears to have a valid value in it (i.e., .fifo[3][0] = x, .fifo[3]
[1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x+3*offset.
But gdb says:

(gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
Cannot access memory at address 0x2a96b73050



Bah -- this is a red herring; this memory is in the shared memory segment, and that memory is not saved in the corefile. So of course gdb can't access it (I just did a short controlled test and proved this to myself).

But I don't understand why I would have a bunch of tests that all segv at btl_sm_add_procs.c:529. :-(

--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to