Could be true; it unfortunately doesn't help us for 1.3.1, though.  :-(

Maybe I'll add a big memset of 0 across the sm segment at the beginning of time and see if this problem goes away.



On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote:

I actually wasn't implying that Eugene's changes -caused- the problem,
but rather that I thought they might have -fixed- the problem.

:-)


On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote:

> I forgot to mention that since I ran into this issue so long ago I
> really doubt that Eugene's SM changes has caused this issue.
>
> --td
>
> Terry Dontje wrote:
>> Hey!!!  I ran into this problem many months ago but its been so
>> elusive that I've haven't nailed it down.  First time we saw this
>> was last October.  I did some MTT gleaning and could not find
>> anyone but Solaris having this issue under MTT.  What's interesting
>> is I gleaned Sun's MTT results and could not find any of these
>> failures as far back as last October.
>> What it looked like to me was that the shared memory segment might
>> not have been initialized with 0's thus allowing one of the
>> processes to start accessing addresses that did not have an
>> appropriate address.  However, when I was looking at this I was
>> told the mmap file was created with ftruncate which essentially
>> should 0 fill the memory.  So I was at a loss as to why this was
>> happening.
>>
>> I was able to reproduce this for a little while manually setting up
>> a script that ran and small np=2 program over and over for sometime
>> under 3-4 days.  But around November I was unable to reproduce the
>> issue after 4 days of runs and threw up my hands until I was able
>> to find more failures under MTT which for Sun I haven't.
>>
>> Note that I was able to reproduce this issue with both SPARC and
>> Intel based platforms.
>>
>> --td
>>
>> Ralph Castain wrote:
>>> Hey Jeff
>>>
>>> I seem to recall seeing the identical problem reported on the user
>>> list not long ago...or may have been the devel list. Anyway, it
>>> was during btl_sm_add_procs, and the code was segv'ing.
>>>
>>> I don't have the archives handy here, but perhaps you might search
>>> them and see if there is a common theme here. IIRC, some of
>>> Eugene's fixes impacted this problem.
>>>
>>> Ralph
>>>
>>>
>>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:
>>>
>>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
>>>>
>>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1
>>>>> MTT.  :-
>>>>> (  I can't reproduce them manually, but they seem to only happen
>>>>> in a
>>>>> very small fraction of overall MTT runs.  I'm seeing at least 3
>>>>> classes of errors:
>>>>>
>>>>> 1. btl_sm_add_procs.c:529 which is this:
>>>>>
>>>>>       if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock !=
>>>>> NULL) {
>>>>>
>>>>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j] [my_smp_rank]
>>>>> appears to have a valid value in it (i.e., .fifo[3][0] =
>>>>> x, .fifo[3]
>>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x
>>>>> +3*offset.
>>>>> But gdb says:
>>>>>
>>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
>>>>> Cannot access memory at address 0x2a96b73050
>>>>>
>>>>
>>>>
>>>> Bah -- this is a red herring; this memory is in the shared memory
>>>> segment, and that memory is not saved in the corefile.  So of
>>>> course gdb can't access it (I just did a short controlled test
>>>> and proved this to myself).
>>>>
>>>> But I don't understand why I would have a bunch of tests that all
>>>> segv at btl_sm_add_procs.c:529.  :-(
>>>>
>>>> --
>>>> Jeff Squyres
>>>> Cisco Systems
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

Reply via email to