Well I have a new wrench into this situation.
We have a power failure at our datacenter took down our entire system 
nodes,switch,sm.  
Now I am unable to produce the error with oob default ibflags etc.

Does this shed any light on the issue?  It also makes it hard to now debug the 
issue without being able to reproduce it.

Any thoughts?  Am I overlooking something? 

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On May 17, 2011, at 2:18 PM, Brock Palen wrote:

> Sorry typo 314 not 313, 
> 
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> bro...@umich.edu
> (734)936-1985
> 
> 
> 
> On May 17, 2011, at 2:02 PM, Brock Palen wrote:
> 
>> Thanks, I though of looking at ompi_info after I sent that note sigh.
>> 
>> SEND_INPLACE appears to help performance of larger messages in my synthetic 
>> benchmarks over regular SEND.  Also it appears that SEND_INPLACE still 
>> allows our code to run.
>> 
>> We working on getting devs access to our system and code. 
>> 
>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> bro...@umich.edu
>> (734)936-1985
>> 
>> 
>> 
>> On May 16, 2011, at 11:49 AM, George Bosilca wrote:
>> 
>>> Here is the output of the "ompi_info --param btl openib":
>>> 
>>>               MCA btl: parameter "btl_openib_flags" (current value: <306>, 
>>> data
>>>                        source: default value)
>>>                        BTL bit flags (general flags: SEND=1, PUT=2, GET=4,
>>>                        SEND_INPLACE=8, RDMA_MATCHED=64, 
>>> HETEROGENEOUS_RDMA=256; flags
>>>                        only used by the "dr" PML (ignored by others): 
>>> ACK=16,
>>>                        CHECKSUM=32, RDMA_COMPLETION=128; flags only used by 
>>> the "bfo"
>>>                        PML (ignored by others): FAILOVER_SUPPORT=512)
>>> 
>>> So the 305 flags means: HETEROGENEOUS_RDMA | CHECKSUM | ACK | SEND. Most of 
>>> these flags are totally useless in the current version of Open MPI (DR is 
>>> not supported), so the only value that really matter is SEND | 
>>> HETEROGENEOUS_RDMA.
>>> 
>>> If you want to enable the send protocol try first with SEND | SEND_INPLACE 
>>> (9), if not downgrade to SEND (1)
>>> 
>>> george.
>>> 
>>> On May 16, 2011, at 11:33 , Samuel K. Gutierrez wrote:
>>> 
>>>> 
>>>> On May 16, 2011, at 8:53 AM, Brock Palen wrote:
>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On May 16, 2011, at 10:23 AM, Samuel K. Gutierrez wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Just out of curiosity - what happens when you add the following MCA 
>>>>>> option to your openib runs?
>>>>>> 
>>>>>> -mca btl_openib_flags 305
>>>>> 
>>>>> You Sir found the magic combination.
>>>> 
>>>> :-)  - cool.
>>>> 
>>>> Developers - does this smell like a registered memory availability hang?
>>>> 
>>>>> I verified this lets IMB and CRASH progress pass their lockup points,
>>>>> I will have a user test this, 
>>>> 
>>>> Please let us know what you find.
>>>> 
>>>>> Is this an ok option to put in our environment?  What does 305 mean?
>>>> 
>>>> There may be a performance hit associated with this configuration, but if 
>>>> it lets your users run, then I don't see a problem with adding it to your 
>>>> environment.
>>>> 
>>>> If I'm reading things correctly, 305 turns off RDMA PUT/GET and turns on 
>>>> SEND.
>>>> 
>>>> OpenFabrics gurus - please correct me if I'm wrong :-).
>>>> 
>>>> Samuel Gutierrez
>>>> Los Alamos National Laboratory
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Brock Palen
>>>>> www.umich.edu/~brockp
>>>>> Center for Advanced Computing
>>>>> bro...@umich.edu
>>>>> (734)936-1985
>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Samuel Gutierrez
>>>>>> Los Alamos National Laboratory
>>>>>> 
>>>>>> On May 13, 2011, at 2:38 PM, Brock Palen wrote:
>>>>>> 
>>>>>>> On May 13, 2011, at 4:09 PM, Dave Love wrote:
>>>>>>> 
>>>>>>>> Jeff Squyres <jsquy...@cisco.com> writes:
>>>>>>>> 
>>>>>>>>> On May 11, 2011, at 3:21 PM, Dave Love wrote:
>>>>>>>>> 
>>>>>>>>>> We can reproduce it with IMB.  We could provide access, but we'd 
>>>>>>>>>> have to
>>>>>>>>>> negotiate with the owners of the relevant nodes to give you 
>>>>>>>>>> interactive
>>>>>>>>>> access to them.  Maybe Brock's would be more accessible?  (If you
>>>>>>>>>> contact me, I may not be able to respond for a few days.)
>>>>>>>>> 
>>>>>>>>> Brock has replied off-list that he, too, is able to reliably 
>>>>>>>>> reproduce the issue with IMB, and is working to get access for us.  
>>>>>>>>> Many thanks for your offer; let's see where Brock's access takes us.
>>>>>>>> 
>>>>>>>> Good.  Let me know if we could be useful
>>>>>>>> 
>>>>>>>>>>> -- we have not closed this issue,
>>>>>>>>>> 
>>>>>>>>>> Which issue?   I couldn't find a relevant-looking one.
>>>>>>>>> 
>>>>>>>>> https://svn.open-mpi.org/trac/ompi/ticket/2714
>>>>>>>> 
>>>>>>>> Thanks.  In csse it's useful info, it hangs for me with 1.5.3 & np=32 
>>>>>>>> on
>>>>>>>> connectx with more than one collective I can't recall.
>>>>>>> 
>>>>>>> Extra data point, that ticket said it ran with mpi_preconnect_mpi 1,  
>>>>>>> well that doesn't help here, both my production code (crash) and IMB 
>>>>>>> still hang.
>>>>>>> 
>>>>>>> 
>>>>>>> Brock Palen
>>>>>>> www.umich.edu/~brockp
>>>>>>> Center for Advanced Computing
>>>>>>> bro...@umich.edu
>>>>>>> (734)936-1985
>>>>>>> 
>>>>>>>> 
>>>>>>>> -- 
>>>>>>>> Excuse the typping -- I have a broken wrist
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> George Bosilca
>>> Research Assistant Professor
>>> Innovative Computing Laboratory
>>> Department of Electrical Engineering and Computer Science
>>> University of Tennessee, Knoxville
>>> http://web.eecs.utk.edu/~bosilca/
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 


Reply via email to