We hit a problem recently with memory errors when scaling a code to 1000 cores. 

Switching to SRQ and some guess of queue values selected appears to let the 
code run.
S,4096,128:S,12288,128:S,65536,12 

Two questions,

This is a ConnectX fabric, should I switch them to XRC queues? And should I use 
the same queue size/count?  That a safe assumption?
X,4096,128:X,12288,128:X,65536,12 


  When should I use one queue type over the other?

Is there a way to get stat feedback on the use of your shared queues (SRQ or 
XRC) ?

Example, using code 'not from here' and would like to know,   "hey you are 
always  running out of your queue of size X"  Or " your queue of size Y is 
never used"

We are kinda blind for a lot of our applications :-)

Brock Palen
www.umich.edu/~brockp
CAEN Advanced Computing
bro...@umich.edu
(734)936-1985




Reply via email to