We hit a problem recently with memory errors when scaling a code to 1000 cores.
Switching to SRQ and some guess of queue values selected appears to let the code run. S,4096,128:S,12288,128:S,65536,12 Two questions, This is a ConnectX fabric, should I switch them to XRC queues? And should I use the same queue size/count? That a safe assumption? X,4096,128:X,12288,128:X,65536,12 When should I use one queue type over the other? Is there a way to get stat feedback on the use of your shared queues (SRQ or XRC) ? Example, using code 'not from here' and would like to know, "hey you are always running out of your queue of size X" Or " your queue of size Y is never used" We are kinda blind for a lot of our applications :-) Brock Palen www.umich.edu/~brockp CAEN Advanced Computing bro...@umich.edu (734)936-1985