Eugene Loh wrote:
Ralph Castain wrote:

Hi Bryan

I have seen similar issues on LANL clusters when message sizes were fairly large. How big are your buffers when you call Allreduce? Can you send us your Allreduce call params (e.g., the reduce operation, datatype, num elements)?

If you don't want to send that to the list, you can send it to me at LANL.

I haven't seen any updates on this. Please tell me Bryan sent info to Ralph at LANL and Ralph nailed this one. Please! :^)

Ralph and I took this off line.

I'm so far unable to reproduce the problem on a node of roadrunner, which is 4 x86_64 cores, openmpi 1.3.2, and sm for transport. That openmpi was built with some special platform files, not a configure run without the platform files. Ralph sent me the platform files and I'm about to build my own version on the small 8 core machine where the problem first showed up.

I'll report more as soon as I know more.  Hopefully in the morning.

        - Bryan

--
Bryan Lally, la...@lanl.gov
505.667.9954
CCS-2
Los Alamos National Laboratory
Los Alamos, New Mexico

Reply via email to