Eugene Loh wrote:
Ralph Castain wrote:
Hi Bryan
I have seen similar issues on LANL clusters when message sizes were
fairly large. How big are your buffers when you call Allreduce? Can
you send us your Allreduce call params (e.g., the reduce operation,
datatype, num elements)?
If you don't want to send that to the list, you can send it to me at
LANL.
I haven't seen any updates on this. Please tell me Bryan sent info to
Ralph at LANL and Ralph nailed this one. Please! :^)
Ralph and I took this off line.
I'm so far unable to reproduce the problem on a node of roadrunner,
which is 4 x86_64 cores, openmpi 1.3.2, and sm for transport. That
openmpi was built with some special platform files, not a configure run
without the platform files. Ralph sent me the platform files and I'm
about to build my own version on the small 8 core machine where the
problem first showed up.
I'll report more as soon as I know more. Hopefully in the morning.
- Bryan
--
Bryan Lally, la...@lanl.gov
505.667.9954
CCS-2
Los Alamos National Laboratory
Los Alamos, New Mexico