I may have finally tracked this down. At least, I can now get the correct devel
map to come out, and found a memory corruption issue that only impacted hetero
operations. I can’t know if this is the root cause of the problem Bill is
seeing, however, as I have no way of actually running the job.
Hello Ralph and everybody,
The issue was finally tracked down. It had nothing to do with OpenMPI.
The LSF Environment Variable LSF_DJOB_DISABLED was set to 'y'. This was
preventing openmpi from launching jobs spanning multiple machines.
Thank you all for your hep and suggestions.
Thanks,
Rahul
I'm sorry I haven't been able to get the lstopo information for
all the nodes, but I had to get the latest version of hwloc installed
first. They've even added in some more modern blades that also
support hyperthreading, ugh. They've also been doing some memory
upgrades as well.
I'm trying to get
No need for the lstopo data anymore, Bill - I was able to recreate the
situation using some very nice hwloc functions plus your prior
descriptions. I'm not totally confident that this fix will resolve the
problem but it will clear out at least one problem.
We'll just have to see what happens and a