On Mar 24, 2010, at 12:49 AM, Anton Starikov wrote:

> Two different OSes: centos 5.4 (2.6.18 kernel) and Fedora-12 (2.6.32 kernel)
> Two different CPUs: Opteron 248 and Opteron 8356.
> 
> same binary for OpenMPI. Same binary for user code (vasp compiled for older 
> arch)

Are you sure that the code is binary compatible between the two platforms?  Can 
you repeat the process with native builds of Open MPI and the app for both 
architectures?

> When I supply rankfile, then depending on combo of OS and CPU results are 
> different
> 
> centos+Opt8356 : works
> centos+Opt248 : works
> fedora+Opt8356 : works
> fedora+Opt248 : fails
> 
> rankfile is (in case of Opt248)
> 
> rank 0=node014 slot=1
> rank 1=node014 slot=0
> 
> I tried play with formats, leave one slot (and start one process) - it 
> doesn't change result
> Without rankfile it works on all combos.

Nifty (meaning: ick!).

I wonder if the processor affinity code is causing the problem here...?  It 
could be a problem in a heterogeneous environment if the systems are "close" 
but not "exact" in terms of binary compatibility...?

> Just in case, all this happens inside of cpuset which always wraps all slots 
> given in rankfile (I just use torque with cpusets and my custom patch for 
> torque which also creates rankfile for openmpi, in this case MPI tasks are 
> bound to particular cores and multithreaded codes limited by given cpuset).
> 
> AFAIR, it also works without problem on both hardware setups with 1.3.x/1.4.0 
> and 2.6.30 kernel from OpenSuSE 11.1.
> 
> Strangely, but when I run OSU benchmarks (osu_bw etc), it works without any 
> problems.

Can you re-run with a trivial test, like MPI hello world and/or ring?  See the 
examples/ directory.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to