On Mar 24, 2010, at 12:49 AM, Anton Starikov wrote: > Two different OSes: centos 5.4 (2.6.18 kernel) and Fedora-12 (2.6.32 kernel) > Two different CPUs: Opteron 248 and Opteron 8356. > > same binary for OpenMPI. Same binary for user code (vasp compiled for older > arch)
Are you sure that the code is binary compatible between the two platforms? Can you repeat the process with native builds of Open MPI and the app for both architectures? > When I supply rankfile, then depending on combo of OS and CPU results are > different > > centos+Opt8356 : works > centos+Opt248 : works > fedora+Opt8356 : works > fedora+Opt248 : fails > > rankfile is (in case of Opt248) > > rank 0=node014 slot=1 > rank 1=node014 slot=0 > > I tried play with formats, leave one slot (and start one process) - it > doesn't change result > Without rankfile it works on all combos. Nifty (meaning: ick!). I wonder if the processor affinity code is causing the problem here...? It could be a problem in a heterogeneous environment if the systems are "close" but not "exact" in terms of binary compatibility...? > Just in case, all this happens inside of cpuset which always wraps all slots > given in rankfile (I just use torque with cpusets and my custom patch for > torque which also creates rankfile for openmpi, in this case MPI tasks are > bound to particular cores and multithreaded codes limited by given cpuset). > > AFAIR, it also works without problem on both hardware setups with 1.3.x/1.4.0 > and 2.6.30 kernel from OpenSuSE 11.1. > > Strangely, but when I run OSU benchmarks (osu_bw etc), it works without any > problems. Can you re-run with a trivial test, like MPI hello world and/or ring? See the examples/ directory. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/