Folks, i was unable to start a simple MPI job using the TCP btl on an heterogeneous cluster and using --mca mpi_procs_cutoff 0. The root cause was the most significant bit of the jobid was set on some nodes but not on others.
This is what we have : from opal/dss/dss_types.h typedef uint32_t opal_jobid_t; typedef uint32_t opal_vpid_t; typedef struct { opal_jobid_t jobid; opal_vpid_t vpid; } opal_process_name_t; and from ompi/proc/proc.h static inline intptr_t ompi_proc_name_to_sentinel (opal_process_name_t name) { return (*((intptr_t *) &name) << 1) | 0x1; } static inline opal_process_name_t ompi_proc_sentinel_to_name (intptr_t sentinel) { sentinel >>= 1; sentinel &= 0x7FFFFFFFFFFFFFFF; return *((opal_process_name_t *) &sentinel); } from a pragmatic point of view, that means that one bit is lost when we use cutoff. on a little endian machine, this is the MSB of the vpid, so we are just fine until two billion tasks in a job on a big endian machine, this is the MSB of the jobid, and the MSB of my jobids happen to be set :-( note the jobid is internally used as two 16 bits unsigned int. the most significant part is a hash of mpirun pid and hostname, and the least significant part is the job "step" on a big endian machine, this is the MSB of the hash that is lost. how should we move forward ? - the simplest option is to always clear the MSB of the hash, and keep in mind the vpid might only be 31 bits. - an other option is to reimplement ompi_proc_name_to_sentinel and friend (depending on the endianness) so we always lose the same bit this bit could be the MSB of - vpid, if 2 billion tasks is an acceptable limit - job hash, if a 15 bits job hash is acceptable - job "step", if we really want a 16 bits job hash any thoughts ? Cheers, Gilles