Folks,

i was unable to start a simple MPI job using the TCP btl on an
heterogeneous cluster and using --mca mpi_procs_cutoff 0.
The root cause was the most significant bit of the jobid was set on
some nodes but not on others.

This is what we have :

from opal/dss/dss_types.h

typedef uint32_t opal_jobid_t;
typedef uint32_t opal_vpid_t;
typedef struct {
    opal_jobid_t jobid;
    opal_vpid_t vpid;
} opal_process_name_t;


and from ompi/proc/proc.h

static inline intptr_t ompi_proc_name_to_sentinel (opal_process_name_t name)
{
  return (*((intptr_t *) &name) << 1) | 0x1;
}

static inline opal_process_name_t ompi_proc_sentinel_to_name (intptr_t sentinel)
{
  sentinel >>= 1;
  sentinel &= 0x7FFFFFFFFFFFFFFF;
  return *((opal_process_name_t *) &sentinel);
}


from a pragmatic point of view, that means that one bit is lost when
we use cutoff.
on a little endian machine, this is the MSB of the vpid, so we are
just fine until two billion tasks in a job
on a big endian machine, this is the MSB of the jobid, and the MSB of
my jobids happen to be set :-(

note the jobid is internally used as two 16 bits unsigned int.
the most significant part is a hash of mpirun pid and hostname,
and the least significant part is the job "step"
on a big endian machine, this is the MSB of the hash that is lost.


how should we move forward ?

- the simplest option is to always clear the MSB of the hash, and keep
in mind the vpid might only be 31 bits.
- an other option is to reimplement ompi_proc_name_to_sentinel and
friend (depending on the endianness) so we always lose the same bit
this bit could be the MSB of
- vpid, if 2 billion tasks is an acceptable limit
- job hash, if a 15 bits job hash is acceptable
- job "step", if we really want a 16 bits job hash


any thoughts ?

Cheers,

Gilles

Reply via email to