Maybe this should go to the devel list but I'll start here.

In tracking the way the PBS tm API propagates error information
back to clients, I noticed that Open MPI is making an incorrect
assumption.  (I'm looking 1.3.2.) The relevant code in
orte/mca/plm/tm/plm_tm_module.c is:

    /* TM poll for all the spawns */
    for (i = 0; i < launched; ++i) {
        rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
        if (TM_SUCCESS != rc) {
            errno = local_err;
            opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
                           " return status = %d", rc);
            goto cleanup;
        }
    }

My reading of the way the tm API works is that tm_poll() can (will)
return TM_SUCCESS(0) even when the tm_spawn event being waited on failed,
i.e. local_err needs to be checked even if rc=0.  It looks like TM_
errors (rc values) are from tm protocol failures or incorrect calls
to tm.  local_err is to do with why the actual requested action failed
and is usually some sort of internal PBSE_ error code.  In fact it's
probably always PBSE_SYSTEM (15010) - I think it is for tm_spawn().

Something like the following is probably closer to what is needed.

    /* TM poll for all the spawns */
    for (i = 0; i < launched; ++i) {
        rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
        if (TM_SUCCESS != rc) {
            errno = local_err;
            opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
                           " return status = %d", rc);
            goto cleanup;
        }
        if (local_err!=0) {
            errno = local_err;
            opal_output(0, "plm:tm: failed to spawn daemon,"
                           " error code = %d", errno );
            goto cleanup;
        }
    }

I checked torque 2.3.3 to confirm that it's tm behaviour is the same as
OpenPBS in this respect. No idea about PBSPro.


David

Reply via email to