Hi David

You are quite correct. IIRC, we didn't bother checking the local_err because
we found it to be unreliable - all Torque checks is that the program exec's.
It doesn't report back an error if it segfaults instantly, for example, or
aborts because it fails to find a required library. So we added a simple
timer that declares the launch a failure if the daemon(s) fail to report
back in a specified time.
Hi David

This didn't cause any problems, so I went ahead and put it in our devel
trunk. Barring any subsequent error reports, I'll move it over to the 1.3
series.

Thanks!
Ralph




> However, it can't hurt to check the flag as well. I'll test it out first
> just to ensure we don't get false failures.
>
> Thanks
> Ralph
>
> On Aug 12, 2009, at 11:33 PM, David Singleton wrote:
>>
>>
>> Maybe this should go to the devel list but I'll start here.
>>
>> In tracking the way the PBS tm API propagates error information
>> back to clients, I noticed that Open MPI is making an incorrect
>> assumption.  (I'm looking 1.3.2.) The relevant code in
>> orte/mca/plm/tm/plm_tm_module.c is:
>>
>>   /* TM poll for all the spawns */
>>   for (i = 0; i < launched; ++i) {
>>       rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
>>       if (TM_SUCCESS != rc) {
>>           errno = local_err;
>>           opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
>>                          " return status = %d", rc);
>>           goto cleanup;
>>       }
>>   }
>>
>> My reading of the way the tm API works is that tm_poll() can (will)
>> return TM_SUCCESS(0) even when the tm_spawn event being waited on failed,
>> i.e. local_err needs to be checked even if rc=0.  It looks like TM_
>> errors (rc values) are from tm protocol failures or incorrect calls
>> to tm.  local_err is to do with why the actual requested action failed
>> and is usually some sort of internal PBSE_ error code.  In fact it's
>> probably always PBSE_SYSTEM (15010) - I think it is for tm_spawn().
>>
>> Something like the following is probably closer to what is needed.
>>
>>   /* TM poll for all the spawns */
>>   for (i = 0; i < launched; ++i) {
>>       rc = tm_poll(TM_NULL_EVENT, &event, 1, &local_err);
>>       if (TM_SUCCESS != rc) {
>>           errno = local_err;
>>           opal_output(0, "plm:tm: failed to poll for a spawned daemon,"
>>                          " return status = %d", rc);
>>           goto cleanup;
>>       }
>>     if (local_err!=0) {
>>           errno = local_err;
>>           opal_output(0, "plm:tm: failed to spawn daemon,"
>>                          " error code = %d", errno );
>>           goto cleanup;
>>       }
>>   }
>>
>> I checked torque 2.3.3 to confirm that it's tm behaviour is the same as
>> OpenPBS in this respect. No idea about PBSPro.
>>
>>
>> David
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>

Reply via email to