David Maxwell has finally gotten the Coverity automation stuff working to download our nightly trunk tarballs and run them through their tool. I was skimming through the results this morning and noticed one that looked odd to me: in osc_rdma_component.c:619, we have:

619                     if (NULL == datatype) {
620                         opal_output(ompi_osc_base_output,
621 "Error recreating datatype. Aborting.");
622                         ompi_mpi_abort(module->m_comm, 1, false);
623                     }

And then shortly after line 623, we use/dereference datatype. Coverity marked this as "possible NULL dereference".

"Hah", I thought, "Coverity doesn't realize that ompi_mpi_abort() will always abort, so this is a false positive."

Just for the heckuvit, I checked: ompi_mpi_abort does *not* guarantee to abort. Indeed, there are two cases where it may actually return:

1. We have logic in ompi_mpi_abort to prevent recursive invocation (ompi_mpi_abort.c:60):

    /* Protection for recursive invocation */
    if (have_been_invoked) {
        return OMPI_SUCCESS;
    }
    have_been_invoked = true;

I added this check in r13354, but I don't remember the exact case in which it was happening. :-\ I'm wondering if we should loop forever (perhaps over sleep or something) instead of returning...

Do anyone know how it could be that ompi_mpi_abort() could be invoked *recursively*? I can imagine in a THREAD_MULTIPLE scenario that multiple threads could call ompi_mpi_abort (which is I *think* is why I added that logic). In that case, do we have enough protection, or do we really need to use atomic operations to test have_been_invoked?

2. In several places, we call orte_errmgr.error_detected(). I think that we will *always* have the orte "proxy" errmgr component loaded in MPI processes, in which case errmgr.error_detected() will always call exit(), so I think we're ok there.

--
Jeff Squyres
Cisco Systems

Reply via email to