WHAT: Allow the runtime to handle fail-stop failures for both runtime (daemons)
or application level processes. This patch extends the orte_process_name_t
structure with a field to store the process epoch (the number of times it died
so far), and add an application failure notification callback
Dear all,
While trying to send a message of size 1610612736 B (1.5 GB), I get the
following error:
[[52363,1],1][../../../../../../ompi/mca/btl/openib/btl_openib_component.c:2951:handle_wc]
from grsacc20 to: grsacc19 error polling LP CQ with status LOCAL LENGTH ERROR
status number 1 for wr_id