[OMPI devel] RFC: Resilient ORTE

2011-06-06 Thread George Bosilca
WHAT: Allow the runtime to handle fail-stop failures for both runtime (daemons) or application level processes. This patch extends the orte_process_name_t structure with a field to store the process epoch (the number of times it died so far), and add an application failure notification callback

[OMPI devel] openib error for message size 1.5 GB

2011-06-06 Thread Sebastian Rinke
Dear all, While trying to send a message of size 1610612736 B (1.5 GB), I get the following error: [[52363,1],1][../../../../../../ompi/mca/btl/openib/btl_openib_component.c:2951:handle_wc] from grsacc20 to: grsacc19 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id