Hello again. I was doing some trace into de PML_OB1 files. I start to follow a MPI_Ssend() trying to find where a message is stored (in the sender) if it is not send until the receiver post the recv, but i didn't find that place.
I've noticed that the message to be sent enters in * mca_pml_ob1_rndv_completion_request(*pml_ob1_sendreq.c*) *and the *rc = send_request_pml_complete_check(sendreq) *returns false when the request hasn't been completed, but the execution never passes through * MCA_PML_OB1_PROGRESS_PENDING,* at least, none of the possible options is executed. So, re-orienting my question: where is stored this message until delivery? and if there any way to know that the receiver goes down? With this information i will be able to detect the failure of the receiver and will try to resend the message to another place. Thanks again. Hugo Meyer 2011/11/17 Hugo Daniel Meyer <[email protected]> > Hello @ll. > > I'm doing some changes in the communication framework. Right now i'm > working on a "secure" MPI_Send, this send needs to know when an endpoint > goes down, and then retry the communication constructing a new endpoint, or > at least, overwriting the data of the old endpoint with the new address of > the receiver process. Overwriting the data of the endpoint is not a problem > anymore, because i've done that before. > > For example, if we consider a Master/Worker application, where the master > sends data to the workers, and workers start the computation, then, the > master posts a send to the worker1 that fails and get restarted in another > node and in his new location the worker1 posts the recv to the master's > send. The problem here is that the master post the send when the process > was residing in one node, but the process expects the message in another > node. I need the sender to realize that the process is now in another node, > and retries the communication with a modificated endpoint. Anyone could > please tell me where in the send code i can obtain the status of a message > that hasn't been send and resend it to a new location. Also i want to know, > where can i obtain information about an endpoint fail?. > > Thanks in advance. > > Hugo >
