At 11:42 AM 11/9/2005, Greg Lindahl wrote:
On Tue, Nov 08, 2005 at 01:08:13PM -0800, Michael Krause wrote:

> If an application takes any action assuming that send complete means
> it is delivered, then it is subject to silent data corruption.

Right. That's the same as pretty much all other *transport* layers. I
don't think anyone's asserting RDS is any different: you can't assume
the other side's application received and acted on your message until
the other side's application tells you that it did.

So, things like HCA failure are not transparent and one cannot simply replay the operations since you don't know what was really seen by the other side unless the application performs the resync itself.  Hence, while RDS can attempt to retransmit, the application must deal with duplicates, etc. or note the error, resync, and retransmit to avoid duplicates. 

BTW, host-based transport implementations can transparently recover from device failure on behalf of applications since their state is in the host and not in the failed device - this is true for networking, storage, etc.  HCA / RNIC / TOE / FC / etc. all loose state or cannot be trusted thus must rely upon upper level software to perform the recovery, resync, retransmission, etc.  Unless RDS has implemented its own state checkpoint between endnodes, this class of failures must be solved by the application since it cannot be solved in the hardware.  Hence, RDS may push some of its reliability requirements to the interconnect but it does not eliminate all reliability requirements from the application or RDS itself.

Mike
_______________________________________________
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to