RE: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
In absence of any protocol level ack (and regardless of protocol level ack), it is the application which has to implement its own reliability. RDS becomes a passive channel passing packet back and forth including duplicate packets. The responsibility then shifts to the application to figure out what is missing, duplicate's etc. This would seem at odds with earlier assertions that as long as there were another path to the endnode, RDS would transparently recover on behalf of the application. I thought Oracle stated for their application that send failure would be interpreted as endnode failure and cast out the peer - perhaps I misread their usage model. Other applications who might want to use RDS could be designed to deal with the associated faults but if one has to deal with recovery / resync at the application layer, then that is quite a bit of work to perform in every application and is again at odds with the purpose of RDS which is to move reliability to the interconnect to the extent possible and to RDS so that the UDP application does not need to take on this complex code and attempt to get it right.[cait] I would agree that there isn't much point in defining a "reliable" datagram service unless it is more reliable than unreliable. To me that means that the the transport should deal with all networking problems other than a *total* failure to re-establish contact with the remote end. That makes it basically equivalent of a point-to-point Reliable Connection. The biggest difference, and justification, for having something like RDS is to eliminate point-to-point flow control and allow it to be replaced with ULP based flow control that is not point-to-point. The resources associated with tracking credits is where a lot of the overhead inherent in multiple point-to-point connections come from (that, and the synchronization of that data over the network). ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
At 12:49 PM 11/14/2005, Nitin Hande wrote: Michael Krause wrote: At 01:01 PM 11/11/2005, Nitin Hande wrote: Michael Krause wrote: At 10:28 AM 11/9/2005, Rick Frank wrote: Yes, the application is responsible for detecting lost msgs at the application level - the transport can not do this. RDS does not guarantee that a message has been delivered to the application - just that once the transport has accepted a msg it will deliver the msg to the remote node in order without duplication - dealing with retransmissions, etc due to sporadic / intermittent msg loss over the interconnect. If after accepting the send - the current path fails - then RDS will transparently fail over to another path - and if required will resend / send any already queued msgs to the remote node - again insuring that no msg is duplicated and they are in order. This is no different than APM - with the exception that RDS can do this across HCAs. The application - Oracle in this case - will deal with detecting a catastrophic path failure - either due to a send that does not arrive and or a timedout response or send failure returned from the transport. If there is no network path to a remote node - it is required that we remove the remote node from the operating cluster to avoid what is commonly termed as a "split brain" condition - otherwise known as a "partition in time". BTW - in our case - the application failure domain logic is the same whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node - after some defined period of time - we will remove the remote node from the cluster. In this case the database will recover all the interesting state that may have been maintained on the removed node - allowing the remaining nodes to continue. If later on, communication to the remote node is restored - it will be allowed to rejoin the cluster and take on application load. Please clarify the following which was in the document provided by Oracle. On page 3 of the RDS document, under the section "RDP Interface", the 2nd and 3rd paragraphs are state: * RDP does not guarantee that a datagram is delivered to the remote application. * It is up to the RDP client to deal with datagrams lost due to transport failure or remote application failure. The HCA is still a fault domain with RDS - it does not address flushing data out of the HCA fault domain, nor does it sound like it ensures that CQE loss is recoverable. I do believe RDS will replay all of the sendmsg's that it believes are pending, but it has no way to determine if already sent sendmsgs were actually successfully delivered to the remote application unless it provides some level of resync of the outstanding sends not completed from an application's perspective as well as any state updated via RDMA operations which may occur without an explicit send operation to flush to a known state. If RDS could define a mechanism that the application could use to inform the sender to resync and replay on catastrophic failure, is that a correct understanding of your suggestion ? I'm not suggesting anything at this point. I'm trying to reconcile the documentation with the e-mail statements made by its proponents. I'm still trying to ascertain whether RDS completely recovers from HCA failure (assuming there is another HCA / path available) between the two endnodes Reading at the doc and the thread, it looks like we need src/dst port for multiplexing connections, we need seq/ack# for resyncing, we need some kind of window availability for flow control. Are'nt we very close to tcp header ? .. TCP does not provide end-to-end to the application as implemented by most OS. Unless one ties TCP ACK to the application's consumption of the receive data, there is no method to ascertain that the application really received the data. The application would be required to send its own application-level acknowledgement. I believe the intent is for applications to remain responsible for the end-to-end receipt of data and that RDS and the interconnect are simply responsible for the exchange at the lower levels.Yes, a TCP ack only implies that it has received the data, and means nothing to the application. It is the application which has send a application level ack to its peer. TCP ACK was intended to be an end-to-end ACK but implementations took it to a lower level ACK only. A TCP stack linked into an application as demonstrated by multiple IHV and research does provide an end-to-end ACK and considerable performance improvements over the traditional network stack implementations. Some claim it is more than good enough to eliminate the need for protocol off-load / RDMA which is true for many applications (certainly for most Sockets, etc.) but not true when one takes advantage of the RDMA comms paradigm which has benefit for a number of applications. Mike ___ openib-general mailing list openib-general@openib.org
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
At 12:49 PM 11/14/2005, Nitin Hande wrote: Michael Krause wrote: At 01:02 PM 11/11/2005, Ranjit Pandit wrote: On 11/11/05, Michael Krause <[EMAIL PROTECTED]> wrote: > Please clarify the following which was in the document provided by Oracle. > > On page 3 of the RDS document, under the section "RDP Interface", the 2nd > and 3rd paragraphs are state: > > * RDP does not guarantee that a datagram is delivered to the remote > application. > * It is up to the RDP client to deal with datagrams lost due to transport > failure or remote application failure. > > The HCA is still a fault domain with RDS - it does not address flushing data > out of the HCA fault domain, nor does it sound like it ensures that CQE loss > is recoverable. > > I do believe RDS will replay all of the sendmsg's that it believes are > pending, but it has no way to determine if already sent sendmsgs were > actually successfully delivered to the remote application unless it provides > some level of resync of the outstanding sends not completed from an > application's perspective as well as any state updated via RDMA operations > which may occur without an explicit send operation to flush to a known > state. I'm still trying to ascertain whether RDS completely recovers from > HCA failure (assuming there is another HCA / path available) between the two > endnodes. RDS will replay the sends that are completed in error by the HCA, which typically would happen if the current path fails or the remote node/HCA dies. Does this mean that the receiving RDS entity is responsible for dealing with duplicates? I believe so... A Send completion error does not mean that the receiving endnode did not receive the data for either IB or iWARP; it only indicates that the Send operation failed which could be just a loss of the receive ACK with the Send completing on the receiver. Such a scenario would imply that RDS would have to comprehend what buffers have actually been consumed before retransmission, i.e. a resync is performed, else one could receive duplicate data at the application layer which can cause corruption or other problems as a function of the application (tolerance will vary by application thus the ULP must present consistent semantics to enable a broader set of applications than perhaps the initial targeted application to be supported).In absence of any protocol level ack (and regardless of protocol level ack), it is the application which has to implement its own reliability. RDS becomes a passive channel passing packet back and forth including duplicate packets. The responsibility then shifts to the application to figure out what is missing, duplicate's etc. This would seem at odds with earlier assertions that as long as there were another path to the endnode, RDS would transparently recover on behalf of the application. I thought Oracle stated for their application that send failure would be interpreted as endnode failure and cast out the peer - perhaps I misread their usage model. Other applications who might want to use RDS could be designed to deal with the associated faults but if one has to deal with recovery / resync at the application layer, then that is quite a bit of work to perform in every application and is again at odds with the purpose of RDS which is to move reliability to the interconnect to the extent possible and to RDS so that the UDP application does not need to take on this complex code and attempt to get it right. Mike Thanks Nitin In case of a catastrophic error on the local HCA, subsequent sends will fail (for a certain time (session_time_wait ) ) as if there was no alternate path available at that time. On getting an error the application should discard any sends unacknowledged by it's peer and take corrective action. Unacknowledged by the peer means at the interconnect or the application level? Again, how is the receive buffer management handled? After the time_wait is over, subsequent sends will initiate a brand new connection which could use the alternate HCA ( if the path is available). This is understood. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
Michael Krause wrote: At 01:01 PM 11/11/2005, Nitin Hande wrote: Michael Krause wrote: At 10:28 AM 11/9/2005, Rick Frank wrote: Yes, the application is responsible for detecting lost msgs at the application level - the transport can not do this. RDS does not guarantee that a message has been delivered to the application - just that once the transport has accepted a msg it will deliver the msg to the remote node in order without duplication - dealing with retransmissions, etc due to sporadic / intermittent msg loss over the interconnect. If after accepting the send - the current path fails - then RDS will transparently fail over to another path - and if required will resend / send any already queued msgs to the remote node - again insuring that no msg is duplicated and they are in order. This is no different than APM - with the exception that RDS can do this across HCAs. The application - Oracle in this case - will deal with detecting a catastrophic path failure - either due to a send that does not arrive and or a timedout response or send failure returned from the transport. If there is no network path to a remote node - it is required that we remove the remote node from the operating cluster to avoid what is commonly termed as a "split brain" condition - otherwise known as a "partition in time". BTW - in our case - the application failure domain logic is the same whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node - after some defined period of time - we will remove the remote node from the cluster. In this case the database will recover all the interesting state that may have been maintained on the removed node - allowing the remaining nodes to continue. If later on, communication to the remote node is restored - it will be allowed to rejoin the cluster and take on application load. Please clarify the following which was in the document provided by Oracle. On page 3 of the RDS document, under the section "RDP Interface", the 2nd and 3rd paragraphs are state: * RDP does not guarantee that a datagram is delivered to the remote application. * It is up to the RDP client to deal with datagrams lost due to transport failure or remote application failure. The HCA is still a fault domain with RDS - it does not address flushing data out of the HCA fault domain, nor does it sound like it ensures that CQE loss is recoverable. I do believe RDS will replay all of the sendmsg's that it believes are pending, but it has no way to determine if already sent sendmsgs were actually successfully delivered to the remote application unless it provides some level of resync of the outstanding sends not completed from an application's perspective as well as any state updated via RDMA operations which may occur without an explicit send operation to flush to a known state. If RDS could define a mechanism that the application could use to inform the sender to resync and replay on catastrophic failure, is that a correct understanding of your suggestion ? I'm not suggesting anything at this point. I'm trying to reconcile the documentation with the e-mail statements made by its proponents. I'm still trying to ascertain whether RDS completely recovers from HCA failure (assuming there is another HCA / path available) between the two endnodes Reading at the doc and the thread, it looks like we need src/dst port for multiplexing connections, we need seq/ack# for resyncing, we need some kind of window availability for flow control. Are'nt we very close to tcp header ? .. TCP does not provide end-to-end to the application as implemented by most OS. Unless one ties TCP ACK to the application's consumption of the receive data, there is no method to ascertain that the application really received the data. The application would be required to send its own application-level acknowledgement. I believe the intent is for applications to remain responsible for the end-to-end receipt of data and that RDS and the interconnect are simply responsible for the exchange at the lower levels. Yes, a TCP ack only implies that it has received the data, and means nothing to the application. It is the application which has send a application level ack to its peer. Nitin Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
Michael Krause wrote: At 01:02 PM 11/11/2005, Ranjit Pandit wrote: On 11/11/05, Michael Krause <[EMAIL PROTECTED]> wrote: > Please clarify the following which was in the document provided by Oracle. > > On page 3 of the RDS document, under the section "RDP Interface", the 2nd > and 3rd paragraphs are state: > >* RDP does not guarantee that a datagram is delivered to the remote > application. >* It is up to the RDP client to deal with datagrams lost due to transport > failure or remote application failure. > > The HCA is still a fault domain with RDS - it does not address flushing data > out of the HCA fault domain, nor does it sound like it ensures that CQE loss > is recoverable. > > I do believe RDS will replay all of the sendmsg's that it believes are > pending, but it has no way to determine if already sent sendmsgs were > actually successfully delivered to the remote application unless it provides > some level of resync of the outstanding sends not completed from an > application's perspective as well as any state updated via RDMA operations > which may occur without an explicit send operation to flush to a known > state. I'm still trying to ascertain whether RDS completely recovers from > HCA failure (assuming there is another HCA / path available) between the two > endnodes. RDS will replay the sends that are completed in error by the HCA, which typically would happen if the current path fails or the remote node/HCA dies. Does this mean that the receiving RDS entity is responsible for dealing with duplicates? I believe so... A Send completion error does not mean that the receiving endnode did not receive the data for either IB or iWARP; it only indicates that the Send operation failed which could be just a loss of the receive ACK with the Send completing on the receiver. Such a scenario would imply that RDS would have to comprehend what buffers have actually been consumed before retransmission, i.e. a resync is performed, else one could receive duplicate data at the application layer which can cause corruption or other problems as a function of the application (tolerance will vary by application thus the ULP must present consistent semantics to enable a broader set of applications than perhaps the initial targeted application to be supported). In absence of any protocol level ack (and regardless of protocol level ack), it is the application which has to implement its own reliability. RDS becomes a passive channel passing packet back and forth including duplicate packets. The responsibility then shifts to the application to figure out what is missing, duplicate's etc. Thanks Nitin In case of a catastrophic error on the local HCA, subsequent sends will fail (for a certain time (session_time_wait ) ) as if there was no alternate path available at that time. On getting an error the application should discard any sends unacknowledged by it's peer and take corrective action. Unacknowledged by the peer means at the interconnect or the application level? Again, how is the receive buffer management handled? After the time_wait is over, subsequent sends will initiate a brand new connection which could use the alternate HCA ( if the path is available). This is understood. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
At 01:02 PM 11/11/2005, Ranjit Pandit wrote: On 11/11/05, Michael Krause <[EMAIL PROTECTED]> wrote: > Please clarify the following which was in the document provided by Oracle. > > On page 3 of the RDS document, under the section "RDP Interface", the 2nd > and 3rd paragraphs are state: > > * RDP does not guarantee that a datagram is delivered to the remote > application. > * It is up to the RDP client to deal with datagrams lost due to transport > failure or remote application failure. > > The HCA is still a fault domain with RDS - it does not address flushing data > out of the HCA fault domain, nor does it sound like it ensures that CQE loss > is recoverable. > > I do believe RDS will replay all of the sendmsg's that it believes are > pending, but it has no way to determine if already sent sendmsgs were > actually successfully delivered to the remote application unless it provides > some level of resync of the outstanding sends not completed from an > application's perspective as well as any state updated via RDMA operations > which may occur without an explicit send operation to flush to a known > state. I'm still trying to ascertain whether RDS completely recovers from > HCA failure (assuming there is another HCA / path available) between the two > endnodes. RDS will replay the sends that are completed in error by the HCA, which typically would happen if the current path fails or the remote node/HCA dies. Does this mean that the receiving RDS entity is responsible for dealing with duplicates? A Send completion error does not mean that the receiving endnode did not receive the data for either IB or iWARP; it only indicates that the Send operation failed which could be just a loss of the receive ACK with the Send completing on the receiver. Such a scenario would imply that RDS would have to comprehend what buffers have actually been consumed before retransmission, i.e. a resync is performed, else one could receive duplicate data at the application layer which can cause corruption or other problems as a function of the application (tolerance will vary by application thus the ULP must present consistent semantics to enable a broader set of applications than perhaps the initial targeted application to be supported). In case of a catastrophic error on the local HCA, subsequent sends will fail (for a certain time (session_time_wait ) ) as if there was no alternate path available at that time. On getting an error the application should discard any sends unacknowledged by it's peer and take corrective action. Unacknowledged by the peer means at the interconnect or the application level? Again, how is the receive buffer management handled? After the time_wait is over, subsequent sends will initiate a brand new connection which could use the alternate HCA ( if the path is available). This is understood. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
At 01:01 PM 11/11/2005, Nitin Hande wrote: Michael Krause wrote: At 10:28 AM 11/9/2005, Rick Frank wrote: Yes, the application is responsible for detecting lost msgs at the application level - the transport can not do this. RDS does not guarantee that a message has been delivered to the application - just that once the transport has accepted a msg it will deliver the msg to the remote node in order without duplication - dealing with retransmissions, etc due to sporadic / intermittent msg loss over the interconnect. If after accepting the send - the current path fails - then RDS will transparently fail over to another path - and if required will resend / send any already queued msgs to the remote node - again insuring that no msg is duplicated and they are in order. This is no different than APM - with the exception that RDS can do this across HCAs. The application - Oracle in this case - will deal with detecting a catastrophic path failure - either due to a send that does not arrive and or a timedout response or send failure returned from the transport. If there is no network path to a remote node - it is required that we remove the remote node from the operating cluster to avoid what is commonly termed as a "split brain" condition - otherwise known as a "partition in time". BTW - in our case - the application failure domain logic is the same whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node - after some defined period of time - we will remove the remote node from the cluster. In this case the database will recover all the interesting state that may have been maintained on the removed node - allowing the remaining nodes to continue. If later on, communication to the remote node is restored - it will be allowed to rejoin the cluster and take on application load. Please clarify the following which was in the document provided by Oracle. On page 3 of the RDS document, under the section "RDP Interface", the 2nd and 3rd paragraphs are state: * RDP does not guarantee that a datagram is delivered to the remote application. * It is up to the RDP client to deal with datagrams lost due to transport failure or remote application failure. The HCA is still a fault domain with RDS - it does not address flushing data out of the HCA fault domain, nor does it sound like it ensures that CQE loss is recoverable. I do believe RDS will replay all of the sendmsg's that it believes are pending, but it has no way to determine if already sent sendmsgs were actually successfully delivered to the remote application unless it provides some level of resync of the outstanding sends not completed from an application's perspective as well as any state updated via RDMA operations which may occur without an explicit send operation to flush to a known state. If RDS could define a mechanism that the application could use to inform the sender to resync and replay on catastrophic failure, is that a correct understanding of your suggestion ? I'm not suggesting anything at this point. I'm trying to reconcile the documentation with the e-mail statements made by its proponents. I'm still trying to ascertain whether RDS completely recovers from HCA failure (assuming there is another HCA / path available) between the two endnodesReading at the doc and the thread, it looks like we need src/dst port for multiplexing connections, we need seq/ack# for resyncing, we need some kind of window availability for flow control. Are'nt we very close to tcp header ? .. TCP does not provide end-to-end to the application as implemented by most OS. Unless one ties TCP ACK to the application's consumption of the receive data, there is no method to ascertain that the application really received the data. The application would be required to send its own application-level acknowledgement. I believe the intent is for applications to remain responsible for the end-to-end receipt of data and that RDS and the interconnect are simply responsible for the exchange at the lower levels. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
On 11/11/05, Michael Krause <[EMAIL PROTECTED]> wrote: > Please clarify the following which was in the document provided by Oracle. > > On page 3 of the RDS document, under the section "RDP Interface", the 2nd > and 3rd paragraphs are state: > >* RDP does not guarantee that a datagram is delivered to the remote > application. >* It is up to the RDP client to deal with datagrams lost due to transport > failure or remote application failure. > > The HCA is still a fault domain with RDS - it does not address flushing data > out of the HCA fault domain, nor does it sound like it ensures that CQE loss > is recoverable. > > I do believe RDS will replay all of the sendmsg's that it believes are > pending, but it has no way to determine if already sent sendmsgs were > actually successfully delivered to the remote application unless it provides > some level of resync of the outstanding sends not completed from an > application's perspective as well as any state updated via RDMA operations > which may occur without an explicit send operation to flush to a known > state. I'm still trying to ascertain whether RDS completely recovers from > HCA failure (assuming there is another HCA / path available) between the two > endnodes. RDS will replay the sends that are completed in error by the HCA, which typically would happen if the current path fails or the remote node/HCA dies. In case of a catastrophic error on the local HCA, subsequent sends will fail (for a certain time (session_time_wait ) ) as if there was no alternate path available at that time. On getting an error the application should discard any sends unacknowledged by it's peer and take corrective action. After the time_wait is over, subsequent sends will initiate a brand new connection which could use the alternate HCA ( if the path is available). > > Mike > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
Michael Krause wrote: At 10:28 AM 11/9/2005, Rick Frank wrote: Yes, the application is responsible for detecting lost msgs at the application level - the transport can not do this. RDS does not guarantee that a message has been delivered to the application - just that once the transport has accepted a msg it will deliver the msg to the remote node in order without duplication - dealing with retransmissions, etc due to sporadic / intermittent msg loss over the interconnect. If after accepting the send - the current path fails - then RDS will transparently fail over to another path - and if required will resend / send any already queued msgs to the remote node - again insuring that no msg is duplicated and they are in order. This is no different than APM - with the exception that RDS can do this across HCAs. The application - Oracle in this case - will deal with detecting a catastrophic path failure - either due to a send that does not arrive and or a timedout response or send failure returned from the transport. If there is no network path to a remote node - it is required that we remove the remote node from the operating cluster to avoid what is commonly termed as a "split brain" condition - otherwise known as a "partition in time". BTW - in our case - the application failure domain logic is the same whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node - after some defined period of time - we will remove the remote node from the cluster. In this case the database will recover all the interesting state that may have been maintained on the removed node - allowing the remaining nodes to continue. If later on, communication to the remote node is restored - it will be allowed to rejoin the cluster and take on application load. Please clarify the following which was in the document provided by Oracle. On page 3 of the RDS document, under the section "RDP Interface", the 2nd and 3rd paragraphs are state: * RDP does not guarantee that a datagram is delivered to the remote application. * It is up to the RDP client to deal with datagrams lost due to transport failure or remote application failure. The HCA is still a fault domain with RDS - it does not address flushing data out of the HCA fault domain, nor does it sound like it ensures that CQE loss is recoverable. I do believe RDS will replay all of the sendmsg's that it believes are pending, but it has no way to determine if already sent sendmsgs were actually successfully delivered to the remote application unless it provides some level of resync of the outstanding sends not completed from an application's perspective as well as any state updated via RDMA operations which may occur without an explicit send operation to flush to a known state. If RDS could define a mechanism that the application could use to inform the sender to resync and replay on catastrophic failure, is that a correct understanding of your suggestion ? I'm still trying to ascertain whether RDS completely recovers from HCA failure (assuming there is another HCA / path available) between the two endnodes Reading at the doc and the thread, it looks like we need src/dst port for multiplexing connections, we need seq/ack# for resyncing, we need some kind of window availability for flow control. Are'nt we very close to tcp header ? .. Nitin . Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
At 10:28 AM 11/9/2005, Rick Frank wrote: Yes, the application is responsible for detecting lost msgs at the application level - the transport can not do this. RDS does not guarantee that a message has been delivered to the application - just that once the transport has accepted a msg it will deliver the msg to the remote node in order without duplication - dealing with retransmissions, etc due to sporadic / intermittent msg loss over the interconnect. If after accepting the send - the current path fails - then RDS will transparently fail over to another path - and if required will resend / send any already queued msgs to the remote node - again insuring that no msg is duplicated and they are in order. This is no different than APM - with the exception that RDS can do this across HCAs. The application - Oracle in this case - will deal with detecting a catastrophic path failure - either due to a send that does not arrive and or a timedout response or send failure returned from the transport. If there is no network path to a remote node - it is required that we remove the remote node from the operating cluster to avoid what is commonly termed as a "split brain" condition - otherwise known as a "partition in time". BTW - in our case - the application failure domain logic is the same whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node - after some defined period of time - we will remove the remote node from the cluster. In this case the database will recover all the interesting state that may have been maintained on the removed node - allowing the remaining nodes to continue. If later on, communication to the remote node is restored - it will be allowed to rejoin the cluster and take on application load. Please clarify the following which was in the document provided by Oracle. On page 3 of the RDS document, under the section "RDP Interface", the 2nd and 3rd paragraphs are state: * RDP does not guarantee that a datagram is delivered to the remote application. * It is up to the RDP client to deal with datagrams lost due to transport failure or remote application failure. The HCA is still a fault domain with RDS - it does not address flushing data out of the HCA fault domain, nor does it sound like it ensures that CQE loss is recoverable. I do believe RDS will replay all of the sendmsg's that it believes are pending, but it has no way to determine if already sent sendmsgs were actually successfully delivered to the remote application unless it provides some level of resync of the outstanding sends not completed from an application's perspective as well as any state updated via RDMA operations which may occur without an explicit send operation to flush to a known state. I'm still trying to ascertain whether RDS completely recovers from HCA failure (assuming there is another HCA / path available) between the two endnodes. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB
My concern is the requirement that RDS resync the structures in the face of failureand know whether to re-transmit or will deal with duplicates. Having pre-posted bufferswill help enable the resync to be accomplished but should not be equated to pre-post equalsone can deal with duplicates or will verify to prevent duplicates from occurring.Mike The semantics should be that barring an error the flow between any two endpoints is reliable and ordered. The difference versus a normal point-to-point definition of reliable is that a) lack of a receive buffer is an error, b) the endpoint communicates with many known remote peers (as opposed to one known remote peer, or many unknown). Having an API with those semantics, particularly as an upgrade in semanitcs from SOCK_DGRAM while preserving SOCK_DGRAM syntax, is something that I believe is of distinct value to many cluster based applications. Further the API can be implemeneted in an offload device (IB or IP) more efficiently than if it is simply implemented on top of SOCK_STREAM sockets by the application. Documenting and clarifying the semantics to make it's general applicability clearer should definitely be done, however. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB
At 10:48 AM 11/10/2005, Caitlin Bestler wrote: Mike Krause wrote in response to Greg Lindahl: > If it is to be reasonably robust, then RDS should be required to support > the resync between the two sides of the communication. This aligns with the > stated objective of implementing reliability in one location in software and > one location in hardware. Without such resync being required in the ULP, > then one ends up with a ULP that falls shorts of its stated objectives and > pushes complexity back up to the application which is where the advocates > have stated it is too complex or expensive to get it correct. I haven't reread all of RDS fine print to double-check this, but my impression is that RDS semantics exactly match the subset of MPI point-to-point communications where the receiving rank is required to have pre-posted buffers before the send is allowed. My concern is the requirement that RDS resync the structures in the face of failure and know whether to re-transmit or will deal with duplicates. Having pre-posted buffers will help enable the resync to be accomplished but should not be equated to pre-post equals one can deal with duplicates or will verify to prevent duplicates from occurring. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
Yes, this is the case. - Original Message - From: "Caitlin Bestler" <[EMAIL PROTECTED]> To: Sent: Thursday, November 10, 2005 1:48 PM Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB Mike Krause wrote in response to Greg Lindahl: If it is to be reasonably robust, then RDS should be required to support the resync between the two sides of the communication. This aligns with the stated objective of implementing reliability in one location in software and one location in hardware. Without such resync being required in the ULP, then one ends up with a ULP that falls shorts of its stated objectives and pushes complexity back up to the application which is where the advocates have stated it is too complex or expensive to get it correct. This sort of message service, by the way, has a long history in distributed computing. Yep. I haven't reread all of RDS fine print to double-check this, but my impression is that RDS semantics exactly match the subset of MPI point-to-point communications where the receiving rank is required to have pre-posted buffers before the send is allowed. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB
Mike Krause wrote in response to Greg Lindahl: > If it is to be reasonably robust, then RDS should be required to support > the resync between the two sides of the communication. This aligns with the > stated objective of implementing reliability in one location in software and > one location in hardware. Without such resync being required in the ULP, > then one ends up with a ULP that falls shorts of its stated objectives and > pushes complexity back up to the application which is where the advocates > have stated it is too complex or expensive to get it correct. >> This sort of message service, by the way, has a long history in distributed computing. > Yep. I haven't reread all of RDS fine print to double-check this, but my impression is that RDS semantics exactly match the subset of MPI point-to-point communications where the receiving rank is required to have pre-posted buffers before the send is allowed. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
At 02:09 PM 11/9/2005, Greg Lindahl wrote: On Wed, Nov 09, 2005 at 01:57:06PM -0800, Michael Krause wrote: > What you indicate above is that RDS > will implement a resync of the two sides of the association to determine > what has been successfully sent. More accurate to say that it "could" implement that. I'm just kibbutzing on someone else's proposal. > This then implies that the reliability of the underlying > interconnect isn't as critical per se as the end-to-end RDS protocol > will assure that data is delivered to the RDS components in the face > of hardware failures. Correct? Yes. That's the intent that I see in the proposal. The implementation required to actually support this may not be what the proposers had in mind. If it is to be reasonably robust, then RDS should be required to support the resync between the two sides of the communication. This aligns with the stated objective of implementing reliability in one location in software and one location in hardware. Without such resync being required in the ULP, then one ends up with a ULP that falls shorts of its stated objectives and pushes complexity back up to the application which is where the advocates have stated it is too complex or expensive to get it correct. This sort of message service, by the way, has a long history in distributed computing. Yep. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
On Wed, Nov 09, 2005 at 12:45:17PM -0800, Caitlin Bestler wrote: > ... Caitlin, I'm having problems reading the quoting "style" too. Please, can you take a look at "quotefix"? http://home.in.tum.de/~jain/software/outlook-quotefix/ thanks, grant > > > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Michael Krause > Sent: Wednesday, November 09, 2005 12:21 PM > To: Rick Frank; Ranjit Pandit > Cc: openib-general@openib.org > Subject: Re: [openib-general] [ANNOUNCE] Contribute > RDS(ReliableDatagramSockets) to OpenIB > > > > One could be able to talk to the remote node across > other HCA but that does not mean one has an understanding of the state > at the remote node unless the failure is noted and a resync of state > occurs or the remote is able to deal with duplicates, etc. This has > nothing to do with API or the transport involved but, as Caitlin noted, > the difference between knowing a send buffer is free vs. knowing that > the application received the data requested. Therefore, one has only > reduced the reliability / robustness problem space to some extent but > has not solved it by the use of RDS. > > > > Correct. When there are point-to-point credits (even if only > enforced/understood > at the ULP) then the application can correctly infer that message N was > successfully processed because the matching credit was restored. A > transport > neutral application can only communicate restoration of credits via ULP > messaging. When credits are shared across sessions then the ULP > has a much more complex task to properly communicate credits. > > The proposal I presented at RAIT for multistreamed MPA had a > non-highlighted > option for a "wildcard" endpoint. Without the option multistream MPA is > essentially > the SCTP adaptation for RDMA running over plain MPA/TCP. It achieves the > same reduction in reliable transport layer connections that RDS does, > but > does not reduce the number of RDMA endpoints. The wildcard option > reduces the number of RDMA endpoints as well, but greatly complicates > the RDMA state machines. RDS over IB faces similar problems, but solved > them slightly differently. > > Over iWARP I believe these complexities favor keeping the point-to-point > logical connection between QP and only reducing the number of L4 > connections (from many TCP connections to a single TCP connection > or SCTP association). The advantage of that approach is that the API > from application to RDMA endpoint (QP) can be left totally unchanged. > But I do not see any such option over IB, unless RD is improved or a > new SCTP-like connection mode is defined. > > In my opinion the multi-streaming is the most important feature here, > but over IB I do not think there is a natural adaptation that provides > multi-streaming without also adding the any-to-any endpoint semantics. > Multistream MPA and SCTP can both support the any-to-any endpoint > semantics by moving the source to payload information rather than > transport information (by invoking "wildcard status" in MS-MPA or > by duplicating the field for SCTP). So the RDS API strikes me as > the best option for a transport neutral application. MS-MPA and SCTP > reductions in transport overhead would be available without special > API support. > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
On 11/9/05, Greg Lindahl <[EMAIL PROTECTED]> wrote: > On Wed, Nov 09, 2005 at 01:57:06PM -0800, Michael Krause wrote: > > > What you indicate above is that RDS > > will implement a resync of the two sides of the association to determine > > what has been successfully sent. > > More accurate to say that it "could" implement that. I'm just > kibbutzing on someone else's proposal. > > > This then implies that the reliability of the underlying > > interconnect isn't as critical per se as the end-to-end RDS protocol > > will assure that data is delivered to the RDS components in the face > > of hardware failures. Correct? > > Yes. That's the intent that I see in the proposal. The implementation > required to actually support this may not be what the proposers had in > mind. The reference implementation of RDS already supports this. It supports failover across HCAs just like APM does across ports within an HCA. > > This sort of message service, by the way, has a long history in > distributed computing. > > -- greg > > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
On 11/9/05, Michael Krause <[EMAIL PROTECTED]> wrote: > I hadn't assumed anything. I'm simply trying to understand the assertions > concerning availability and recovery. What you indicate above is that RDS > will implement a resync of the two sides of the association to determine > what has been successfully sent. It will then retransmit what has not > transparent to the application. This then implies that the reliability of > the underlying interconnect isn't as critical per se as the end-to-end RDS > protocol will assure that data is delivered to the RDS components in the > face of hardware failures. Correct? > > Mike Correct. Ranjit > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
On Wed, Nov 09, 2005 at 01:57:06PM -0800, Michael Krause wrote: > What you indicate above is that RDS > will implement a resync of the two sides of the association to determine > what has been successfully sent. More accurate to say that it "could" implement that. I'm just kibbutzing on someone else's proposal. > This then implies that the reliability of the underlying > interconnect isn't as critical per se as the end-to-end RDS protocol > will assure that data is delivered to the RDS components in the face > of hardware failures. Correct? Yes. That's the intent that I see in the proposal. The implementation required to actually support this may not be what the proposers had in mind. This sort of message service, by the way, has a long history in distributed computing. -- greg ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
At 01:24 PM 11/9/2005, Greg Lindahl wrote: On Wed, Nov 09, 2005 at 12:18:28PM -0800, Michael Krause wrote: > So, things like HCA failure are not transparent and one cannot simply > replay the operations since you don't know what was really seen by the > other side unless the application performs the resync itself. I think you are over-stating the case. On the remote end, the kernel piece of RDS knows what it presented to the remote application, ditto on the local end. If only an HCA fails, and not the sending and receiving kernels or applications, that knowledge is not lost. Perhaps you were assuming that RDS would be implemented only in firmware on the HCA, and there is no kernel piece that knows what's going on. I hadn't seen that stated by anyone, and of course there are several existing and contemplated OpenIB devices that are considerably different from the usual offload engine. You could also choose to implement RDS using an offload engine and still keep enough state in the kernel to recover. I hadn't assumed anything. I'm simply trying to understand the assertions concerning availability and recovery. What you indicate above is that RDS will implement a resync of the two sides of the association to determine what has been successfully sent. It will then retransmit what has not transparent to the application. This then implies that the reliability of the underlying interconnect isn't as critical per se as the end-to-end RDS protocol will assure that data is delivered to the RDS components in the face of hardware failures. Correct? Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
On Wed, Nov 09, 2005 at 12:18:28PM -0800, Michael Krause wrote: > So, things like HCA failure are not transparent and one cannot simply > replay the operations since you don't know what was really seen by the > other side unless the application performs the resync itself. I think you are over-stating the case. On the remote end, the kernel piece of RDS knows what it presented to the remote application, ditto on the local end. If only an HCA fails, and not the sending and receiving kernels or applications, that knowledge is not lost. Perhaps you were assuming that RDS would be implemented only in firmware on the HCA, and there is no kernel piece that knows what's going on. I hadn't seen that stated by anyone, and of course there are several existing and contemplated OpenIB devices that are considerably different from the usual offload engine. You could also choose to implement RDS using an offload engine and still keep enough state in the kernel to recover. -- greg ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael KrauseSent: Wednesday, November 09, 2005 12:21 PMTo: Rick Frank; Ranjit PanditCc: openib-general@openib.orgSubject: Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB One could be able to talk to the remote node across other HCA but that does not mean one has an understanding of the state at the remote node unless the failure is noted and a resync of state occurs or the remote is able to deal with duplicates, etc. This has nothing to do with API or the transport involved but, as Caitlin noted, the difference between knowing a send buffer is free vs. knowing that the application received the data requested. Therefore, one has only reduced the reliability / robustness problem space to some extent but has not solved it by the use of RDS. Correct. When there are point-to-point credits (even if only enforced/understood at the ULP) then the application can correctly infer that message N was successfully processed because the matching credit was restored. A transport neutral application can only communicate restoration of credits via ULP messaging. When credits are shared across sessions then the ULP has a much more complex task to properly communicate credits. The proposal I presented at RAIT for multistreamed MPA had a non-highlighted option for a "wildcard" endpoint. Without the option multistream MPA is essentially the SCTP adaptation for RDMA running over plain MPA/TCP. It achieves the same reduction in reliable transport layer connections that RDS does, but does not reduce the number of RDMA endpoints. The wildcard option reduces the number of RDMA endpoints as well, but greatly complicates the RDMA state machines. RDS over IB faces similar problems, but solved them slightly differently. Over iWARP I believe these complexities favor keeping the point-to-point logical connection between QP and only reducing the number of L4 connections (from many TCP connections to a single TCP connection or SCTP association). The advantage of that approach is that the API from application to RDMA endpoint (QP) can be left totally unchanged. But I do not see any such option over IB, unless RD is improved or a new SCTP-like connection mode is defined. In my opinion the multi-streaming is the most important feature here, but over IB I do not think there is a natural adaptation that provides multi-streaming without also adding the any-to-any endpoint semantics. Multistream MPA and SCTP can both support the any-to-any endpoint semantics by moving the source to payload information rather than transport information (by invoking "wildcard status" in MS-MPA or by duplicating the field for SCTP). So the RDS API strikes me as the best option for a transport neutral application. MS-MPA and SCTP reductions in transport overhead would be available without special API support. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
At 10:28 AM 11/9/2005, Rick Frank wrote: Yes, the application is responsible for detecting lost msgs at the application level - the transport can not do this. RDS does not guarantee that a message has been delivered to the application - just that once the transport has accepted a msg it will deliver the msg to the remote node in order without duplication - dealing with retransmissions, etc due to sporadic / intermittent msg loss over the interconnect. If after accepting the send - the current path fails - then RDS will transparently fail over to another path - and if required will resend / send any already queued msgs to the remote node - again insuring that no msg is duplicated and they are in order. This is no different than APM - with the exception that RDS can do this across HCAs. The application - Oracle in this case - will deal with detecting a catastrophic path failure - either due to a send that does not arrive and or a timedout response or send failure returned from the transport. If there is no network path to a remote node - it is required that we remove the remote node from the operating cluster to avoid what is commonly termed as a "split brain" condition - otherwise known as a "partition in time". BTW - in our case - the application failure domain logic is the same whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node - after some defined period of time - we will remove the remote node from the cluster. In this case the database will recover all the interesting state that may have been maintained on the removed node - allowing the remaining nodes to continue. If later on, communication to the remote node is restored - it will be allowed to rejoin the cluster and take on application load. One could be able to talk to the remote node across other HCA but that does not mean one has an understanding of the state at the remote node unless the failure is noted and a resync of state occurs or the remote is able to deal with duplicates, etc. This has nothing to do with API or the transport involved but, as Caitlin noted, the difference between knowing a send buffer is free vs. knowing that the application received the data requested. Therefore, one has only reduced the reliability / robustness problem space to some extent but has not solved it by the use of RDS. Mike - Original Message - From: Michael Krause To: Ranjit Pandit Cc: openib-general@openib.org Sent: Tuesday, November 08, 2005 4:08 PM Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB At 12:33 PM 11/8/2005, Ranjit Pandit wrote: > Mike wrote: > - RDS does not solve a set of failure models. For example, if a RNIC / HCA > were to fail, then one cannot simply replay the operations on another RNIC / > HCA without extracting state, etc. and providing some end-to-end sync of > what was really sent / received by the application. Yes, one can recover > from cable or switch port failure by using APM style recovery but that is > only one class of faults. The harder faults either result in the end node > being cast out of the cluster or see silent data corruption unless > additional steps are taken to transparently recover - again app writers > don't want to solve the hard problems; they want that done for them. The current reference implementation of RDS solves the HCA failure case as well. Since applications don't need to keep connection states, it's easier to handle cases like HCA and intermediate path failures. As far as application is concerned, every sendmsg 'could' result in a new connection setup in the driver. If the current path fails, RDS reestablishes a connection, if available, on a different port or a different HCA , and replays the failed messages. Using APM is not useful because it doesn't provide failover across HCA's. I think others may disagree about whether RDS solves the problem. You have no way of knowing whether something was received or not into the other node's coherency domain without some intermediary or application's involvement to see the data arrived. As such, you might see many hardware level acks occur and not know there is a real failure. If an application takes any action assuming that send complete means it is delivered, then it is subject to silent data corruption. Hence, RDS can replay to its heart content but until there is an application or middleware level of acknowledgement, you have not solve the fault domain issues. Some may be happy with this as they just cast out the endnode from the cluster / database but others see the loss of a server as a big deal so may not be happy to see this occur. It really comes down to whether you believe loosing a server is worth while just for a local failure event which is not fatal to the rest of the server. APM's
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
At 11:42 AM 11/9/2005, Greg Lindahl wrote: On Tue, Nov 08, 2005 at 01:08:13PM -0800, Michael Krause wrote: > If an application takes any action assuming that send complete means > it is delivered, then it is subject to silent data corruption. Right. That's the same as pretty much all other *transport* layers. I don't think anyone's asserting RDS is any different: you can't assume the other side's application received and acted on your message until the other side's application tells you that it did. So, things like HCA failure are not transparent and one cannot simply replay the operations since you don't know what was really seen by the other side unless the application performs the resync itself. Hence, while RDS can attempt to retransmit, the application must deal with duplicates, etc. or note the error, resync, and retransmit to avoid duplicates. BTW, host-based transport implementations can transparently recover from device failure on behalf of applications since their state is in the host and not in the failed device - this is true for networking, storage, etc. HCA / RNIC / TOE / FC / etc. all loose state or cannot be trusted thus must rely upon upper level software to perform the recovery, resync, retransmission, etc. Unless RDS has implemented its own state checkpoint between endnodes, this class of failures must be solved by the application since it cannot be solved in the hardware. Hence, RDS may push some of its reliability requirements to the interconnect but it does not eliminate all reliability requirements from the application or RDS itself. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
On Tue, Nov 08, 2005 at 01:08:13PM -0800, Michael Krause wrote: > If an application takes any action assuming that send complete means > it is delivered, then it is subject to silent data corruption. Right. That's the same as pretty much all other *transport* layers. I don't think anyone's asserting RDS is any different: you can't assume the other side's application received and acted on your message until the other side's application tells you that it did. -- greg ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
Yes, the application is responsible for detecting lost msgs at the application level - the transport can not do this. RDS does not guarantee that a message has been delivered to the application - just that once the transport has accepted a msg it will deliver the msg to the remote node in order without duplication - dealing with retransmissions, etc due to sporadic / intermittent msg loss over the interconnect. If after accepting the send - the current path fails - then RDS will transparently fail over to another path - and if required will resend / send any already queued msgs to the remote node - again insuring that no msg is duplicated and they are in order. This is no different than APM - with the exception that RDS can do this across HCAs. The application - Oracle in this case - will deal with detecting a catastrophic path failure - either due to a send that does not arrive and or a timedout response or send failure returned from the transport. If there is no network path to a remote node - it is required that we remove the remote node from the operating cluster to avoid what is commonly termed as a "split brain" condition - otherwise known as a "partition in time". BTW - in our case - the application failure domain logic is the same whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc. Basically, if we can not talk to a remote node - after some defined period of time - we will remove the remote node from the cluster. In this case the database will recover all the interesting state that may have been maintained on the removed node - allowing the remaining nodes to continue. If later on, communication to the remote node is restored - it will be allowed to rejoin the cluster and take on application load. - Original Message - From: Michael Krause To: Ranjit Pandit Cc: openib-general@openib.org Sent: Tuesday, November 08, 2005 4:08 PM Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB At 12:33 PM 11/8/2005, Ranjit Pandit wrote: > Mike wrote:> - RDS does not solve a set of failure models. For example, if a RNIC / HCA> were to fail, then one cannot simply replay the operations on another RNIC /> HCA without extracting state, etc. and providing some end-to-end sync of> what was really sent / received by the application. Yes, one can recover> from cable or switch port failure by using APM style recovery but that is> only one class of faults. The harder faults either result in the end node> being cast out of the cluster or see silent data corruption unless> additional steps are taken to transparently recover - again app writers> don't want to solve the hard problems; they want that done for them.The current reference implementation of RDS solves the HCA failure case as well.Since applications don't need to keep connection states, it's easierto handle cases like HCA and intermediate path failures.As far as application is concerned, every sendmsg 'could' result in anew connection setup in the driver.If the current path fails, RDS reestablishes a connection, ifavailable, on a different port or a different HCA , and replays thefailed messages.Using APM is not useful because it doesn't provide failover across HCA's.I think others may disagree about whether RDS solves the problem. You have no way of knowing whether something was received or not into the other node's coherency domain without some intermediary or application's involvement to see the data arrived. As such, you might see many hardware level acks occur and not know there is a real failure. If an application takes any action assuming that send complete means it is delivered, then it is subject to silent data corruption. Hence, RDS can replay to its heart content but until there is an application or middleware level of acknowledgement, you have not solve the fault domain issues. Some may be happy with this as they just cast out the endnode from the cluster / database but others see the loss of a server as a big deal so may not be happy to see this occur. It really comes down to whether you believe loosing a server is worth while just for a local failure event which is not fatal to the rest of the server.APM's value is the ability to recover from link failure. It has the same value for any other ULP in that it recovers transparently to the ULP.Mike ___openib-general mailing listopenib-general@openib.orghttp://openib.org/mailman/listinfo/openib-generalTo unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.o
Re: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB
Caitlin, Can you please use the standard quoting style? I can't tell which comments are yours. Thanks. -- greg ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael KrauseSent: Tuesday, November 08, 2005 1:08 PMTo: Ranjit PanditCc: openib-general@openib.orgSubject: Re: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB At 12:33 PM 11/8/2005, Ranjit Pandit wrote: > Mike wrote:> - RDS does not solve a set of failure models. For example, if a RNIC / HCA> were to fail, then one cannot simply replay the operations on another RNIC /> HCA without extracting state, etc. and providing some end-to-end sync of> what was really sent / received by the application. Yes, one can recover> from cable or switch port failure by using APM style recovery but that is> only one class of faults. The harder faults either result in the end node> being cast out of the cluster or see silent data corruption unless> additional steps are taken to transparently recover - again app writers> don't want to solve the hard problems; they want that done for them.The current reference implementation of RDS solves the HCA failure case as well.Since applications don't need to keep connection states, it's easierto handle cases like HCA and intermediate path failures.As far as application is concerned, every sendmsg 'could' result in anew connection setup in the driver.If the current path fails, RDS reestablishes a connection, ifavailable, on a different port or a different HCA , and replays thefailed messages.Using APM is not useful because it doesn't provide failover across HCA's. I think others may disagree about whether RDS solves the problem. You have no way of knowing whether something was received or not into the other node's coherency domain without some intermediary or application's involvement to see the data arrived. As such, you might see many hardware level acks occur and not know there is a real failure. If an application takes any action assuming that send complete means it is delivered, then it is subject to silent data corruption. Hence, RDS can replay to its heart content but until there is an application or middleware level of acknowledgement, you have not solve the fault domain issues. Some may be happy with this as they just cast out the endnode from the cluster / database but others see the loss of a server as a big deal so may not be happy to see this occur. It really comes down to whether you believe loosing a server is worth while just for a local failure event which is not fatal to the rest of the server.[cait] Applications should not infer anything from send completion other than that their source buffer is no longer requried for the transmit to complete. That is the only assumption that can be supported in a transport neutral way. I'll also point out that even under InfiniBand the fact that a send or write has completed does NOT guarantee that the remote peer has *noticed* the data. The Remote peer could fail *after* the date has been delivered to it and before it has had a chance to act upon it. A well-designed robust application should never rely on anything other than a peer ack to indicate that the peer has truly taken ownership of transmitted information. The essence of RDS, or any similar solution, is the delivery of message with datagram semantics reliably over point-to-point reliable connections. So whatever reliability and fault-tolerance benefits the reliable connections are inherited by the RDS layer. After that it is mostly a matter of how you avoid head-of-line blocking problems when there is no receive buffer. You don't want to send an RNR (or drop the DDP Segment under iWARP) because *one* endpoint does not have available buffers. Other than that any reliable datagram service should be just as reliable as the underlying rc service. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
At 12:33 PM 11/8/2005, Ranjit Pandit wrote: > Mike wrote: > - RDS does not solve a set of failure models. For example, if a RNIC / HCA > were to fail, then one cannot simply replay the operations on another RNIC / > HCA without extracting state, etc. and providing some end-to-end sync of > what was really sent / received by the application. Yes, one can recover > from cable or switch port failure by using APM style recovery but that is > only one class of faults. The harder faults either result in the end node > being cast out of the cluster or see silent data corruption unless > additional steps are taken to transparently recover - again app writers > don't want to solve the hard problems; they want that done for them. The current reference implementation of RDS solves the HCA failure case as well. Since applications don't need to keep connection states, it's easier to handle cases like HCA and intermediate path failures. As far as application is concerned, every sendmsg 'could' result in a new connection setup in the driver. If the current path fails, RDS reestablishes a connection, if available, on a different port or a different HCA , and replays the failed messages. Using APM is not useful because it doesn't provide failover across HCA's. I think others may disagree about whether RDS solves the problem. You have no way of knowing whether something was received or not into the other node's coherency domain without some intermediary or application's involvement to see the data arrived. As such, you might see many hardware level acks occur and not know there is a real failure. If an application takes any action assuming that send complete means it is delivered, then it is subject to silent data corruption. Hence, RDS can replay to its heart content but until there is an application or middleware level of acknowledgement, you have not solve the fault domain issues. Some may be happy with this as they just cast out the endnode from the cluster / database but others see the loss of a server as a big deal so may not be happy to see this occur. It really comes down to whether you believe loosing a server is worth while just for a local failure event which is not fatal to the rest of the server. APM's value is the ability to recover from link failure. It has the same value for any other ULP in that it recovers transparently to the ULP. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
At 12:37 PM 11/8/2005, Hal Rosenstock wrote: On Tue, 2005-11-08 at 15:33, Ranjit Pandit wrote: > Using APM is not useful because it doesn't provide failover across HCA's. Can't APM be made to work across HCAs ? No. It requires state that is only within the HCA and there are other aspects that prevent this, e.g. no single unified QP space across all HCA, etc. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael KrauseSent: Tuesday, November 08, 2005 11:52 AMTo: Rimmer, ToddCc: openib-general@openib.orgSubject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB The entire discussion might be distilled into the following:- Datagram applications trade reliability for flexibility and resource savings. Reliable Datagram applications have endpoints that accept messages from multiple known sources, rather than from a single known source (TCP, RC) or multiple unknown sources (UDP, RD). This does save resources, but perhaps just as importantly it may reflect how the application truly thinks of its communication endpoints. Oracle is not unique in this communication requirement. This is essentially the interface MPI presents to its users as well. - Datagram applications that require reliability have to re-invent the wheel and given it is non-trivial, they often get it variable quality and can suffer performance loss if done poorly or the network is very lossy. Given networks are a lot less lossy today than years past, sans congestion drops, one might argue about whether there is still a significant problem or not.[cait] Standardized congestion control that is not dependent on application specific control is highly desirable. In the IP world new ULPs based upon UDP are heavily discouraged for exactly this reason. - The reliable datagram model isn't new - been there, done that on earlier interconnects - but it isn't free. IB could have done something like RDS but the people who pushed the original requirements (some who are advocating RDS now) did not want to take on the associated software enablement thus it was subsumed into hardware and made slightly more restrictive as a result - perhaps more than some people may like. The only real delta between RDS one sense and the current IB RD is the number of outstanding messages in flight on a given EEC. If RD were re-defined to allow software to recover some types of failures much like UC, then one could simply use RD.[cait] The RDS API should definitely be compatiable with IB RD service, especially any later one that solves the crippling limitation on in-flight messages. Similarly the API should be compatible with IP based solutions, which since it is derived from SOCK_DGRAM isn't much of a challenge. - RDS does not solve a set of failure models. For example, if a RNIC / HCA were to fail, then one cannot simply replay the operations on another RNIC / HCA without extracting state, etc. and providing some end-to-end sync of what was really sent / received by the application. Yes, one can recover from cable or switch port failure by using APM style recovery but that is only one class of faults. The harder faults either result in the end node being cast out of the cluster or see silent data corruption unless additional steps are taken to transparently recover - again app writers don't want to solve the hard problems; they want that done for them.[cait] This goes to the question of where the Reliable Datagram Service is implemented. When done as middleware over existing reliable connection services then the middleware does have a few issues on handling flushed buffers after an RNIC failure. These issues make implementation of a zero-copy strategy more of an issue. But if the endpoint is truly a datagram endpoint then these issues are the same as for failover of connection-oriented endpoints between two RNICs/HCAs. - RNIC / HCA provide hardware acceleration and reliable delivery to the remote RNIC / HCA (not to the application since that is in a separate fault domain). Doing software multiplexing over such an interconnect as envisioned for IB RD is relatively straight in many respects but not a trivial exercise as some might contend. Yes, people can point to a small number of lines of code but that is just for the initial offering and is not an indication of what it might have to become long-term to add all of the bells-n-whistles that people have envisioned.[cait] IB RD is not transport neutral, and has the problem of severe in-flight limitations that would make it unacceptable to most applications that would benefit from RDS even if they were There is no way that iWARP vendors would ever implement a service designed to match IB RD. An RDS service could be implemented over TCP, MPA, MS-MPA or SCTP. - RDS is not an API but a ULP. It really uses a set of physical connections and which are then used to set up logical application associations (often referred to as connections but really are not in terms of the interconnect). These associations can be quickly established as they are just control messages over t
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
On Tue, 2005-11-08 at 15:33, Ranjit Pandit wrote: > Using APM is not useful because it doesn't provide failover across HCA's. Can't APM be made to work across HCAs ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
> Mike wrote: > - RDS does not solve a set of failure models. For example, if a RNIC / HCA > were to fail, then one cannot simply replay the operations on another RNIC / > HCA without extracting state, etc. and providing some end-to-end sync of > what was really sent / received by the application. Yes, one can recover > from cable or switch port failure by using APM style recovery but that is > only one class of faults. The harder faults either result in the end node > being cast out of the cluster or see silent data corruption unless > additional steps are taken to transparently recover - again app writers > don't want to solve the hard problems; they want that done for them. The current reference implementation of RDS solves the HCA failure case as well. Since applications don't need to keep connection states, it's easier to handle cases like HCA and intermediate path failures. As far as application is concerned, every sendmsg 'could' result in a new connection setup in the driver. If the current path fails, RDS reestablishes a connection, if available, on a different port or a different HCA , and replays the failed messages. Using APM is not useful because it doesn't provide failover across HCA's. > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
At 03:02 PM 11/4/2005, Rimmer, Todd wrote: > Bob wrote, > Perhaps if tunneling udp packets over RC connections rather than > UD connections provides better performance, as was seen in the RDS > experiment, then why not just convert > IPoIB to use a connected model (rather than datagrams) > and then all existing IP upper level > protocols would could benefit, TCP, UDP, SCTP, This would miss the second major improvement of RDS, namely removing the need for the application to perform timeouts and retries on datagram packets. If Oracle ran over UDP/IP/IPoIB it would not be guaranteed a loss-less reliable interface. If UDP/IP/IPoIB provided a loss-less reliable interface it would likely break or affect other UDP applications which are expecting a flow controlled interface. The entire discussion might be distilled into the following: - Datagram applications trade reliability for flexibility and resource savings. - Datagram applications that require reliability have to re-invent the wheel and given it is non-trivial, they often get it variable quality and can suffer performance loss if done poorly or the network is very lossy. Given networks are a lot less lossy today than years past, sans congestion drops, one might argue about whether there is still a significant problem or not. - The reliable datagram model isn't new - been there, done that on earlier interconnects - but it isn't free. IB could have done something like RDS but the people who pushed the original requirements (some who are advocating RDS now) did not want to take on the associated software enablement thus it was subsumed into hardware and made slightly more restrictive as a result - perhaps more than some people may like. The only real delta between RDS one sense and the current IB RD is the number of outstanding messages in flight on a given EEC. If RD were re-defined to allow software to recover some types of failures much like UC, then one could simply use RD. - RDS does not solve a set of failure models. For example, if a RNIC / HCA were to fail, then one cannot simply replay the operations on another RNIC / HCA without extracting state, etc. and providing some end-to-end sync of what was really sent / received by the application. Yes, one can recover from cable or switch port failure by using APM style recovery but that is only one class of faults. The harder faults either result in the end node being cast out of the cluster or see silent data corruption unless additional steps are taken to transparently recover - again app writers don't want to solve the hard problems; they want that done for them. - RNIC / HCA provide hardware acceleration and reliable delivery to the remote RNIC / HCA (not to the application since that is in a separate fault domain). Doing software multiplexing over such an interconnect as envisioned for IB RD is relatively straight in many respects but not a trivial exercise as some might contend. Yes, people can point to a small number of lines of code but that is just for the initial offering and is not an indication of what it might have to become long-term to add all of the bells-n-whistles that people have envisioned. - RDS is not an API but a ULP. It really uses a set of physical connections and which are then used to set up logical application associations (often referred to as connections but really are not in terms of the interconnect). These associations can be quickly established as they are just control messages over the existing physical connections. Again, builds on concepts already shipping in earlier interconnects / solutions from a number of years back. Hence, for large scale applications which are association intensive, RDS is able to improve the performance of establishing these associations. While RDS improves the performance in this regard, its impacts on actual performance stem more from avoiding some operations thus nearly all of the performance numbers quoted are really an apple-to-orange comparison. Nothing wrong with this but people need to keep in mind that things are not being compared with one another on the same level thus the results can look more dramatic. - One thing to keep in mind is that RDS is about not doing work to gain performance and to potentially improve code by eliminating software that was too complex / difficult to get clean when it was invoked to recover from fabric-related issues. This is somewhat the same logic as used by NFS when migrating to TCP from UDP. Could not get clean software so change the underlying comms to push the problem to a place where it is largely solved. Now, whether you believe RDS is great or not, it is an attempt to solve a problem plaguing one class of applications who'd rather not spend their resources on the problem. That is a fair thing to consider if someone else has already done it better using another technology. One could also consider having IB change the RD semantics to see if that would solve the problem sin
RE: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
-Original Message- From: [EMAIL PROTECTED] on behalf of Roland Dreier Sent: Fri 11/4/2005 6:49 PM To: Rick Frank Cc: openib-general@openib.org Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB Rick> Do you mean useTCP and the RC transport in the ethernet Rick> verbs provider ? No, I mean just write RDS for ethernet on top of sockets. I don't think it's worth implementing a whole RDMA provider on top of ethernet just so you can use the same RDS code. The SilverStorm RDS code is only about 10K lines of code, and I think a sane implementation would probably be less than 5K, so you're not getting much benefit from from all the effort of writing an RDMA provider. In fact I'm not sure that it doesn't make sense to implement RDS as a library + daemon completely in userspace. - R. [Caitlin] Correct, the idea of providing Reliable Datagram service over reliable point-to-point tunnels enables userspace solutions as long as they have access to high-throughput reliable connection service. Whether a TCP service that provides no stateful acceleration qualifies is a topic that we do not need to take up here. [/Caitlin] ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
Rick> Do you mean useTCP and the RC transport in the ethernet Rick> verbs provider ? No, I mean just write RDS for ethernet on top of sockets. I don't think it's worth implementing a whole RDMA provider on top of ethernet just so you can use the same RDS code. The SilverStorm RDS code is only about 10K lines of code, and I think a sane implementation would probably be less than 5K, so you're not getting much benefit from from all the effort of writing an RDMA provider. In fact I'm not sure that it doesn't make sense to implement RDS as a library + daemon completely in userspace. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
Rick> We (Oracle) are currently investigating / working on an RDS Rick> over Ethernet driver for Linux. Our current plans are to Rick> produce a new verbs provider that registers with Gen 2 IB Rick> verbs layer. This new driver will bind to a standard Rick> ethernet nic driver and implement the RC semantics. This Rick> will allow us to use 100% of the ported RDS ULP. That seems rather an awkward way to go about it. Why not just use TCP? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
At this point we really need to get RDS on IB ported to Gen 2 so we can get this into Linux distributions ASAP. We (Oracle) are currently investigating / working on an RDS over Ethernet driver for Linux. Our current plans are to produce a new verbs provider that registers with Gen 2 IB verbs layer. This new driver will bind to a standard ethernet nic driver and implement the RC semantics. This will allow us to use 100% of the ported RDS ULP. Note that RDP should also run over any other interconnect that registers with the verbs layer - such as iWARP, etc . - Original Message - From: "Bob Woodruff" <[EMAIL PROTECTED]> To: "'Ranjit Pandit'" <[EMAIL PROTECTED]> Cc: "Rick Frank" <[EMAIL PROTECTED]>; Sent: Friday, November 04, 2005 6:58 PM Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB Ranjit wrote, RDS is somewhat like SDP in that it offloads/accelerates SOCK_DGRAM instead of SOCK_STREAM. So back to the question from Roland that started this thread. When do you plan to re-work the code to use the OpenIB verbs and make it suitable for the kernel ? And do you plan to develop the code, or at least the infrastructure to allow multiple RDS providers to plug in so that it is ubiquitous - supported on all interconnects - to include simple Ethernet NICs ? woody ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
Ranjit wrote, >RDS is somewhat like SDP in that it offloads/accelerates SOCK_DGRAM >instead of SOCK_STREAM. So back to the question from Roland that started this thread. When do you plan to re-work the code to use the OpenIB verbs and make it suitable for the kernel ? And do you plan to develop the code, or at least the infrastructure to allow multiple RDS providers to plug in so that it is ubiquitous - supported on all interconnects - to include simple Ethernet NICs ? woody ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
Rick wrote, >SCTP is connection based - we have many dependencies on our connectionless >datagram model. I think I get it now. I was just talking with Roy about SCTP, and he said the same thing, SCTP is a connected rather than datagram model, so SCTP does not seem to solve the problem since it has the same FD scaling problems as TCP. >Of course for this to work - we will need RDS to be ubiquitous - supported >on all interconnects - to include simple Ethernet NICs. Makes sense. woody ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
SCTP is connection based - we have many dependencies on our connectionless datagram model. - Original Message - From: "Bob Woodruff" <[EMAIL PROTECTED]> To: "'Rimmer, Todd'" <[EMAIL PROTECTED]>; "Caitlin Bestler" <[EMAIL PROTECTED]>; "Rick Frank" <[EMAIL PROTECTED]>; "Pandit, Ranjit" <[EMAIL PROTECTED]>; "Grant Grundler" <[EMAIL PROTECTED]> Cc: Sent: Friday, November 04, 2005 6:10 PM Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB Todd wrote, This would miss the second major improvement of RDS, namely removing the need for >the application to perform timeouts and retries on datagram packets. If Oracle ran over UDP/IP/IPoIB it would not be guaranteed a loss-less reliable interface. >If UDP/IP/IPoIB provided a loss-less reliable interface it would likely break or >affect other UDP applications which are expecting a flow controlled interface. Todd Rimmer Then use SCTP instead of UDP, which already provides a loss-less reliable interface. If SCTP has problems with the number of endpoints it can currently support, why not just fix that problem and fix IpoIB to use a connected model to increase performance, rather than inventing a completly new protocol and/or address family. Just a thought. woody ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
On 11/4/05, Bob Woodruff <[EMAIL PROTECTED]> wrote: > Woody wrote, > >Perhaps if tunneling udp packets over RC connections rather than > >UD connections provides better performance, as was seen in the RDS > >experiment, then why not just convert > >IPoIB to use a connected model (rather than datagrams) > >and then all existing IP upper level > >protocols would could benefit, TCP, UDP, SCTP, > > Saying this another way. > Make the hardware run the existing protocols better, don't > design a new protocol to work around the problems with a > specific hardware transport. > What about SDP? Isn't SDP bypassing the existing TCP protocol stack to take advantage of a specific harware transport - IB? RDS is somewhat like SDP in that it offloads/accelerates SOCK_DGRAM instead of SOCK_STREAM. > woody > > > > > -Original Message- > From: Caitlin Bestler [mailto:[EMAIL PROTECTED] > Sent: Friday, November 04, 2005 2:31 PM > To: Woodruff, Robert J; Rick Frank; Ranjit Pandit; Grant Grundler > Cc: openib-general@openib.org > Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( > ReliableDatagramSockets) to OpenIB > > > > > -Original Message- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of Bob Woodruff > > Sent: Friday, November 04, 2005 2:15 PM > > To: 'Rick Frank'; Ranjit Pandit; Grant Grundler > > Cc: openib-general@openib.org > > Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( > > ReliableDatagramSockets) to OpenIB > > > > Rick wrote, > > >I've atttached a draft proposal for RDS from Oracle which discusses > > >some of > > > > >the motivation for RDS. > > > > Couple of questions/comments on the spec. > > > > > > AF_INET_OFFLOAD should be renamed to something like AF_INET_RDS. > > > > Would something like SCTP provide the same type of > > capabilities (relaible datagrams) that you are suggesting to > > add with RDP ? > > > > Each stream within an SCTP association provides a reliable, > ordered service. > > There would be two primary constraints in using SCTP for > this usage profile: > > 1) The Stream ID is 16 bits, and the natural mapping would >be to have each stream represent a source/destination >pairing. That would imply fewer than 256 endpoints per >host. If the source were encoded by hand then the limitation >would be 64K, but that's an awkard mix of application and >transport layer encoding. > 2) The network has to be composed of SCTP friendly equipment. >When IP network equipment operated exclusively at L2/L3, >and L4 was left to the endpoints, SCTP would have had no >problem being deployed. But because of security and IPV4 >address shortages there are a lot of middleboxes that are >L4 aware, and generally that L4 awareness is limited to >TCP and UDP. > > SCTP support would also have to be part of the offload device. > RDS enables reliable datagrams using existing offloaded RC > services (IB RC, iWARP, TOE). No NIC enhancements are required. > > > > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
Todd wrote, >This would miss the second major improvement of RDS, namely removing the need for >the application to perform timeouts and retries on datagram packets. If Oracle >ran over UDP/IP/IPoIB it would not be guaranteed a loss-less reliable interface. >If UDP/IP/IPoIB provided a loss-less reliable interface it would likely break or >affect other UDP applications which are expecting a flow controlled interface. >Todd Rimmer Then use SCTP instead of UDP, which already provides a loss-less reliable interface. If SCTP has problems with the number of endpoints it can currently support, why not just fix that problem and fix IpoIB to use a connected model to increase performance, rather than inventing a completly new protocol and/or address family. Just a thought. woody ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
Woody wrote, >Perhaps if tunneling udp packets over RC connections rather than >UD connections provides better performance, as was seen in the RDS >experiment, then why not just convert >IPoIB to use a connected model (rather than datagrams) >and then all existing IP upper level >protocols would could benefit, TCP, UDP, SCTP, Saying this another way. Make the hardware run the existing protocols better, don't design a new protocol to work around the problems with a specific hardware transport. woody -Original Message- From: Caitlin Bestler [mailto:[EMAIL PROTECTED] Sent: Friday, November 04, 2005 2:31 PM To: Woodruff, Robert J; Rick Frank; Ranjit Pandit; Grant Grundler Cc: openib-general@openib.org Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Bob Woodruff > Sent: Friday, November 04, 2005 2:15 PM > To: 'Rick Frank'; Ranjit Pandit; Grant Grundler > Cc: openib-general@openib.org > Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( > ReliableDatagramSockets) to OpenIB > > Rick wrote, > >I've atttached a draft proposal for RDS from Oracle which discusses > >some of > > >the motivation for RDS. > > Couple of questions/comments on the spec. > > > AF_INET_OFFLOAD should be renamed to something like AF_INET_RDS. > > Would something like SCTP provide the same type of > capabilities (relaible datagrams) that you are suggesting to > add with RDP ? > Each stream within an SCTP association provides a reliable, ordered service. There would be two primary constraints in using SCTP for this usage profile: 1) The Stream ID is 16 bits, and the natural mapping would be to have each stream represent a source/destination pairing. That would imply fewer than 256 endpoints per host. If the source were encoded by hand then the limitation would be 64K, but that's an awkard mix of application and transport layer encoding. 2) The network has to be composed of SCTP friendly equipment. When IP network equipment operated exclusively at L2/L3, and L4 was left to the endpoints, SCTP would have had no problem being deployed. But because of security and IPV4 address shortages there are a lot of middleboxes that are L4 aware, and generally that L4 awareness is limited to TCP and UDP. SCTP support would also have to be part of the offload device. RDS enables reliable datagrams using existing offloaded RC services (IB RC, iWARP, TOE). No NIC enhancements are required. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
> Bob wrote, > Perhaps if tunneling udp packets over RC connections rather than > UD connections provides better performance, as was seen in the RDS > experiment, then why not just convert > IPoIB to use a connected model (rather than datagrams) > and then all existing IP upper level > protocols would could benefit, TCP, UDP, SCTP, This would miss the second major improvement of RDS, namely removing the need for the application to perform timeouts and retries on datagram packets. If Oracle ran over UDP/IP/IPoIB it would not be guaranteed a loss-less reliable interface. If UDP/IP/IPoIB provided a loss-less reliable interface it would likely break or affect other UDP applications which are expecting a flow controlled interface. Todd Rimmer ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB
Catlin wrote, >SCTP support would also have to be part of the offload device. >RDS enables reliable datagrams using existing offloaded RC >services (IB RC, iWARP, TOE). No NIC enhancements are required. BTW. SCTP runs in Linux today without any NIC enhancements or offload support. Perhaps if tunneling udp packets over RC connections rather than UD connections provides better performance, as was seen in the RDS experiment, then why not just convert IPoIB to use a connected model (rather than datagrams) and then all existing IP upper level protocols would could benefit, TCP, UDP, SCTP, woody -Original Message- From: Caitlin Bestler [mailto:[EMAIL PROTECTED] Sent: Friday, November 04, 2005 2:31 PM To: Woodruff, Robert J; Rick Frank; Ranjit Pandit; Grant Grundler Cc: openib-general@openib.org Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Bob Woodruff > Sent: Friday, November 04, 2005 2:15 PM > To: 'Rick Frank'; Ranjit Pandit; Grant Grundler > Cc: openib-general@openib.org > Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( > ReliableDatagramSockets) to OpenIB > > Rick wrote, > >I've atttached a draft proposal for RDS from Oracle which discusses > >some of > > >the motivation for RDS. > > Couple of questions/comments on the spec. > > > AF_INET_OFFLOAD should be renamed to something like AF_INET_RDS. > > Would something like SCTP provide the same type of > capabilities (relaible datagrams) that you are suggesting to > add with RDP ? > Each stream within an SCTP association provides a reliable, ordered service. There would be two primary constraints in using SCTP for this usage profile: 1) The Stream ID is 16 bits, and the natural mapping would be to have each stream represent a source/destination pairing. That would imply fewer than 256 endpoints per host. If the source were encoded by hand then the limitation would be 64K, but that's an awkard mix of application and transport layer encoding. 2) The network has to be composed of SCTP friendly equipment. When IP network equipment operated exclusively at L2/L3, and L4 was left to the endpoints, SCTP would have had no problem being deployed. But because of security and IPV4 address shortages there are a lot of middleboxes that are L4 aware, and generally that L4 awareness is limited to TCP and UDP. SCTP support would also have to be part of the offload device. RDS enables reliable datagrams using existing offloaded RC services (IB RC, iWARP, TOE). No NIC enhancements are required. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Bob Woodruff > Sent: Friday, November 04, 2005 2:15 PM > To: 'Rick Frank'; Ranjit Pandit; Grant Grundler > Cc: openib-general@openib.org > Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS ( > ReliableDatagramSockets) to OpenIB > > Rick wrote, > >I've atttached a draft proposal for RDS from Oracle which discusses > >some of > > >the motivation for RDS. > > Couple of questions/comments on the spec. > > > AF_INET_OFFLOAD should be renamed to something like AF_INET_RDS. > > Would something like SCTP provide the same type of > capabilities (relaible datagrams) that you are suggesting to > add with RDP ? > Each stream within an SCTP association provides a reliable, ordered service. There would be two primary constraints in using SCTP for this usage profile: 1) The Stream ID is 16 bits, and the natural mapping would be to have each stream represent a source/destination pairing. That would imply fewer than 256 endpoints per host. If the source were encoded by hand then the limitation would be 64K, but that's an awkard mix of application and transport layer encoding. 2) The network has to be composed of SCTP friendly equipment. When IP network equipment operated exclusively at L2/L3, and L4 was left to the endpoints, SCTP would have had no problem being deployed. But because of security and IPV4 address shortages there are a lot of middleboxes that are L4 aware, and generally that L4 awareness is limited to TCP and UDP. SCTP support would also have to be part of the offload device. RDS enables reliable datagrams using existing offloaded RC services (IB RC, iWARP, TOE). No NIC enhancements are required. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
Rick wrote, >I've atttached a draft proposal for RDS from Oracle which discusses some of >the motivation for RDS. Couple of questions/comments on the spec. AF_INET_OFFLOAD should be renamed to something like AF_INET_RDS. Would something like SCTP provide the same type of capabilities (relaible datagrams) that you are suggesting to add with RDP ? http://www.networksorcery.com/enp/protocol/sctp.htm http://www.faqs.org/rfcs/rfc2960.html ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
Rick Frank wrote: No we do not use TCP sockets - we use to many connections for this 100k+. Isn't RDS implemented on top of reliable IB/RDMA connections anyway? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
No we do not use TCP sockets - we use to many connections for this 100k+. - Original Message - From: "Bob Woodruff" <[EMAIL PROTECTED]> To: "'Rick Frank'" <[EMAIL PROTECTED]>; "Ranjit Pandit" <[EMAIL PROTECTED]>; "Grant Grundler" <[EMAIL PROTECTED]> Cc: Sent: Friday, November 04, 2005 11:35 AM Subject: RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB Rick wrote, I've atttached a draft proposal for RDS from Oracle which discusses some of the motivation for RDS. I assume that you have a driver that uses TCP sockets, Correct ? If so, have you compared the performance of RDS to SDP ? woody ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
Rick wrote, >I've atttached a draft proposal for RDS from Oracle which discusses some of >the motivation for RDS. I assume that you have a driver that uses TCP sockets, Correct ? If so, have you compared the performance of RDS to SDP ? woody ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
Grant wrote, >2) include some docs on it's use and why RDS is better than SDP. >3) nag people to review the ported code >4) post functional test results Looking at the code that is in the contrib branch, it looks like RDS uses connected channels, Is that correct ? If so, I do not see that it provides any value over SDP. If it indeed were using datagrams over IB, then I see that it might provide for better scaling than SDP, since with very large numbers of connections, memory usage becomes an issue, but as it is currently coded, I don't see the point. I was unable to attend the RDS talk at OpenIB workshop, so perhaps Rick can provide some reason why this protocol is better than SDP. woody ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general