RE: [openib-general] RDMA connection and address translation API
Sean wrote: > >>It looks like this would work. If a client wanted to create multiple > >>connections to the same remote service (for example, to separate control and > >>data), then it seems more efficient to move the asynchronous at outside of > >>the > >>connect call. > >>- Sean > > Thats a good point. What I had in mind was mainly simplicity for the > consumer - save him dealing with another upcall. > > Maybe caching in at module would make things better, but I agree > that for multiple connections to the same remote service, the > asynchronous at aproach, seems more appropriate. OTOH, After thinking about it some more, there might be problems in letting each and every consumer do his own caching. The at.c has a (non implemented yet) mechanism with invalidate for caching tables. Do we really want to let the consumer handle all the cases of routing tables changing on the fly etc. or centralize it in one place (i.e at.c) ? What do you think, Sean ? Guy ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: Guy German [mailto:[EMAIL PROTECTED] > Sent: Friday, August 26, 2005 12:28 PM > To: Caitlin Bestler; Sean Hefty; James Lentini > Cc: openib-general@openib.org > Subject: RE: [openib-general] RDMA connection and address > translation API > > > What do you think about this flow ? > > 1. resolve device and port from ip address - synchronous operation > >(like at.c resolve_ip) > > 2. rdma_create_qp (device+port) - modifies qp to init with default > > pkey index 3. ib_post_recvs(...); 4. cma_connect - > asynchronous at, > > modify qp with correct pkey index, cm_connect > > Caitlin wrote: > >At least with iWARP a QP is not bound to a specific port, or > even to an > >IP Address. It is only bound to the RDMA Device (RNIC) and > Protection > >Domain. The same QP can be re-used for a new connection with > a new IP > >address. Indeed, that is exactly what would happen with > >application-layer controlled failover (such as iSER). > > In ib, in order to post receive the QP need to be in init. > In order to modify qp to init, you need port and pkey_index. > If iWARP can post receive without it, the iwarp > implementation of "rdma_create_qp" can ignore the port attribute. > The closest equivalent of a pkey_index would be the VLAN ID, which is at L2 and totally transparent to an iWARP QP. You can definitely post receive buffers before knowing anything about the TCP connection (or SCTP association/stream) that will provide the LLP service. > The other option, that was suggested to solve the sync > problem (need of post receive before connect) is to retrieve > the path synchronically, which will require an unnecessary > upcall handling for iwarp consumers. > The generic requirement is that the QP passed to the connect method is ready to be moved to a connected state as soon as the connection establishment exchanges have finished. If I follow what you are proposing, you are trying to find a way to do this for IB automatically as a by-product of determining what device to use. I don't see any problem with this, as long as the "port" being returned from the first call is defined in such a way that it can have a void value when the transport does not need this refinement. Avoiding transport-dependent steps is good for encouraging development of RDMA-aware applications. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
>What do you think about this flow ? >1. resolve device and port from ip address - synchronous operation > (like at.c resolve_ip) >2. rdma_create_qp (device+port) - modifies qp to init with default pkey index >3. ib_post_recvs(...); >4. cma_connect - asynchronous at, modify qp with correct pkey index, cm_connect >>It looks like this would work. If a client wanted to create multiple >>connections to the same remote service (for example, to separate control and >>data), then it seems more efficient to move the asynchronous at outside of the >>connect call. >>- Sean Thats a good point. What I had in mind was mainly simplicity for the consumer - save him dealing with another upcall. Maybe caching in at module would make things better, but I agree that for multiple connections to the same remote service, the asynchronous at aproach, seems more appropriate. So ... Does everyone else thinks that we should change the API of a cm abstraction to asynchronous at before connection ? (This should concern mostly the iWAPR guys - Caitlin,Tom etc..) Thanks, Guy ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
>What do you think about this flow ? >1. resolve device and port from ip address - synchronous operation > (like at.c resolve_ip) >2. rdma_create_qp (device+port) - modifies qp to init with default pkey index >3. ib_post_recvs(...); >4. cma_connect - asynchronous at, modify qp with correct pkey index, cm_connect It looks like this would work. If a client wanted to create multiple connections to the same remote service (for example, to separate control and data), then it seems more efficient to move the asynchronous at outside of the connect call. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> What do you think about this flow ? > 1. resolve device and port from ip address - synchronous operation >(like at.c resolve_ip) > 2. rdma_create_qp (device+port) - modifies qp to init with > default pkey index > 3. ib_post_recvs(...); > 4. cma_connect - asynchronous at, modify qp with correct > pkey index, cm_connect Caitlin wrote: >At least with iWARP a QP is not bound to a specific port, or even >to an IP Address. It is only bound to the RDMA Device (RNIC) and >Protection Domain. The same QP can be re-used for a new connection >with a new IP address. Indeed, that is exactly what would happen >with application-layer controlled failover (such as iSER). In ib, in order to post receive the QP need to be in init. In order to modify qp to init, you need port and pkey_index. If iWARP can post receive without it, the iwarp implementation of "rdma_create_qp" can ignore the port attribute. The other option, that was suggested to solve the sync problem (need of post receive before connect) is to retrieve the path synchronically, which will require an unnecessary upcall handling for iwarp consumers. Guy ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Guy German > Sent: Friday, August 26, 2005 1:27 AM > To: Sean Hefty; James Lentini > Cc: openib-general@openib.org > Subject: RE: [openib-general] RDMA connection and address > translation API > > >> We need to insert in here: > >> > >> ib_modify_qp(...); /* somehow uses address resolution... */ > >> ib_post_recvs(...); > >> > > > >or add a new call to create the qp and modify it to init (an > analog to > >the socket(2) function). > > Sean> This approach seems reasonable to me. Maybe something like: > Sean> rdma_create_qp(rdma_addr_info); > > Sean> Uses the output from the address resolution to create the QP on > Sean> the correct device and transitions it to the INIT > state. The user > Sean> can now post any work requests that they want. For > example, with > Sean> iWarp, I believe that even send work requests can be > posted in the INIT state. > > What do you think about this flow ? > 1. resolve device and port from ip address - synchronous operation >(like at.c resolve_ip) > 2. rdma_create_qp (device+port) - modifies qp to init with > default pkey index 3. ib_post_recvs(...); 4. cma_connect - > asynchronous at, modify qp with correct pkey index, cm_connect > At least with iWARP a QP is not bound to a specific port, or even to an IP Address. It is only bound to the RDMA Device (RNIC) and Protection Domain. The same QP can be re-used for a new connection with a new IP address. Indeed, that is exactly what would happen with application-layer controlled failover (such as iSER). ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
On Thu, 25 Aug 2005, Sean Hefty wrote: > >> Any way providing src/dst IPs in the CM Private data is simple, > >> and we can come with IBTA extension blessing that data structure > >> as a general way to map IP oriented protocols over IB (a 1-2 page > >> draft at the most) This way it can also address Caitlin concerns > >> regarding NFS & IETF (since now it's a transport specific issue) > > > >How long do you estimate it would take to standardize an IP<->GID > >mechanism (ATS, CM embedded, ...) in the IBTA? 3 months? 6 months? > >A year? > > > >Let's assume that everyone on this list is in agreement. > > Does anyone in the IB world disagree with adding IP addresses in the > CM private data area? Would we want to extend this concept to SIDR > as well? I think we should focus on providing a mechanism to allow ULPs to use IP addresses on InfiniBand networks. Service discovery (SIDR) seems like a separate issue. The ability to ask "What UD QPN is this service using?" seems useful on its own. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
>> We need to insert in here: >> >> ib_modify_qp(...); /* somehow uses address resolution... */ >> ib_post_recvs(...); >> > >or add a new call to create the qp and modify it to init (an analog to >the socket(2) function). Sean> This approach seems reasonable to me. Maybe something like: Sean> rdma_create_qp(rdma_addr_info); Sean> Uses the output from the address resolution to create the QP on the Sean> correct device and transitions it to the INIT state. The user can Sean> now post any work requests that they want. For example, with iWarp, Sean> I believe that even send work requests can be posted in the INIT state. What do you think about this flow ? 1. resolve device and port from ip address - synchronous operation (like at.c resolve_ip) 2. rdma_create_qp (device+port) - modifies qp to init with default pkey index 3. ib_post_recvs(...); 4. cma_connect - asynchronous at, modify qp with correct pkey index, cm_connect Guy ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Thu, Aug 25, 2005 at 01:18:06PM -0400, Talpey, Thomas wrote: > At 12:56 PM 8/25/2005, Caitlin Bestler wrote: > >Generic code MUST support both IPv4 and IPv6 addresses. > >I've even seen code that actually does this. > > Let me jump ahead to the root question. How will the NFS layer know > what address to resolve? > > On IB mounts, it will need to resolve a hostname or numeric string to > a GID, in order to provide the address to connect. On TCP/UDP, or > iWARP mounts, it must resolve to IP address. The mount command > has little or no context to perform these lookups, since it does not > know what interface will be used to form the connection. > > In exports, the server must inspect the source network of each > incoming request, in order to match against /etc/exports. If there > are wildcards in the file, a GID-specific algorithm must be applied. > Historically, /etc/exports contains hostnames and IPv4 netmasks/ > addresses. > > In either case, I think it is a red herring to assume that the GID > is actually an IPv6 address. They are not assigned by the sysadmin, > they are not subnetted, and they are quite foreign to many users. > IPv6 support for Linux NFS isn't even submitted yet, btw. > > With an IP address service, we don't have to change a line of > NFS code. I think this shows that using IP addresses in any service over infiniband that isn't actually IP networking is extremly stupid. Just stop living in the illusion that it makes sense and use IB-specific addressing, namely IB and stop all this layering violations into IP, which is much higher up the stack. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: Sean Hefty [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 25, 2005 2:37 PM > To: 'James Lentini'; Yaron Haviv > Cc: openib-general@openib.org > Subject: RE: [openib-general] RDMA connection and address translation API > > >> Any way providing src/dst IPs in the CM Private data is simple, and we > >> can come with IBTA extension blessing that data structure as a general > >> way to map IP oriented protocols over IB (a 1-2 page draft at the most) > >> This way it can also address Caitlin concerns regarding NFS & IETF > >> (since now it's a transport specific issue) > > > >How long do you estimate it would take to standardize an IP<->GID > >mechanism (ATS, CM embedded, ...) in the IBTA? 3 months? 6 months? A > >year? > > > >Let's assume that everyone on this list is in agreement. > > Does anyone in the IB world disagree with adding IP addresses in the CM > private > data area? Would we want to extend this concept to SIDR as well? > > - Sean I send my proposal from 2004 re-send again as text (attached) Also addresses the ServiceID issue, this can be a baseline for discussions Feel free to change Yaron Mapping of iWarp/TCP connections to InfiniBand AUTHOR Yaron Haviv ([EMAIL PROTECTED]) VERSION 0.30, Mon June 28 2004 I. INTRODUCTION InfiniBand and iWarp semantics are similar especially with the latest Verb Extensions, the major difference is in the way connections are established, iWarp uses TCP based connection establishment while InfiniBand uses a CM for that. Another related difference is that in iWarp a user can start in a standard TCP mode and migrate to RDMA verbs in the middle of a session. The following document provides a general mapping from iWarp/TCP connection establishment to InfiniBand which can be used by ULPs over InfiniBand or by any other future iWarp protocols, it imitates the SDP connection establishment process and CM headers (does not require SDP, just have the same data formats for CM messages). II. Establishing a TCP/iWarp like connections over InfiniBand In order to emulate an iWarp connection, it is required to open an InfiniBand RC connection, associate it with IP addresses and TCP ports In addition protocols may transfer control/login packets before the migration to the RDMA mode; this requires exchanging receiver buffer size and depth for initial usage (the ULPs will manage the flow control for the duration of the connection). The mapping uses the same data structures already defined for connection establishment in SDP (IBTA Socket Direct Protocol) which accomplish the same goal of mapping TCP Sockets addressing to InfiniBand, the non relevant SDP fields were Reserved. iWarp emulation CM Request (Hello) Private Data header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 04| MID | Rsvd | bufs | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 08| len | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 12| Reserved| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16| Reserved| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 20| MajVer| MinVer| IPVer | FlowC | Reserved| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 24| DesRemRcvSz | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 28| LocalRcvSz | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 32| Local Port| Reserved| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 36| Src IP (127-96) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 40| Src IP ( 95-64) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 44| Src IP ( 63-32) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 48| Src IP ( 31-00
RE: [openib-general] RDMA connection and address translation API
>> Any way providing src/dst IPs in the CM Private data is simple, and we >> can come with IBTA extension blessing that data structure as a general >> way to map IP oriented protocols over IB (a 1-2 page draft at the most) >> This way it can also address Caitlin concerns regarding NFS & IETF >> (since now it's a transport specific issue) > >How long do you estimate it would take to standardize an IP<->GID >mechanism (ATS, CM embedded, ...) in the IBTA? 3 months? 6 months? A >year? > >Let's assume that everyone on this list is in agreement. Does anyone in the IB world disagree with adding IP addresses in the CM private data area? Would we want to extend this concept to SIDR as well? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
>Sean> Another possibility could be to add a list of receives to >Sean> rdma_connect(). > >Guy> I added this to both connect and accept calls > >I don't think this is a good idea. Let's try to streamline the >connect call, not add every single possible feature to it. I don't think that we want to add a list of receives to the connect call either. I only mentioned that it was a possibility. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
>> We need to insert in here: >> >> ib_modify_qp(...); /* somehow uses address resolution... */ >> ib_post_recvs(...); >> > >or add a new call to create the qp and modify it to init (an analog to >the socket(2) function). This approach seems reasonable to me. Maybe something like: rdma_create_qp(rdma_addr_info); Uses the output from the address resolution to create the QP on the correct device and transitions it to the INIT state. The user can now post any work requests that they want. For example, with iWarp, I believe that even send work requests can be posted in the INIT state. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Tue, 23 Aug 2005, Roland Dreier wrote: > The listen side is even simpler: > > rdma_listen(): > inputs: local service, event callback, consumer context > > Wait for connection requests and pass events to the consumer's > callback. I'm not sure if/home we want to support binding to > a particular IP address. The current IB CM in Linux doesn't > support binding a listen to a single device or port, and even > if it did it's not clear how to handle binding to one IP > address when a port has more than one IP. > > I guess the event callback would receive a device pointer and > the same RDMA transport address union I talked about above > when discussing address resolution. > > It would be possible to have another function like > rdma_getpeername() that takes the transport address and > returns a source IP address. To be complete, the API needs an rdma_getpeername() function: rdma_getpeername(): inputs: connected QP outputs: peer IP address > In the IB case this would do an > ATS reverse lookup. However, I hate this idea. iSER already > uses the CM private data to pass the source IP in the IB case, > and I would much rather fix NFS/RDMA to do the same thing (so > we can just kill ATS as an address resolution method). ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
At 12:56 PM 8/25/2005, Caitlin Bestler wrote: >Generic code MUST support both IPv4 and IPv6 addresses. >I've even seen code that actually does this. Let me jump ahead to the root question. How will the NFS layer know what address to resolve? On IB mounts, it will need to resolve a hostname or numeric string to a GID, in order to provide the address to connect. On TCP/UDP, or iWARP mounts, it must resolve to IP address. The mount command has little or no context to perform these lookups, since it does not know what interface will be used to form the connection. In exports, the server must inspect the source network of each incoming request, in order to match against /etc/exports. If there are wildcards in the file, a GID-specific algorithm must be applied. Historically, /etc/exports contains hostnames and IPv4 netmasks/ addresses. In either case, I think it is a red herring to assume that the GID is actually an IPv6 address. They are not assigned by the sysadmin, they are not subnetted, and they are quite foreign to many users. IPv6 support for Linux NFS isn't even submitted yet, btw. With an IP address service, we don't have to change a line of NFS code. Tom. > >So supporting GIDs is not that much of an issue as long >as no IB network IDs are assigned with a meaning that >conflicts with any reachable IPv6 network ID. (In other >words, assign GIDs so that they are in fact valid IPv6 >addresses. Something that was always planned to be one >option for GIDs). > > > >> -Original Message- >> From: [EMAIL PROTECTED] >> [mailto:[EMAIL PROTECTED] On Behalf Of James Lentini >> Sent: Thursday, August 25, 2005 9:48 AM >> To: Tom Tucker >> Cc: openib-general@openib.org >> Subject: RE: [openib-general] RDMA connection and address >> translation API >> >> >> >> On Wed, 24 Aug 2005, Tom Tucker wrote: >> >> > > >> > > - It's not just preventing connections to the wrong >> local address. >> > >NFS-RDMA wants the remote source address (ie >> getpeername()) so that >> > >it can look it up in the exports list. >> > >> > Agreed. But you could also get rid of ATS by allowing GIDs to be >> > specified in the exports file and then treating them like >> > IPv6 addresses for the purpose of subnet comparisons. >> >> Could generic code use both GIDs and IPv4 addresses? >> ___ >> openib-general mailing list >> openib-general@openib.org >> http://openib.org/mailman/listinfo/openib-general >> >> To unsubscribe, please visit >> http://openib.org/mailman/listinfo/openib-general >> >> > >___ >openib-general mailing list >openib-general@openib.org >http://openib.org/mailman/listinfo/openib-general > >To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Sean Hefty wrote: > >With this in mind, I believe that the connection API needs to be > >something more like the following: > > > >rdma_resolve_address(): > >inputs: dest IP address, qos, npaths, > >done callback, opaque context > > done callback params: status, local RDMA device, > >RDMA transport address, context > ... > >rdma_connect(): > >inputs: local QP, RDMA transport address, destination service, > >private data, timeout, event callback, opaque context > > Have we agreed that this is the functionality that we should be > aiming towards? I think so, but as you pointed out the local QP must be in the init state. > > >rdma_resolve_address(...); > >/* wait for resolution */ > >ib_create_qp(...) /* use device pointer we got from > > rdma_resolve_address() > >*/ > > We need to insert in here: > > ib_modify_qp(...); /* somehow uses address resolution... */ > ib_post_recvs(...); > or add a new call to create the qp and modify it to init (an analog to the socket(2) function). > >rdma_connect(...); /* pass transport address we got from > >rdma_resolve_address() */ > >/* wait for connection to finish... */ > > Another possibility could be to add a list of receives to > rdma_connect(). The caller might also want to setup memory windows. Requiring the qp to be in the init state before calling connect seems cleaner to me. > > - Sean > > > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Fab Tillier wrote: > Performing a forward lookup via ARP is going to be a lot faster than > ATS if the ARP entry already exists. ATS responses could also be cached. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: James Lentini [mailto:[EMAIL PROTECTED] > Sent: Thursday, August 25, 2005 12:21 PM > To: Yaron Haviv > Cc: Fab Tillier; Roland Dreier; openib-general@openib.org > Subject: RE: [openib-general] RDMA connection and address translation API > > > > On Wed, 24 Aug 2005, Yaron Haviv wrote: > > > Any way providing src/dst IPs in the CM Private data is simple, and we > > can come with IBTA extension blessing that data structure as a general > > way to map IP oriented protocols over IB (a 1-2 page draft at the most) > > This way it can also address Caitlin concerns regarding NFS & IETF > > (since now it's a transport specific issue) > > How long do you estimate it would take to standardize an IP<->GID > mechanism (ATS, CM embedded, ...) in the IBTA? 3 months? 6 months? A > year? > > Let's assume that everyone on this list is in agreement. James, I can identify enough IBTA members in this list In case the group is in agreement I believe it's a rather short process Since it's just some minor definition, and IBTA doesn't have much on its agenda these days. For example Hal added a feature to the SM (client re-register ..) in weeks Based on the OpenIB input We also don't have to wait for finalized spec to implement, just like we implement IPoIB without an IETF RFC (only a draft) By the way a quick path could be to define it in DAT and hand it over to IBTA, after all ATS is also not an IBTA standard Yaron ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
Generic code MUST support both IPv4 and IPv6 addresses. I've even seen code that actually does this. So supporting GIDs is not that much of an issue as long as no IB network IDs are assigned with a meaning that conflicts with any reachable IPv6 network ID. (In other words, assign GIDs so that they are in fact valid IPv6 addresses. Something that was always planned to be one option for GIDs). > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of James Lentini > Sent: Thursday, August 25, 2005 9:48 AM > To: Tom Tucker > Cc: openib-general@openib.org > Subject: RE: [openib-general] RDMA connection and address > translation API > > > > On Wed, 24 Aug 2005, Tom Tucker wrote: > > > > > > > - It's not just preventing connections to the wrong > local address. > > >NFS-RDMA wants the remote source address (ie > getpeername()) so that > > >it can look it up in the exports list. > > > > Agreed. But you could also get rid of ATS by allowing GIDs to be > > specified in the exports file and then treating them like > > IPv6 addresses for the purpose of subnet comparisons. > > Could generic code use both GIDs and IPv4 addresses? > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
At 12:34 PM 8/25/2005, Roland Dreier wrote: >All implementation of NFS/RDMA on top of IB had better interoperate, >right? Which means that someone has to specify which address >translation mechanism is the choice for NFS/RDMA. Correct. At the moment the existing NFS/RDMA implementations use ATS (Sun's and NetApp's). >NFS/RDMA is being defined on top of an abstract RDMA interface. >Someone has to write a spec for how that RDMA abstraction is >translated into packets on the wire for each transport that NFS/RDMA >will run on top of. Well, we did. We specify the ULP payload of all the messages in those two IETF documents. What we didn't do is define how each transport handles IP addressing, that is a transport issue. We don't need address translation over iWARP, since that uses IP. Over IB, so far, we have used ATS. I am perfectly fine with a better solution, but ATS has been fine too. I am catching up to this discussion, so this is just one reply. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Tom Tucker wrote: > > > > - It's not just preventing connections to the wrong local address. > >NFS-RDMA wants the remote source address (ie getpeername()) so that > >it can look it up in the exports list. > > Agreed. But you could also get rid of ATS by allowing GIDs to > be specified in the exports file and then treating them like > IPv6 addresses for the purpose of subnet comparisons. Could generic code use both GIDs and IPv4 addresses? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Roland> No, I think we just need to realize that a perfectly Roland> transport neutral protocol implementation is not Roland> achievable. James> It is achievable. Although the IB and iWARP protocols are James> different, they can provide the same services to NFS-RDMA. Not really. This is just hiding the transport dependence in some other layer and then pretending it doesn't exist. IB and iWARP can provide the same services to NFS/RDMA, but only through some intermediate layer that implements the actual transport-dependent wire protocol. James> IB is missing one service that iWARP has, namely that nodes James> can be identified with IP addresses. The ATS mechanism James> provides this capability for IB networks. If there are James> better mechanisms that do the same thing, then NFS-RDMA can James> use them. All implementation of NFS/RDMA on top of IB had better interoperate, right? Which means that someone has to specify which address translation mechanism is the choice for NFS/RDMA. James> The important things is not to push this up into the James> ULPs. The NFS-RDMA protocol is being standardized in the James> IETF. There is no reason to upset that process. If an James> additional IB specific protocol is necessary, it should be James> standardized in the IBTA. NFS/RDMA is being defined on top of an abstract RDMA interface. Someone has to write a spec for how that RDMA abstraction is translated into packets on the wire for each transport that NFS/RDMA will run on top of. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Roland Dreier wrote: > James> I agree with Caitlin. The eventual solution cannot force > James> protocol modifications in ULPs. > > Does this mean we're stuck with the current use of ATS in NFS-RDMA? NFS-RDMA requires that the lower layer provide IP addressing. ATS is one proposal and the only one being documented and standardized in a standards organization. Any other solution that was documented and standardized should be considered. Since this will involve the wire protocol, it can't be OpenIB specific. > Surely there's still time to fix the protocol. I believe that a solution can be found without impacting the NFS-RDMA specifications. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Yaron Haviv wrote: > Any way providing src/dst IPs in the CM Private data is simple, and we > can come with IBTA extension blessing that data structure as a general > way to map IP oriented protocols over IB (a 1-2 page draft at the most) > This way it can also address Caitlin concerns regarding NFS & IETF > (since now it's a transport specific issue) How long do you estimate it would take to standardize an IP<->GID mechanism (ATS, CM embedded, ...) in the IBTA? 3 months? 6 months? A year? Let's assume that everyone on this list is in agreement. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Thu, 2005-08-25 at 08:58 -0700, Roland Dreier wrote: > Sean> Another possibility could be to add a list of receives to > Sean> rdma_connect(). > > Guy> I added this to both connect and accept calls > > I don't think this is a good idea. Let's try to streamline the > connect call, not add every single possible feature to it. > > - R. I think it is a good solution for the sync problem that sean raised - in the case where we modify the qp inside the abstraction layer. We can take it out (i.e getting the path and modify qp to init *before* connect) but I think this will be more complicated for the consumers (especially the iwarp ones). I am not saying we *have* to do it - this is just a suggestion. Guy ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Caitlin Bestler wrote: > NFS over RDMA does not do that. > > Shouldn't that be the end of discussion on abusing CM private data > unless you are talking *solely* about IB private data. And if that is > the discussion, should not such a strategy be proposed to IETF > and/or IBTA for an NFSoRDMA for IB official mapping? Since this is IB specific, I think it should be addressed in the IBTA. > The other end of the NFSoRDMA connection is not necessarily > running OpenIB or even Linux and is not party to any of these > discussions. > > > > > My resistance is that ATS is just complexity without any benefit. It > > doesn't provide additional security. It doesn't solve the > > multi-homing problem we're talking about now. Once you've thrown away > > information by turning your IP address into an IB GID, there's no > > magic way ATS can recreate that information and be psychic about which > > of the multi-homed IPs you actually meant. So why not just put the IP > > addressing information into the CM private data, the way that the SDP > > protocol already does? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Roland Dreier wrote: > James> You need to consider what makes sense for *both* ib and > James> iwarp. Keep in mind that the correct API will allow a > James> consumer to use ib and iwarp devices transparently. In > James> other words their will be one code path that support both. > > James> If we were to adopt your proposal, the consumer would need > James> to perform unnecessary operations on iWARP. > > No, I think we just need to realize that a perfectly transport neutral > protocol implementation is not achievable. It is achievable. Although the IB and iWARP protocols are different, they can provide the same services to NFS-RDMA. IB is missing one service that iWARP has, namely that nodes can be identified with IP addresses. The ATS mechanism provides this capability for IB networks. If there are better mechanisms that do the same thing, then NFS-RDMA can use them. The important things is not to push this up into the ULPs. The NFS-RDMA protocol is being standardized in the IETF. There is no reason to upset that process. If an additional IB specific protocol is necessary, it should be standardized in the IBTA. > It's unfortunate that kDAPL fooled people by hiding the details of > the wire protocol under a supposedly "neutral API," but the fact is > that mapping an abstract RDMA transport to a real implementation > will always involve arbitrary transport-dependent choices. The kDAPL API *is* transport neutral. This has been demonstrated at several interoperability tests at which the same applications were run on both IB and iWARP. kDAPL isn't the only transport neutral networking API. The Sockets API supports UDP and TCP transports via the same interface. I believe we are very close to reaching agreement on a transport neutral RDMA connection API. Comparing your API proposal to the API that we proposed at the BOF, they are very similar. The most important similarity is that both use IP addressing. The only real point of debate is over how to perform the address translation (IP <-> GID) on IB. I believe we should separate that from the API discussion. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Sean> Another possibility could be to add a list of receives to Sean> rdma_connect(). Guy> I added this to both connect and accept calls I don't think this is a good idea. Let's try to streamline the connect call, not add every single possible feature to it. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
The data required when doing a qp-modify-to-rts is inherently transport specific. IB requires a set of data obtained from the IB CM protocol (or the equivalent data through application specific black magic), while iWARP requires a handle for a TCP connection (assumed to be a socket, but not explicitly required to be so). The problem is that when the RDMAC specified the iWARP modify qp to RTS behaviour they did not forsee the non-technical barriers to simply using a socket handle to specify transfer of ownership of a TCP connection from one stack to another. > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of James Lentini > Sent: Thursday, August 25, 2005 7:54 AM > To: Roland Dreier > Cc: openib-general@openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > > > On Wed, 24 Aug 2005, Roland Dreier wrote: > > > Sean> Is the idea that the user calls connect() and > then receives > > Sean> a single callback indicating that the connection has been > > Sean> established? If so, then the user may need to > modify the QP > > Sean> to the INIT state, which would require some knowledge > > Sean> already of the path. We would also need to be clear on > > Sean> whether the QP is expected to be in the INIT state before > > Sean> connect is called, or if it could be in any > arbitrary state. > > Sean> The other alternative is to provide multiple callbacks > > Sean> during connection establishment. > > > > To me it makes sense for the generic CM API to be defined > so that an > > IB QP must be in the INIT state before being passed to connect(). > > Will the ib_modify_qp() function be made transport neutral? I > see some fields in the ib_qp_attr structure that are IB specific. > > I think the RDMA connection API should perform all the QP > state transitions for the ULP. How about a new call to create > the QP and perform all QP state transitions necessary for the > posting receive work requests? > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Roland Dreier wrote: > Sean> Is the idea that the user calls connect() and then receives > Sean> a single callback indicating that the connection has been > Sean> established? If so, then the user may need to modify the QP > Sean> to the INIT state, which would require some knowledge > Sean> already of the path. We would also need to be clear on > Sean> whether the QP is expected to be in the INIT state before > Sean> connect is called, or if it could be in any arbitrary state. > Sean> The other alternative is to provide multiple callbacks > Sean> during connection establishment. > > To me it makes sense for the generic CM API to be defined so that an > IB QP must be in the INIT state before being passed to connect(). Will the ib_modify_qp() function be made transport neutral? I see some fields in the ib_qp_attr structure that are IB specific. I think the RDMA connection API should perform all the QP state transitions for the ULP. How about a new call to create the QP and perform all QP state transitions necessary for the posting receive work requests? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
Good point. But that's about wire behavior, not what an application sees. And yes, the RDMA device must behave as though its IP layer were part of the host stack. That is a strong argument for standardizing many of those interactions rather than relying on fully compliant parallel processing. -Original Message- From: Christoph Hellwig [mailto:[EMAIL PROTECTED] Sent: Thursday, August 25, 2005 1:52 AM To: Caitlin Bestler Cc: Christoph Hellwig; openib-general@openib.org Subject: Re: [openib-general] RDMA connection and address translation API On Wed, Aug 24, 2005 at 02:22:31PM -0700, Caitlin Bestler wrote: > Not if the host connects two disjoint networks and does not route > between them. Such a host should/may be configured to reject any > packet that arrives with a destination address that does not match the > expected destination address for the port it arrives upon. While you can configure a Linux system to reject such request through a bunch of crude hacks, the default and fully RFC compliant behaviour is to always reply to ARP requests for any IP address assigned to the system. RDMA CM implementations must work the same. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
On Wed, 2005-08-24 at 18:28 -0700, Sean Hefty wrote: > Another possibility could be to add a list of receives to rdma_connect(). I added this to both connect and accept calls Guy ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Wed, Aug 24, 2005 at 02:22:31PM -0700, Caitlin Bestler wrote: > Not if the host connects two disjoint networks and does not route > between them. Such a host should/may be configured to reject any > packet that arrives with a destination address that does not match > the expected destination address for the port it arrives upon. While you can configure a Linux system to reject such request through a bunch of crude hacks, the default and fully RFC compliant behaviour is to always reply to ARP requests for any IP address assigned to the system. RDMA CM implementations must work the same. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Wed, Aug 24, 2005 at 02:15:09PM -0700, Roland Dreier wrote: > Roland> Well, that's not what I would expect. Suppose I have a > Roland> device configured with local addresses 192.168.11.12 and > Roland> 192.168.98.99 and I > > Christoph> You never configure a device with local addresses. IP > Christoph> addresses are always a per-host attribute in Linux. > > I don't think this is really true. In some ways Linux behaves as if > IP addresses are per-host (eg ARP responses can go out any interface) > but really IP addresses are attached to an interface. Every struct > net_device has a struct in_device, and every struct in_device has a > list of struct in_ifaddrs for the device's IP addresses. This is correct, but the user-visible effect is what I said above. When you do an ARP query for any of the IP addresses of a linux box you'll get a responce even if that interface isn't on the network. Even if you don't think that's enough you can assign any number of IP and other networking addresses to a given device even formally, rendering the notation of an IP address <-> network device relation rather mood. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
>With this in mind, I believe that the connection API needs to be >something more like the following: > >rdma_resolve_address(): >inputs: dest IP address, qos, npaths, >done callback, opaque context > done callback params: status, local RDMA device, >RDMA transport address, context ... >rdma_connect(): >inputs: local QP, RDMA transport address, destination service, >private data, timeout, event callback, opaque context Have we agreed that this is the functionality that we should be aiming towards? >rdma_resolve_address(...); >/* wait for resolution */ >ib_create_qp(...) /* use device pointer we got from rdma_resolve_address() >*/ We need to insert in here: ib_modify_qp(...); /* somehow uses address resolution... */ ib_post_recvs(...); >rdma_connect(...); /* pass transport address we got from >rdma_resolve_address() */ >/* wait for connection to finish... */ Another possibility could be to add a list of receives to rdma_connect(). - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: Roland Dreier [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 7:29 PM > To: Yaron Haviv > Cc: James Lentini; Roland Dreier; openib-general@openib.org > Subject: Re: [openib-general] RDMA connection and address translation API > > > Yaron, has anyone raised all this in the IBTA WG? > I raised it about a year ago, but didn't really followed up on it At the time IBTA was also busy with other more urgent stuff (verb ext..) We work with few key IBTA members to re-surface it with the need for an abstract CM See the following text that was proposed (a Year ago as is) It is slightly different than your proposal but can be altered if needed It basically uses SDP header and marks one of the fields with 01 (FlowC) to indicate it's not SDP, this way even SDP can use it Also it covers some nice idea raised by MS & SUN to extend SDP to accept PUT & GET operations for RDMA, so you can get a BSD like API with few additional APIs rather than have a totally new API like DAPL Establishing a TCP/iWarp like connections over InfiniBand = In order to emulate an iWarp connection, it is required to open an InfiniBand RC connection, associate it with IP addresses and TCP ports In addition protocols may transfer control/login packets before the migration to the RDMA mode; this requires exchanging receiver buffer size and depth for initial usage (the ULP's will manage the flow control for the duration of the connection). The mapping uses the same data structures already defined for connection establishment in SDP (IBTA Socket Direct Protocol) which accomplish the same goal of mapping TCP Sockets addressing to InfiniBand, the non relevant SDP fields were Reserved. iWarp emulation CM Request (Hello) Private Data header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 04| MID | Rsvd | bufs | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 08| len | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 12| Reserved| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16| Reserved| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 20| MajVer| MinVer| IPVer | FlowC | Reserved| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 24| DesRemRcvSz | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 28| LocalRcvSz | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 32| Local Port| Reserved| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 36| Src IP (127-96) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 40| Src IP ( 95-64) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 44| Src IP ( 63-32) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 48| Src IP ( 31-00) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 52| Dst IP (127-96) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 56| Dst IP ( 95-64) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 60| Dst IP ( 63-32) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 64| Dst IP ( 31-00) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1 CM Hello private data structure iWarp emulation CM Response (HelloReply) Private Data header 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 04| MID | Rsvd | bufs | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
RE: [openib-general] RDMA connection and address translation API
> From: James Lentini [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 1:58 PM > > On Wed, 24 Aug 2005, Fab Tillier wrote: > > > > From: Roland Dreier [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, August 24, 2005 11:03 AM > > > > > > Fab> Why can't the IPV field be ignored? If a listen wants only > > > Fab> IPV4 addresses, it would specify a 16-byte compare buffer > > > Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 > > > Fab> address, and would set the offset to that of the hello > > > Fab> message's destination address (32). > > > > > > Yes, you're right for SDP. I guess if we're comfortable mandating > > > that all protocols put their source and destination IPs in the private > > > data for the IB case, then this works. Of course it's somewhat > > > awkward to pass this information into the transport-neutral CM API but > > > I think this can be worked around. > > > > I don't know if we need to mandate IP usage - it's up to the > > application. Any application that wants to have similar semantics > > to the way socket listens work (especially when bound to one of > > multiple IP addresses on a port) the application would have to > > define its private data to accommodate this. > > > > At the IB level, the contents of the private data are still opaque, > > even to the CM. The CM would only expose the ability to have it > > perform an initial triage of requests by doing binary comparisons > > over regions of private data. It doesn't know (or need to know) > > what the data represents - it only cares about finding a match (or > > not). The CM doesn't define any sort of policy here, and I don't > > think it should. It's just bytes to the CM, and it's doing a blind > > comparison without interpreting the contents. > > You need to consider what makes sense for *both* ib and iwarp. Keep in > mind that the correct API will allow a consumer to use ib and iwarp > devices transparently. In other words their will be one code path that > support both. I believe using the private data makes the most sense from the IB perspective. One could even argue that it is the only way to provide positive "getpeername" functionality. Use of the IB private data does not require identical use of private data in other technologies. > If we were to adopt your proposal, the consumer would need to perform > unnecessary operations on iWARP. It doesn't have to impact the client if there's some intermediate abstraction to isolate the client from the IB CM details (including private data use). > A transport neutral client would be forced to put IP information into > its CM private data on iWARP. > > Likewise, a transport neutral server would be forced to pass an > private data offset and binary blob to the listen API call on iWARP. > > Neither of these make sense. A higher-level CM abstraction could implement the policy of private data use when running on IB without the client's involvement. The end result still is that you end up with a wire protocol that needs to be documented so that someone without that exact CM abstraction knows where and how to format the private data as well as how to interpret it. If the IBTA defines something like this, all these issues go away. I don't know if the IBTA can define this without affecting existing protocols like SDP and iSER that already define how to encapsulate the source and destination information in the private data. Using the private data, either by the client or some IB-specific CM abstraction, will remove the need for any reverse lookups. A forward lookup to validate the incoming source GID to the source IP in the private data can validate the IP address. Performing a forward lookup via ARP is going to be a lot faster than ATS if the ARP entry already exists. On large fabrics, ARP is also going to scale better since there's not one single entity responsible for responding to every node's requests. > These API problems are secondary to the burden you would be placing on > the protocols. As has been mentioned in a previous email, extending > the current protocols to use this convention will require further > standardization and in some cases may not be compatible with their > current architecture. I think biting the bullet now on establishing these standards for applications using IP addressing over IB, whether in the IBTA or in each application, is going to give us the best long term result. - Fab ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Yaron> The current implementation may not use the private data Yaron> field (since its not critical/mandatory) but the intention Yaron> is to add it to address multi homed hosts, we would like to Yaron> push such a definition into IBTA so every IP oriented ULP Yaron> can use it, several people expressed interest in such a Yaron> definition, this can also support NFS/RDMA or any other IP Yaron> based ULP. Strange as it may seem, I agree completely with Yaron ;) It would make perfect sense to take a couple of the reserved bits in the CM REQ format and turn them into an "IP address present" field (a couple of bits so we can distinguish between v4 and v6). When this field is set, then the first (or last, or whatever) 32 bytes of the private data would hold the source and destination IP address. Having this standardized also gives us the ability to deal with the concerns around connections initiated in userspace. The kernel proxy for the user CM can make sure that any REQs sent with the "IP address present" field set actually has an IP assigned to the local system. Remote systems would still need to treat CM messages from QPs other than QP 1 as untrusted. Of course for real security some stronger authentication is needed in any case (even in the iWARP case the source IP can't be trusted; an attacker could DOS the real owner of the IP, flood the switches MAC tables so it becomes a hub, and then take over any IP it wants). The only unfortunate thing about all this is that the SDP Hello message format is already frozen, and it seems a little too specialized for generic use (eg we don't want a "Max Zcopy Advertisements" field). Yaron, has anyone raised all this in the IBTA WG? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: James Lentini [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 5:51 PM > To: Yaron Haviv > Cc: Roland Dreier; openib-general@openib.org > Subject: RE: [openib-general] RDMA connection and address translation API > > > > Which draft contains this? I found > > http://www.ietf.org/internet-drafts/draft-ietf-ips-iser-04.txt > James, You should look at : http://www.haifa.il.ibm.com/satran/ips/draft-ietf-ips-iser-05-candidate. txt The 05 rev really adds all the InfiniBand related stuff You can see how the association between IB & IP is done using IPoIB The current implementation may not use the private data field (since its not critical/mandatory) but the intention is to add it to address multi homed hosts, we would like to push such a definition into IBTA so every IP oriented ULP can use it, several people expressed interest in such a definition, this can also support NFS/RDMA or any other IP based ULP. > but the HELLO header in section 9.3 does not contain any IP address > information. > > > I believe it can be a good idea to use the same approach for > > NFS/RDMA and eliminate the need for reverse ATS lookup (the may have > > some conflicts when multiple IPs exists per node). We may just use > > the SDP hello header as is with unused fields zeroed This will allow > > all ULPs to use the same mechanism > > NFS/RDMA is not specific to iWARP or InfiniBand. My understanding is > that this could not be easily accommodated in the current standards > for that reason. Not sure why is that the case, if we add an IBTA definition of CM exchange for IP based ULP's (i.e. send src/dst IP and optionally ports) you can now have an NFS/RDMA spec that doesn't need to have any IB/iWarp specific definitions, since the differences are pushed down to the IBTA In case of NFS/RDMA over other (non IB or iWarp) transport you can specify that providing the IP addressing is a responsibility of the underline transport. Yaron ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: Roland Dreier [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 4:03 PM > To: Tom Tucker > Cc: Sean Hefty; Roland Dreier; openib-general@openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > Tom> The issue is that this connection will be established when > Tom> the server may only want to accept requests that are > Tom> targetted to the 10.10.1.1 address. I don't get why this is > Tom> such a big deal. You can preclude this behavior by simply > Tom> keeping a one to one mapping between the IPv4 addresses and > Tom> the GIDs using the existing protocols and without mandating a > Tom> private data format across *all* ulps and transports. > > Well, a few problems with what you say: > > - ATS does not help at all with the case of a multi-homed interface. >Unless the remote system puts the IP it's trying to connect to >somewhere in the connection request, there is no way to be psychic >and recover this information. I thought a single HCA could have multiple GIDs. All I'm advocating is that a "correct" multi-homed configuration has a one-to-one mapping between it's IP addresses and it's GIDS. > > - Mandating ATS use is dictating protocol design just as much as >requiring the CM private data to carry source and destination IP >addresses. I think ATS dictates the kinds of authentication that can be done by the server over an IB transport, but not the protocol design. Certainly the private data can have additional authentication data (which I think is what you're advocating). > > - It's not just preventing connections to the wrong local address. >NFS-RDMA wants the remote source address (ie getpeername()) so that >it can look it up in the exports list. Agreed. But you could also get rid of ATS by allowing GIDs to be specified in the exports file and then treating them like IPv6 addresses for the purpose of subnet comparisons. > > - Saying that a given GID may only have a single IP address is >definitely a case of the cure being worse than the disease. I >don't think we can forbid perfectly valid multi-homed >configurations just because it's inconvenient for us to > support them. I think our different perspectives come from what we consider to be "perfectly valid multi-homed configurations". One approach advocates overloading private data, the other advocates overloading address assignments. My approach suffers from the fact that multiple IP addresses for the same GID are just aliases that are interchangeable and at the remote end indistinguishable. The private data approach suffers from the need to mandate private data formats across all ulps and transports. I prefer the former limitation/cost. > > By the way, as far as I can tell, there is NO formal > documentation of the NFS-RDMA wire protocol. The current > draft (draft-ietf-nfsv4-rpcrdma-01.txt) simply says: > > This protocol is designed to function with equivalent semantics > over all appropriate RDMA transports. In its abstract form, this > protocol does not implement RDMA directly. [...] It therefore > becomes a useful, implementable standard when mapped onto a > specific RDMA transport, such as iWARP [RDDP] or Infiniband [IB]. > > [...] > > In setting up a new RDMA connection, the first action by an RPC > client will be to obtain a transport address for the server. The > mechanism used to obtain this address, and to open an RDMA > connection is dependent on the type of RDMA transport, > and outside > the scope of this protocol. > > So it seems perfectly reasonable and acceptable for the > mapping of NFS-RDMA onto IB to specify that the source and > destination IP addresses for an IB connection are placed in > the CM private data. > This seems much easier than trying to turn ATS into an IETF standard. > > - R. > I think there is a way to get rid of ATS as I described above without overloading the private data. Phew -- I'm exhausted. I'm going to go write code ;-) ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Roland> No, I think we just need to realize that a perfectly Roland> transport neutral protocol implementation is not Roland> achievable. It's unfortunate that kDAPL fooled people by Roland> hiding the details of the wire protocol under a supposedly Roland> "neutral API," but the fact is that mapping an abstract Roland> RDMA transport to a real implementation will always Roland> involve arbitrary transport-dependent choices. Further: if we would be willing to say that transport-neutral protocols must use a "kDAPL wire protocol," then there's no problem in defining that wire protocol to put the source and destination IP address somewhere in the CM private data. The current "kDAPL wire protocol" happens to use ATS to try and achieve this (although it doesn't handle the multi-homed case), but that is no more and no less of an arbitrary protocol design choice. So in a nutshell, my objection to using ATS is that it is an arbitrary design choice that doesn't work as well as other equally valid choices. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
James> NFS/RDMA is not specific to iWARP or InfiniBand. My James> understanding is that this could not be easily accommodated James> in the current standards for that reason. Yes, it seems that there will need to be some additional NFS/RDMA drafts describing the iWARP and IB wire protocols before the standard is complete. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
The requirement is solely that that System Administrators for each host directly attached to Network X agree on the basic addressing characteristics for Network X. This onerous challenge is sucessfully overcome on every IP subnet in the world every day for such details as what the subnet is, what the mask is, etc. Further, two adjoining subnets won't be able to talk unless their administrators have arranged for them to agree on what their network identifiers are/etc. For the specific question it is even less of a problem than theory suggests. A rule such as "non IPv4 subnets are direct translated while IPv4 subnets use IPv4" is actually quite simple to implement. That could even be extended to allow *some* IPv6 subnets to be translated so that mutiple IPV6 aliases for a single GID could be identified (that is, if anyone has a need for such a thing). -Original Message- From: Roland Dreier [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 24, 2005 2:45 PM To: Caitlin Bestler Cc: Roland Dreier; James Lentini; openib-general@openib.org Subject: Re: [openib-general] RDMA connection and address translation API Caitlin> So with this wealth of options available, do you agree Caitlin> that there is no reason to elevate any of these issues to Caitlin> being visisble to a transport neutral application? No -- the fact that there are a wealth of options actually means that picking one is an arbitrary choice we impose on transport neutral implementations and is de facto mandating a wire protocol. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Yaron Haviv wrote: > > On Tue, 23 Aug 2005, Roland Dreier wrote: > > > > > It would be possible to have another function like > > > rdma_getpeername() that takes the transport address and returns > > > a source IP address. In the IB case this would do an ATS > > > reverse lookup. However, I hate this idea. iSER already uses > > > the CM private data to pass the source IP in the IB case, > > > > I know this is how IB SDP works, but I don't think iSER works this > > way. > > > > The code in the tree calls dat_ep_connect() with a NULL private > > data pointer. > > > > There is an iSER HELLO message described in iser_header.h contains > > IP addresses, but I'm not certain that this is part of the current > > protocol (ISER_HELLO_LEN and ISER_HELLO_REPLY_LEN are unused). > > James, > > iSER doesn't mandate the source IP in general since its doing a much > stronger authentication during Login > However we believe using a similar header to SDP can help the Passive > side > a. know which destination IP was targeted (in a multi homed environment) > b. for some implementations that want to validate the source for some > reason > > that's why the draft suggested adding the source/dst IP in the private > data just like SDP does, Which draft contains this? I found http://www.ietf.org/internet-drafts/draft-ietf-ips-iser-04.txt but the HELLO header in section 9.3 does not contain any IP address information. > I believe it can be a good idea to use the same approach for > NFS/RDMA and eliminate the need for reverse ATS lookup (the may have > some conflicts when multiple IPs exists per node). We may just use > the SDP hello header as is with unused fields zeroed This will allow > all ULPs to use the same mechanism NFS/RDMA is not specific to iWARP or InfiniBand. My understanding is that this could not be easily accommodated in the current standards for that reason. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Caitlin> So with this wealth of options available, do you agree Caitlin> that there is no reason to elevate any of these issues to Caitlin> being visisble to a transport neutral application? No -- the fact that there are a wealth of options actually means that picking one is an arbitrary choice we impose on transport neutral implementations and is de facto mandating a wire protocol. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
I think it would be more accurate to state that DAPL requires the 128-bit "IA Address space" to be administratively subdivided so that each "subnet" unambiguously translates to a specific IA reached network and that translation of the "IA Address" into and from that network's wire protocol is not visible to the DAT Consumer. ATS is indeed *one* solution for doing so. Adding RARP to IPoIB would make for another solution. Direct translation is also a valid solution for IPv6 compatible network IDs. So with this wealth of options available, do you agree that there is no reason to elevate any of these issues to being visisble to a transport neutral application? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Roland Dreier Sent: Wednesday, August 24, 2005 2:31 PM To: James Lentini Cc: openib-general@openib.org Subject: Re: [openib-general] RDMA connection and address translation API Roland> No, I think we just need to realize that a perfectly Roland> transport neutral protocol implementation is not Roland> achievable. It's unfortunate that kDAPL fooled people by Roland> hiding the details of the wire protocol under a supposedly Roland> "neutral API," but the fact is that mapping an abstract Roland> RDMA transport to a real implementation will always Roland> involve arbitrary transport-dependent choices. Further: if we would be willing to say that transport-neutral protocols must use a "kDAPL wire protocol," then there's no problem in defining that wire protocol to put the source and destination IP address somewhere in the CM private data. The current "kDAPL wire protocol" happens to use ATS to try and achieve this (although it doesn't handle the multi-homed case), but that is no more and no less of an arbitrary protocol design choice. So in a nutshell, my objection to using ATS is that it is an arbitrary design choice that doesn't work as well as other equally valid choices. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Roland Dreier Sent: Wednesday, August 24, 2005 2:03 PM To: Tom Tucker Cc: openib-general@openib.org Subject: Re: [openib-general] RDMA connection and address translation API By the way, as far as I can tell, there is NO formal documentation of the NFS-RDMA wire protocol. The current draft (draft-ietf-nfsv4-rpcrdma-01.txt) simply says: This protocol is designed to function with equivalent semantics over all appropriate RDMA transports. In its abstract form, this protocol does not implement RDMA directly. [...] It therefore becomes a useful, implementable standard when mapped onto a specific RDMA transport, such as iWARP [RDDP] or Infiniband [IB]. [...] In setting up a new RDMA connection, the first action by an RPC client will be to obtain a transport address for the server. The mechanism used to obtain this address, and to open an RDMA connection is dependent on the type of RDMA transport, and outside the scope of this protocol. So it seems perfectly reasonable and acceptable for the mapping of NFS-RDMA onto IB to specify that the source and destination IP addresses for an IB connection are placed in the CM private data. This seems much easier than trying to turn ATS into an IETF standard. - R. NFS over RDMA was intended to be implemented using DAPL in a transport neutrall way. Now having the transport layer *add* data before the private data is legitimate for any specific transport. It would just have to be defined independently of openib and linux. Basically, any solution that allows NFS over RDMA to be coded with the *same* set of kDAPL calls to listen/connect/accept/reject would be compliant with the intent -- as long as the mapping to wire protocols was straight-forward and allowed non-kDAPL implementations. For example, mapping the DAPL private data to the IETF MPA Request/Reply frame Private Data certainly qualifies as "straight forward". ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
James> You need to consider what makes sense for *both* ib and James> iwarp. Keep in mind that the correct API will allow a James> consumer to use ib and iwarp devices transparently. In James> other words their will be one code path that support both. James> If we were to adopt your proposal, the consumer would need James> to perform unnecessary operations on iWARP. No, I think we just need to realize that a perfectly transport neutral protocol implementation is not achievable. It's unfortunate that kDAPL fooled people by hiding the details of the wire protocol under a supposedly "neutral API," but the fact is that mapping an abstract RDMA transport to a real implementation will always involve arbitrary transport-dependent choices. To use an analogy, the IP layer is mostly insulated from the details of the L2 transport it's using by the net_device abstraction. However, there are a few things that require code like: int arp_mc_map(u32 addr, u8 *haddr, struct net_device *dev, int dir) { switch (dev->type) { case ARPHRD_ETHER: case ARPHRD_FDDI: case ARPHRD_IEEE802: ip_eth_mc_map(addr, haddr); return 0; case ARPHRD_IEEE802_TR: ip_tr_mc_map(addr, haddr); return 0; case ARPHRD_INFINIBAND: ip_ib_mc_map(addr, haddr); return 0; default: if (dir) { memcpy(haddr, dev->broadcast, dev->addr_len); return 0; } } return -EINVAL; } - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
Not if the host connects two disjoint networks and does not route between them. Such a host should/may be configured to reject any packet that arrives with a destination address that does not match the expected destination address for the port it arrives upon. One of the things that iWARP vendors strive for is to ensure that all such existing filtring/safety rules on accepting connections are left 100% intact. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Christoph Hellwig Sent: Wednesday, August 24, 2005 2:00 PM To: Caitlin Bestler Cc: openib-general@openib.org Subject: Re: [openib-general] RDMA connection and address translation API On Wed, Aug 24, 2005 at 11:14:08AM -0700, Caitlin Bestler wrote: > The concensus when this issue was debated in the DAT Collaborative was > that there was no transport neutral way to specify a set of addresses > to listen on other than "all addresses supported by this device". That doesn't make any sense at all for iWarp as that uses IP addressing which in Linux is host-, not device-based. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Roland> Well, that's not what I would expect. Suppose I have a Roland> device configured with local addresses 192.168.11.12 and Roland> 192.168.98.99 and I Christoph> You never configure a device with local addresses. IP Christoph> addresses are always a per-host attribute in Linux. I don't think this is really true. In some ways Linux behaves as if IP addresses are per-host (eg ARP responses can go out any interface) but really IP addresses are attached to an interface. Every struct net_device has a struct in_device, and every struct in_device has a list of struct in_ifaddrs for the device's IP addresses. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Tom> The issue is that this connection will be established when Tom> the server may only want to accept requests that are Tom> targetted to the 10.10.1.1 address. I don't get why this is Tom> such a big deal. You can preclude this behavior by simply Tom> keeping a one to one mapping between the IPv4 addresses and Tom> the GIDs using the existing protocols and without mandating a Tom> private data format across *all* ulps and transports. Well, a few problems with what you say: - ATS does not help at all with the case of a multi-homed interface. Unless the remote system puts the IP it's trying to connect to somewhere in the connection request, there is no way to be psychic and recover this information. - Mandating ATS use is dictating protocol design just as much as requiring the CM private data to carry source and destination IP addresses. - It's not just preventing connections to the wrong local address. NFS-RDMA wants the remote source address (ie getpeername()) so that it can look it up in the exports list. - Saying that a given GID may only have a single IP address is definitely a case of the cure being worse than the disease. I don't think we can forbid perfectly valid multi-homed configurations just because it's inconvenient for us to support them. By the way, as far as I can tell, there is NO formal documentation of the NFS-RDMA wire protocol. The current draft (draft-ietf-nfsv4-rpcrdma-01.txt) simply says: This protocol is designed to function with equivalent semantics over all appropriate RDMA transports. In its abstract form, this protocol does not implement RDMA directly. [...] It therefore becomes a useful, implementable standard when mapped onto a specific RDMA transport, such as iWARP [RDDP] or Infiniband [IB]. [...] In setting up a new RDMA connection, the first action by an RPC client will be to obtain a transport address for the server. The mechanism used to obtain this address, and to open an RDMA connection is dependent on the type of RDMA transport, and outside the scope of this protocol. So it seems perfectly reasonable and acceptable for the mapping of NFS-RDMA onto IB to specify that the source and destination IP addresses for an IB connection are placed in the CM private data. This seems much easier than trying to turn ATS into an IETF standard. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Wed, Aug 24, 2005 at 11:14:08AM -0700, Caitlin Bestler wrote: > The concensus when this issue was debated in the DAT Collaborative was > that there was no transport neutral way to specify a set of addresses to > listen > on other than "all addresses supported by this device". That doesn't make any sense at all for iWarp as that uses IP addressing which in Linux is host-, not device-based. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Wed, Aug 24, 2005 at 09:26:42AM -0700, Roland Dreier wrote: > Tom> I think I understand, but the purpose of specifying the IP > Tom> address in the listen is not to filter incoming connect > Tom> requests, but rather to determine which devices I listen > Tom> on. I think this works for the IB case as well. So the > Tom> utility of the IP address specified in the listen is only to > Tom> determine which devices the sid is created on. Does this make > Tom> sense or am I missing something? > > Well, that's not what I would expect. Suppose I have a device > configured with local addresses 192.168.11.12 and 192.168.98.99 and I You never configure a device with local addresses. IP addresses are always a per-host attribute in Linux. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Fab Tillier wrote: > > From: Roland Dreier [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, August 24, 2005 11:03 AM > > > > Fab> Why can't the IPV field be ignored? If a listen wants only > > Fab> IPV4 addresses, it would specify a 16-byte compare buffer > > Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 > > Fab> address, and would set the offset to that of the hello > > Fab> message's destination address (32). > > > > Yes, you're right for SDP. I guess if we're comfortable mandating > > that all protocols put their source and destination IPs in the private > > data for the IB case, then this works. Of course it's somewhat > > awkward to pass this information into the transport-neutral CM API but > > I think this can be worked around. > > I don't know if we need to mandate IP usage - it's up to the > application. Any application that wants to have similar semantics > to the way socket listens work (especially when bound to one of > multiple IP addresses on a port) the application would have to > define its private data to accommodate this. > > At the IB level, the contents of the private data are still opaque, > even to the CM. The CM would only expose the ability to have it > perform an initial triage of requests by doing binary comparisons > over regions of private data. It doesn't know (or need to know) > what the data represents - it only cares about finding a match (or > not). The CM doesn't define any sort of policy here, and I don't > think it should. It's just bytes to the CM, and it's doing a blind > comparison without interpreting the contents. You need to consider what makes sense for *both* ib and iwarp. Keep in mind that the correct API will allow a consumer to use ib and iwarp devices transparently. In other words their will be one code path that support both. If we were to adopt your proposal, the consumer would need to perform unnecessary operations on iWARP. A transport neutral client would be forced to put IP information into its CM private data on iWARP. Likewise, a transport neutral server would be forced to pass an private data offset and binary blob to the listen API call on iWARP. Neither of these make sense. These API problems are secondary to the burden you would be placing on the protocols. As has been mentioned in a previous email, extending the current protocols to use this convention will require further standardization and in some cases may not be compatible with their current architecture. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
So the listening server takes the IP address from the private data, uses AT to get the GID and then compares it to the GID in the connect request? It feels to me like this private data thing is a case of the cure is worse than the disease. As I understand it, we're trying to avoid the following: server: dev = ib_get_device(10.10.1.1 /*src ip*/,0 /*dest ip*/); /* GID has IP addresses 10.10.1.1, 10.10.1.2 */ ib_listen(dev, 10.10.1.1 /* listen bind address */, 143 /* port */, 10 /* backlog */); client: dev = ib_get_device(0 /* src wildcard */, 10.10.1.2 /* dest ip*/) ib_connect(dev, 0 /*src*/, 10.10.1.2 /*dest*/, 143/*port*/, ...); The issue is that this connection will be established when the server may only want to accept requests that are targetted to the 10.10.1.1 address. I don't get why this is such a big deal. You can preclude this behavior by simply keeping a one to one mapping between the IPv4 addresses and the GIDs using the existing protocols and without mandating a private data format across *all* ulps and transports. If I'm being painfully stupid...please feel free to tell me. > -Original Message- > From: Sean Hefty [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 2:12 PM > To: Tom Tucker; Roland Dreier > Cc: openib-general@openib.org > Subject: RE: [openib-general] RDMA connection and address > translation API > > >Because it would be better to configure your network "properly". > >Putting IP addresses in private data is fundamentally insecure since > >any user mode client can spoof the IP address. > > A simple forward lookup could detect this. > > - Sean > > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Tom> Isn't this inevitable regardless of whether or not we have a Tom> tranport independent connection API. I thought ATS was Tom> required by NFS for authentication/authorization. Sorry in Tom> advance if I'm confused --- again. Current NFS-RDMA code uses and relies on ATS. However I hope that we can fix the NFS-RDMA draft to get rid of this. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
Isn't this inevitable regardless of whether or not we have a tranport independent connection API. I thought ATS was required by NFS for authentication/authorization. Sorry in advance if I'm confused --- again. > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Roland Dreier > Sent: Wednesday, August 24, 2005 3:27 PM > To: James Lentini > Cc: Caitlin Bestler; openib-general@openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > James> I agree with Caitlin. The eventual solution cannot force > James> protocol modifications in ULPs. > > Does this mean we're stuck with the current use of ATS in NFS-RDMA? > Surely there's still time to fix the protocol. > > - R. > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
James> I agree with Caitlin. The eventual solution cannot force James> protocol modifications in ULPs. Does this mean we're stuck with the current use of ATS in NFS-RDMA? Surely there's still time to fix the protocol. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Caitlin Bestler wrote: > On 8/24/05, Fab Tillier <[EMAIL PROTECTED]> wrote: > > > > I think if all ULPs provide their source and destination IP in the > > private data, you can eliminate the reverse lookup altogether. A > > simple forward lookup is all that's needed to validate that the > > source GID in the REQ matches the reported source IP in the > > private data. The forward lookup could be done via ATS or via > > ARP, but the CM doesn't need to care which method is used. > > That is not an option. > > The applications are expecting source/destination network addresses > that come from a network layer, not from the peer application. IP > has no problem meeting this requirement. This is an IB problem that > needs to be solved within the scope of IB without changing any ULPs. I agree with Caitlin. The eventual solution cannot force protocol modifications in ULPs. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Sean> Is the idea that the user calls connect() and then receives Sean> a single callback indicating that the connection has been Sean> established? If so, then the user may need to modify the QP Sean> to the INIT state, which would require some knowledge Sean> already of the path. We would also need to be clear on Sean> whether the QP is expected to be in the INIT state before Sean> connect is called, or if it could be in any arbitrary state. Sean> The other alternative is to provide multiple callbacks Sean> during connection establishment. To me it makes sense for the generic CM API to be defined so that an IB QP must be in the INIT state before being passed to connect(). - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
>If the connect call succeeds in establishing a connection, the ULP's >QP should be ready for posting work requests. This simplifies the ULP >considerably. > >The API should not create the QP. That would create race conditions >for certain protocols. For example, consider a protocol in which the >first message was a send from the server to the client. To properly >implement such a protocol, the client must post a receive work request >before initiating a connection. Thanks for the clarification. This is similar to what I was thinking as well. I guess we should note that in order to post receives to the QP, it at least needs to be in the INIT state. Would this be done by the CM abstraction or the user? For IB, the following fields need to be set when transitioning to INIT: enable RDMA, PKey index, and physical port. Is the idea that the user calls connect() and then receives a single callback indicating that the connection has been established? If so, then the user may need to modify the QP to the INIT state, which would require some knowledge already of the path. We would also need to be clear on whether the QP is expected to be in the INIT state before connect is called, or if it could be in any arbitrary state. The other alternative is to provide multiple callbacks during connection establishment. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
On Wed, 24 Aug 2005, Sean Hefty wrote: > I guess that I'd like to clarify what the operation of a connect > call would do. Would it be responsible for modifying the QP? If > so, could such a call also allocate the QP? Note that I'm not > advocating either of these, just trying to determine what the > behavior of the API would be. If the connect call succeeds in establishing a connection, the ULP's QP should be ready for posting work requests. This simplifies the ULP considerably. The API should not create the QP. That would create race conditions for certain protocols. For example, consider a protocol in which the first message was a send from the server to the client. To properly implement such a protocol, the client must post a receive work request before initiating a connection. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: [EMAIL PROTECTED] [mailto:openib-general- > [EMAIL PROTECTED] On Behalf Of Fab Tillier > Sent: Wednesday, August 24, 2005 3:00 PM > To: 'Roland Dreier' > Cc: openib-general@openib.org > Subject: RE: [openib-general] RDMA connection and address translation API > > > From: Roland Dreier [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, August 24, 2005 11:03 AM > > > > Fab> Why can't the IPV field be ignored? If a listen wants only > > Fab> IPV4 addresses, it would specify a 16-byte compare buffer > > Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 > > Fab> address, and would set the offset to that of the hello > > Fab> message's destination address (32). > > > > Yes, you're right for SDP. I guess if we're comfortable mandating > > that all protocols put their source and destination IPs in the private > > data for the IB case, then this works. Of course it's somewhat > > awkward to pass this information into the transport-neutral CM API but > > I think this can be worked around. > > I don't know if we need to mandate IP usage - it's up to the application. > Any > application that wants to have similar semantics to the way socket listens > work > (especially when bound to one of multiple IP addresses on a port) the > application would have to define its private data to accommodate this. > The context of this discussion is around a common API for iWarp/IB ULPs In that case they all use IP addresses (since it's the common addressing) If someone would use the IB specific API under this abstraction level he can provide what ever data he wants to the CM Any way providing src/dst IPs in the CM Private data is simple, and we can come with IBTA extension blessing that data structure as a general way to map IP oriented protocols over IB (a 1-2 page draft at the most) This way it can also address Caitlin concerns regarding NFS & IETF (since now it's a transport specific issue) Yaron ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
>Because it would be better to configure your network "properly". Putting >IP addresses in private data is fundamentally insecure since any user >mode client can spoof the IP address. A simple forward lookup could detect this. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: Roland Dreier [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 1:17 PM > To: Tom Tucker > Cc: openib-general@openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > Tom> Good point, although for iWARP it will work that way that you > Tom> expect. For IB, admitedly it's more complex and would > Tom> require ATS. There seems to be significant reluctance around > Tom> ATS and I don't understand the issues. Can you provide a > Tom> quick synopsis? > > My resistance is that ATS is just complexity without any benefit. IMHO the benefit is that you have a transport independent addressing mechanism -- albeit with some limitations as you've mentioned. In this case, the vast majority of clients enjoy the benefit without suffering the limitations. > ... It > doesn't provide additional security. It doesn't solve the > multi-homing problem we're talking about now. Whenever a single GID maps to multiple IP addresses, I agree, it is a limitation. However, I don't believe that this is strictly necessary. > ... Once you've thrown away > information by turning your IP address into an IB GID, there's no > magic way ATS can recreate that information and be psychic about which > of the multi-homed IPs you actually meant. I agree, so don't do that. If you want it to work properly, then you need to map GIDS to IP addresses. > ... So why not just put the IP > addressing information into the CM private data, the way that the SDP > protocol already does? > > - R. > Because it would be better to configure your network "properly". Putting IP addresses in private data is fundamentally insecure since any user mode client can spoof the IP address. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> From: Sean Hefty [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 11:18 AM > > For IB, using private data to listen on a specific IP address seems the > easiest thing to do. (Maybe we could do it by mapping different IP > addresses to different service IDs, requiring registration and lookup?) The problem with the SID method is that the SID namespace is smaller than the IPV6 address name space. There's no way to get every possible IPV6 address represented by a 64-bit SID. This further ignores the rules for SIDs in the IB specification. I think private data is the only way to do this properly. > If the CM abstraction layer expected those values to be returned in the > REP message, it could validate that the remote side it using the same > protocol to ensure some degree of backwards compatibility. > > I don't know if it makes more sense to push private data checks into the > actual CM or keep them in a CM abstraction layer. My guess is that the > former may be the easier implementation. I think putting the checks in the CM makes the most sense, though it should be done in a generic fashion. A CM abstraction layer could then simply apply a policy for private data usage - where in the private data it stores the IP address information. Layering it this way allows the private data compare to be used for things other than IP addresses. Add functionality without imposing policy. - Fab ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> From: Roland Dreier [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 11:03 AM > > Fab> Why can't the IPV field be ignored? If a listen wants only > Fab> IPV4 addresses, it would specify a 16-byte compare buffer > Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 > Fab> address, and would set the offset to that of the hello > Fab> message's destination address (32). > > Yes, you're right for SDP. I guess if we're comfortable mandating > that all protocols put their source and destination IPs in the private > data for the IB case, then this works. Of course it's somewhat > awkward to pass this information into the transport-neutral CM API but > I think this can be worked around. I don't know if we need to mandate IP usage - it's up to the application. Any application that wants to have similar semantics to the way socket listens work (especially when bound to one of multiple IP addresses on a port) the application would have to define its private data to accommodate this. At the IB level, the contents of the private data are still opaque, even to the CM. The CM would only expose the ability to have it perform an initial triage of requests by doing binary comparisons over regions of private data. It doesn't know (or need to know) what the data represents - it only cares about finding a match (or not). The CM doesn't define any sort of policy here, and I don't think it should. It's just bytes to the CM, and it's doing a blind comparison without interpreting the contents. - Fab ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> From: Caitlin Bestler [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 11:14 AM > > On 8/24/05, Fab Tillier <[EMAIL PROTECTED]> wrote: > > > From: Roland Dreier [mailto:[EMAIL PROTECTED] > > > Sent: Wednesday, August 24, 2005 10:16 AM > > > > > > Fab> Knowledge of actual IP addresses would be up to the consumer. > > > Fab> However, the IB CM can facilitate checks by allowing the user > > > Fab> to specify an offset and length in the private data to match > > > Fab> to for incoming requests. > > > > > > This seems too complex and at the same time too limited to me. For > > > one thing -- although I think ATS should die -- this doesn't support > > > ATS reverse lookups. > > > > I think if all ULPs provide their source and destination IP in the private > > data, you can eliminate the reverse lookup altogether. A simple forward > > lookup is all that's needed to validate that the source GID in the REQ > > matches the reported source IP in the private data. The forward lookup > > could be done via ATS or via ARP, but the CM doesn't need to care which > > method is used. > > That is not an option. > > The applications are expecting source/destination network addresses > that come from a network layer, not from the peer application. IP has > no problem meeting this requirement. This is an IB problem that needs > to be solved within the scope of IB without changing any ULPs. If the app wants to use source/destination network addresses, there isn't a problem. The problem is the app wants to use IP addresses, which are *not* network addresses in IB. So the app needs to decide between one of two things - be aware of IB network addresses, or provide meaning to IP addresses over IB. The latter can't be done reliably under the covers - ATS reverse lookups won't tell you the IP the source actually used, and there's no way to do so without either using private data in the CM REQ or requiring a 1:1 mapping of IB:IP addresses. The 1:1 IB:IP mapping is not feasible, so the only way to know what IP address the application used is to embed that into the private data. I would expect protocols that try to use IP as their addressing would accommodate this in their IB usage, just like SDP accommodates it in the hello message. - Fab ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
NFS over RDMA does not do that. Shouldn't that be the end of discussion on abusing CM private data unless you are talking *solely* about IB private data. And if that is the discussion, should not such a strategy be proposed to IETF and/or IBTA for an NFSoRDMA for IB official mapping? The other end of the NFSoRDMA connection is not necessarily running OpenIB or even Linux and is not party to any of these discussions. > > My resistance is that ATS is just complexity without any benefit. It > doesn't provide additional security. It doesn't solve the > multi-homing problem we're talking about now. Once you've thrown away > information by turning your IP address into an IB GID, there's no > magic way ATS can recreate that information and be psychic about which > of the multi-homed IPs you actually meant. So why not just put the IP > addressing information into the CM private data, the way that the SDP > protocol already does? > > - R. > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> From: Roland Dreier [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 10:16 AM > > Fab> Knowledge of actual IP addresses would be up to the consumer. > Fab> However, the IB CM can facilitate checks by allowing the user > Fab> to specify an offset and length in the private data to match > Fab> to for incoming requests. > > This seems too complex and at the same time too limited to me. For > one thing -- although I think ATS should die -- this doesn't support > ATS reverse lookups. I think if all ULPs provide their source and destination IP in the private data, you can eliminate the reverse lookup altogether. A simple forward lookup is all that's needed to validate that the source GID in the REQ matches the reported source IP in the private data. The forward lookup could be done via ATS or via ARP, but the CM doesn't need to care which method is used. > For another, it doesn't handle something like > the SDP Hello header, where the IP version is at a certain offset, and > then the IP address is interpreted according to the IP address. Why can't the IPV field be ignored? If a listen wants only IPV4 addresses, it would specify a 16-byte compare buffer with the first 12 bytes zero, the next 4 filled with the IPV4 address, and would set the offset to that of the hello message's destination address (32). > What makes it really ugly is that it's perfectly reasonable for one > consumer to listen to a service at 192.168.11.12 and another consumer > to listen to the same service at 192.168.98.99. How do we handle this > in the IB case?? As long as the service IP address (the local address on the listening side) is always advertised in the same place in the private data, this isn't a problem. The compare lengths and offsets would be identical for both services, but the compare buffer contents would differ. Did I miss what you were getting at? - Fab ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
>> I think if all ULPs provide their source and destination IP in the private >data, >> you can eliminate the reverse lookup altogether. A simple forward lookup is >all >> that's needed to validate that the source GID in the REQ matches the reported >> source IP in the private data. The forward lookup could be done via ATS or >via >> ARP, but the CM doesn't need to care which method is used. >> > >That is not an option. > >The applications are expecting source/destination network addresses >that come from a network layer, not from the peer application. IP has >no problem meeting this requirement. This is an IB problem that needs >to be solved within the scope of IB without changing any ULPs. IB can solve the option by exposing fewer bytes of private data. ULPs do not need to know that part of the IB private data is actually used by the CM abstraction layer. ULPs that make use of this new interface change anyway. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: [EMAIL PROTECTED] [mailto:openib-general- > [EMAIL PROTECTED] On Behalf Of Caitlin Bestler > Sent: Wednesday, August 24, 2005 2:14 PM > To: Fab Tillier > Cc: openib-general@openib.org > Subject: Re: [openib-general] RDMA connection and address translation API > > > The applications are expecting source/destination network addresses > that come from a network layer, not from the peer application. IP has > no problem meeting this requirement. This is an IB problem that needs > to be solved within the scope of IB without changing any ULPs. > To my understanding IB private data fields are IB CM specific So embedding src/dst IP in it doesn't change the ULP and could be considered as part of the IB CM You can look at the private data in that case as a replacement to the TCP CM (Syn/SynAck exchange), and Syn packet includes IPs & Ports Yaron ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
>Fab> Why can't the IPV field be ignored? If a listen wants only >Fab> IPV4 addresses, it would specify a 16-byte compare buffer >Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 >Fab> address, and would set the offset to that of the hello >Fab> message's destination address (32). > >Yes, you're right for SDP. I guess if we're comfortable mandating >that all protocols put their source and destination IPs in the private >data for the IB case, then this works. Of course it's somewhat >awkward to pass this information into the transport-neutral CM API but >I think this can be worked around. For IB, using private data to listen on a specific IP address seems the easiest thing to do. (Maybe we could do it by mapping different IP addresses to different service IDs, requiring registration and lookup?) If the CM abstraction layer expected those values to be returned in the REP message, it could validate that the remote side it using the same protocol to ensure some degree of backwards compatibility. I don't know if it makes more sense to push private data checks into the actual CM or keep them in a CM abstraction layer. My guess is that the former may be the easier implementation. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Tom> Good point, although for iWARP it will work that way that you Tom> expect. For IB, admitedly it's more complex and would Tom> require ATS. There seems to be significant reluctance around Tom> ATS and I don't understand the issues. Can you provide a Tom> quick synopsis? My resistance is that ATS is just complexity without any benefit. It doesn't provide additional security. It doesn't solve the multi-homing problem we're talking about now. Once you've thrown away information by turning your IP address into an IB GID, there's no magic way ATS can recreate that information and be psychic about which of the multi-homed IPs you actually meant. So why not just put the IP addressing information into the CM private data, the way that the SDP protocol already does? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On 8/24/05, Fab Tillier <[EMAIL PROTECTED]> wrote: > > From: Roland Dreier [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, August 24, 2005 10:16 AM > > > > Fab> Knowledge of actual IP addresses would be up to the consumer. > > Fab> However, the IB CM can facilitate checks by allowing the user > > Fab> to specify an offset and length in the private data to match > > Fab> to for incoming requests. > > > > This seems too complex and at the same time too limited to me. For > > one thing -- although I think ATS should die -- this doesn't support > > ATS reverse lookups. > > I think if all ULPs provide their source and destination IP in the private > data, > you can eliminate the reverse lookup altogether. A simple forward lookup is > all > that's needed to validate that the source GID in the REQ matches the reported > source IP in the private data. The forward lookup could be done via ATS or > via > ARP, but the CM doesn't need to care which method is used. > That is not an option. The applications are expecting source/destination network addresses that come from a network layer, not from the peer application. IP has no problem meeting this requirement. This is an IB problem that needs to be solved within the scope of IB without changing any ULPs. > > For another, it doesn't handle something like > > the SDP Hello header, where the IP version is at a certain offset, and > > then the IP address is interpreted according to the IP address. > > Why can't the IPV field be ignored? If a listen wants only IPV4 addresses, it > would specify a 16-byte compare buffer with the first 12 bytes zero, the next > 4 > filled with the IPV4 address, and would set the offset to that of the hello > message's destination address (32). > > > What makes it really ugly is that it's perfectly reasonable for one > > consumer to listen to a service at 192.168.11.12 and another consumer > > to listen to the same service at 192.168.98.99. How do we handle this > > in the IB case?? > > As long as the service IP address (the local address on the listening side) is > always advertised in the same place in the private data, this isn't a problem. > The compare lengths and offsets would be identical for both services, but the > compare buffer contents would differ. Did I miss what you were getting at? > The concensus when this issue was debated in the DAT Collaborative was that there was no transport neutral way to specify a set of addresses to listen on other than "all addresses supported by this device". As noted in another posting, it is easy to support "all for device" and "this address only" with transport neutral interfaces. Anything else is problematic. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: Roland Dreier [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 11:27 AM > To: Tom Tucker > Cc: Roland Dreier; openib-general@openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > Tom> I think I understand, but the purpose of specifying the IP > Tom> address in the listen is not to filter incoming connect > Tom> requests, but rather to determine which devices I listen > Tom> on. I think this works for the IB case as well. So the > Tom> utility of the IP address specified in the listen is only to > Tom> determine which devices the sid is created on. Does this make > Tom> sense or am I missing something? > > Well, that's not what I would expect. Suppose I have a > device configured with local addresses 192.168.11.12 and > 192.168.98.99 and I start listening for some service at the > address 192.168.11.12. I don't think I should see a > connection request if a remote system tries to connect to > 192.168.98.99 (even though it's the same network interface as > 192.168.11.12). > > - R. > Good point, although for iWARP it will work that way that you expect. For IB, admitedly it's more complex and would require ATS. There seems to be significant reluctance around ATS and I don't understand the issues. Can you provide a quick synopsis? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Fab> Why can't the IPV field be ignored? If a listen wants only Fab> IPV4 addresses, it would specify a 16-byte compare buffer Fab> with the first 12 bytes zero, the next 4 filled with the IPV4 Fab> address, and would set the offset to that of the hello Fab> message's destination address (32). Yes, you're right for SDP. I guess if we're comfortable mandating that all protocols put their source and destination IPs in the private data for the IB case, then this works. Of course it's somewhat awkward to pass this information into the transport-neutral CM API but I think this can be worked around. Roland> What makes it really ugly is that it's perfectly Roland> reasonable for one consumer to listen to a service at Roland> 192.168.11.12 and another consumer to listen to the same Roland> service at 192.168.98.99. How do we handle this in the IB Roland> case?? Fab> As long as the service IP address (the local address on the Fab> listening side) is always advertised in the same place in the Fab> private data, this isn't a problem. The compare lengths and Fab> offsets would be identical for both services, but the compare Fab> buffer contents would differ. Did I miss what you were Fab> getting at? No, I think I confused myself. As long as the CM can get at the IP information, it can figure out which consumer is which. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: [EMAIL PROTECTED] [mailto:openib-general- > [EMAIL PROTECTED] On Behalf Of James Lentini > Sent: Wednesday, August 24, 2005 1:43 PM > To: Roland Dreier > Cc: openib-general@openib.org > Subject: Re: [openib-general] RDMA connection and address translation API > > > > On Tue, 23 Aug 2005, Roland Dreier wrote: > > > It would be possible to have another function like > > rdma_getpeername() that takes the transport address and > > returns a source IP address. In the IB case this would do an > > ATS reverse lookup. However, I hate this idea. iSER already > > uses the CM private data to pass the source IP in the IB case, > > I know this is how IB SDP works, but I don't think iSER works this > way. > > The code in the tree calls dat_ep_connect() with a NULL private data > pointer. > > There is an iSER HELLO message described in iser_header.h contains IP > addresses, but I'm not certain that this is part of the current > protocol (ISER_HELLO_LEN and ISER_HELLO_REPLY_LEN are unused). James, iSER doesn't mandate the source IP in general since its doing a much stronger authentication during Login However we believe using a similar header to SDP can help the Passive side a. know which destination IP was targeted (in a multi homed environment) b. for some implementations that want to validate the source for some reason that's why the draft suggested adding the source/dst IP in the private data just like SDP does, I believe it can be a good idea to use the same approach for NFS/RDMA and eliminate the need for reverse ATS lookup (the may have some conflicts when multiple IPs exists per node). We may just use the SDP hello header as is with unused fields zeroed This will allow all ULPs to use the same mechanism Yaron > ___ > openib-general mailing list > openib-general@openib.org > http://openib.org/mailman/listinfo/openib-general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib- > general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
On Tue, 23 Aug 2005, Roland Dreier wrote: > It would be possible to have another function like > rdma_getpeername() that takes the transport address and > returns a source IP address. In the IB case this would do an > ATS reverse lookup. However, I hate this idea. iSER already > uses the CM private data to pass the source IP in the IB case, I know this is how IB SDP works, but I don't think iSER works this way. The code in the tree calls dat_ep_connect() with a NULL private data pointer. There is an iSER HELLO message described in iser_header.h contains IP addresses, but I'm not certain that this is part of the current protocol (ISER_HELLO_LEN and ISER_HELLO_REPLY_LEN are unused). ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Fab> I think the IB CM needs to be able to do two things. It Fab> needs to allow a listen to be bound to a specific port - Fab> using the port GUID or the LID or something along those Fab> lines. Yes, this is probably a good idea. Fab> Knowledge of actual IP addresses would be up to the consumer. Fab> However, the IB CM can facilitate checks by allowing the user Fab> to specify an offset and length in the private data to match Fab> to for incoming requests. This seems too complex and at the same time too limited to me. For one thing -- although I think ATS should die -- this doesn't support ATS reverse lookups. For another, it doesn't handle something like the SDP Hello header, where the IP version is at a certain offset, and then the IP address is interpreted according to the IP address. What makes it really ugly is that it's perfectly reasonable for one consumer to listen to a service at 192.168.11.12 and another consumer to listen to the same service at 192.168.98.99. How do we handle this in the IB case?? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
> > However, there's another problem with trying to lump address > > translation and connection into a single "connect" call, and this > > problem looks fundamental and fatal to me. The connect call takes a > > QP pointer, but to create a QP the consumer needs to know which local > > device to use. However, the consumer doesn't know which device to use > > until the destination address has been resolved to a route, including > > a local interface. > > The proposition, also presented (I beleive) in the OpenIB workshop, > include a function called ib_cma_get_device, that retrieves the device > (for qp creation purposes) according to the destination address and the > local routing table. That function was included in the presentation. Given that the discussion focused on the proper location of address translation, it is understandable that its presence was overlooked. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> From: Roland Dreier [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 9:27 AM > > Tom> I think I understand, but the purpose of specifying the IP > Tom> address in the listen is not to filter incoming connect > Tom> requests, but rather to determine which devices I listen > Tom> on. I think this works for the IB case as well. So the > Tom> utility of the IP address specified in the listen is only to > Tom> determine which devices the sid is created on. Does this make > Tom> sense or am I missing something? > > Well, that's not what I would expect. Suppose I have a device > configured with local addresses 192.168.11.12 and 192.168.98.99 and I > start listening for some service at the address 192.168.11.12. I > don't think I should see a connection request if a remote system tries > to connect to 192.168.98.99 (even though it's the same network > interface as 192.168.11.12). I think the IB CM needs to be able to do two things. It needs to allow a listen to be bound to a specific port - using the port GUID or the LID or something along those lines. The Windows CM currently take a port GUID as input to allow binding requests to a local IB port. Incoming MADs are matched based on which port they came in on. This does introduce the limitation that sending CM MADs to a port other than the one you wish to connect to won't have the desired result if the ULP performs port filtering. I don't think this is a big deal. Knowledge of actual IP addresses would be up to the consumer. However, the IB CM can facilitate checks by allowing the user to specify an offset and length in the private data to match to for incoming requests. ULPs that would want to distinguish between IP addresses on a given port would put the IP in their private data, and instruct the CM to compare a specific value at a specific offset and length for every incoming REQ. The Windows CM does this - a listen takes as input a private data compare buffer, buffer length, and offset within the REQ private data to perform the comparison. Without the CM performing the private data comparison for the client, there is no way for the CM to route to the proper person based on something like IP. Using a generic private data compare mechanism enables the users to do whatever they feel like, without putting knowledge of IP addresses and whatnot into the IB CM or dictating how clients must use their private data. A lookup of a listen for an incoming request changes from just being based on SID to taking as additional parameters the port GUID on which the REQ was received and the REQ's private data in case a private data compare needs to be performed. - Fab ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Roland Dreier > Sent: Wednesday, August 24, 2005 11:27 AM > To: Tom Tucker > Cc: openib-general@openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > Tom> I think I understand, but the purpose of specifying the IP > Tom> address in the listen is not to filter incoming connect > Tom> requests, but rather to determine which devices I listen > Tom> on. I think this works for the IB case as well. So the > Tom> utility of the IP address specified in the listen is only to > Tom> determine which devices the sid is created on. Does this make > Tom> sense or am I missing something? > > Well, that's not what I would expect. Suppose I have a device > configured with local addresses 192.168.11.12 and 192.168.98.99 and I > start listening for some service at the address 192.168.11.12. I > don't think I should see a connection request if a remote system tries > to connect to 192.168.98.99 (even though it's the same network > interface as 192.168.11.12). > I agree Roland. ULPs that listen to a specific addr, expect only connections requests that were sent to that ip addr. I think we want to provide this functionality. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Tom> I think I understand, but the purpose of specifying the IP Tom> address in the listen is not to filter incoming connect Tom> requests, but rather to determine which devices I listen Tom> on. I think this works for the IB case as well. So the Tom> utility of the IP address specified in the listen is only to Tom> determine which devices the sid is created on. Does this make Tom> sense or am I missing something? Well, that's not what I would expect. Suppose I have a device configured with local addresses 192.168.11.12 and 192.168.98.99 and I start listening for some service at the address 192.168.11.12. I don't think I should see a connection request if a remote system tries to connect to 192.168.98.99 (even though it's the same network interface as 192.168.11.12). - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
> -Original Message- > From: Roland Dreier [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 11:04 AM > To: Tom Tucker > Cc: Roland Dreier; openib-general@openib.org > Subject: Re: [openib-general] RDMA connection and address > translation API > > Tom> The listen side, however, I think needs a little tweaking. It > Tom> would be beneficial if the client can specify either an IP > Tom> address and port to listen on (effectively selecting a > Tom> particular device), or a wild card (all RDMA devices). An NFS > Tom> server is an example of the later. This is trivial to do by > Tom> providing an address to the listen call where a '0' > Tom> represents a wild card. > > I agree that it's useful to be able to pass a sockaddr to > bind a listen to (just like the bind() call in userspace). > However, the problem is that in the IB world, an incoming > connection request does not come with a destination IP > address in any standard way. So I don't know the right way > to implement bind() in the IB case. I think I understand, but the purpose of specifying the IP address in the listen is not to filter incoming connect requests, but rather to determine which devices I listen on. I think this works for the IB case as well. So the utility of the IP address specified in the listen is only to determine which devices the sid is created on. Does this make sense or am I missing something? > > By the way, an IP address/port does not necessarily select a > single RDMA device. It's a perfectly valid configuration to > have 10 network interfaces all with the same local IP address. > > - R. > Yes, and in this case, all devices with the same IP address would end up listening in the same way that specifying a wildcard (0) would result in multiple devices listening. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
Tom> The listen side, however, I think needs a little tweaking. It Tom> would be beneficial if the client can specify either an IP Tom> address and port to listen on (effectively selecting a Tom> particular device), or a wild card (all RDMA devices). An NFS Tom> server is an example of the later. This is trivial to do by Tom> providing an address to the listen call where a '0' Tom> represents a wild card. I agree that it's useful to be able to pass a sockaddr to bind a listen to (just like the bind() call in userspace). However, the problem is that in the IB world, an incoming connection request does not come with a destination IP address in any standard way. So I don't know the right way to implement bind() in the IB case. By the way, an IP address/port does not necessarily select a single RDMA device. It's a perfectly valid configuration to have 10 network interfaces all with the same local IP address. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
Roland, this looks good! A few comments below... > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Roland Dreier > Sent: Wednesday, August 24, 2005 12:07 AM > To: openib-general@openib.org > Subject: [openib-general] RDMA connection and address translation API > > At the OpenIB workshop on Monday, we had some discussion about a > high-level transport-neutral API for connection handling. After > giving the topic some more thought, I've come to the conclusion that > neither the kDAPL API nor the new API that was presented are usable. > In this email, I'll try to detail my reasoning and sketch what I > believe is the correct API. > > The new API that we looked at was essentially the following (I'm > recreating this from memory, so I apologize if I misrepresent it): > > listen(local_ip_address, service_id, listen_callback) > connect(local_qp, remote_ip_address, qos, service_id, > private_data, connect_callback) > > We already discussed the problem with having the listen callback pass > the consumer a remote source address -- doing this requires the > connection handling module to do an ATS reverse lookup in the IB case, > which the consumer might not want. I think there's agreement that the > correct thing here is for the listen callback to pass a transport > address to the consumer and provide a function that the consumer can > call to perform an ATS reverse lookup if desired. This isn't a major > problem and can be dealt with. > > However, there's another problem with trying to lump address > translation and connection into a single "connect" call, and this > problem looks fundamental and fatal to me. The connect call takes a > QP pointer, but to create a QP the consumer needs to know which local > device to use. However, the consumer doesn't know which device to use > until the destination address has been resolved to a route, including > a local interface. > > As far as I can tell, kDAPL punts on this and simply requires the > consumer to handle the route lookup itself before calling > dat_ep_connect(). It seems that current kDAPL consumers similarly > punt on this issue: the iSER initiator and the NFS-RDMA client both > just use a single device which is statically discovered at init time. > Yes, DAPL punts on this. > It seems that the kDAPL connection model has a serious flaw, in that > it pushes the complexity of route lookup into the consumer. Further, > we have strong evidence that this routing code is hard to write and > that consumers will just ignore this complexity and hard-code > solutions that don't work under all configurations. > I agree! > With this in mind, I believe that the connection API needs to be > something more like the following: > > rdma_resolve_address(): > inputs: dest IP address, qos, npaths, > done callback, opaque context > done callback params: status, local RDMA device, > RDMA transport address, context > > This function starts the process of resolving an IP address to > an RDMA device and address. When the resolution is complete, > the callback is called with a status. If the status is > "success" then the callback also gets the device pointer and > transport address (as well as the original context that the > consumer passed in). > > The "RDMA transport address" type is a union containing > transport-dependent data. In the IB case, it's all of the > SGID, DGID, SLID, DLID, SL etc. that we know and love. In the > iWARP case, it's the source IP, destination IP and QOS. > > npaths can be either 1 or 2 in the IB case; if it's 2, then > the resolver will try to find a primary and alternate path for > APM. In the iWARP case, I guess npaths will always be 1, and > I guess anyone who wants to use iWARP over multihomed SCTP > will probably have to use some lower-level API. > > By the way, we may also have to have the option of passing in > a local netdev so that we can handle link-local IPv6 > addresses. There may be other cases I haven't thought of yet. > I just hope we can avoid going all the way to the horror of > the getaddrinfo() API. > > I also hope we can agree to use IPoIB ARP to resolve the > address in the IB case; having a flag or some other hack in > the API to expose the option of ATS seems unacceptably ugly. > > rdma_connect(): > inputs: local QP, RDMA transport address, destination service, > private data, timeout, event callback, opaque context > > This function takes the resolved address and actually > connects. > > I'm not sure how we want to abstract the IB service vs. iWARP > TCP port number difference. I guess it's OK to have iWARP > consumers stick their (16-bit) port number in a
RE: [openib-general] RDMA connection and address translation API
Roland: Steve and I came to the same conclusion on the airplane ride back to Austin. Whereas plain old TCP/IP selects a device at the bottom of the stack, RDMA transports must select the device at the top because pre-connect resources must be allocated and these resouces are associated with a particular device. I think you've absolutely nailed the active side (by the way, I think the ib_at_route_by_ip service already performs the necessary routing function). The listen side, however, I think needs a little tweaking. It would be beneficial if the client can specify either an IP address and port to listen on (effectively selecting a particular device), or a wild card (all RDMA devices). An NFS server is an example of the later. This is trivial to do by providing an address to the listen call where a '0' represents a wild card. > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Roland Dreier > Sent: Wednesday, August 24, 2005 12:07 AM > To: openib-general@openib.org > Subject: [openib-general] RDMA connection and address translation API > > At the OpenIB workshop on Monday, we had some discussion > about a high-level transport-neutral API for connection > handling. After giving the topic some more thought, I've > come to the conclusion that neither the kDAPL API nor the new > API that was presented are usable. > In this email, I'll try to detail my reasoning and sketch > what I believe is the correct API. > > The new API that we looked at was essentially the following > (I'm recreating this from memory, so I apologize if I > misrepresent it): > > listen(local_ip_address, service_id, listen_callback) > connect(local_qp, remote_ip_address, qos, service_id, > private_data, connect_callback) > > We already discussed the problem with having the listen > callback pass the consumer a remote source address -- doing > this requires the connection handling module to do an ATS > reverse lookup in the IB case, which the consumer might not > want. I think there's agreement that the correct thing here > is for the listen callback to pass a transport address to the > consumer and provide a function that the consumer can call to > perform an ATS reverse lookup if desired. This isn't a major > problem and can be dealt with. > > However, there's another problem with trying to lump address > translation and connection into a single "connect" call, and > this problem looks fundamental and fatal to me. The connect > call takes a QP pointer, but to create a QP the consumer > needs to know which local device to use. However, the > consumer doesn't know which device to use until the > destination address has been resolved to a route, including a > local interface. > > As far as I can tell, kDAPL punts on this and simply requires > the consumer to handle the route lookup itself before calling > dat_ep_connect(). It seems that current kDAPL consumers > similarly punt on this issue: the iSER initiator and the > NFS-RDMA client both just use a single device which is > statically discovered at init time. > > It seems that the kDAPL connection model has a serious flaw, > in that it pushes the complexity of route lookup into the > consumer. Further, we have strong evidence that this routing > code is hard to write and that consumers will just ignore > this complexity and hard-code solutions that don't work under > all configurations. > > With this in mind, I believe that the connection API needs to > be something more like the following: > > rdma_resolve_address(): > inputs: dest IP address, qos, npaths, > done callback, opaque context > done callback params: status, local RDMA device, > RDMA transport address, context > > This function starts the process of resolving an IP address to > an RDMA device and address. When the resolution is complete, > the callback is called with a status. If the status is > "success" then the callback also gets the device pointer and > transport address (as well as the original context that the > consumer passed in). > > The "RDMA transport address" type is a union containing > transport-dependent data. In the IB case, it's all of the > SGID, DGID, SLID, DLID, SL etc. that we know and love. In the > iWARP case, it's the source IP, destination IP and QOS. > > npaths can be either 1 or 2 in the IB case; if it's 2, then > the resolver will try to find a primary and alternate path for > APM. In the iWARP case, I guess npaths will always be 1, and > I guess anyone who wants to use iWARP over multihomed SCTP > will probably have to use some lower-level API. > > By the way, we may also have to have the option of passing in > a local netdev so that we can handle link-local IPv6 > addresses. There may be other
Re: [openib-general] RDMA connection and address translation API
Hi, - Here is a header file for cm abstraction API proposition. - This is just a preliminary suggestion, for review. - All comments are welcome. - Please read the notes in the header remarks - I am attaching the file and will send it later in a different message, to the list. - I think that the ib_ prefix should be changed to rdma_, but that should be done for the rest of the verbs as well, if we are claiming that the ib verbs abstract iwarp. - I think that the main difference between the 2 propositions is the question of whether or not to expose the consumer to the address resolution. I believe this suggestion (of covering it in the cma) is simpler, because it saves unnecessary upcall handling for the consumer. In any case - I don't believe this is clear cut, and would like to hear other opinions from people on the list. - Also please see my embedded answer to this mail Thanks, Guy. > We already discussed the problem with having the listen callback pass > the consumer a remote source address -- doing this requires the > connection handling module to do an ATS reverse lookup in the IB case, > which the consumer might not want. I think there's agreement that the > correct thing here is for the listen callback to pass a transport > address to the consumer and provide a function that the consumer can > call to perform an ATS reverse lookup if desired. This isn't a major > problem and can be dealt with. I agree. This is corrected in the current suggestion > However, there's another problem with trying to lump address > translation and connection into a single "connect" call, and this > problem looks fundamental and fatal to me. The connect call takes a > QP pointer, but to create a QP the consumer needs to know which local > device to use. However, the consumer doesn't know which device to use > until the destination address has been resolved to a route, including > a local interface. The proposition, also presented (I beleive) in the OpenIB workshop, include a function called ib_cma_get_device, that retrieves the device (for qp creation purposes) according to the destination address and the local routing table. This is done synchronously, and it is implemented today in the at module. If using link-local IPv6 addresses, I think that this function isn't even necessary (If I understand it correctly - you need to know which device to get out from). > As far as I can tell, kDAPL punts on this and simply requires the > consumer to handle the route lookup itself before calling > dat_ep_connect(). It seems that current kDAPL consumers similarly > punt on this issue: the iSER initiator and the NFS-RDMA client both > just use a single device which is statically discovered at init time. > > It seems that the kDAPL connection model has a serious flaw, in that > it pushes the complexity of route lookup into the consumer. Further, > we have strong evidence that this routing code is hard to write and > that consumers will just ignore this complexity and hard-code > solutions that don't work under all configurations. > With this in mind, I believe that the connection API needs to be > something more like the following: > > rdma_resolve_address(): > inputs: dest IP address, qos, npaths, > done callback, opaque context > done callback params: status, local RDMA device, > RDMA transport address, context > > This function starts the process of resolving an IP address to > an RDMA device and address. When the resolution is complete, > the callback is called with a status. If the status is > "success" then the callback also gets the device pointer and > transport address (as well as the original context that the > consumer passed in). In the address resolution you have 2 upcalls (from ip to gid and from gid to path). So, if you are already covering one upcall in the cma, why not cover both ? > The "RDMA transport address" type is a union containing > transport-dependent data. In the IB case, it's all of the > SGID, DGID, SLID, DLID, SL etc. that we know and love. In the > iWARP case, it's the source IP, destination IP and QOS. > > npaths can be either 1 or 2 in the IB case; if it's 2, then > the resolver will try to find a primary and alternate path for > APM. In the iWARP case, I guess npaths will always be 1, and > I guess anyone who wants to use iWARP over multihomed SCTP > will probably have to use some lower-level API. > > By the way, we may also have to have the option of passing in > a local netdev so that we can handle link-local IPv6 > addresses. There may be other cases I haven't thought of yet. > I just hope we can avoid going all the way to the horror of > the getaddrinfo() API. > > I also hope we can agree to use IPoIB ARP to resolve the > address in the IB case; having a flag
RE: [openib-general] RDMA connection and address translation API
>However, there's another problem with trying to lump address >translation and connection into a single "connect" call, and this >problem looks fundamental and fatal to me. The connect call takes a >QP pointer, but to create a QP the consumer needs to know which local >device to use. However, the consumer doesn't know which device to use >until the destination address has been resolved to a route, including >a local interface. I agree that this is a fairly serious issue with the proposed API. I guess that I'd like to clarify what the operation of a connect call would do. Would it be responsible for modifying the QP? If so, could such a call also allocate the QP? Note that I'm not advocating either of these, just trying to determine what the behavior of the API would be. >Wait for connection requests and pass events to the consumer's >callback. I'm not sure if/home we want to support binding to >a particular IP address. The current IB CM in Linux doesn't >support binding a listen to a single device or port, and even >if it did it's not clear how to handle binding to one IP >address when a port has more than one IP. I don't think that it would be overly difficult to bind IB CM listen requests to a specific port or LID, or based on matching specific private data. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general