Re: [openib-general] basic IB doubt
At 03:39 AM 8/26/2006, Gleb Natapov wrote: On Fri, Aug 25, 2006 at 03:53:12PM -0400, Talpey, Thomas wrote: Flush (sync for_device) before posting. Invalidate (sync for_cpu) before processing. So, before touching the data that was RDMAed into the buffer application should cache invalidate the buffer, is this even possible from user space? (Not on x86, but it isn't needed there.) Interesting you should mention that. :-) There isn't a user verb for dma_sync, there's only deregister. The kernel can perform this for receive completions, and signaled RDMA Reads, but it can't do so for remote RDMA Writes. Only the upper layer knows where those went. There are two practical solutions: 1) (practical solution) user mappings must be fully consistent, within the capability of the hardware. Still, don't go depending on any specific ordering here. 2) user must deregister any mapping before inspecting the result. I doubt any of them do this, for that reason anyway. MO is that this will bite us in the a** some day. If anybody was running this code on the Sparc architecture it already would have. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] basic IB doubt
At 09:00 AM 8/28/2006, Gleb Natapov wrote: 2) user must deregister any mapping before inspecting the result. I doubt any of them do this, for that reason anyway. This may have big performance impact. You think? :-) MO is that this will bite us in the a** some day. If anybody was running this code on the Sparc architecture it already would have. AFAIK SUN runs MPI over UDAPL, but they have their own IB implementation, so may be they handle all coherency issues in the UDAPL itself. The Sparc IOMMU supports consistent mappings, in which the i/o streaming caches are not used. There is a performance impact to using this mode however. The best throughput is achieved using streaming with explicit software consistency. However, even in consistent mode, the Sparc API requires that the synchronization calls be made. I have never gotten a completely satisfactory answer as to why, but on the high-end server platforms, I think it's possible that the busses can't always snoop one another and the calls provide a push. Will turning on the Opteron's IOMMU introduce some of these issues to x86? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] basic IB doubt
At 12:22 PM 8/28/2006, Jason Gunthorpe wrote: On Mon, Aug 28, 2006 at 10:38:43AM -0400, Talpey, Thomas wrote: Will turning on the Opteron's IOMMU introduce some of these issues to x86? No, definately not. The Opteron IOMMU (the GART) is a pure address translation mechanism and doesn't change the operation of the caches. Okay, that's good. However, doesn't it delay reads and writes until the necessary table walk / mapping is resolved? Because it passes all other cycles through, it seems to me that an interrupt may pass data, meaning that ordering (at least) may be somewhat different when it's present. And, those pending writes are not in the cache's consistency domain (i.e. they can't be snooped yet, right?). If Sun has a problem on larger systems I wonder if SGI Altix also has a problem? SGI Altix is definately a real system that people use IB cards in today and it would be easy to imagine such a large system could have coherence issues with memory polling.. I'd be interested in this too. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] basic IB doubt
At 12:40 PM 8/25/2006, Sean Hefty wrote: Thomas How does an adapter guarantee that no bridges or other Thomas intervening devices reorder their writes, or for that Thomas matter flush them to memory at all!? That's a good point. The HCA would have to do a read to flush the posted writes, and I'm sure it's not doing that (since it would add horrible latency for no good reason). I guess it's not safe to rely on ordering of RDMA writes after all. Couldn't the same point then be made that a CQ entry may come before the data has been posted? When the CQ entry arrives, the context that polls it off the queue must use the dma_sync_*() api to finalize any associated data transactions (known by the uper layer). This is basic, and it's the reason that a completion is so important. The completion, in and of itself, isn't what drives the synchronization. It's the transfer of control to the processor. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] basic IB doubt
At 03:23 PM 8/25/2006, Greg Lindahl wrote: On Fri, Aug 25, 2006 at 03:21:20PM -0400, [EMAIL PROTECTED] wrote: I presume you meant invalidate the cache, not flush it, before accessing DMA'ed data. Yes, this is what I meant. Sorry! Flush (sync for_device) before posting. Invalidate (sync for_cpu) before processing. On some architectures, these operations flush and/or invalidate i/o pipeline caches as well. As they should. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] basic IB doubt
At 07:46 PM 8/23/2006, Roland Dreier wrote: Greg Actually, that leads me to a question: does the vendor of Greg that adaptor say that this is actually safe? Just because Greg something behaves one way most of the time doesn't mean it Greg does it all of the time. So it it really smart to write Greg non-standard-conforming programs unless the vendor stands Greg behind that behavior? Yes, Mellanox documents that it is safe to rely on the last byte of an RDMA being written last. How does an adapter guarantee that no bridges or other intervening devices reorder their writes, or for that matter flush them to memory at all!? Without signalling the host processor, that is. Isn't that what the dma_sync() API is all about? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ib_get_dma_mr and remote access
At 05:53 PM 8/15/2006, Louis Laborde wrote: Hi there, I would like to know if any application today uses ib_get_dma_mr verb with remote access flag(s). The NFS/RDMA client does this, if configured to do so. Otherwise, it registers specific byte regions when remote access is required. The client supports numerous memory registration strategies, to suit user requirements and HCA/RNIC limitations. It seems to me that such a dependency could first, create a security hole and second, make this verb hard to implement for some RNICs. Yes, and yes. If only local access is required for this special memory region, can it be implemented with the Reserved LKey or STag0, whichever way it's called? Sure, and I expect many consumers would be fine with this. Note however that iWARP RDMA Read requires remote write access to be granted on the destination sge's, unlike IB RDMA Read, which requires only local. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] NFS/RDMA for Linux: client and server update release 6
Network Appliance is pleased to announce release 6 of the NFS/RDMA client and server for Linux 2.6.17. This update to the May 22 release fixes known issues, improves usability and server stability, and supports NFSv4. The code supports both Infiniband and iWARP transports over the standard openfabrics Linux facility. http://sourceforge.net/projects/nfs-rdma/ https://sourceforge.net/project/showfiles.php?group_id=97628package_id=199510 This code is running successfully at multiple user locations. A special thanks goes to Helen Chen and her team at Sandia Labs for their help in resolving multiple usability and stability issues. The code in the current release was used to produce the results reported in their presentation at the recent Commodity Cluster Computing Symposium in Baltimore. Tom Talpey, for the NFS/RDMA project. --- Changes since RC5 2.6.17.* kernel/transport switch target (also fixes IPv6 issues) NFS-RDMA client: support NFSv4 NFS-RDMA server: kconfig changes fully uses dma_map()/dma_unmap() api fix race between connection acceptance and first client request fix I/O thread not going to sleep fix two issues in export cache handling fix data corruption with certain pathological client alignments nfsrdmamount command: support NFSv4 runtime warnings on certain systems addressed ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Fwd: WG Action: Conclusion of IP over InfiniBand (ipoib)
FYI... -- Forwarded Message -- To: ietf-announce@ietf.org From: IESG Secretary [EMAIL PROTECTED] Date: Wed, 05 Jul 2006 15:50:01 -0400 Cc: ipoverib@ietf.org, H.K. Jerry Chu [EMAIL PROTECTED], Bill Strahm [EMAIL PROTECTED] Subject: WG Action: Conclusion of IP over InfiniBand (ipoib) List-Id: ietf-announce.ietf.org List-Post: mailto:ietf-announce@ietf.org List-Help: mailto:[EMAIL PROTECTED] List-Subscribe: https://www1.ietf.org/mailman/listinfo/ietf-announce, mailto:[EMAIL PROTECTED] The IP over InfiniBand WG (ipoib) in the Internet Area has concluded. The IESG contact persons are Jari Arkko and Mark Townsley. +++ The IPOIB working group has completed its main task of defining how to run IP over InfiniBand. It has published three RFCs and a fourth one is in the RFC Editor's queue, soon to become an RFC as well. There are some additional work items in the milestone plan, a set of MIBs. But after reviewing the status and activity in the group it seems best to close the WG. There are a few individuals who are still interested in pursuing a part of the MIB work, and they are encouraged to submit their work as an AD sponsored document, when the work is completed. The mailing list for the group will remain active. ___ IETF-Announce mailing list IETF-Announce@ietf.org https://www1.ietf.org/mailman/listinfo/ietf-announce -- End of Forwarded Message -- ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] max_send_sge max_sge
Yep, you're confirming my comment that the sge size is dependent on the memory registration strategy (and not the protocol itself). Because you have a pool approach, you potentially have a lot of discontiguous regions. Therefore, you need more sge's. (You could have the same issue with large preregistrations, etc.) If it's just for RDMA Write, the penalty really isn't that high - you can easily break the i/o up into separate RDMA Write ops and pump them out in a sequence. The HCA streams them, and using unsignalled completion on the WRs means the host overhead can be low. For sends, it's more painful. You have to pull them up. Do you really need send inlines to be that big? I guess if you're supporting a writev() api over inline you don't have much control, but even writev has a maxiov. The approach the NFS/RDMA client takes is basically to have a pool of dedicated buffers for headers, with a certain amount of space for small sends. This maximum inline size is typically 1K or maybe 4K (it's configurable), and it copies send data into them if it fits. All other operations are posted as chunks, which are explicit protocol objects corresponding to { mr, offset, length } triplets. The protocol supports an arbitrary number of them, but typically 8 is plenty. Each chunk results in an RDMA op from the server. If the server is coded well, the RDMA streams beautifully and there is no bandwidth issue. Just some ideas. I feel your pain. Tom. At 04:34 PM 6/27/2006, Pete Wyckoff wrote: [EMAIL PROTECTED] wrote on Tue, 27 Jun 2006 09:06 -0400: At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote: Unless you use it, passing the absolute maximum value supported by hardware does not seem, to me, to make sense - it will just slow you down, and waste resources. Is there a protocol out there that actually has a use for 30 sge? It's not a protocol thing, it's a memory registration thing. But I agree, that's a huge number of segments for send and receive. 2-4 is more typical. I'd be interested to know what wants 30 as well... This is the OpenIB port of pvfs2: http://www.pvfs.org/pvfs2/download.html See pvfs2/src/io/bmi/bmi_ib/openib.c for the bottom of the transport stack. The max_sge-1 aspect I'm complaining about isn't checked in yet. It's a file system application. The MPI-IO interface provides datatypes and file views that let a client write complex subsets of the in-memory data to a file with a single call. One case that happens is contiguous-in-file but discontiguous-in-memory, where the file system client writes data from multiple addresses to a single region in a file. The application calls MPI_File_write or a variant, and this complex buffer description filters all the way down to the OpenIB transport, which then has to figure out how to get the data to the server. These separate data regions may have been allocated all at once using MPI_Alloc_mem (rarely), or may have been used previously for file system operations so are already pinned in the registration cache. Are you implying there is more memory registration work that has to happen beyond making sure each of the SGE buffers is pinned and has a valid lkey? It would not be a major problem to avoid using more than a couple of SGEs; however, I didn't see any reason to avoid them. Please let me know if you see a problem with this approach. -- Pete ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] max_send_sge max_sge
At 08:42 AM 6/28/2006, Michael S. Tsirkin wrote: Quoting r. Talpey, Thomas [EMAIL PROTECTED]: Just some ideas. I feel your pain. Is there something that would make life easier for you? A work-request-based IBTA1.2/iWARP-compliant FMR implementation. Please. :-) Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] max_send_sge max_sge
At 10:51 AM 6/28/2006, Michael S. Tsirkin wrote: Yep. We could have an option to have the stack scale the requested values down to some legal set instead of failing an allocation. But we couldn't come up with a clean way to tell the stack e.g. what should it round down: the SGE or WR value. Do you think selecting something arbitrarily might still be a good idea? No! Well, not as the default. Otherwise, the consumer has to go back and check what happened even on success, which is a royal pain and highly inefficient. Maybe we should pass in an optional attribute structure, that is returned with the granted attributes on success, or the would-have-been attributes on failure? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] max_send_sge max_sge
At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote: Unless you use it, passing the absolute maximum value supported by hardware does not seem, to me, to make sense - it will just slow you down, and waste resources. Is there a protocol out there that actually has a use for 30 sge? It's not a protocol thing, it's a memory registration thing. But I agree, that's a huge number of segments for send and receive. 2-4 is more typical. I'd be interested to know what wants 30 as well... Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Local QP operation error
At 09:21 AM 6/27/2006, Ramachandra K wrote: Does this error point to some issue with the DMA address specified in the work request SGE ? Ding Ding Ding Ding! :-) We recently identified the exact issue in the NFS/RDMA server, which happened only when running on ia64. If you're not using the dma_map_* api, that's maybe something to look at. ;-) Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Mellanox HCAs: outstanding RDMAs
Mike, I am not arguing to change the standard. I am simply saying I do not want to be a victim of the default. It is my belief that very few upper layer programmers are aware of this, btw. The Linux NFS/RDMA upper layer implementation already deals with the issue, as I mentioned. It would certainly welcome a higher available IRD on Mellanox hardware however. Thanks for your comments. Tom. At 01:55 PM 6/15/2006, Michael Krause wrote: As one of the authors of IB and iWARP, I can say that both Roland and Todd's responses are correct and the intent of the specifications. The number of outstanding RDMA Reads are bounded and that is communicated during session establishment. The ULP can choose to be aware of this requirement (certainly when we wrote iSER and DA we were well aware of the requirement and we documented as such in the ULP specs) and track from above so that it does not see a stall or it can stay ignorant and deal with the stall as a result. This is a ULP choice and has been intentionally done that way so that the hardware can be kept as simple as possible and as low cost as well while meeting the breadth of ULP needs that were used to develop these technologies. Tom, you raised this issue during iWARP's definition and the debate was conducted at least several times. The outcome of these debates is reflected in iWARP and remains aligned with IB. So, unless you really want to have the IETF and IBTA go and modify their specs, I believe you'll have to deal with the issue just as other ULP are doing today and be aware of the constraint and write the software accordingly. The open source community isn't really the right forum to change iWARP and IB specifications at the end of the day. Build a case in the IETF and IBTA and let those bodies determine whether it is appropriate to modify their specs or not. And yes, it is modification of the specs and therefore the hardware implementations as well address any interoperability requirements that would result (the change proposed could fragment the hardware offerings as there are many thousands of devices in the market that would not necessarily support this change). Mike At 12:07 PM 6/6/2006, Talpey, Thomas wrote: Todd, thanks for the set-up. I'm really glad we're having this discussion! Let me give an NFS/RDMA example to illustrate why this upper layer, at least, doesn't want the HCA doing its flow control, or resource management. NFS/RDMA is a credit-based protocol which allows many operations in progress at the server. Let's say the client is currently running with an RPC slot table of 100 requests (a typical value). Of these requests, some workload-specific percentage will be reads, writes, or metadata. All NFS operations consist of one send from client to server, some number of RDMA writes (for NFS reads) or RDMA reads (for NFS writes), then terminated with one send from server to client. The number of RDMA read or write operations per NFS op depends on the amount of data being read or written, and also the memory registration strategy in use on the client. The highest-performing such strategy is an all-physical one, which results in one RDMA-able segment per physical page. NFS r/w requests are, by default, 32KB, or 8 pages typical. So, typically 8 RDMA requests (read or write) are the result. To illustrate, let's say the client is processing a multi-threaded workload, with (say) 50% reads, 20% writes, and 30% metadata such as lookup and getattr. A kernel build, for example. Therefore, of our 100 active operations, 50 are reads for 32KB each, 20 are writes of 32KB, and 30 are metadata (non-RDMA). To the server, this results in 100 requests, 100 replies, 400 RDMA writes, and 160 RDMA Reads. Of course, these overlap heavily due to the widely differing latency of each op and the highly distributed arrival times. But, for the example this is a snapshot of current load. The latency of the metadata operations is quite low, because lookup and getattr are acting on what is effectively cached data. The reads and writes however, are much longer, because they reference the filesystem. When disk queues are deep, they can take many ms. Imagine what happens if the client's IRD is 4 and the server ignores its local ORD. As soon as a write begins execution, the server posts 8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads are sent, the fifth stalls, and stalls the send queue! Even when three RDMA Reads complete, the queue remains stalled, it doesn't unblock until the fourth is done and all the RDMA Reads have been initiated. But, what just happened to all the other server send traffic? All those metadata replies, and other reads which completed? They're stuck, waiting for that one write request. In my example, these number 99 NFS ops, i.e. 654 WRs! All for one NFS write! The client operation stream effectively became single threaded. What good is the rapid initiation of RDMA Reads you describe
[openib-general] Re: Mellanox HCAs: outstanding RDMAs
At 08:44 AM 6/6/2006, Michael S. Tsirkin wrote: MST, are you disagreeing that RDMA Reads can stall the queue? I don't disagree with this of course. I was simply suggesting to ULP designers to read the chapter 9.5 and become aware of the rules, taking them into account at early stages of protocol design. :-) RTFM? I still think flow control is wrong and dangerous thing for RDMA Read. If it never happened, and the connections just failed, we'd never have the issue. Also, I'm certain we'll see upper layers that work on one provider, only to fail on another. Sigh. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: Mellanox HCAs: outstanding RDMAs
At 08:56 AM 6/6/2006, Michael S. Tsirkin wrote: The core spec does not require it. An implementation *may* enforce it, but is not *required* to do so. And as pointed out in the other message, there are repercussions of doing so. Interesting, I wasn't aware of such interpretation of the spec. When QP is modified to RTS, the initiator depth is passed to it, which suggests that the provider must obey, not ignore this parameter. No? This is the difference between may and must. The value is provided, but I don't see anything in the spec that makes a requirement on its enforcement. Table 107 says the consumer can query it, that's about as close as it comes. There's some discussion about CM exchange too. Don't forget about iWARP, btw. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
At 10:40 AM 6/6/2006, Roland Dreier wrote: Thomas This is the difference between may and must. The value Thomas is provided, but I don't see anything in the spec that Thomas makes a requirement on its enforcement. Table 107 says the Thomas consumer can query it, that's about as close as it Thomas comes. There's some discussion about CM exchange too. This seems like a very strained interpretation of the spec. For I don't see how strained has anything to do with it. It's not saying anything either way. So, a legal implementation can make either choice. We're talking about the spec! But, it really doesn't matter. The point is, an upper layer should be paying attention to the number of RDMA Reads it posts, or else suffer either the queue-stalling or connection-failing consequences. Bad stuff either way. Tom. example, there's no explicit language in the IB spec that requires an HCA to use the destination LID passed via a modify QP operation, but I don't think anyone would seriously argue that an implementation that sent messages to some other random destination was compliant. In the same way, if I pass a limit for the number of outstanding RDMA/atomic operations in to a modify QP operation, I would expect the HCA to use that limit. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] Re: Mellanox HCAs: outstanding RDMAs
Todd, thanks for the set-up. I'm really glad we're having this discussion! Let me give an NFS/RDMA example to illustrate why this upper layer, at least, doesn't want the HCA doing its flow control, or resource management. NFS/RDMA is a credit-based protocol which allows many operations in progress at the server. Let's say the client is currently running with an RPC slot table of 100 requests (a typical value). Of these requests, some workload-specific percentage will be reads, writes, or metadata. All NFS operations consist of one send from client to server, some number of RDMA writes (for NFS reads) or RDMA reads (for NFS writes), then terminated with one send from server to client. The number of RDMA read or write operations per NFS op depends on the amount of data being read or written, and also the memory registration strategy in use on the client. The highest-performing such strategy is an all-physical one, which results in one RDMA-able segment per physical page. NFS r/w requests are, by default, 32KB, or 8 pages typical. So, typically 8 RDMA requests (read or write) are the result. To illustrate, let's say the client is processing a multi-threaded workload, with (say) 50% reads, 20% writes, and 30% metadata such as lookup and getattr. A kernel build, for example. Therefore, of our 100 active operations, 50 are reads for 32KB each, 20 are writes of 32KB, and 30 are metadata (non-RDMA). To the server, this results in 100 requests, 100 replies, 400 RDMA writes, and 160 RDMA Reads. Of course, these overlap heavily due to the widely differing latency of each op and the highly distributed arrival times. But, for the example this is a snapshot of current load. The latency of the metadata operations is quite low, because lookup and getattr are acting on what is effectively cached data. The reads and writes however, are much longer, because they reference the filesystem. When disk queues are deep, they can take many ms. Imagine what happens if the client's IRD is 4 and the server ignores its local ORD. As soon as a write begins execution, the server posts 8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads are sent, the fifth stalls, and stalls the send queue! Even when three RDMA Reads complete, the queue remains stalled, it doesn't unblock until the fourth is done and all the RDMA Reads have been initiated. But, what just happened to all the other server send traffic? All those metadata replies, and other reads which completed? They're stuck, waiting for that one write request. In my example, these number 99 NFS ops, i.e. 654 WRs! All for one NFS write! The client operation stream effectively became single threaded. What good is the rapid initiation of RDMA Reads you describe in the face of this? Yes, there are many arcane and resource-intensive ways around it. But the simplest by far is to count the RDMA Reads outstanding, and for the *upper layer* to honor ORD, not the HCA. Then, the send queue never blocks, and the operation streams never loses parallelism. This is what our NFS server does. As to the depth of IRD, this is a different calculation, it's a DelayxBandwidth of the RDMA Read stream. 4 is good for local, low latency connections. But over a complicated switch infrastructure, or heaven forbid a dark fiber long link, I guarantee it will cause a bottleneck. This isn't an issue except for operations that care, but it is certainly detectable. I would like to see if a pure RDMA Read stream can fully utilize a typical IB fabric, and how much headroom an IRD of 4 provides. Not much, I predict. Closing the connection if IRD is insufficient to meet goals isn't a good answer, IMO. How does that benefit interoperability? Thanks for the opportunity to spout off again. Comments welcome! Tom. At 12:43 PM 6/6/2006, Rimmer, Todd wrote: Talpey, Thomas Sent: Tuesday, June 06, 2006 10:49 AM At 10:40 AM 6/6/2006, Roland Dreier wrote: Thomas This is the difference between may and must. The value Thomas is provided, but I don't see anything in the spec that Thomas makes a requirement on its enforcement. Table 107 says the Thomas consumer can query it, that's about as close as it Thomas comes. There's some discussion about CM exchange too. This seems like a very strained interpretation of the spec. For I don't see how strained has anything to do with it. It's not saying anything either way. So, a legal implementation can make either choice. We're talking about the spec! But, it really doesn't matter. The point is, an upper layer should be paying attention to the number of RDMA Reads it posts, or else suffer either the queue-stalling or connection-failing consequences. Bad stuff either way. Tom. Somewhere beneath this discussion is a bug in the application or IB stack. I'm not sure which may in the spec you are referring to, but the mays I have found all are for cases where the responder might support only 1 outstanding request
RE: [openib-general] Mellanox HCAs: outstanding RDMAs
At 10:03 AM 6/3/2006, Rimmer, Todd wrote: Yes, the limit of outstanding RDMAs is not related to the send queue depth. Of course you can post many more than 4 RDMAs to a send queue -- the HCA just won't have more than 4 requests outstanding at a time. To further clarity, this parameter only affects the number of concurrent outstanding RDMA Reads which the HCA will process. Once it hits this limit, the send Q will stall waiting for issued reads to complete prior to initiating new reads. It's worse than that - the send queue must stall for *all* operations. Otherwise the hardware has to track in-progress operations which are queued after stalled ones. It really breaks the initiation model. Semantically, the provider is not required to provide any such flow control behavior by the way. The Mellanox one apparently does, but it is not a requirement of the verbs, it's a requirement on the upper layer. If more RDMA Reads are posted than the remote peer supports, the connection may break. The number of outstanding RDMA Reads is negotiated by the CM during connection establishment and the QP which is sending the RDMA Read must have a value configured for this parameter which is = the remote ends capability. In other words, we're probably stuck at 4. :-) I don't think there is any Mellanox-based implementation that has ever supported 4. In previous testing by Mellanox on SDR HCAs they indicated values beyond 2-4 did not improve performance (and in fact required more RDMA resources be allocated for the corresponding QP or HCA). Hence I suspect a very large value like 128 would offer no improvement over values in the 2-8 range. I am not so sure of that. For one thing, it's dependent on VERY small latencies. The presence of a switch, or link extenders will make a huge difference. Second, heavy multi-QP firmware loads will increase the latencies. Third, constants are pretty much never a good idea in networking. The NFS/RDMA client tries to set the maximum IRD value it can obtain. RDMA Read is used quite heavily by the server to fetch client data segments for NFS writes. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
At 11:38 AM 6/5/2006, hbchen wrote: Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. IPoIB=420MB/sec bandwidth utilization= 420/1024 = 41.01% Helen, have you measured the CPU utilizations during these runs? Perhaps you are out of CPU. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately the world's worst 10GbE adapter (not to put too fine of a point on it :-) ) There is no hardware checksumming, nor large-send offloading, both of which force overhead onto software. And, as you just discovered it isn't even 10Gb! In general, network emulation layers are always going to perform more poorly than native implementations. But this is only a generality learned from years of experience with them. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
At 12:11 PM 6/5/2006, hbchen wrote: Perhaps you are out of CPU. Tom, I am HB Chen from LANL not the Helen Chen from SNL. Oops, sorry! I have too many email messages going by. :-) HB, then. I didn't run out of CPU. It is about 70-80 % of CPU utilization. But, is one CPU at 100%? Interrupt processing, for example. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately the world's worst 10GbE adapter (not to put too fine of a point on it :-) ) The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98% bandwidth utilization why not the IPoIB ? I am not familiar with the implementation Myrinet uses. In any case, I am not saying that an emulation can't reach certain goals, just that they will pretty much always be inferior to native approaches. Sometimes far inferior. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Question about the IPoIB bandwidth performance ?
Who said anything about Ethernnet emulation. Hal said he is running straight Netperf over IB not ethernet emulation. I don't think that any IB HCAs today support offloaded checksum and large send. You are comparing apples and oranges. I consider IPoIB to be Ethernet emulation. As for apples and oranges, my point exactly. Tom. At 12:53 PM 6/5/2006, Bernard King-Smith wrote: Thomas Talpey said: At 11:38 AM 6/5/2006, hbchen wrote: Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization is still very low. IPoIB=420MB/sec bandwidth utilization= 420/1024 = 41.01% Helen, have you measured the CPU utilizations during these runs? Perhaps you are out of CPU. Outrageous opinion follows. Frankly, an IB HCA running Ethernet emulation is approximately the world's worst 10GbE adapter (not to put too fine of a point on it :-) ) There is no hardware checksumming, nor large-send offloading, both of which force overhead onto software. And, as you just discovered it isn't even 10Gb! In general, network emulation layers are always going to perform more poorly than native implementations. But this is only a generality learned from years of experience with them Tom. Hold on here Who said anything about Ethernnet emulation. Hal said he is running straight Netperf over IB not ethernet emulation. I don't think that any IB HCAs today support offloaded checksum and large send. You are comparing apples and oranges. The only appropriate comparison is to use the IBM HCA compared to the mthca adapters. I think Hal's point is actually comparing any IB adapter against GigE and Myrinet. Both the mthca and IBM HCA's should get similar IPoIB performance using identical OpenIB stacks. Bernie King-Smith IBM Corporation Server Group Cluster System Performance [EMAIL PROTECTED](845)433-8483 Tie. 293-8483 or wombat2 on NOTES We are not responsible for the world we are born into, only for the world we leave when we die. So we have to accept what has gone before us and work to change the only thing we can, -- The Future. William Shatner openib-general-re [EMAIL PROTECTED] Sent by: To openib-general-bo openib-general@openib.org [EMAIL PROTECTED] cc Subject 06/05/2006 12:11 openib-general Digest, Vol 24, PMIssue 22 Please respond to [EMAIL PROTECTED] enib.org Send openib-general mailing list submissions to openib-general@openib.org To subscribe or unsubscribe via the World Wide Web, visit http://openib.org/mailman/listinfo/openib-general or, via email, send a message with subject or body 'help' to [EMAIL PROTECTED] You can reach the person managing the list at [EMAIL PROTECTED] When replying, please edit your Subject line so it is more specific than Re: Contents of openib-general digest... Today's Topics: 1. Re: Question about the IPoIB bandwidth performance ? (hbchen) 2. Re: [PATCH] osm: trivial missing header files fix (Hal Rosenstock) 3. Re: [PATCH] osm: trivial missing cast in osmt_service call for memcmp (Hal Rosenstock) 4. Re: Question about the IPoIB bandwidth performance ? (Bernard King-Smith) 5. Re: Re: [PATCH]Repost: IPoIB skb panic (Shirley Ma) 6. Re: [PATCHv2 1/2] resend: mthca support for max_map_per_fmr device attribute (Roland Dreier) 7. Re: Question about the IPoIB bandwidth performance ? (Talpey, Thomas) 8. Re: Question about the IPoIB bandwidth performance ? (hbchen) - Message from hbchen [EMAIL PROTECTED] on Mon, 05 Jun 2006 09:38:24 -0600 - To: Hal Rosenstock [EMAIL PROTECTED] cc: OPENIB openib-general@openib.org
Re: [openib-general] NFS/RDMA for Linux: client and server update release 5
[Cutting down the reply list to more relevant parties...] It's hard to say what is crashing, but I suspect the CM code, due to the process context being ib_cm. Is there some reason you're not getting symbols in the stack trace? If you could feed this oops text to ksymoops it will give us more information. In any case, it appears the connection is succeeding at the server, but the client RPC code isn't being signalled that it has done so. Perhaps this is due to a lost reply, but the NFS code hasn't actually started to do anything. So, I would look for IB-level issues. Is the client running the current OpenFabrics svn top-of-tree? Let's take this offline to diagnose, unless someone has an idea why the CM would be failing. The ksymoops analysis would help. Tom. At 07:19 PM 5/23/2006, helen chen wrote: Hi Tom, I have downloaded your release 5 of the NFS/RDMA and am having trouble mounting the rdma nfs, the ./nfsrdmamount -o rdma on16-ib:/mnt/rdma /mnt/rdma command never returned. and the dmesg for client and server are: -- demsg from client - RPCRDMA Module Init, register RPC RDMA transport Defaults: MaxRequests 50 MaxInlineRead 1024 MaxInlineWrite 1024 Padding 0 Memreg 5 RPC: Registered rdma transport module. RPC: Registered rdma transport module. RPC: xprt_setup_rdma: 140.221.134.221:2049 nfs: server on16-ib not responding, timed out Unable to handle kernel NULL pointer dereference at RIP: [] PGD a9f2b067 PUD a8ca2067 PMD 0 Oops: 0010 [1] PREEMPT SMP CPU 1 Modules linked in: xprtrdma ib_srp iscsi_tcp scsi_transport_iscsi scsi_mod Pid: 346, comm: ib_cm/1 Not tainted 2.6.16.16 #4 RIP: 0010:[] [] RSP: 0018:8100af5a1c30 EFLAGS: 00010246 RAX: 8100aeff2400 RBX: 8100aeff2400 RCX: 8100afc9e458 RDX: RSI: 8100af5a1d48 RDI: 8100aeff2440 RBP: 8100aeff2440 R08: R09: R10: 0003 R11: R12: 8100aeff2500 R13: ff99 R14: 8100af5a1d48 R15: 8036c72c FS: 00505ae0() GS:810003ce25c0() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: ad587000 CR4: 06a0 Process ib_cm/1 (pid: 346, threadinfo 8100af5a, task 8100afea8100) Stack: 8802a331 8100aeff2500 0001 8100aeff2440 804011fd 8802a343 8100afdd6100 80364ee4 0100 Call Trace: [8802a331] [804011fd] [8802a343] [80364ee4] [80364341] [8036f85c] [8036fcf2] [8036baeb] [8036bdc1] [8036d6fe] [8036c72c] [801377b4] [801377fb] [8013a960] [80137900] [8012309b] [8013a960] [8012309b] [8013a960] [8013a937] [8010b8d6] [8013a960] [801160b9] [801160b9] [801160b9] [8013a86f] [8010b8ce] Code: Bad RIP value. RIP [] RSP 8100af5a1c30 CR2: --dmesg from server -- nfsd: request from insecure port 140.221.134.220, port=32768! svc_rdma_recvfrom: transport 81007e8f2800 is closing svc_rdma_put: Destroying transport 81007e8f2800, cm_id=81007e945200, sk_flags=154, sk_inuse=0 Did I forget to configure necessary components into my kernel? Thanks, Helen On Mon, 2006-05-22 at 13:25, Talpey, Thomas wrote: Network Appliance is pleased to announce release 5 of the NFS/RDMA client and server for Linux 2.6.16.16. This update to the April 19 release adds improved server parallel performance and fixes various issues. This code supports both Infiniband and iWARP transports. http://sourceforge.net/projects/nfs-rdma/ http://sourceforge.net/project/showfiles.php?group_id=97628package_id=191427 Comments and feedback welcome. We're especially interested in successful test reports! Thanks. Tom Talpey, for the various NFS/RDMA projects. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] NFS/RDMA for Linux: client and server update release 5
OBTW, I just noticed that your server printed the message: nfsd: request from insecure port 140.221.134.220, port=32768! This means the /mnt/rdma export isn't configured with insecure, and causes the server to close the connection. Because the IB CM does not allow the client to use so-called secure ports ( 1024), you need to set this flag on any RDMA exports, this is mentioned in our README. The jury is out on whether it's worth implementing the source port emulation in the IB CM. The problem is that to do so requires the CM to interface with the local IP port space, or manage one of its own. So for now, NFS/RDMA just recommends using the exports flag. Frankly, it provides no additional security, and is misnamed... Tom. At 07:25 AM 5/24/2006, Talpey, Thomas wrote: [Cutting down the reply list to more relevant parties...] It's hard to say what is crashing, but I suspect the CM code, due to the process context being ib_cm. Is there some reason you're not getting symbols in the stack trace? If you could feed this oops text to ksymoops it will give us more information. In any case, it appears the connection is succeeding at the server, but the client RPC code isn't being signalled that it has done so. Perhaps this is due to a lost reply, but the NFS code hasn't actually started to do anything. So, I would look for IB-level issues. Is the client running the current OpenFabrics svn top-of-tree? Let's take this offline to diagnose, unless someone has an idea why the CM would be failing. The ksymoops analysis would help. Tom. At 07:19 PM 5/23/2006, helen chen wrote: Hi Tom, I have downloaded your release 5 of the NFS/RDMA and am having trouble mounting the rdma nfs, the ./nfsrdmamount -o rdma on16-ib:/mnt/rdma /mnt/rdma command never returned. and the dmesg for client and server are: -- demsg from client - RPCRDMA Module Init, register RPC RDMA transport Defaults: MaxRequests 50 MaxInlineRead 1024 MaxInlineWrite 1024 Padding 0 Memreg 5 RPC: Registered rdma transport module. RPC: Registered rdma transport module. RPC: xprt_setup_rdma: 140.221.134.221:2049 nfs: server on16-ib not responding, timed out Unable to handle kernel NULL pointer dereference at RIP: [] PGD a9f2b067 PUD a8ca2067 PMD 0 Oops: 0010 [1] PREEMPT SMP CPU 1 Modules linked in: xprtrdma ib_srp iscsi_tcp scsi_transport_iscsi scsi_mod Pid: 346, comm: ib_cm/1 Not tainted 2.6.16.16 #4 RIP: 0010:[] [] RSP: 0018:8100af5a1c30 EFLAGS: 00010246 RAX: 8100aeff2400 RBX: 8100aeff2400 RCX: 8100afc9e458 RDX: RSI: 8100af5a1d48 RDI: 8100aeff2440 RBP: 8100aeff2440 R08: R09: R10: 0003 R11: R12: 8100aeff2500 R13: ff99 R14: 8100af5a1d48 R15: 8036c72c FS: 00505ae0() GS:810003ce25c0() knlGS: CS: 0010 DS: 0018 ES: 0018 CR0: 8005003b CR2: CR3: ad587000 CR4: 06a0 Process ib_cm/1 (pid: 346, threadinfo 8100af5a, task 8100afea8100) Stack: 8802a331 8100aeff2500 0001 8100aeff2440 804011fd 8802a343 8100afdd6100 80364ee4 0100 Call Trace: [8802a331] [804011fd] [8802a343] [80364ee4] [80364341] [8036f85c] [8036fcf2] [8036baeb] [8036bdc1] [8036d6fe] [8036c72c] [801377b4] [801377fb] [8013a960] [80137900] [8012309b] [8013a960] [8012309b] [8013a960] [8013a937] [8010b8d6] [8013a960] [801160b9] [801160b9] [801160b9] [8013a86f] [8010b8ce] Code: Bad RIP value. RIP [] RSP 8100af5a1c30 CR2: --dmesg from server -- nfsd: request from insecure port 140.221.134.220, port=32768! svc_rdma_recvfrom: transport 81007e8f2800 is closing svc_rdma_put: Destroying transport 81007e8f2800, cm_id=81007e945200, sk_flags=154, sk_inuse=0 Did I forget to configure necessary components into my kernel? Thanks, Helen On Mon, 2006-05-22 at 13:25, Talpey, Thomas wrote: Network Appliance is pleased to announce release 5 of the NFS/RDMA client and server for Linux 2.6.16.16. This update to the April 19 release adds improved server parallel performance and fixes various issues. This code supports both Infiniband and iWARP transports. http://sourceforge.net/projects/nfs-rdma/ http://sourceforge.net/project/showfiles.php?group_id=97628package_id=191427 Comments and feedback welcome. We're especially interested in successful test reports! Thanks. Tom Talpey, for the various NFS/RDMA projects
Re: [openib-general] Re: [PATCH] mthca: fix posting lists of 256 entries for tavor
At 10:52 AM 5/24/2006, Roland Dreier wrote: Michael No idea - the site seems to be down :) It's working from here -- must be an issue in your network. I saw the same error, but adding www. to the openib.org url fixes it. Tom. Anyway the report is: * Host Architecture : x86_64 Linux Distribution: Fedora Core release 4 (Stentz) Kernel Version: 2.6.11-1.1369_FC4smp Memory size : 4071672 kB Driver Version: OFED-1.0-rc5-pre5 HCA ID(s) : mthca0 HCA model(s) : 25208 FW version(s) : 4.7.600 Board(s) : MT_00A0010001 * posting a list of multiples of 256 WR to SRQ or QP may be corrupted. The WR list that is being posted may be posted to a different QP than the QP number of the QP handle. test to reproduce it: qp_test daemon: qp_test --daemon client: qp_test --thread=15 --oust=256 --srq CLIENT SR 1 1 or qp_test --thread=15 --oust=256 CLIENT SR 1 1 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 1/2] mthca support for max_map_per_fmr device attribute
Doesn't this change only *increase* the window of vulnerability which FMRs suffer? I.e. when you say dirty, you mean still mapped, right? Tom. At 07:11 AM 5/23/2006, Or Gerlitz wrote: Or Gerlitz wrote: The max fmr remaps device attribute is not set by the driver, so the generic fmr_pool uses a default of 32. Enlaring this quantity would make the amortized cost of remaps lower. With the current mthca default profile on memfull HCA 17 bits are used for MPT addressing so an FMR can be remapped 2^15 - 1 32 times. Actually, the bigger (than unmap amortized cost) problem i was facing with the unmap count being very low is the following: say my app publishes N credits and serving each credit consumes one FMR, so my app implementation created the pool with 2N FMRs and set the watermark to N. When requests come fast enough, there's a window in time when there's an unmapping of N FMRs running at batch, but out of the remaining N FMRs some are already dirty and can't be used to serve a credit. So the app fails temporally... So, setting the watermark to 0.5N might solve this, but since enlarging the number of remaps is trivial, i'd like to do it first. The app i am talking about is a SCSI LLD (eg iSER, SRP) where each SCSI command consumes one FMR and the LLD posts to the SCSI ML how many commands can be issued in parallel. Or. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] NFS/RDMA for Linux: client and server update release 5
Network Appliance is pleased to announce release 5 of the NFS/RDMA client and server for Linux 2.6.16.16. This update to the April 19 release adds improved server parallel performance and fixes various issues. This code supports both Infiniband and iWARP transports. http://sourceforge.net/projects/nfs-rdma/ http://sourceforge.net/project/showfiles.php?group_id=97628package_id=191427 Comments and feedback welcome. We're especially interested in successful test reports! Thanks. Tom Talpey, for the various NFS/RDMA projects. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] CMA IPv6 support
At 01:05 PM 5/15/2006, Sean Hefty wrote: I came to the same conclusion a couple of weeks ago. Rdma_create_id() will likely need an address family parameter, or the user must explicitly bind before calling listen. Rdma_create_id() already takes a struct sockaddr *, which has an address family selector (sa_family) to define the contained address format. Why is that one not sufficient? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] CMA IPv6 support
At 01:26 PM 5/15/2006, Talpey, Thomas wrote: At 01:05 PM 5/15/2006, Sean Hefty wrote: I came to the same conclusion a couple of weeks ago. Rdma_create_id() will likely need an address family parameter, or the user must explicitly bind before calling listen. Rdma_create_id() already takes a struct sockaddr *, which has an address family selector (sa_family) to define the contained address format. Why is that one not sufficient? Scratch that, I was looking at our usage one layer up in the NFS/RDMA code, which does have the struct sockaddr *. Looking at rdma_listen(), the code I see checks for bound state before proceeding to listen: int rdma_listen(struct rdma_cm_id *id, int backlog) { struct rdma_id_private *id_priv; int ret; id_priv = container_of(id, struct rdma_id_private, id); if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN)) return -EINVAL; ... This makes sense, because sockets work this way, and servers generally want to listen on a port of their own choosing. So, I think it's already there. Right? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] CMA IPv6 support
At 02:04 PM 5/15/2006, Sean Hefty wrote: This is a slightly older version of the code. There's now a call to bind if the user hadn't previously called it. Ok, and sorry for not checking the top-of-tree. So I like the old code better (requiring the bind). Besides, if the user does bind, then the family argument would be completely redundant. I assume you'd continue to support rdma_bind_addr() letting the system choose a port by binding to port 0... Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ip over ib throughtput
Hi Shirley - I had a chance to try with the tiny blocksizes but I'm afraid the results aren't useful to estimate max throughput. The server I am using runs out of CPU at about 33,600 IOPS for small I/Os (=4KB), so with 2000 byte reads, all I can get is about 65MB/sec. (I get 33MB/s with 1KB, 120MB/s with 4KB, etc). And recall with NFS-default 32KB reads I get 450MB/s. All these limits are due to this server's CPU at 100%. Time to find a bigger server! The good news is, performance is nice and flat right up until the server hits the CPU wall. In fact, the more directio threads I run in parallel, the lower the client overhead. With 50 threads issuing reads, I see as little as 0.5 interrupts per I/O! Sorry I couldn't push more throughput using only small reads. I could trunk the I/O to multiple servers, but I assume you're only interested in single- stream results. Tom. At 11:11 PM 5/10/2006, Shirley Ma wrote: Talpey, Thomas [EMAIL PROTECTED] wrote on 05/10/2006 03:10:57 PM: Sure, but I wonder why it's interesting. Nobody ever uses NFS in such small blocksizes, and 2044 bytes would mean, say, 1800 bytes of payload. What data are you looking for, throughput and overhead? Direct RDMA, or inline? Tom. Throughput. I am wondering how much room IPoIB performance (throughput) can go. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general][patch review] srp: fmr implementation,
I certainly won't shoot you - I agree. The other risk of the current FMRs is that people will think the F means Fast. Tom. At 08:32 PM 5/10/2006, Tom Tucker wrote: On Wed, 2006-05-10 at 08:53 -0700, Roland Dreier wrote: Thomas I am planning to test this some more in the next few Thomas weeks, but what I'd really like to see is an IBTA Thomas 1.2-compliant implementation, and one that operated on Thomas work queue entries (not synchronous verbs). Is that being Thomas worked on? No current hardware supports that as far as I know. (Well, ipath could fake it since they already implement all the verbs in software) I'm almost certain I'll be shot for saying this, but isn't there a danger of confusion with real FMRs when the HW shows up? If the benefit isn't there -- why do it if the application outcomes are almost certainly all bad? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general][patch review] srp: fmr implementation,
At 03:12 PM 5/9/2006, Roland Dreier wrote: BTW, does Mellanox (or anyone else) have any numbers showing that using FMRs makes any difference in performance on a semi-realistic benchmark? Not me. Using the current FMRs to register/deregister windows for each NFS/RDMA operation yields only a slight performance improvement over ib_reg_phys_mr(), and I suspect this is mainly from the fact that FMRs are page-rounded. Additionally, I find that the queuepair (or perhaps the completion queue) seems to hang unpredictably, new events get stuck, only to flush after the upper layer times out and closes the connection. What I really don't like about the current FMRs is that they seem to be optimized only for lazy-deregistration, the fmr pools attempt to defer the deregistration somewhat indefinitely. This is an enormous security hole, and pretty much defeats the point of dynamic registration. The NFS/RDMA client has full-physical mode for users that want speed in well-protected environments. And it's a LOT faster. I am planning to test this some more in the next few weeks, but what I'd really like to see is an IBTA 1.2-compliant implementation, and one that operated on work queue entries (not synchronous verbs). Is that being worked on? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ip over ib throughtput
At 11:13 PM 5/9/2006, Shirley Ma wrote: Have you tried to send payload smaller than 2044? Any difference? You mean MTU or ULP payload? The default NFS reads and writes are 32KB, and in the addressing mode used in these tests they were broken into 8 page-sized RDMA ops. So, there were 9 ops from the server, per NFS read. I used the default MTU so these were probably 19 messages on the wire. I don't expect much difference with smaller MTU, but smaller NFS ops would be noticeable. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ip over ib throughtput
At 10:05 AM 5/10/2006, Shirley Ma wrote: I meant payload less than or equal to 2044, not IB MTU. IPoIB can only send =2044 payload per ib_send_post(). NFS/RDMA in this case send 32KB per ib_post_send(). Actually, in the cases I mentioned earlier, the NFS/RDMA server is posting 8 4KB RDMA writes and one ~200 byte send to satisfy the 32KB direct read issued by the client. It's possible for the client to construct many other requests however, so it's possible to result in a 32KB single inline (nonRDMA) message, or if scatter/gather memory registration is available, a single 32KB RDMA followed by the 200 byte reply. Obviously, there are significant resource differences between these. Which one to use can depend on many factors. It would be nice to know the performance difference under same payload for IPoIB over UD and NFS/RDMA. Is that possible? Sure, but I wonder why it's interesting. Nobody ever uses NFS in such small blocksizes, and 2044 bytes would mean, say, 1800 bytes of payload. What data are you looking for, throughput and overhead? Direct RDMA, or inline? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general][patch review] srp: fmr implementation,
At 11:36 AM 5/10/2006, Vu Pham wrote: I can get ~780 MB/s max without FMRs and ~920 MB/s with FMRs (using 256 KB sequential read direct IO request) In the without case, what memory registration strategy? Also, what is the CPU utilization on the initiator in the two runs (i.e. is the 780MB/s run CPU limited)? Do you have performance results with smaller blocksizes? Thanks, Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ip over ib throughtput
Shirley, Hassan - I am *very* interested in these results, and I want to at least mention that I'm doing similar NFS/RDMA testing, and getting some contrasting results. 699040 699040 1638460.003668.07 (458MB/s) cpu utilization was around 95%. On my dual-2.4GHz Xeon, with the relatively untuned NFS/RDMA client on 2.6.16.6, I am able to pull about 450MB/sec of read throughput at 35% total CPU. This is using 16 threads of NFS direct i/o (O_DIRECT) to a midrange NetApp server, I did achieve a similar result with the Linux NFS/RDMA server (but only after hotwiring the ext2 interface because I don't have the spindles). I am using a dedicated filesystem test to generate the load, and also iozone. These NFS/RDMA direct reads use RDMA writes from the server to the client. Also, this was with client hyperthreading disabled and a dual-processor Dell, I could reboot with a single CPU to get more comparable results. But, the throughput was limited by server CPU (100%), the client was actually loafing a little bit. I thought it was interesting that a filesystem achieves the same throughput at better overhead than a dedicated network test. :-) And I haven't played with interrupt affinity at all. Tom. At 07:23 PM 5/8/2006, Shirley Ma wrote: I am testing most of my patches. Under 1.Intel(R) Xeon(TM) CPU 2.80GHz, one cpu, 2. fw-23108-3_4_000-MHXL-CF128-T.bin 3. pci-x without msi_x enabled 4. kernel 2.6.16 5. netperf-2.4.0 6. SVN 68XX+several IPoIB patches The best result I got so far: Testing with the following command line: netperf -l 60 -H 10.1.1.100 -t TCP_STREAM -i 10,2 -I 95,5 -- -m 16384 -s 349520 -S 349520 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.1.100 (10.1.1.100) port 0 AF_INET : +/-2.5% @ 95% conf. Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 699040 699040 1638460.003668.07 (458MB/s) cpu utilization was around 95%. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638 Hassan M. Jafri [EMAIL PROTECTED] Sent by: [EMAIL PROTECTED] 05/08/2006 03:52 PM To openib-general@openib.org cc Subject Re: [openib-general] ip over ib throughtput I cant crank out more than 150 MB/sec with my 2.0 GHz xeons. verbs level benchmarks, however give decent numbers for bandwidth. With netperf, the server side CPU usage is 99% which is much higher than other posted bandwidth results on this thread. Any suggestions? Here is the complete configuration for my bandwidth tests Kernel-2.6.15.4 netperf-2.3-3 OpenIB rev 6552 MTLP23108-CF128 Firmware 3.4.0 MSI-X is enabled for the HCA -- Here is the netperf output TCP STREAM TEST to 192.168.2.2 Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.MBytes /s % T % T us/KB us/KB 262142 262142 3276810.01 151.32 59.6699.847.700 12.886 --- Here is ib0 config for one of the nodes ib0 Link encap:UNSPEC HWaddr 00-02-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:192.168.2.1 Bcast:192.168.2.255 Mask:255.255.255.0 inet6 addr: fe80::202:c902:0:3ce9/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:1724527 errors:0 dropped:0 overruns:0 frame:0 TX packets:9685456 errors:0 dropped:2 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:89830114 (85.6 MiB) TX bytes:2213308646 (2.0 GiB) Michael S. Tsirkin wrote: Hi! What kind of performance do people see with ip over ib on gen2? I see about 100Mbyte/sec at 99% CPU utilisation on send, on an express card, Xeon 2.8GHz, SSE doorbells enabled. MST ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ip over ib throughtput
At 05:47 PM 5/9/2006, Shirley Ma wrote: Thanks for sharing these test results. The netperf/netserver IPoIB over UD mode test spent most of time on copying data from user to kernel + checksum(csum_partial_copy_generic), and it only can send no more than mtu=2044 ib_post_send() per wiki, which definitely limits its performance compared to RDMA read/write. I would expect NFS/RDMA throughput much better than IPoIB over UD. Actually, I got excellent results in regular cached mode too, which results in one data copy from the file page cache to user space. (In NFS O_DIRECT, the RDMA is targeted at the user pages, bypassing the cache and yielding zero-copy zero-touch even though the I/O is kernel mediated by the NFS stack.) Throughput remains as high as in the direct case (because it's still not CPU limited), and utilization rises to a number less than you might expect - 65%. Specifically, the cached i/o test used 79us/32KB, and the direct i/o used 56us/32KB. Of course, the NFS/RDMA copies do not need to compute the checksum, so they are more efficient than the socket atop IPoIB. But I am not sure that the payload per WQE is important. We are nowhere near the op rate of the adapter. I think the more important factor is the interrupt rate. NFS/RDMA allows the client to take a single interrupt (the server reply) after all RDMA has occurred. Also, the client uses unsignalled completion on as many sends as possible. I believe I measured 0.7 interrupts per NFS op in my tests. Well, I have been very pleased with the results so far! We'll have more detail as we go. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: mthca FMR correctness (and memory windows)
At 03:41 AM 3/21/2006, Michael S. Tsirkin wrote: Which applications do register/unregister for each I/O? Storage! Do you have a specific benchmark in mind? Storage! :-) Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mthca FMR correctness (and memory windows)
At 10:14 PM 3/20/2006, Roland Dreier wrote: Thomas Oh yeah, I have to guess the PD too. You can't guess the PD. You would have to trick the victim into putting the remote QP into the same PD. (PDs are not represented on the wire at all) Ok - uncle. Since we're implementing the Linux PD protection in the OpenIB driver, it's moot to discuss what happens if it can be bypassed. My point is merely that the scope of the rkey is quite important, and must not be compromisable. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] mthca FMR correctness (and memory windows)
At 01:01 AM 3/21/2006, Dror Goldenberg wrote: Not sure we managed to convince you anything about FMRs. Anyway, On the contrary, I feel I know them much better. :-) I'm certainly more aware of the behavior of the fmr pool code, which is not appropriate for storage ULPs, in my opinion. I would suggest, even just for the sake of performance evaluation to try the following FMR approaches: 0a - this is the allegedly fastest: using FMR to consolidate list of pages 2a - with fmr map/unmap - same behavior as MWs (this is what you called 4) 3a - with fmr pool - will be like async unbind. Yes, I am considering these. You're reversing my numbering, but I can say that your 0a is definitely desirable, and 2a is what I'm attempting to implement now. I wouldn't be surprised if you end up finding 0a a win-win for both the client and the server. If you end up finding differently, then that may also be interesting. BTW, iSER only works this way, the RFC does not allow passing a chunk list as far as I know... Yes, iSER follows the SCSI transfer mode which places a single segment on the wire for each operation. RPC/RDMA was designed rather differently. For one thing, NFS is not a block-oriented protocol. This means it is more flexible w.r.t. data segmentation. Also, NFS has a much broader range of message types, with metadata payload. These lead to requirements for a more flexible wire structure. I am hopeful that NFS/RDMA will lend itself well to cluster computing, due to its good sharing semantics, transparent file API, and low overhead from use of the RDMA fabric. The one thing I don't want to build in is some kind of compromise on security or data integrity. No performance gain is worth that. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] mthca FMR correctness (and memory windows)
At 12:25 PM 3/19/2006, Dror Goldenberg wrote: When ib_unmap_fmr() is done, you can be sure that the old FMR is inaccessible. That's why this call blocks... Okay, that's good. But Tom, I think that you should be looking at rdma/ib_fmr_pool.h for a better API to use for FMRs. This way you can allocate a pool and remap FMRs each time you need one. You can look for examples in ulp/sdp directory. Yeah, I noticed it but there is already a mechanism in the RPC/RDMA client which supports memory windows, and it is easily adapted to include fmr's. I might use the pool api later. Is there a plan to make fmr's compliant with verbs 1.2? In the future... And it will probably be a different API, such an API that can go through a WQE-CQE. Yes, it certainly will be different. I would prefer the CQ completion style to the blocking style of the current fmr's. It would allow for better overlap of RPC processing. Currently I must defer the deregistration until processing reaches user context, and then the blocking operation costs a context switch. With the memory window API, I can launch the deregistration early, and it's often polled as complete by the time I'm ready to return from the RPC. So, I would prefer that fmr's used a similar process. Final question is memory windows. The code is there in the NFS/RDMA client, but all it gets is ENOSYS from ib_bind_mw(). (Yes, I know memory windows may not perform well on mthca. However, they're correct and hopefully faster than ib_reg_phys_mr()). FMR's the fastest. MWs are supported by the mthca HW. To my knowledge there was no demand for MWs so far and that's why the code to handle them hasn't been implemented in mthca. I want to quantify fastest before I agree with you. But I don't doubt that they will perform better than memory windows, whose performance on mthca hardware are disappointing (apparently) due to fencing and DMA flushes. I could not achieve more than 150MB/sec using windows, while I reached full bus bandwidth with a single full-frontal rkey. I am hoping that fmr's come in closer to the latter. I will not agree with your statement that nobody wants memory windows. User space applications that don't wish to expose large amounts of memory will certainly want them. Kernel space has the advantage here, by being able to use fmr. User space can't do that. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mthca FMR correctness (and memory windows)
At 02:58 PM 3/20/2006, Roland Dreier wrote: If you want to invalidate each FMR individually, then there's not much point in using FMRs at all. Just register and unregister memory regions. Really? The idea of FMR, I thought, was to preallocate the TPT entry up front (alloc_fmr), and then populate it with the lightweight fmr api (map_phys_fmr). If I use ib_reg_phys_mr(), then I incur all the preallocation overhead each time. Sure the benefit of an FMR isn't merely that the hard work is deferred, opening a vulnerability while it's pending?? No chance. You need to implement MW allocation in mthca. It's not a ton of work but it hasn't reached the top of anyone's list yet. Rats! :-) Well, I'll maybe try to scope that, if you haven't already? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] mthca FMR correctness (and memory windows)
At 05:09 PM 3/20/2006, Dror Goldenberg wrote: It's not exactly the same. The important difference is about scatter/gather. If you use dma_mr, then you have to send a chunk list from the client to the server. Then, for each one of the chunks, the server has to post an RDMA read or write WQE. Also, the typical message size on the wire will be a page (I am assuming large IOs for the purpose of this discussion). Yes, of course that is a consideration. The RPC/RDMA protocol carries many more chunks for NFS_READ and NFS_WRITE RPCs in this mode. But, the performance is still excellent, because the server can stream RDMA Writes and/or RDMA Reads to and from the chunklists in response. Since NFS clients typically use 32KB or 64KB sizes, such chunklists are typically 8 or 16 elements, for which the client offers large numbers of rdma read responder resources. Along with large numbers of RPC/RDMA operation credits. In a typical read or write burst, I have seen the Linux client have 10 or 20 RPC operations outstanding, each with 8 or 16 RDMA operations and two sends for the request/response. In full transactional workloads, I have seen over a hundred RPCs. It's pretty impressive on an analyzer. Alternatively, if you use FMR, you can take the list of pages, the IO is comprised of, collapse them into a virtually contiguous memory region, and use just one chunk for the IO. This: - Reduces the amount of WQEs that need to be posted per IO operation * lower CPU utilization - Reduces the amount of messages on the wire and increases their sizes * better HCA performance It's all relative! And most definitely not a zero-sum game. Another way of looking at it: If the only way to get fewer messages is to incur more client overhead, it's (probably) a bad trade. Besides, we're nowhere near the op rate of your HCA with most storage workloads. So it's an even better strategy to just put the work on the wire asap. Then, the throughput simply scales (rises) with demand. This, by the way, is why the fencing behavior of memory windows is so painful. I would much rather take an interrupt on bind completion than fence the entire send queue. But there isn't a standard way to do that, even in iWARP. Sigh. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mthca FMR correctness (and memory windows)
At 06:00 PM 3/20/2006, Sean Hefty wrote: Can you provide more details on this statement? When are you fencing the send queue when using memory windows? Infiniband 101, and VI before it. Memory windows fence later operations on the send queue until the bind completes. It's a misguided attempt to make upper layers' job easier because they can post a bind and then immediately post a send carrying the rkey. In reality, it introduces bubbles in the send pipeline and reduces op rates dramatically. I argued against them in iWARP verbs, and lost. If Linux could introduce a way to make the fencing behavior optional, I would lead the parade. I fear most hardware is implemented otherwise. Yes, I know about binding on a separate queue. That doesn't work, because windows are semantically not fungible (for security reasons). Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mthca FMR correctness (and memory windows)
Ok, this is a longer answer. At 06:08 PM 3/20/2006, Fabian Tillier wrote: You pre-alloc the MPT entry, but not the MTT entries. You then populate the MTT by doing posted writes to the HCA memory (or host memory for memfree HCAs). ... I don't know if allocating MTT entries is really expensive. What costs is the fact that you need to do command interface transactions to write the MTT entries, while FMRs support posted writes. I don't know what MPTs and MTTs are (Mellanox implementation?) nor do I know exactly what the overhead difference you refer to really is. It's less about the overhead and more about the resource contention, in my experience. That is, just like with alloc_fmr, you need to reserve and format an MPT for regular memory registrations, which is a command interface transaction. For memory registration, one or more commands precede this to write to the MTT. Thus, a memory registration is at a minimum a 2 command interface transaction operation, potentially more depending on the size of the registration. Deregistration and freeing (not unmapping) an FMR should be equivalent, I would think. So, in the RPC/RDMA client, I do ib_alloc_fmr() a bunch of times way up front, when setting up the connection. This provides the windows which are then used to register chunks (RPC/RDMA segments). As each RPC is placed on the wire, I borrow fmr's from the above list and call ib_map_phys_fmr() to establish the mapping for each of its segments. No allocation is performed on this hot path. When the server replies, I call ib_unmap_fmr() to tear down the mappings. No deallocation is performed, the fmrs are returned to a per-mount pool, *after unmaping them*. I just want the fastest possible map and unmap. I guess that means I want fast MTT's. I'd spoken with Dror about changing the implementation of memory registration to always use posted writes, and we'd come to the conclusion that this would work, though doing so was not the intended usage and thus not something that was garanteed to work going forward. One of Dror's main concerns was that a future change in firmware could break this. Such a change would allow memory registration to require only a single command interface transaction (and thus only a single wait operation while that command completes). I'd think that was beneficial, but haven't had a chance to poke around to quatify the gains. Again, it's not registration, it's the map/unmap. Do you believe that would be faster with this interface? I don't think it requires an API change outside the mthca interface, btw. I'd still be interested in seeing regular registration calls improved, as it's clear that an application that is sensitive about its security must either restrict itself to send/recv, buffer the data (data copy overhead), or register/unregister for each I/O. Trust me, storage is sensitive to its security (and its data integrity). As to using FMRs to create virtually contiguous regions, the last data I saw about this related to SRP (not on OpenIB), and resulted in a gain of ~25% in throughput when using FMRs vs the full frontal DMA MR. So there is definitely something to be gained by creating virutally contiguous regions, especially if you're doing a lot of RDMA reads for which there's a fairly low limit to how many can be in flight (4 comes to mind). 25% throughput over what workload? And I assume, this was with the lazy deregistration method implemented with the current fmr pool? What was your analysis of the reason for the improvement - if it was merely reducing the op count on the wire, I think your issue lies elsewhere. Also, see previous paragraph - if your SRP is fast but not safe, then only fast but not safe applications will want to use it. Fibre channel adapters do not introduce this vulnerability, but they go fast. I can show you NFS running this fast too, by the way. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mthca FMR correctness (and memory windows)
At 07:50 PM 3/20/2006, Roland Dreier wrote: Thomas Yes, I know about binding on a separate queue. That Thomas doesn't work, because windows are semantically not Thomas fungible (for security reasons). Can you elaborate on the issue of fungibility? If one entity has two QPs, one of which it's using for traffic and one of which it's using for MW binds, I don't see any security issue (beyond the fact that you've now given up ordering of operations between the QPs). If I can snoop or guess rkeys (not a huge challenge with 32 bits), and if I can use them on an arbitrary queuepair, then I can handily peek and poke at memory that does not belong to me. For this reason, iWARP requires its steering tags to be scoped to a single connection. This leverages the IP security model and provides correctness. It is true that IB implementations generally don't do this. They should. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] mthca FMR correctness (and memory windows)
At 08:42 PM 3/20/2006, Diego Crupnicoff wrote: If I can snoop or guess rkeys (not a huge challenge with 32 bits), and if I can use them on an arbitrary queuepair, then I can handily peek and poke at memory that does not belong to me. No. You can't get to the Window from an arbitrary QP. Only from those QPs that belong to the same PD. Oh yeah, I have to guess the PD too. For this reason, iWARP requires its steering tags to be scoped to a single connection. This leverages the IP security model and provides correctness. It is true that IB implementations generally don't do this. They should. IB allows the 2 flavors (PD bound Windows aka type 1, and QP bound Windows aka type 2). Does mthca? I thought it's all type 1. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] mthca FMR correctness (and memory windows)
At 08:24 PM 3/20/2006, Doug O'Neil wrote: From iWarp RDMA Verbs Section 5.2 ... Tom, I read the above as an STag that represents a MR can be used by any QP with the same PD ID. STags that represent a MW must be used on the same QP that created them. The iWARP verbs were never made part of the RDDP specification, nor would an API-based security model have passed muster in the IETF. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] mthca FMR correctness (and memory windows)
I'm implementing FMR memory registration mode in the NFS/RDMA client, and I've got it mostly working. However as I understand it, mthca's existing fmr's do not guarantee that the r_key is completely invalidated when the ib_unmap_fmr() returns. This makes using them rather problematic, to say the least. Now, I notice that ib_unmap_fmr() is a blocking operation (at least, the kernel whines about semaphores being waited in interrupt context, when I experimented with that). Does this mean mthca's ib_unmap_fmr() is waiting for the invalidation now, or plans to in the future? Second comment is that the existing fmr api is (IMO) very inconsistent. Why does ib_map_phys_fmr() take an array of u64 physaddrs and not struct page *'s? And the unmap api mysteriously takes a struct list *, not any object returned by ib_alloc_fmr() or ib_map_phys_fmr(). Is there a plan to make fmr's compliant with verbs 1.2? Final question is memory windows. The code is there in the NFS/RDMA client, but all it gets is ENOSYS from ib_bind_mw(). (Yes, I know memory windows may not perform well on mthca. However, they're correct and hopefully faster than ib_reg_phys_mr()). What is the plan to implement mthca memory windows? Thanks, Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: Revenge of the sysfs maintainer! (was Re: [PATCH 8 of 20] ipath - sysfs support for core driver)
At 11:58 PM 3/9/2006, Bryan O'Sullivan wrote: I'd like a mechanism that is (a) always there (b) easy for kernel to use and (c) easy for userspace to use. A sysfs file satisfies a, b, and c, but I can't use it; a sysfs bin file satisfies all three (a bit worse on b), but I can't use it; debugfs isn't there, so I can't use it. That leaves me with few options, I think. What do you suggest? (Please don't say netlink.) mmap()? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 0 of 20] [RFC] ipath driver - another round for review
At 07:35 PM 3/9/2006, Bryan O'Sullivan wrote: - We've added an ethernet emulation driver so that if you're not using Infiniband support, you still have a high-performance net device (lower latency and higher bandwidth than IPoIB) for IP traffic. This strikes me as very unwise. In addition to duplicating a standardized IPoIB facility, is the emulation supported by any other implementation? Who will be using this code *without* having enabled the current OpenIB support? What standardization is planned for this new protocol? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 0 of 20] [RFC] ipath driver - another round for review
At 10:59 AM 3/10/2006, Bryan O'Sullivan wrote: On Fri, 2006-03-10 at 09:06 -0500, Talpey, Thomas wrote: This strikes me as very unwise. In addition to duplicating a standardized IPoIB facility, is the emulation supported by any other implementation? No, it's specific to our hardware. Its main purpose is to provide an IP stack that works over the fabric when there are no IB drivers present, so it's not duplicating IPoIB in any meaningful sense. This is not sufficient justification to introduce an incompatible and redundant Ethernet emulation layer into the core. Will it work in a system where IPoIB is enabled? How do you handle IP addressing and discovery? Have you tested it under all upper layers including IPv6? What apps do your users run? Who will be using this code *without* having enabled the current OpenIB support? We already have a pile of customers using it. It happens to have lower latency and higher bandwidth than IPoIB, but I suspect that's in part because we haven't had time to tune IPoIB yet. You need to put your effort into supporting IPoIB. I would like to know what tuning it means btw. What standardization is planned for this new protocol? None at present. It's there for people who want it, and people are already using it. For those who need something standards-based, there's IPoIB. That just doesn't cut it. Standard is better than Better. This code is at the moment a proprietary extension, being proposed for global inclusion. At a minimum, you need to document its protocol, and quantify its performance advantages. If so, perhaps it can be justified as an experimental upper layer. By the way, what's the name of this component? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH 0 of 20] [RFC] ipath driver - another round for review
Hrm. Not sure how the emulation isn't Infiniband-related. But you see the problem, right? If integrated, this becomes a Linux-to-Linux protocol (only). And the first question it has to answer is why isn't this just IPoIB? I haven't seen an answer to that. Tom. At 01:12 PM 3/10/2006, Bryan O'Sullivan wrote: On Fri, 2006-03-10 at 13:02 -0500, Talpey, Thomas wrote: Will it work in a system where IPoIB is enabled? Yes. How do you handle IP addressing and discovery? DHCP and static addressing both work as you'd expect. Have you tested it under all upper layers including IPv6? Yes. What apps do your users run? Whatever they want. NFS, ssh, SMB, etc, etc. At a minimum, you need to document its protocol, and quantify its performance advantages. If so, perhaps it can be justified as an experimental upper layer. It's not Infiniband-related at all, if that's what you're objecting to. By the way, what's the name of this component? ipath_ether. b ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] TX/RX_RING_SIZE as loadable parameters
At 02:11 AM 3/7/2006, Michael S. Tsirkin wrote: The default TX_RING_SIZE is 64, RX_RING_SIZE is 128 in IPoIB, which These parameters must be a power of 2, and at least 2, otherwise things break. I'd suggest making these a log and multiplying the result by 2, to exclude the possibility of user error. Surely this isn't true of all hardware. If the underlying hardware requires a power of 2, it should fix it, not make a requirement on the framework setting? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: Re: [PATCH] TX/RX_RING_SIZE as loadable parameters
At 08:00 AM 3/7/2006, Michael S. Tsirkin wrote: Surely this isn't true of all hardware. This is true for all hardware. We have code in ipoib like tx_req = priv-tx_ring[priv-tx_head (IPOIB_TX_RING_SIZE - 1)] So it only works for tz size that is a power of 2. Sorry but that sounds like a needless restriction. Some hardware doesn't have ring buffers at all. Well, if ipoib does it this way though, I guess that's what it is. If the underlying hardware requires a power of 2, it should fix it, not make a requirement on the framework setting? Supporting arbitrary ring size will require integer division or coditional code on data path, I don't think it's worth it. That's actually not what I suggested. I said the hardware driver should change any unacceptable value to something that is. Or, it can simply reject it. Anyway, I definitely think it should be settable - but isn't code like you quote going to result in changing *all* ipoib interfaces? This kind of thing is usually a driver parameter, not an upper layer. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] TX/RX_RING_SIZE as loadable parameters
At 08:59 AM 3/7/2006, Michael S. Tsirkin wrote: How about we take the ring buffer size from dev-tx_queue_len? Round it up to the previous power of 2, for simplicity. IPOIB_TX_RING_SIZE will then be just the default value, so it can stay hardcoded. Make the RX queue say twice the number, and keep it per device. This way I think the user can both view, and set, these values with the standard ip command. Makes sense to me! And it certainly meets the requirement of per-interface tunables, using standard interfaces (ifconfig). Remember though, tx_queue_len is only somewhat proportional to the hardware tx queue. Technically, it's the send backlog for use when the hardware queue is full. It's often smaller for faster hardware. OTOH, receive rings are usually *larger* for faster hardware. Might be worth thinking these relationships through... Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] TX/RX_RING_SIZE as loadable parameters
At 01:20 PM 3/7/2006, Michael S. Tsirkin wrote: Quoting r. Roland Dreier [EMAIL PROTECTED]: Subject: Re: [PATCH] TX/RX_RING_SIZE as loadable parameters Michael How about we take the ring buffer size from Michael dev-tx_queue_len? But dev-tx_queue_len is a different setting. It's quite reasonable to have the tx_queue_len be set independently of the underlying pseudo-hardware work queues. It kind of makes sense to have them related though, does it not? Again - not necessarily! The tx_queue_len is a software backlog used if the software overdrives the hardware on transmit. It avoids losing packets, any of which can be seen in the netstat per-interface drops statistic. It only needs to be big enough. In fact, a large device ring means you probably only need a small tx queue. But tuning this stuff can be a black art. Generally there is no reason to want a silly-large tx ring. Besides, IPoIB only ever has one message in flight, right? In an earlier message you mentioned scaling the rx ring to the tx ring. I think you should think more about that. The rx ring (hardware) has to be big enough to keep packets while they await hardware interrupt processing. So it's dependent on the arrival rate and the interrupt latency (including any interrupt coalescing), not the transmitter depth. All these numbers should be settable per-interface, and should attempt to adhere to the principle of least surprise - do what other drivers in net/ do. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: Re: RFC: e2e credits
At 04:39 PM 3/7/2006, Michael S. Tsirkin wrote: Anyway, sine ULPs don't seem to need it, another approach would be an option to disable these flow controls globally, and probably add a module option to enable them back just in case. That's much simpler, isn't it? Thumbs up! If nothing uses them, why hang them around, enabled, just to cause problems? As an upper layer implementor, I sure don't want them in the way, nor do I want to add special code to turn something off I wasn't even aware was in the provider. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] TX/RX_RING_SIZE as loadable parameters
At 04:45 PM 3/7/2006, Roland Dreier wrote: No, IPoIB can have arbitrarily many packets in flight. It's just like any other layer 2 net device in that respect. I thought UD has only a single-packet window in the qp context. There isn't much uniformity about this in drivers/net. Unfortunately, making the ring sizes settable per-interface leads to a lot of ugly option handling code. Is it relly important, or can we get by with one per-module setting? Well, it's only important if the code that's there works well. I thought Shirley said it doesn't. Has anyone instrumented it for overruns and drops, and watched it under load? That would tell us what to tweak. I dunno, it's probably deferrable (with Shirley's queue changes) as long as there's a way to diagnose it later. Constants are pretty much never correct in networking code. And module parameters are darn close to being constants. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RFC: move SDP from AF_INET_SDP to IPPROTO_SDP
We're encountering a similar situation in NFS/RDMA protocol naming. The existing NFS client and server understand just IPPROTO_UDP and IPPROTO_TCP. One comment though, IP protocols are just 8 bits, 0-255. No need to go to 65K. I agree with Bryan though that it's not ours to say yes, it's netdev. You should maybe stress the getaddrinfo() point more strongly, since sharing of naming interfaces is highly desirable. SDP is all about code compatibility, after all. Tom. At 01:15 PM 3/6/2006, Michael S. Tsirkin wrote: Hi! Would it make sense to move SDP from using a separate address family to a separate protocol under AF_INET and AF_INET6? Something like IPPROTO_SDP? The main advantages are - IPv6 support will come more naturally and without further extending to a yet another address family - We could use a protocol number 64K (e.g. 7) to avoid conflicting with any IP based protocol. There are much more free protocol numbers that free family numbers (which only go up to 32 in linux for now). - I could reuse more code for creating connections from af_inet.c I also have a hunch this might make getaddrinfo work better on sdp but I'm not sure. Comments? Are there disadvantages to this approach that someone can see? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] NFS/RDMA client *and server* release for Linux 2.6.15
Following up on the client release of Feb 8, we are releasing a first-functional NFS/RDMA server for Linux 2.6.15, along with client updates based on comments received. These are both licensed under dual BSD/GPL2 terms, and available at the project's Sourceforge site: http://sourceforge.net/projects/nfs-rdma/ http://sourceforge.net/project/showfiles.php?group_id=###package_id=### Both client and server employ the native OpenIB verbs API for RDMA, and work equally for Infiniband and iWARP. The client and server implement the IETF draft protocol and fully support direct (zero-copy, zero-touch) RDMA transfers at the RPC layer. However, the write performance is not yet representative of full RDMA operation, due to a bottleneck in the server's use of RDMA Read, and at least one data copy in its handoff to the filesystem. We will be rectifying the former, and investigating the latter. Both the client and server have been tested with NFSv3 and pass the Connectathon test suite. Due to the additional components, the procedure for applying the patches is substantially more involved, requiring several steps to be followed in a particular sequence. Also, the server patches have been separated into framework and RDMA sections, as already been done for the client. The package README has details. The RDMA support in this NFS server release was developed by Tom Tucker of Open Grid Computing and we thank him for his efforts on this. At this time, the changes to the Linux NFS server svc framework are in effect a first proposal for how RDMA support might be added to the code. There are open issues in both how the module linkage should be structured, and also how the linkage to existing code be done. As before, we look forward to comments and feedback! Thanks for all of it so far. Tom Talpey, for the various NFS/RDMA projects. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] NFS/RDMA client *and server* release for Linux 2.6.15
Please ignore this draft message, which somehow escaped my outbox! The code will be up shortly though. Sorry for the interruption. Tom. At 01:50 PM 3/6/2006, Talpey, Thomas wrote: Following up on the client release of Feb 8, we are releasing a first-functional NFS/RDMA server for Linux 2.6.15, along with client updates based on comments received. These are both licensed under dual BSD/GPL2 terms, and available at the project's Sourceforge site: http://sourceforge.net/projects/nfs-rdma/ http://sourceforge.net/project/showfiles.php?group_id=###package_id=### Both client and server employ the native OpenIB verbs API for RDMA, and work equally for Infiniband and iWARP. The client and server implement the IETF draft protocol and fully support direct (zero-copy, zero-touch) RDMA transfers at the RPC layer. However, the write performance is not yet representative of full RDMA operation, due to a bottleneck in the server's use of RDMA Read, and at least one data copy in its handoff to the filesystem. We will be rectifying the former, and investigating the latter. Both the client and server have been tested with NFSv3 and pass the Connectathon test suite. Due to the additional components, the procedure for applying the patches is substantially more involved, requiring several steps to be followed in a particular sequence. Also, the server patches have been separated into framework and RDMA sections, as already been done for the client. The package README has details. The RDMA support in this NFS server release was developed by Tom Tucker of Open Grid Computing and we thank him for his efforts on this. At this time, the changes to the Linux NFS server svc framework are in effect a first proposal for how RDMA support might be added to the code. There are open issues in both how the module linkage should be structured, and also how the linkage to existing code be done. As before, we look forward to comments and feedback! Thanks for all of it so far. Tom Talpey, for the various NFS/RDMA projects. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] NFS/RDMA client *and server* release for Linux 2.6.15
Following up on the client release of Feb 8, we are releasing a first-functional NFS/RDMA server for Linux 2.6.15, along with client updates based on comments received. These are both licensed under dual BSD/GPL2 terms, and available at the project's Sourceforge site: http://sourceforge.net/projects/nfs-rdma/ http://sourceforge.net/project/showfiles.php?group_id=97628package_id=182485release_id=399220 Both client and server employ the native OpenIB verbs API for RDMA, and work equally for Infiniband and iWARP. The client and server implement the IETF draft protocol (*) and fully support direct (zero-copy, zero-touch) RDMA transfers at the RPC layer. However, the write performance is not yet representative of full RDMA operation, due to a bottleneck in the server's use of RDMA Read, and at least one data copy in its handoff to the filesystem. We will be rectifying the former, and investigating the latter. Both the client and server have been tested with NFSv3 and pass the Connectathon test suite. Due to the additional components, the procedure for applying the patches is substantially more involved, requiring several steps to be followed in a particular sequence. Also, the server patches have been separated into framework and RDMA sections, as already been done for the client. The package README has details. The RDMA support in this NFS server release was developed by Tom Tucker of Open Grid Computing and we thank him for his efforts on this. At this time, the changes to the Linux NFS server svc framework are in effect a first proposal for how RDMA support might be added to the code. There are open issues in both how the module linkage should be structured, and also how the linkage to existing code be done. As before, we look forward to comments and feedback! Thanks for all of it so far. Tom Talpey, for the various NFS/RDMA projects. (*) Protocol docs under Internet-Drafts at bottom of page: http://www.ietf.org/html.charters/nfsv4-charter.html ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] NFSoRDMA
At 06:25 PM 3/1/2006, Brad Dameron wrote: Anyone have the NFS over RDMA working? I have tried getting the CITI patches to compile with no luck. I am using the Voltaire IB card's which appear to be Mellanox MT23108 cards. The CITI server will not compile and link on a generic OpenIB-enabled kernel (it needs kDAPL). That code was a prototype and there is no further work on it planned. Instead we are developing a new server which uses native OpenIB and will implement the full protocol including direct RDMA transfers and multiple credits (two major elements not in the CITI code). It will be released next week in first-functional form. I assume you have seen the client announcement from a couple of weeks back? Have you had any issues with that code? http://openib.org/pipermail/openib-general/2006-February/016218.html Anyway, watch here for the followup - It will have client changes based on comments here and from others in the NFS community as well. I'll mail here when we've assembled the patches (next week). Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] NFS/RDMA client release for Linux 2.6.15
Thanks for the detailed review! Some replies below. I left the IETF list out of this reply since it's basically porting, not protocol. At 07:01 AM 2/19/2006, Christoph Hellwig wrote: On Wed, Feb 08, 2006 at 03:58:56PM -0500, Talpey, Thomas wrote: We have released an updated NFS/RDMA client for Linux at the project's Sourceforge site: Thanks, this looks much better than the previous patch. Comments: - please don't build the rdma transport unconditional, but make it a user-visible config option It's an option, but it's located in fs/Kconfig not net/. This is the way SUNRPC is selected, so we simply followed that. BTW, Chuck's transport switch doesn't support dynamically loading modules yet so there is a dependency to work out until that's in place. - please use the kernel u*/s* types instead of (u)int*_t We use uint*_t for the user-visible protocol definitions (on the wire) and u32 etc for kernel stuff. I'll recheck if we got something wrong. - please include your local headers after the linux/*.h headers, There are a couple of issues with header include ordering that seem to change pretty often. In a couple of cases we had to rearrange things to avoid forward declarations, I'll recheck this. and keep all the includes at the beginning of the files, just after the licence comment block - chunktype shouldn't be a typedef but a pure enum, and the names look a bit too generic, please add an rdma_ prefix Ok on both. - please kill the XDR_TARGET and pos0 macros, maybe RPC_SEND_SEG0 and RPC_SEND_LEN0, too - RPC_SEND_VECS should become an inline functions and be spelled lowercase - RPC_SEND_COPY is probably too large to be inlined and should be spelled lowercase - RPC_RECV_VECS should be an inline and spelled lowercase - RPC_RECV_SEG0 and PC_RECV_LEN0 should probably go away. - RPC_RECV_COPY is probably too large to be inlined and should be spelled lowercase - RPC_RECV_COPY same comment about highmem and kmap as in RPC_SEND_COPY These are killable. They were there to support code sharing for 2.4 kernels and are easy to eliminate now. - the CONFIG_HIGHMEM ifdef block in RPC_SEND_COPY is wrong. Please always use kmap, it does the right thing for non-highmem aswell. The PageHighMem check and using kmap_high directly is always wrong, they are internal implementation details. I'd also suggest evaluating kmap_atomic because it scales much better on SMP systems. Yes, there are some issues here which we're still working out. In fact, we can't use kunmap() in the context you mention because in 2.6.14 (or is it .15) it started to check for being invoked in interrupt context. There is one configuration in which we do call it in bh context. The call won't block but the kernel BUG_ON's. This is something on our list to address. - please try to avoid file-scope forward-prototypes but try to order the code in the natural flow where they aren't required Good point. Will recheck for these. - structures like rpcrdma_msg that are on the wire should use __be* for endianess annotations, and the cpu_to_be*/be*_to_cpu accessor functions instead of hton?/ntoh?. Please verify that these annotations are correct using sparse -D__CHECK_ENDIAN__=1 Hmm, okay but existing RPC and NFS code don't do this. I'm reluctant to differ from the style of the governing subsystem. I'll check w/Trond. - rdma_convert_physiov/rdma_convert_phys are completely broken. page_to_phys can't be used by driver/fs code. RDMA only deals with bus addresses, not physical addresses. You must use the dma mapping API instead. Also coalescing decisions are made by the dma layer, because they are platform dependent and much more complex then what the code in this patch does. Now that we are moving to OpenIB api's this is needed. There is some thought necessary w.r.t. our max-performance mode of preregistering memory in DMA mode. That's on our list of course. - transport.c is missing a GPL license statement Oops. - in transport.c please don't use CamelCase variable names. This is just for module parameters? These are going away but we don't have the new NFS mount API yet. There is a comment to that effect but maybe it doesn't mention the module stuff. - MODULE_PARM shouldn't be used in new code, but module_param instead. Ditto. - please don't use the (void) function() style, it just obsfucates the code without benefit. Ok. - try_module_get(THIS_MODULE) is always wrong. Reference counting should happen from the calling module. This is the same convention used by the other RPC transports. I will pass the comment along. - please initialize global or file-scope spinlocks with DEFINE_SPINLOCK(). Ok. - the traditional name for the second argument to spin_lock_irqsave is just flags, not lock_flags. This doesn't really matter, but following such conventions makes it easier to understand
Re: [openib-general] NFS/RDMA client release for Linux 2.6.15
At 05:28 PM 2/19/2006, Roland Dreier wrote: Christoph - rdma_convert_physiov/rdma_convert_phys are completely Christoph broken. page_to_phys can't be used by driver/fs code. Christoph RDMA only deals with bus addresses, not physical Christoph addresses. You must use the dma mapping API Christoph instead. Also coalescing decisions are made by the dma Christoph layer, because they are platform dependent and much Christoph more complex then what the code in this patch does. Thomas Now that we are moving to OpenIB api's this is Thomas needed. There is some thought necessary w.r.t. our Thomas max-performance mode of preregistering memory in DMA Thomas mode. That's on our list of course. Again let me echo Christoph's point. If you are passing physical addresses into IB functions, then your code simply won't work on some architectures. Making sure your code actually works on something like a ppc64 box with an IOMMU would be a good test -- the low-end IBM POWER machines are cheap enough that you could just buy one if you don't have easy access. Yep, I get it! To elaborate a little, we're not exactly passing physical addresses. What we're doing is using the physaddr to calculate an offset relative to a base of zero. We register the zero address and advertise RDMA buffers via offsets relative to that r_key. And, this is only one of many memory registration modes. We would use memory windows, if only OpenIB provided them (yes I know the hardware currently sucks for them). We will add FMR support shortly. In both these modes we perform all addressing by the book via 1-1 OpenIB registration. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] NFS performance and general disk network export advice (Linux-Windows)
At 03:17 PM 2/9/2006, Paul Baxter wrote: I'm looking to export a filesystem from each of four linux 64bit boxes to a single Windows server 2003 64bit Ed. Has anyone achieved this already using an IB transport? Can I use NFS over IPoIB cross platform? i.e. do both ends support a solution? Is NFS over RDMA compatible with Windows (pretty sure the answer is no to this one but love to be proven wrong). I've attached Tom's announcement of the latest to the bottom of this email. I don't think Windows has the RDMA abstraction (yet)? Not the code I posted! :-) But sure, it's possible to implement NFS/RDMA on Windows. Let us know when you're ready to test. ;-) Are windows IB drivers (Openib or Mellanox) compatible with these options? Do I layer Windows services for Unix on top of the Windows IB drivers and IPoIB to achieve a cross platform NFS? You could do this but your real challenge is the upper layer IFS interface. You would need to implement a Windows filesystem for NFS first. Of course, there are such beasts, Hummingbird's comes to mind. The code I posted uses strictly the OpenIB RDMA interfaces, plus CMA for address resolution and making connections. By the way, it will work over iWARP too. Has anyone done much in the way of NFS performance comparisons of NFS over IPoIB in cross-platform situations vs say Gigabit ethernet. Does it work :) What is large file throughput and processor loading - I'm aiming for 150-200 MB/s on large files on 4x SDR IB (possibly DDR if we can fit the bigger 144 port switch chassis into our rack layout for 50-ish nodes). NFS over IPoIB does work, but is nowhere near as low-overhead as native NFA over RDMA. There are several issues with an IPoIB implementation, first of all the fact that an IPoIB solution is quite a bit less optimal than a native 10GbE NIC: - The UD connection typically has a single message in flight, which negates much of the streaming throughput capable with RC. - The IPoIB layer is an emulation, and does not generally perform the hardware checksumming and large segment offload that even 100Mb NICs provide. - The network stack is still in the loop on both ends, adding computational overhead and latency. - The data must still be copied. I have seen native zero-copy zero-touch NFS/RDMA streaming at full PCI/X throughput using only about 20% of a dual-processor 2GHz Xeon. Typically, most network stacks top out at 100% CPU at perhaps half this rate on similar platforms. I'd expect IPoIB to be even less due to the reasons above. Are there any alternatives to using NFS that may be better and that would 'transparently' receive a performance boost with IB compared with using a simple NFS/gigabit ethernet solution. Must be fairly straightforward, ideally application neutral (configure a drive and load/unload script for Linux and it just happens) and compatible between Win2003 and Linux? Alternatives using perhaps Samba on the Linux side? My lack of knowledge of IB in the windows world has got me concerned over whether this is actually achievable (easily). I hope to be trying this once we get a Windows 2003 machine, but hope someone can encourage me that its a breeze prior to my coming unstuck in a month or so! Some detail about the bit I do understand: I will be using a patched Linux kernel (realtime preemption patches ) but prefer not to apply/track too many kernel patches as the kernel evolves. The NFS patches suggested by Tom in his announcement below make me a little nervous. The most important patches for integrating the NFS/RDMA client are already in the 2.6.15 kernel, but there is additional work which is still in progress. These are the patches I refer to. One of the major ones is the ability to dynamically load RPC transports, such as the NFS/RDMA module. So you do need some sort of patch to use the client, currently. The transport switch continues to evolve and become integrated into the kernel, so the need for this particular patch will fall away eventually. FYI, the transport switch is much more general than NFS/RDMA - it's the underpinning of IPv6 support for the NFS client. Your real issue in working with NFS/RDMA in the way you describe is the availability of the server. The Linux NFS/RDMA server is still very much under development, and will take time just to be ready for experimentation. Especially, it will take time to get it to a state where it can perform the way you require (performance). Please feel free to contact me offline if you want to talk about details of actually setting this up. With a stock 2.6.15.2 kernel and a couple of IB cards you could get it going just to get started. Tom. The application will alternate between a real-time mode with (probably) no NFS (or similar network exporting of the disk) and an archiving mode where Linux will load relevant network filesystem modules and let the windows machine read the disks. The reason for this odd load/unload behaviour is because our current
Re: [openib-general] Re: [PATCH] CMA and iWARP
At 06:53 PM 1/23/2006, Roland Dreier wrote: vetoed on netdev and b) trying to get openib and the kernel community to accept code just so a vendor can meet a product marketing deadline. BTW, upon reflection, the best idea for moving this forward might be to push the Ammasso driver along with the rest of the iWARP patches, so that there's some more context for review. Just because a vendor is out of business is no reason for Linux not to have a driver for a piece of hardware. In fact, there are a bunch of Ammasso cards out there, and also, what better proof could you have that there isn't a hidden hardware agenda in the submission! Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] iser/uverbs integration
At 08:06 AM 8/31/2005, Gleb Natapov wrote: The question is what is the best way to proceed? Will the changes needed to use userspace QP from kernel will be accepted? How NFS/RDMA works now? To answer the second question, both client and server NFS/RDMA create and connect all endpoints completely within the kernel. This is also true of NFS/Sockets btw. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] Re: RDMA Generic Connection Management
At 02:06 PM 8/31/2005, Yaron Haviv wrote: Also note that with Virtual machines this type of event may be more frequent and we may want to decouple the ULPs from the actual hardware s/may want/definitely want/ Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA Generic Connection Management
At 10:55 AM 8/30/2005, Yaron Haviv wrote: The iSCSI discovery may return multiple src dst IP addresses and the iSCSI multipath implementation will open multiple connections. There are many TCP/IP protocols that do that at the upper layers (e.g. GridFTP, ..), not sure how NFS does it. The answer to that question depends on the version of NFS, and also the implementation. For NFSv2/v3, the situation is ad hoc. Some clients support multiple connections which they are able to round-robin. Solaris does this for example. The problem is, to the server each NFSv2/v3 connection appears to be a different client. Therefore the correctness guarantees (such as they are) go out the window. For example, a retry on a different connection is not a retry at all, it's a new op. So, the shotgun (trunked) NFSv3 situation is useful only for a certain class of use. For NFSv4, it's a little better in that there is a clientid which identifies the source. However, NFSv4 does not sufficiently deal with the case of requests on different connections either. With our new NFSv4 sessions proposal, planned to be part of NFSv4.1 (http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-sess-02.txt), trunking is fully supported, by allowing requests to belong to a higher-layer session regardless of what connection they arrive on. This exists in prototype form, the NFSv4.1 spec is still being pulled together. UMich/CITI is developing this btw. With a session, the client gets full consistency guarantees and trunked connections are therefore completely transparent. One thing to stress is that the type of connection (TCP, UDP, RDMA, etc) makes little or no difference in the trunking/multipathing picture. In fact, with an NFSv4.1 session, a mix of such connections is possible, and even a good idea. So it's more than a question of what RDMA capabilities are there, it's really *all* connections. To answer the question of how NFS finds out about multiple connections and trunking, the answer is generally that the mount command tells it. Mount can get this information from the command line, or DNS. I believe Solaris uses the command line approach. There may be a way to use the RPC portmapper for it, but the portmapper isn't used by NFSv4. Bottom line? NFS would love to have a way to learn multipathing topology. But it needs to follow existing practice, such as having an IP address / DNS expression. If the only way to find it is to query fabric services, that's not very compelling. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: RDMA Generic Connection Management
At 01:12 PM 8/30/2005, Roland Dreier wrote: Steve I thought all ULPs needed to register as an IB client Steve regardless? Right now they do, because there's no other way to get a struct ib_device pointer. If we add a new API that returns a struct ib_device pointer, then inevitably consumers will use it instead of the current client API, and then hotplug will be hopelessly broken. Are you telling us that RPC/RDMA (for example) has to handle hotplug events just to use IB? Isn't that the job of a lower layer? NFS/Sockets don't have to deal with these, f'rinstance. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: RDMA Generic Connection Management
At 02:01 PM 8/30/2005, Roland Dreier wrote: Thomas Are you telling us that RPC/RDMA (for example) has to Thomas handle hotplug events just to use IB? Isn't that the job Thomas of a lower layer? NFS/Sockets don't have to deal with Thomas these, f'rinstance. Yes, if you want to talk directly to the device then you have to make sure that the device is still there to talk to. Verbs don't do that? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: RDMA Generic Connection Management
At 02:15 PM 8/30/2005, Roland Dreier wrote: Thomas Verbs don't do that? Not as they are currently defined. And I don't think we want to add reference counting (aka cache-line pingpong) into every verbs call including the fast path to make sure that a device doesn't go away in the middle of the call. Well, you're saying somebody has to do it, right? Is it easier to fob this off to upper layers that (frankly) don't care what hardware they're talking to!? This means we have N copies of this, and N ways to do it. Talk about cacheline pingpong. Sorry but it suddenly sounds like we're all writing device drivers, not developing upper layers. This is a mistake. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: RDMA Generic Connection Management
kDAPL does this! :-) At 02:35 PM 8/30/2005, Roland Dreier wrote: Thomas Well, you're saying somebody has to do it, right? Is it Thomas easier to fob this off to upper layers that (frankly) Thomas don't care what hardware they're talking to!? This means Thomas we have N copies of this, and N ways to do it. Talk about Thomas cacheline pingpong. Upper layers have the luxury of being able to do this at a per-connection level, can sleep, etc. If we push it down into the verbs, then we have to do it in every verbs call, including the fast path verbs call. And that means we get into all sorts of crazy code to deal with a device disappearing between a consumer calling ib_post_send() and the core code being entered, etc. Right now we have a very simple set of rules: An upper level protocol consumer may begin using an IB device as soon as the add method of its struct ib_client is called for that device. A consumer must finish all cleanup and free all resources relating to a device before returning from the remove method. A consumer is permitted to sleep in its add and remove methods. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: RDMA Generic Connection Management
At 03:08 PM 8/30/2005, Roland Dreier wrote: Thomas kDAPL does this! :-) Does what? As far as I can tell kDAPL just ignores hotplug and routing and hopes the problems go away ;) I was referring to kDAPL's architecture, which does in fact address hotplug with async evd upcalls. In the early days of the reference port we implemented it on Solaris this way, for example. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: RDMA Generic Connection Management
At 04:10 PM 8/30/2005, Talpey, Thomas wrote: At 03:08 PM 8/30/2005, Roland Dreier wrote: Thomas kDAPL does this! :-) Does what? As far as I can tell kDAPL just ignores hotplug and routing and hopes the problems go away ;) I was referring to kDAPL's architecture, which does in fact address hotplug with async evd upcalls. In the early days of the reference port we implemented it on Solaris this way, for example. And I remember naming the upcall E_NIC_ON_FIRE. There was another one after putting it out, of course. :-) Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] RDMA connection and address translation API
At 12:34 PM 8/25/2005, Roland Dreier wrote: All implementation of NFS/RDMA on top of IB had better interoperate, right? Which means that someone has to specify which address translation mechanism is the choice for NFS/RDMA. Correct. At the moment the existing NFS/RDMA implementations use ATS (Sun's and NetApp's). NFS/RDMA is being defined on top of an abstract RDMA interface. Someone has to write a spec for how that RDMA abstraction is translated into packets on the wire for each transport that NFS/RDMA will run on top of. Well, we did. We specify the ULP payload of all the messages in those two IETF documents. What we didn't do is define how each transport handles IP addressing, that is a transport issue. We don't need address translation over iWARP, since that uses IP. Over IB, so far, we have used ATS. I am perfectly fine with a better solution, but ATS has been fine too. I am catching up to this discussion, so this is just one reply. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] RDMA connection and address translation API
At 12:56 PM 8/25/2005, Caitlin Bestler wrote: Generic code MUST support both IPv4 and IPv6 addresses. I've even seen code that actually does this. Let me jump ahead to the root question. How will the NFS layer know what address to resolve? On IB mounts, it will need to resolve a hostname or numeric string to a GID, in order to provide the address to connect. On TCP/UDP, or iWARP mounts, it must resolve to IP address. The mount command has little or no context to perform these lookups, since it does not know what interface will be used to form the connection. In exports, the server must inspect the source network of each incoming request, in order to match against /etc/exports. If there are wildcards in the file, a GID-specific algorithm must be applied. Historically, /etc/exports contains hostnames and IPv4 netmasks/ addresses. In either case, I think it is a red herring to assume that the GID is actually an IPv6 address. They are not assigned by the sysadmin, they are not subnetted, and they are quite foreign to many users. IPv6 support for Linux NFS isn't even submitted yet, btw. With an IP address service, we don't have to change a line of NFS code. Tom. So supporting GIDs is not that much of an issue as long as no IB network IDs are assigned with a meaning that conflicts with any reachable IPv6 network ID. (In other words, assign GIDs so that they are in fact valid IPv6 addresses. Something that was always planned to be one option for GIDs). -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of James Lentini Sent: Thursday, August 25, 2005 9:48 AM To: Tom Tucker Cc: openib-general@openib.org Subject: RE: [openib-general] RDMA connection and address translation API On Wed, 24 Aug 2005, Tom Tucker wrote: - It's not just preventing connections to the wrong local address. NFS-RDMA wants the remote source address (ie getpeername()) so that it can look it up in the exports list. Agreed. But you could also get rid of ATS by allowing GIDs to be specified in the exports file and then treating them like IPv6 addresses for the purpose of subnet comparisons. Could generic code use both GIDs and IPv4 addresses? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator
At 07:41 PM 8/18/2005, Grant Grundler wrote: If kDAPL for any reason doesn't get pushed upstream to kernel.org, we effectively don't have iSER or NFS/RDMA in linux. Since I think without them, linux won't be competitive in the commercial market place. Put another way, OpenIB want storage to use it, and vice versa. I can speak for NFS/RDMA. If NFS/RDMA doesn't have kDAPL, then it gets thrown backwards due to having to reimplement. That's recoverable (sigh) but there are still missing pieces. By far the largest is the connection and addressing models. There is, as yet, no unified means for an upper layer to connect over any other transport in the OpenIB framework. In fact, there isn't even a way to use IP addressing on the OpenIB framework now, which is an even more fundamental issue. So, yes, without kDAPL at the moment we don't have iSER or NFS/RDMA. We can recode the message handling pieces to OpenIB verbs. For NFS/RDMA, that's not even a ton of work. Then we'll be forced to reimplement or reuse pretty much all of the connect and listen code, and the IP address translation, atop OpenIB. How quickly can OpenIB move to a transport model that supports these missing pieces? I can give a different answer with that information. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface
At 11:52 PM 8/11/2005, Tom Duffy wrote: On Aug 11, 2005, at 2:38 PM, Hal Rosenstock wrote: Can anyone think of another approach to do this and keep backward compatibility ? Do we need backward compatibility? How about the stuff that includes if_packet.h gets rebuilt? You are adding to the end of the struct, after all. The size of the struct is less of an issue than the test for ARPHRD_INFINIBAND. David said as much: -- it won't work for anything else without adding -- more special tests to that af_packet.c code I have to say, SOCKADDR_COMPAT_LL is pretty stinky too. Hal, why *are* you testing for ARPHRD_INFINIBAND anyway? What different action happens in the transport-independent code in this special case? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface
At 09:01 AM 8/12/2005, Hal Rosenstock wrote: It is done to preserve length checks that were already there (on struct msghdr in packet_sendmsg and addr_len in packet_bind). I didn't want to weaken that. Are you sure things break if you simply build a message in user space that's got the larger address (without changing the sockaddr_ll at all)? It looks to me as if msg-msg_namelen/msg_name can be any appropriate size which is at least as large as the sockaddr_ll. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface
At 10:12 AM 8/12/2005, Hal Rosenstock wrote: If sockaddr_ll struct is left alone, I think it may be a problem on the receive side where the size of that struct is used. Maybe. The receive side builds the incoming sockaddr_ll in the skb-cb. But that's 48 bytes and it goes off to your device's hard_header_parse to do so... You sure you have hard_header_len and all the appropriate vectors set up properly? (netdevice.h) Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface
At 12:21 PM 8/12/2005, Tom Duffy wrote: Can we do an audit of what stuff will break with this change? If it is a handful of applications that we all have the source to, maybe it won't be that big of a deal. Right now it looks to me like the app is arping, and it can be fixed by increasing the size of the storage it allocates in its data segment, without changing the sockaddr_ll. Maybe others, haven't bothered to look. struct sockaddr_ll me - union { struct sockaddr_ll xx; unsigned char yy[32]; } me; Note: Hal's change requires arping to be recompiled too! Can't stick 20 bytes into 8 there, either. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface
At 12:57 PM 8/12/2005, Hal Rosenstock wrote: Using old arping on IPoIB will get the error on the sendto as the hardware type is not available at bind time. Okay, that's a feature then, instead of Bus Error - core dumped when 20 bytes land on top of 8, they'll get a send failure. :-) Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mapping between IP address and device name
At 05:34 PM 6/27/2005, Roland Dreier wrote: I'm not sure I understand this. At best, ATS can give you back a list of IPs. How do you decide which one to check against the exports? Any or all of them. Exports is a fairly simple access list, and membership by the client is all that's required. It supports wildcards as well as single address entries. Here's the example from the Linux manpage: # sample /etc/exports file / master(rw) trusty(rw,no_root_squash) /projects proj*.local.domain(rw) /usr*.local.domain(ro) @trusted(rw) /home/joe pc001(rw,all_squash,anonuid=150,anongid=100) /pub(ro,insecure,all_squash) See the wildcards? If any of the machine's IPs matches one, that line yields true. Also of course, even the non-wildcards can expand to a list of addresses; in the first line master is a single host, any of its IP addresses is eligible for a match. In a pure IP world, every packet from a multihomed client carries a source IP address. So a server can use getpeername() to determine which address a client is connecting from. This is fundamentally different from ATS. I don't understand. ATS allows each incoming connection to map to one or more IP addresses, effectively supporting getpeername() on the IB QP. DAPL passes this address up to the consumer in the connection indication via the cr_param's ia_address_ptr. The consumer doesn't invoke ATS directly, nor would it want to. In the NFS server case, it just needs to run this address down the exports list, same way it would for a TCP connection or UDP datagram. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][RFC] nfsordma: initial port of nfsrdma to 2.6 and james'sss kdapl
At 06:25 PM 6/27/2005, Tom Duffy wrote: I have done some initial work to port nfsrdma to 2.6 and to James's kDAPL. This builds now inside the kernel. Tom, thanks for starting this and I'll take a look at your approach. In fact we already have a version working on 2.6.11, the main change merging up to 2.6.12 is having to merge with Chuck's new RPC transport switch. We are definitely willing to GPL the code, I am in the process of getting that approved and putting the result in the pipeline. I am not sure about using the OpenIB repository just yet, because of the dependency on the RPC transport. We don't want to inject the changes in multiple places. Let's caucus offline to figure out how to handle the repository question. It shouldn't be a major issue once we figure out the dependencies. I'll get back to you tomorrow (after taking a look at these patches too). Tom. You will need to follow the kDAPL directions first to put that in your kernel tree, then slap this patch over top of that. So you have 2.6.12 + svn drivers/infiniband + kdapl from james's tree + this patch to get it to build. Oh you will also need to patch the rpc header to get it to build. I think it is time to open up a tree in openib repository. Tom, is netapp willing to GPL this code? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] [PATCH][RFC] nfsordma: initial port of nfsrdma to 2.6 and james'sss kdapl
At 11:13 AM 6/28/2005, Tom Duffy wrote: I am sure I got the RPC stuff wrong. I just wanted to make it compile against James's kDAPL and inside the drivers/infiniband directory. Where can I find the 2.6.11 version of the patch? Here: http://troy.citi.umich.edu/~cel/linux-2.6/2.6.11/release-notes.html This is kind of involved, you need to apply some prepatches as well as postpatches. The 2.6.12 is significantly cleaner but it implements a new API, due to Trond's requests. In other words, moving target alert. I have a working tarball against 2.6.11. The two files rdma_transport.c and rdma_marshal.c plus what you're doing in rdma_kdapl.c might just work. But it's not GPL yet. Can you wait a couple of days? Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] mapping between IP address and device name
At 03:10 AM 6/26/2005, Itamar Rabenstein wrote: But the ATS will not solve the problem of many to one. What will the nfs module will do if the the result from the ATS will be a list of IP's which only one of them is has permission to the nfs ? ATS cant tell you who is the source IP. The NFS server exports will function just fine in such a case. This is no different from any other multihomed client, and /etc/exports can be configured appropriately. What wouldn't be useful would be to use MAC addresses (GIDs) for mounting, exports, etc. Can you imagine administering a network where hardware addresses were the only naming? No sysadmin would even entertain such an idea. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mapping between IP address and device name
At 01:31 PM 6/23/2005, Roland Dreier wrote: James kDAPL uses this feature to provide the passive side of a James connection with the IP address of the remote peer. kDAPL James consumers can use this information as a weak authentication James mechanism. This seems so weak as to be not useful, and rather expensive to boot. To implement this, a system receiving a connection request would have to perform an SA query to map the remote LID back to a GuidInfo record, and then for each GID attached to the remote LID, somehow retrieve the set of IP addresses configured for that GID (assuming that is somehow even possible). Yes, it's weak, but it's needed. A good example is the NFS server's exports function. For the last 20 or so years, NFS servers have a table which assigns access rights to filesystems by IP address, for example restricting access, making it r/o, etc to certain classes of client. (man exports for gory detail). The NFS daemons inspect the peer address of incoming connections and requests to compare them against this list. When the endpoint is a socket, they can simply use getpeername() and a DNS op. When it's an IB endpoint (without IPoIB or SDP), what can they use? The requirement is that there needs to be a way to track a connection back to a traditional hostname and/or address. Today in the Linux NFS/RDMA work we use ATS to provide the getpeername() function. There are stronger authentication techniques NFS can use of course. But the vast majority of NFS users don't bother and just stuff DNS names into their exports. Replacing these with GIDs is not acceptable (just try asking a sysadmin if he or she wants to put mac addresses in this file!). Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mapping between IP address and device name
At 12:19 PM 6/24/2005, Roland Dreier wrote: It seems far preferable to me to just define the wire protocol of NFS/RDMA for IB such that a client passes its IP address as part of the connection request. This scheme was used for SDP to avoid precisely the complications that we're discussing now. But that's totally and completely insecure. The goal of /etc/exports is to place at least part of the client authentication in the network rather than the supplied credentials. NFS has quite enough of a history with AUTH_SYS to prove the issues there. Some of the exports options (e.g. the *_squash ones) are specifically because of this. I don't care about ATS either, by the way. I'm looking for an interoperable alternative. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mapping between IP address and device name
At 01:02 PM 6/24/2005, Jay Rosser wrote: On the subject of NFS/RDMA, what is the IB ServiceID space that is used? If I recall correctly, I have seen simply the value 2049 (i.e. the standard TCP/UDP port number) used in some implementations (i.e. 00 00 00 00 00 00 20 49). Is there a mapping onto an IB ServiceID defined? We aren't currently using the portmapper to discover the serviceid that the NFS/RDMA server is listening on. Brent Callaghan chose serviceid 2049 as a convenience in Sun's first implementation, and so far it has stuck. Theoretically the server can listen on any endpoint it chooses, this is how NFS/TCP and NFS/UDP work. But typically all servers use the well known port. It's probably a good idea to define a better default mapping. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mapping between IP address and device name
At 12:42 PM 6/24/2005, Roland Dreier wrote: Thomas But that's totally and completely insecure. The goal of Thomas /etc/exports is to place at least part of the client Thomas authentication in the network rather than the supplied Thomas credentials. NFS has quite enough of a history with Thomas AUTH_SYS to prove the issues there. Some of the exports Thomas options (e.g. the *_squash ones) are specifically because Thomas of this. ATS is completely insecure too, right? A client can create any old service record in the subnet administrator's database and claim that its GID has whatever IP address it wants. As I said - I am not attached to ATS. I would welcome an alternative. But in the absence of one, I like what we have. Also, I do not want to saddle the NFS/RDMA transport with carrying an IP address purely for the benefit of a missing transport facility. After all NFS/RDMA works on iWARP too. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] mapping between IP address and device name
At 01:30 PM 6/24/2005, Roland Dreier wrote: Thomas But in the absence of one, I like what we have. Also, I do Thomas not want to saddle the NFS/RDMA transport with carrying an Thomas IP address purely for the benefit of a missing transport Thomas facility. After all NFS/RDMA works on iWARP too. I'm not sure I understand this objection. We wouldn't be saddling the transport with anything -- simply specifying in the binding of NFS/RDMA to IB that certain information is carried in the private data fields of the CM messages used to establish a connection. Clearly iWARP would use its own mechanism for providing the peer address. This would be exactly analogous to the situation for SDP -- obviously SDP running on iWARP does not use the IB CM to exchange IP address information in the same way the SDP over IB does. Oh - I thought you meant that NFS/RDMA should have a HELLO message carrying an IP address, like SDP/IB. That's a nonstarter for the reason I mentioned, plus the fact that it links this state to the connection, which might break and require reconnect. In fact, NFSv4 and our Sessions proposal addresses this, but it doesn't help NFSv3, which is the predominant use today. On the other hand, placing a mandatory content in the CM exchange brings in a whole different raft of interoperability questions, as James mentioned earlier. For better or for worse, the ATS approach is easily administered and does not impact any protocol layers outside of its own. I think of it as ARP for IB. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] A new simple ulp (SPTS)
At 12:43 PM 6/22/2005, Jeff Carr wrote: On 06/21/2005 12:50 PM, Roland Dreier wrote: What happens if you try replacing the send_flags line with the one you have commented out? +// send_wr.send_flags = IB_SEND_SIGNALED; Thanks, you are correct. IB_SEND_SIGNALED gives me the behavior I was expecting. By the way, unsignaled sends can work very well indeed, but you have to be sure to poll for completions at regular intervals. Basically, what you're doing is ensuring that software (mthca) gets control from time to time, either via an interrupt (signaled) or poll (unsignaled). It's quite a challenge to get the polling right, but the reduction in interrupts can be a win. The NFS/RDMA module does this, but it takes the approach of occasionally posting a signaled send. The trick is getting the value of occasionally right. :-) Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] A new simple ulp (SPTS)
At 01:11 PM 6/23/2005, Jeff Carr wrote: I didn't know there was a nfs/rmda module? [EMAIL PROTECTED]:/test/gen2# find . |grep -i nfs [EMAIL PROTECTED]:/test/gen2# Brief intro to NFS/RDMA work: Current client version is on Sourceforge, supporting various flavors of 2.4. I'm preparing a new release for 2.6. http://sourceforge.net/projects/nfs-rdma The client needs a patched version of Sunrpc in order to hook in as an NFS transport. This patch is being used by the NFS/IPv6 project as well, and it's planned for future kernel.org integration. http://troy.citi.umich.edu/~cel/linux-2.6/2.6.12/release-notes.html (There are also patch sets for earlier Linux revs) Server version for 2.6 is under development at UMich CITI, scroll down to Documents and Code sections of this page: http://www.citi.umich.edu/projects/rdma/ Finally, NetApp and Sun have demonstrated implementations, Solaris 10 has support for it. We look forward to running NFS/RDMA over OpenIB, when its kDAPL is ready. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] NFS/RDMA/kDAPL
At 03:11 PM 6/23/2005, Tom Duffy wrote: On Thu, 2005-06-23 at 13:54 -0400, Talpey, Thomas wrote: We look forward to running NFS/RDMA over OpenIB, when its kDAPL is ready. Let's get NFSoRDMA going sooner rather than later on James's kDAPL. I think this will both be a good test case as well as a vehicle to demonstrate the functionality of kDAPL. What can I do to help? Can the Linux OpenIB client connect to Solaris 10? If so, we might consider using Sol10's NFS/RDMA server. If not, we'll have to use a NetApp filer (which is fine by me but maybe hard for you), because the CITI NFS/RDMA server is accepting connections but not yet processing RPCs. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] kDAPL: remove dapl_os_assert()
At 05:09 PM 6/23/2005, Grant Grundler wrote: On Thu, Jun 23, 2005 at 04:55:38PM -0400, James Lentini wrote: My argument in favor of retaining them is that dapl_evd_wc_to_event() will crash if the cookie NULL. A BUG_ON will detect this situation ... The tombstone from the data page fault panic should make this nearly as obvious as the BUG_ON(). Yes, I agree a BUG_ON() is completely obvious. But it's also burning CPU cycles for something that Not to argue one way or the other, but if this cookie is NULL, whose fault would that be? I think that should govern whether it's common enough to warrant BUG_ON or rare enough to warrant a straight crash. I would suggest BUG_ON only if it were possible to trigger this from a loadable module, etc. Tom. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general