RE: Status of ummunot branch?
Here are a few more clarifications: 1) ODP MRs can cover address ranges that do not have a mapping at registration time. This means that MPI can register in advance, say, the lower GB's of the address space, covering malloc's primary arena. Thus, there is no need to adjust to each increase in sbrk(). Similarly, you can register the stack region up to the maximum size of the stack. The stack can grow and shrink, and ODP will always use the current mapping. 2) Virtual addresses covered by an ODP MR must have a valid mapping when they are is accessed (during send/receive WQE processing or as a target of an RDMA/atomic operation). So, Jeff, the only thing you need to make sure is that you don't free() a buffer that you posted and haven't got a completion yet - but I guess that this is something that you already do... :) For example, in the following scenario: a. reg_mr(first GB of the address space) b. p = malloc() c. post_send(p) d. poll for completion e. free(p) f. p = malloc() g. post_send(p) h. poll for completion i. free(p) (c) may incur a page fault (if not pre-fetched or faulted-in by another thread). (e) happens after the completion, so it is guaranteed that (c), when processed by HW, uses the correct application buffer with the current virt-to-phys mapping (at HW access time) The reallocation may or may not change the virtual-to-physical mappings. The message may or may not be paged out (ODP does not hold a reference on the page). In any case, when (g) is processed, it always uses the current mapping. --Liran -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Saturday, June 08, 2013 2:58 AM To: Jeff Squyres (jsquyres) Cc: Haggai Eran; Or Gerlitz; linux-rdma@vger.kernel.org; Shachar Raindel Subject: Re: Status of ummunot branch? On Fri, Jun 07, 2013 at 10:59:43PM +, Jeff Squyres (jsquyres) wrote: I don't think this covers other memory regions, like those added via mmap, right? We talked about this at the MPI Forum this week; it doesn't seem like ODP fixes any MPI problems. ODP without 'register all address space' changes the nature of the problem, and fixes only one problem. You do need to cache registrations, and all the tuning parameters (how much do I cache, how long do I hold it for, etc, etc) all still apply. What goes away (is fixed) is the need for intercepts and the need to purge address space from the cache because the backing registration has become non-coherent/invalid. Registrations are always coherent/valid with ODP. This cache, and the associated optimization problem, can never go away. With a 'register all of memory' semantic the cache can move into the kernel, but the performance implication and overheads are all still present, just migrated. 2. MPI still has to intercept (at least) munmap(). Curious to know what for? If you want to prune registrations (ie to reduce memory footprint), this can be done lazyily at any time (eg in a background thread or something). Read /proc/self/maps and purge all the registrations pointing to unmapped memory. Similar to garbage collection. There is no harm in keeping a registration for a long period, except for the memory footprint in the kernel. 3. Having mmap/malloc/etc. return new memory that may already be registered because of a prior memory registration and subsequent munmap/free/etc. is just plain weird. Worse, if we re-register it, ref counts could go such that the actual registration will never actually expire until the process dies (which could lead to processes with abnormally large memory footprints, because they never actually let go of memory because it's still registered). This is entirely on the registration cache implementation to sort out, there are lots of performance/memory trade offs. It is only weird when you think about it in terms of buffers. memory registration has to do with address space, not buffers. What MPI wants is: 1. verbs for ummunotify-like functionality 2. non-blocking memory registration verbs; poll the cq to know when it has completed To me, ODP with an additional 'register all address space' semantic, plus an asynchronous prefetch does both of these for you. 1. ummunotify functionality and caching is now in the kernel, under ODP. RDMA access to an 'all of memory' registration always does the right thing. 2. asynchronous prefetch (eg as a work request) triggers ODP and kernel actions to ready a subset of memory for RDMA, including all the work that memory registration does today (get_user_pages, COW break, etc) Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to
Re: How to do replication right with SRP or remote storage?
On 08.06.2013 04:31, Bruce McKenzie wrote: Hi Bart. any advice on using this fix with MD raid 1? a guide or site you know of? ive compiled ubuntu 13.04 to kernel 3.6.11 with OFED 2 from Mellanox, and it works ok, performance is a little better with SRP. Some packages dont seem to work, ie srptools and IB-diags some commands fail, which looks like those tools havenet been tested with 3.6.11? or updated. Ive tried using DRBD with pacemaker Stonith etc (which also works on 3.6.11) but it only works with iSCSI over IPOIB. ie virtual nic with mounted LVM using scst to present file i/o. and pacemaker to fail over the VIP to node 2. But OFED 2 doesnt seem to support SDP to have to rep via IPOIB which is slow even over dedicated IB_IPOIB nic. IE DRBD rep is 200MB/s Any help or direction would be greatfull. Cheers Bruce McKenzie (changed subject into something I think is more appropriate) Hi Bruce, thanks for contacting me privately in parallel. I can answer you the replication questions. In order to share experience for others I reply here again. Please evaluate the ib_srp fixes from Bart and from me as well and send us your feedback! We are still negotiating how to do fast IO failing and the automatic reconnect right, also together with the Mellanox SRP guys Sagi Grimberg, Vu Pham, Oren Duer and others. You need these patches in order to fail IO in the time you want to the upper layers so that dm-multipath can fail over the path first and ib_srp continuously tries to reconnect the failed path. If the other path also fails, then very likely the storage server is down, so you fail the IO further up to MD RAID-1 so that it can fail that replica. For replication the last slide of my talk on LinuxTag this year could be interesting for you: http://www.slideshare.net/SebastianRiemer/infini-band-rdmaforstoragesrpvsiser-21791250 That slide caused a lot of discussion afterwards. The thing is that replication of remote storage is best on the initiator (a single kernel manages all replica, parallel network paths, symmetric latency,...). The bad news is that replication of virtual/remote storage with MD RAID-1 is a use case which basically works but has some issues which Neil Brown doesn't want to have fixed in mainline. So you need a kernel developer for some cool features like e.g. safe VM live migration. Perhaps, I should collect all guys who require MD RAID-1 for remote storage replication in order to put some pressure on Neil. At least some things of this use case are easy to merge with mainline behavior like e.g. letting MD assembly scale right (mdadm searches the whole /dev without a need). I was surprised that he will make the data offset settable again so that you can set it to 4 MiB (1 LV extent). We already have that by custom patches on top of mdadm 3.2.6. DRBD is already with iSCSI crap. 200 MB/s with IB sounds familiar. I had 250 MB/s in primary/secondary setup with DRBD during evaluation. That's storeforward writes to the secondary which is slow. Chained network paths! With Ethernet that hurts even more. People report 70 MB/s with that. I've taught them how to use blktrace and it became obvious that they were trapped in latency. I can also recommend you Vasiliy Tolstov v.tols...@selfip.ru. He also uses SRP with MD RAID-1. He could convince Neil to fix the MD data offet. OpenSource is all about the right allies, Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to do replication right with SRP or remote storage?
On 10.06.2013 14:44, Bart Van Assche wrote: On 06/10/13 14:05, Sebastian Riemer wrote: Perhaps, I should collect all guys who require MD RAID-1 for remote storage replication in order to put some pressure on Neil. If I remember correctly one of the things Neil is trying to explain to md users is that when md is used without write-intent bitmap there is a risk of triggering a so-called write hole after a power failure ? I'm not sure. Haven't seen something like this on the mailing list. Do you have a reference from the archives? I think this is handled by superblock writes in the correct order by now. The main reason for the write-intent bitmap remains from my knowledge that you need a full resync without it if a component device is down for a short moment in time. It becomes faulty. If you know that there can't be a hardware issue (e.g. virtual storage), you can remove the faulty device and re-add it to the array. If a device was faulty, then it assembles again. There is an error counter in /sys/block/mdX/md/ sysfs and a maximum read error count (usually 20) after which the faulty device doesn't assemble again. /sys/block/mdX/md/dev-Y/errors /sys/block/mdX/md/max_read_errors Cheers, Sebastian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of ummunot branch?
On Jun 7, 2013, at 4:57 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: We talked about this at the MPI Forum this week; it doesn't seem like ODP fixes any MPI problems. ODP without 'register all address space' changes the nature of the problem, and fixes only one problem. I agree that pushing all registration issues out of the application and (somewhere) into the verbs stack would be a nice solution. You do need to cache registrations, and all the tuning parameters (how much do I cache, how long do I hold it for, etc, etc) all still apply. What goes away (is fixed) is the need for intercepts and the need to purge address space from the cache because the backing registration has become non-coherent/invalid. Registrations are always coherent/valid with ODP. This cache, and the associated optimization problem, can never go away. With a 'register all of memory' semantic the cache can move into the kernel, but the performance implication and overheads are all still present, just migrated. Good summary; and you corrected some of my mistakes -- thanks. That being said, everyone I've talked to about ODP finds it very, very strange that the kernel would keep memory registrations around for memory that is no longer part of a process. Not only does it lead to the new memory is magically already registered semantic that I find weird, it's just plain *odd* for the kernel to maintain state for something that doesn't exist any more. It feels dirty. Sidenote: I was just informed today that the current way MPI implementations implement registration cache coherence (glibc malloc hooks) has been deprecated and will be removed from glibc (http://sourceware.org/ml/libc-alpha/2011-05/msg00103.html). This really puts on the pressure to find a new / proper solution. What MPI wants is: 1. verbs for ummunotify-like functionality 2. non-blocking memory registration verbs; poll the cq to know when it has completed To me, ODP with an additional 'register all address space' semantic, plus an asynchronous prefetch does both of these for you. 1. ummunotify functionality and caching is now in the kernel, under ODP. RDMA access to an 'all of memory' registration always does the right thing. Register all address space is the moral equivalent of not having userspace registration, so let's talk about it in those terms. Specifically, there's a subtle difference between: a) telling verbs to register (0...2^64) -- Which is weird because it tells verbs to register memory that isn't in my address space b) telling verbs that the app doesn't want to handle registration -- How that gets implemented is not important (from userspace's point of view) -- if the kernel chooses to implement that by registering non-existent memory, that's the kernel's problem I guess I'm arguing that registering non-existent memory is not the Right Thing. Regardless of what solution is devised for registered memory management (ummunotify, ODP, or something else), a non-blocking verb for registering memory would still be a Very Useful Thing. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Status of ummunot branch?
-Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma- ow...@vger.kernel.org] On Behalf Of Jeff Squyres (jsquyres) Sent: Monday, June 10, 2013 5:50 PM To: Jason Gunthorpe Cc: Haggai Eran; Or Gerlitz; linux-rdma@vger.kernel.org; Shachar Raindel Subject: Re: Status of ummunot branch? On Jun 7, 2013, at 4:57 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: We talked about this at the MPI Forum this week; it doesn't seem like ODP fixes any MPI problems. ODP without 'register all address space' changes the nature of the problem, and fixes only one problem. I agree that pushing all registration issues out of the application and (somewhere) into the verbs stack would be a nice solution. You do need to cache registrations, and all the tuning parameters (how much do I cache, how long do I hold it for, etc, etc) all still apply. What goes away (is fixed) is the need for intercepts and the need to purge address space from the cache because the backing registration has become non-coherent/invalid. Registrations are always coherent/valid with ODP. This cache, and the associated optimization problem, can never go away. With a 'register all of memory' semantic the cache can move into the kernel, but the performance implication and overheads are all still present, just migrated. Good summary; and you corrected some of my mistakes -- thanks. That being said, everyone I've talked to about ODP finds it very, very strange that the kernel would keep memory registrations around for memory that is no longer part of a process. Not only does it lead to the new memory is magically already registered semantic that I find weird, it's just plain *odd* for the kernel to maintain state for something that doesn't exist any more. It feels dirty. Sidenote: I was just informed today that the current way MPI implementations implement registration cache coherence (glibc malloc hooks) has been deprecated and will be removed from glibc (http://sourceware.org/ml/libc-alpha/2011-05/msg00103.html). This really puts on the pressure to find a new / proper solution. What MPI wants is: 1. verbs for ummunotify-like functionality 2. non-blocking memory registration verbs; poll the cq to know when it has completed To me, ODP with an additional 'register all address space' semantic, plus an asynchronous prefetch does both of these for you. 1. ummunotify functionality and caching is now in the kernel, under ODP. RDMA access to an 'all of memory' registration always does the right thing. Register all address space is the moral equivalent of not having userspace registration, so let's talk about it in those terms. Specifically, there's a subtle difference between: a) telling verbs to register (0...2^64) -- Which is weird because it tells verbs to register memory that isn't in my address space Another way to look at it is specify IO access permissions for address space ranges. This could be useful to implement a buffer pool to be used for a specific MR only, yet still map/unmap memory within this pool on the fly to optimize physical memory utilization. In this case, you would provide smaller ranges than 2^64... b) telling verbs that the app doesn't want to handle registration -- How that gets implemented is not important (from userspace's point of view) -- if the kernel chooses to implement that by registering non-existent memory, that's the kernel's problem I guess I'm arguing that registering non-existent memory is not the Right Thing. Regardless of what solution is devised for registered memory management (ummunotify, ODP, or something else), a non-blocking verb for registering memory would still be a Very Useful Thing. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of ummunot branch?
On Mon, Jun 10, 2013 at 02:49:24PM +, Jeff Squyres (jsquyres) wrote: On Jun 7, 2013, at 4:57 PM, Jason Gunthorpe jguntho...@obsidianresearch.com wrote: We talked about this at the MPI Forum this week; it doesn't seem like ODP fixes any MPI problems. ODP without 'register all address space' changes the nature of the problem, and fixes only one problem. I agree that pushing all registration issues out of the application and (somewhere) into the verbs stack would be a nice solution. Well, it creates a mess in another sense, because now you've lost context. When your MPI goes to do a 1byte send the kernel may well prefetch a few megabytes of page tables, whereas an implementation in userspace still has the context and can say, no I don't need that.. Maybe a prefetch WR can restore the lost context, donno.. That being said, everyone I've talked to about ODP finds it very, very strange that the kernel would keep memory registrations around for memory that is no longer part of a process. Not only does it MRs are badly named. They are not 'memory registrations'. They are 'address registrations'. Don't conflat address === memory in your head, then it seems weird :) The memory the address space points to is flexible. The address space is tied to the lifetime of the process. It doesn't matter if there is no memory mapped to the address space, the address space is still there. Liran had a good example. You can register address space and then use mmap/munmap/MAP_FIXED to mess around with where it points to. A practical example of using this would be to avoid the need to send scatter buffer pointers to the remote. The remote writes into a memory ring and the ring is made 'endless' by clever use of remapping. Register all address space is the moral equivalent of not having userspace registration, so let's talk about it in those terms. Specifically, there's a subtle difference between: a) telling verbs to register (0...2^64) b) telling verbs that the app doesn't want to handle registration I agree, a verb to do 'B' is a cleaner choice than trying to cram this kind of API into A... Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v5 01/28] rdma/cm: define native IB address
Define AF_IB and sockaddr_ib to allow the rdma_cm to use native IB addressing. Signed-off-by: Sean Hefty sean.he...@intel.com --- include/linux/socket.h |2 + include/rdma/ib.h | 89 2 files changed, 91 insertions(+), 0 deletions(-) create mode 100644 include/rdma/ib.h diff --git a/include/linux/socket.h b/include/linux/socket.h index 2b9f74b..68f7120 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -167,6 +167,7 @@ struct ucred { #define AF_PPPOX 24 /* PPPoX sockets*/ #define AF_WANPIPE 25 /* Wanpipe API Sockets */ #define AF_LLC 26 /* Linux LLC*/ +#define AF_IB27 /* Native InfiniBand address*/ #define AF_CAN 29 /* Controller Area Network */ #define AF_TIPC 30 /* TIPC sockets */ #define AF_BLUETOOTH 31 /* Bluetooth sockets*/ @@ -211,6 +212,7 @@ struct ucred { #define PF_PPPOX AF_PPPOX #define PF_WANPIPE AF_WANPIPE #define PF_LLC AF_LLC +#define PF_IBAF_IB #define PF_CAN AF_CAN #define PF_TIPC AF_TIPC #define PF_BLUETOOTH AF_BLUETOOTH Are there any objections from the network maintainers to adding these definitions? The rest of the changes from this series are restricted to the RDMA subsystem. Currently, the RDMA stack connects using IP addresses, which must be mapped to IB addresses. This change allows the RDMA stack to establish connections using native IB addresses. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html