Re: srp-ha backport
Bart Van Assche writes: > > On 11/20/12 05:04, Vasiliy Tolstov wrote: > > Thanks for this backport! I have some problem under sles 11 sp2 (kernel 3.0.42- > > 0.7-xen) then i shutdown srp target (reboot one sas server) multipath -ll does > > not respond. If i provide in multipath and srp identical dev_loss_tmo and > > fast_io_fail_tmo nothing changed. multipath -ll unblocks only then the server > > goes up. > > That's strange. After the fast_io_fail_tmo timer has fired multipath -ll > should unblock independent of the state of the SRP target. > > Bart. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo@... > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Hi Bart. any advice on using this fix with MD raid 1? a guide or site you know of? ive compiled ubuntu 13.04 to kernel 3.6.11 with OFED 2 from Mellanox, and it works ok, performance is a little better with SRP. Some packages dont seem to work, ie srptools and IB-diags some commands fail, which looks like those tools havenet been tested with 3.6.11? or updated. Ive tried using DRBD with pacemaker Stonith etc (which also works on 3.6.11) but it only works with iSCSI over IPOIB. ie virtual nic with mounted LVM using scst to present file i/o. and pacemaker to fail over the VIP to node 2. But OFED 2 doesnt seem to support SDP to have to rep via IPOIB which is slow even over dedicated IB_IPOIB nic. IE DRBD rep is 200MB/s Any help or direction would be greatfull. Cheers Bruce McKenzie -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of "ummunot" branch?
On Fri, Jun 07, 2013 at 10:59:43PM +, Jeff Squyres (jsquyres) wrote: > > I don't think this covers other memory regions, like those added via mmap, > > right? > > We talked about this at the MPI Forum this week; it doesn't seem > like ODP fixes any MPI problems. ODP without 'register all address space' changes the nature of the problem, and fixes only one problem. You do need to cache registrations, and all the tuning parameters (how much do I cache, how long do I hold it for, etc, etc) all still apply. What goes away (is fixed) is the need for intercepts and the need to purge address space from the cache because the backing registration has become non-coherent/invalid. Registrations are always coherent/valid with ODP. This cache, and the associated optimization problem, can never go away. With a 'register all of memory' semantic the cache can move into the kernel, but the performance implication and overheads are all still present, just migrated. > 2. MPI still has to intercept (at least) munmap(). Curious to know what for? If you want to prune registrations (ie to reduce memory footprint), this can be done lazyily at any time (eg in a background thread or something). Read /proc/self/maps and purge all the registrations pointing to unmapped memory. Similar to garbage collection. There is no harm in keeping a registration for a long period, except for the memory footprint in the kernel. > 3. Having mmap/malloc/etc. return "new" memory that may already be > registered because of a prior memory registration and subsequent > munmap/free/etc. is just plain weird. Worse, if we re-register it, > ref counts could go such that the actual registration will never > actually expire until the process dies (which could lead to > processes with abnormally large memory footprints, because they > never actually let go of memory because it's still registered). This is entirely on the registration cache implementation to sort out, there are lots of performance/memory trade offs. It is only weird when you think about it in terms of buffers. memory registration has to do with address space, not buffers. > What MPI wants is: > > 1. verbs for ummunotify-like functionality > 2. non-blocking memory registration verbs; poll the cq to know when it has > completed To me, ODP with an additional 'register all address space' semantic, plus an asynchronous prefetch does both of these for you. 1. ummunotify functionality and caching is now in the kernel, under ODP. RDMA access to an 'all of memory' registration always does the right thing. 2. asynchronous prefetch (eg as a work request) triggers ODP and kernel actions to ready a subset of memory for RDMA, including all the work that memory registration does today (get_user_pages, COW break, etc) Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of "ummunot" branch?
On Jun 6, 2013, at 4:33 PM, Jeff Squyres (jsquyres) wrote: > I don't think this covers other memory regions, like those added via mmap, > right? We talked about this at the MPI Forum this week; it doesn't seem like ODP fixes any MPI problems. 1. MPI still has to have a memory registration cache, because ibv_reg_mr(0...sbrk()) doesn't cover the stack or mmap'ed memory, etc. 2. MPI still has to intercept (at least) munmap(). 3. Having mmap/malloc/etc. return "new" memory that may already be registered because of a prior memory registration and subsequent munmap/free/etc. is just plain weird. Worse, if we re-register it, ref counts could go such that the actual registration will never actually expire until the process dies (which could lead to processes with abnormally large memory footprints, because they never actually let go of memory because it's still registered). 4. Even if MPI checks the value of sbrk() and re-registers (0...sbrk()) when sbrk() increases, this would seem to create a lot of work for the kernel -- which is both slow and synchronous. Example: a = malloc(5GB); MPI_Send(a, 1, MPI_CHAR, ...); // MPI sends 1 byte Then the MPI_Send of 1 byte will have to pay the cost of registering 5GB of new memory. - Unless we understand this wrong (and there's definitely a chance that we do!), it doesn't sound like ODP solves anything for MPI. Especially since HPC applications almost never swap (in fact, swap is usually disabled in HPC environments). What MPI wants is: 1. verbs for ummunotify-like functionality 2. non-blocking memory registration verbs; poll the cq to know when it has completed -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] please pull infiniband.git
Hi Linus, Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git tags/rdma-for-linus InfiniBand fixes for 3.10-rc: - qib RCU/lockdep fix - iser device removal fix, plus doc fixes Mike Marciniszyn (1): IB/qib: Fix lockdep splat in qib_alloc_lkey() Or Gerlitz (2): IB/iser: Add Mellanox copyright MAINTAINERS: Add entry for iSCSI Extensions for RDMA (iSER) initiator Roi Dayan (1): IB/iser: Fix device removal flow Roland Dreier (1): Merge branches 'iser' and 'qib' into for-next MAINTAINERS | 10 ++ drivers/infiniband/hw/qib/qib_keys.c | 2 +- drivers/infiniband/ulp/iser/iscsi_iser.c | 1 + drivers/infiniband/ulp/iser/iscsi_iser.h | 1 + drivers/infiniband/ulp/iser/iser_initiator.c | 1 + drivers/infiniband/ulp/iser/iser_memory.c| 1 + drivers/infiniband/ulp/iser/iser_verbs.c | 16 +--- 7 files changed, 24 insertions(+), 8 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] RDMA/cma: Add IPv6 support for iWARP.
Please use local variables for the sockaddr_in{,6} pointers instead of casting over and over and over and over again. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: Revert pinned_vm braindamage
On Fri, 7 Jun 2013, Peter Zijlstra wrote: > However you twist this; your patch leaves an inconsistent mess. If you > really think they're two different things then you should have > introduced a second RLIMIT_MEMPIN to go along with your counter. Well continuing to repeat myself: I worked based on agreed upon characteristics of mlocked pages. The patch was there to address a brokenness in the mlock accounting because someone naively assumed that pinning = mlock. > I'll argue against such a thing; for I think that limiting the total > amount of pages a user can exempt from paging is the far more > userful/natural thing to measure/limit. Pinned pages are exempted by the kernel. A device driver or some other kernel process (reclaim, page migration, io etc) increase the page count. There is currently no consistent accounting for pinned pages. The vm_pinned counter was introduced to allow the largest pinners to track what they did. > > I said that the use of a PIN page flag would allow correct accounting if > > one wanted to interpret the limit the way you do. > > You failed to explain how that would help any. With a pin page flag you > still need to find the mm to unaccount crap from. Also, all user > controlled address space ops operate on vmas. Pinning is kernel controlled... > > Page migration is not a page fault? > > It introduces faults; what happens when a process hits the migration > pte? It gets a random delay and eventually services a minor fault to the > new page. Ok but this is similar to reclaim and other such things that are unmapping pages. > At which point the saw will have cut your finger off (going with the > most popular RT application ever -- that of a bandsaw and a laser beam). I am pretty confused by your newer notion of RT. RT was about high latency deterministic behavior I thought. RT was basically an abused marketing term and was referring to the bloating of the kernel with all sorts of fair stuff that slows us down. What happened to make you work on low latency stuff? There is some shift that you still need to go through to make that transition. Yes, you would want to avoid reclaim and all sorts of other stuff for low latency. So you disable auto NUMA, defrag etc to avoid these things. > > > This leaves the RT people unhappy -- therefore _if_ we continue with > > > this Linux specific interpretation of mlock() we must introduce new > > > syscalls that implement the intended mlock() semantics. > > > > Intended means Peter's semantics? > > No, I don't actually write RT applications. But I've had plenty of > arguments with RT people when I explained to them what our mlock() > actually does vs what they expected it to do. Ok Guess this is all new to you at this point. I am happy to see that you are willing to abandon your evil ways (although under pressure from your users) and are willing to put the low latency people now in the RT camp. > They're not happy. Aside from that; you HPC/HFT minimal latency lot > should very well appreciate the minimal interference stuff they do > actually expect. Sure we do and we know how to do things to work around the "fair scheduler" and other stuff. But you are breaking the basics of how we do things with your conflation of pinning and mlocking. We do not migrate, do not allow defragmentation or reclaim when running low latency applications. These are non issues. > This might well be; and I'm not arguing we remove this. I'm merely > stating that it doesn't make everybody happy. Also what purpose do HPC > type applications have for mlock()? HPC wants to keep them in memory to avoid eviction. HPC apps are not as sensitive to faults as low latency apps are. Minor faults have traditionally be tolerated there. The lower you get in terms of the latencies required the more difficult the OS control becomes. > Here we must disagree I fear; given that mlock() is of RT origin and RT > people very much want/expect mlock() to do what our proposed mpin() will > do. RT is a dirty word for me given the fairness and bloat issue. Not sure what you mean with that. mlock is a means to keep data in memory and not a magical wand that avoids all OS handling of the page. > > That cannot be so since mlocked pages need to be migratable. > > I'm talking about the proposed mpin() stuff. Could you write that up in detail? I am not sure how this could work at this point. > So I proposed most of the machinery that would be required to actually > implement the syscalls. Except that the IB code stumped me. In > particular I cannot easily find the userspace address to unpin for > ipath/qib release paths. > > Once we have that we can trivially implement the syscalls. Why would you need syscalls? Pinning is driver/kernel subsystem initiated and therefore the driver can do the pin/unpin calls. > > Pinning is not initiated by user space but by the kernel. Either > > temporarily (page count increases are used all over the kernel for this) > > or for longer time frame (I
Re: [PATCH 0/4] Add IPv6 support for iWARP
Reviewed-by: Steve Wise -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][TRIVIAL] Add attribute information to SA request error messages
On 6/6/2013 10:35 AM, Line Holen wrote: > Signed-off-by: Line Holen Thanks. Applied (fixing one typo). -- Hal -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] mm: Revert pinned_vm braindamage
On Thu, Jun 06, 2013 at 06:46:50PM +, Christoph Lameter wrote: > On Thu, 6 Jun 2013, Peter Zijlstra wrote: > > > Since RLIMIT_MEMLOCK is very clearly a limit on the amount of pages the > > process can 'lock' into memory it should very much include pinned pages > > as well as mlock()ed pages. Neither can be paged. > > So we we thought that this is the sum of the pages that a process has > mlocked. Initiated by the process and/or environment explicitly. A user > space initiated action. Which we; also it remains fact that your changelog didn't mention this change in semantics at all. Nor did you CC all affected parties. However you twist this; your patch leaves an inconsistent mess. If you really think they're two different things then you should have introduced a second RLIMIT_MEMPIN to go along with your counter. I'll argue against such a thing; for I think that limiting the total amount of pages a user can exempt from paging is the far more userful/natural thing to measure/limit. > > Since nobody had anything constructive to say about the VM_PINNED > > approach and the IB code hurts my head too much to make it work I > > propose we revert said patch. > > I said that the use of a PIN page flag would allow correct accounting if > one wanted to interpret the limit the way you do. You failed to explain how that would help any. With a pin page flag you still need to find the mm to unaccount crap from. Also, all user controlled address space ops operate on vmas. We had the VM_LOCKED far before we had the lock page flag. And you cannot replace all VM_LOCKED utility with the pageflag either. > > Once again the rationale; MLOCK(2) is part of POSIX Realtime Extentsion > > (1003.1b-1993/1003.1i-1995). It states that the specified part of the > > user address space should stay memory resident until either program exit > > or a matching munlock() call. > > > > This definition basically excludes major faults from happening on the > > pages -- a major fault being one where IO needs to happen to obtain the > > page content; the direct implication being that page content must remain > > in memory. > > Exactly that is the definition. > > > Linux has taken this literal and made mlock()ed pages subject to page > > migration (albeit only for the explicit move_pages() syscall; but it > > would very much like to make them subject to implicit page migration for > > the purpose of compaction etc.). > > Page migration is not a page fault? It introduces faults; what happens when a process hits the migration pte? It gets a random delay and eventually services a minor fault to the new page. At which point the saw will have cut your finger off (going with the most popular RT application ever -- that of a bandsaw and a laser beam). > The ability to move a process > completely (including its mlocked segments) is important for the manual > migration of process memory. That is what page migration was made for. If > mlocked pages are treated as pinnned pages then the complete process can > no longer be moved from node to node. > > > This view disregards the intention of the spec; since mlock() is part of > > the realtime spec the intention is very much that the user address range > > generate no faults; neither minor nor major -- any delay is > > unacceptable. > > Where does it say that no faults are generated? Dont we generate COW on > mlocked ranges? That's under user control. If the user uses fork() the user can avoid those faults by pre-faulting the pages. > > This leaves the RT people unhappy -- therefore _if_ we continue with > > this Linux specific interpretation of mlock() we must introduce new > > syscalls that implement the intended mlock() semantics. > > Intended means Peter's semantics? No, I don't actually write RT applications. But I've had plenty of arguments with RT people when I explained to them what our mlock() actually does vs what they expected it to do. They're not happy. Aside from that; you HPC/HFT minimal latency lot should very well appreciate the minimal interference stuff they do actually expect. > > It was found that there are useful purposes for this weaker mlock(), a > > rationale to indeed have two sets of syscalls. The weaker mlock() can be > > used in the context of security -- where we avoid sensitive data being > > written to disk, and in the context of userspace deamons that are part > > of the IO path -- which would otherwise form IO deadlocks. > > Migratable mlocked pages enable complete process migration between nodes > of a NUMA system for HPC workloads. This might well be; and I'm not arguing we remove this. I'm merely stating that it doesn't make everybody happy. Also what purpose do HPC type applications have for mlock()? > > The proposed second set of primitives would be mpin() and munpin() and > > would implement the intended mlock() semantics. > > I agree that we need mpin and munpin. But they should not be called mlock > semantics. Here we must disagree I fea