Re: srp-ha backport

2013-06-07 Thread Bruce McKenzie



Bart Van Assche  writes:

> 
> On 11/20/12 05:04, Vasiliy Tolstov wrote:
> > Thanks for this backport! I have some problem under sles 11 sp2 (kernel
3.0.42-
> > 0.7-xen) then i shutdown srp target (reboot one sas server) multipath
-ll does
> > not respond. If i provide in multipath and srp identical dev_loss_tmo and
> > fast_io_fail_tmo nothing changed. multipath -ll unblocks only then the
server
> > goes up.
> 
> That's strange. After the fast_io_fail_tmo timer has fired multipath -ll 
> should unblock independent of the state of the SRP target.
> 
> Bart.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@...
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
Hi Bart.

any advice on using this fix with MD raid 1? a guide or site you know of?

ive compiled ubuntu 13.04 to kernel 3.6.11 with OFED 2 from Mellanox, and it
works ok, performance is a little better with SRP.  Some packages dont seem
to work, ie srptools and IB-diags some commands fail, which looks like those
tools havenet been tested with 3.6.11?  or updated.

Ive tried using DRBD with pacemaker Stonith etc (which also works on 3.6.11)
but it only works with iSCSI over IPOIB.  ie virtual nic with mounted LVM
using scst to present file i/o.  and pacemaker to fail over the VIP to node
2.  But OFED 2 doesnt seem to support SDP to have to rep via IPOIB which is
slow even over dedicated IB_IPOIB nic.  IE DRBD rep is 200MB/s

Any help or direction would be greatfull.
Cheers
Bruce McKenzie



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of "ummunot" branch?

2013-06-07 Thread Jason Gunthorpe
On Fri, Jun 07, 2013 at 10:59:43PM +, Jeff Squyres (jsquyres) wrote:

> > I don't think this covers other memory regions, like those added via mmap, 
> > right?
>  
> We talked about this at the MPI Forum this week; it doesn't seem
> like ODP fixes any MPI problems.

ODP without 'register all address space' changes the nature of the
problem, and fixes only one problem.

You do need to cache registrations, and all the tuning parameters (how
much do I cache, how long do I hold it for, etc, etc) all still apply.

What goes away (is fixed) is the need for intercepts and the need to
purge address space from the cache because the backing registration
has become non-coherent/invalid. Registrations are always
coherent/valid with ODP.

This cache, and the associated optimization problem, can never go
away. With a 'register all of memory' semantic the cache can move into
the kernel, but the performance implication and overheads are all
still present, just migrated.

> 2. MPI still has to intercept (at least) munmap().

Curious to know what for? 

If you want to prune registrations (ie to reduce memory footprint),
this can be done lazyily at any time (eg in a background thread or
something). Read /proc/self/maps and purge all the registrations
pointing to unmapped memory. Similar to garbage collection.

There is no harm in keeping a registration for a long period, except
for the memory footprint in the kernel.

> 3. Having mmap/malloc/etc. return "new" memory that may already be
> registered because of a prior memory registration and subsequent
> munmap/free/etc. is just plain weird.  Worse, if we re-register it,
> ref counts could go such that the actual registration will never
> actually expire until the process dies (which could lead to
> processes with abnormally large memory footprints, because they
> never actually let go of memory because it's still registered).

This is entirely on the registration cache implementation to sort
out, there are lots of performance/memory trade offs.

It is only weird when you think about it in terms of buffers. memory
registration has to do with address space, not buffers.

> What MPI wants is:
> 
> 1. verbs for ummunotify-like functionality
> 2. non-blocking memory registration verbs; poll the cq to know when it has 
> completed

To me, ODP with an additional 'register all address space' semantic, plus
an asynchronous prefetch does both of these for you.

1. ummunotify functionality and caching is now in the kernel, under
   ODP. RDMA access to an 'all of memory' registration always does the
   right thing.
2. asynchronous prefetch (eg as a work request) triggers ODP and
   kernel actions to ready a subset of memory for RDMA, including
   all the work that memory registration does today (get_user_pages,
   COW break, etc)
   
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of "ummunot" branch?

2013-06-07 Thread Jeff Squyres (jsquyres)
On Jun 6, 2013, at 4:33 PM, Jeff Squyres (jsquyres)  wrote:

> I don't think this covers other memory regions, like those added via mmap, 
> right?


We talked about this at the MPI Forum this week; it doesn't seem like ODP fixes 
any MPI problems.

1. MPI still has to have a memory registration cache, because 
ibv_reg_mr(0...sbrk()) doesn't cover the stack or mmap'ed memory, etc.

2. MPI still has to intercept (at least) munmap().

3. Having mmap/malloc/etc. return "new" memory that may already be registered 
because of a prior memory registration and subsequent munmap/free/etc. is just 
plain weird.  Worse, if we re-register it, ref counts could go such that the 
actual registration will never actually expire until the process dies (which 
could lead to processes with abnormally large memory footprints, because they 
never actually let go of memory because it's still registered).

4. Even if MPI checks the value of sbrk() and re-registers (0...sbrk()) when 
sbrk() increases, this would seem to create a lot of work for the kernel -- 
which is both slow and synchronous.  Example:

a = malloc(5GB);
MPI_Send(a, 1, MPI_CHAR, ...); // MPI sends 1 byte

Then the MPI_Send of 1 byte will have to pay the cost of registering 5GB of new 
memory.

-

Unless we understand this wrong (and there's definitely a chance that we do!), 
it doesn't sound like ODP solves anything for MPI.  Especially since HPC 
applications almost never swap (in fact, swap is usually disabled in HPC 
environments).

What MPI wants is:

1. verbs for ummunotify-like functionality
2. non-blocking memory registration verbs; poll the cq to know when it has 
completed

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] please pull infiniband.git

2013-06-07 Thread Roland Dreier
Hi Linus,

Please pull from

git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git 
tags/rdma-for-linus



InfiniBand fixes for 3.10-rc:
 - qib RCU/lockdep fix
 - iser device removal fix, plus doc fixes


Mike Marciniszyn (1):
  IB/qib: Fix lockdep splat in qib_alloc_lkey()

Or Gerlitz (2):
  IB/iser: Add Mellanox copyright
  MAINTAINERS: Add entry for iSCSI Extensions for RDMA (iSER) initiator

Roi Dayan (1):
  IB/iser: Fix device removal flow

Roland Dreier (1):
  Merge branches 'iser' and 'qib' into for-next

 MAINTAINERS  | 10 ++
 drivers/infiniband/hw/qib/qib_keys.c |  2 +-
 drivers/infiniband/ulp/iser/iscsi_iser.c |  1 +
 drivers/infiniband/ulp/iser/iscsi_iser.h |  1 +
 drivers/infiniband/ulp/iser/iser_initiator.c |  1 +
 drivers/infiniband/ulp/iser/iser_memory.c|  1 +
 drivers/infiniband/ulp/iser/iser_verbs.c | 16 +---
 7 files changed, 24 insertions(+), 8 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] RDMA/cma: Add IPv6 support for iWARP.

2013-06-07 Thread David Miller

Please use local variables for the sockaddr_in{,6} pointers instead of
casting over and over and over and over again.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: Revert pinned_vm braindamage

2013-06-07 Thread Christoph Lameter
On Fri, 7 Jun 2013, Peter Zijlstra wrote:

> However you twist this; your patch leaves an inconsistent mess. If you
> really think they're two different things then you should have
> introduced a second RLIMIT_MEMPIN to go along with your counter.

Well continuing to repeat myself: I worked based on agreed upon
characteristics of mlocked pages. The patch was there to address a
brokenness in the mlock accounting because someone naively assumed that
pinning = mlock.

> I'll argue against such a thing; for I think that limiting the total
> amount of pages a user can exempt from paging is the far more
> userful/natural thing to measure/limit.

Pinned pages are exempted by the kernel. A device driver or some other
kernel process (reclaim, page migration, io etc) increase the page count.
There is currently no consistent accounting for pinned pages. The
vm_pinned counter was introduced to allow the largest pinners to track
what they did.

> > I said that the use of a PIN page flag would allow correct accounting if
> > one wanted to interpret the limit the way you do.
>
> You failed to explain how that would help any. With a pin page flag you
> still need to find the mm to unaccount crap from. Also, all user
> controlled address space ops operate on vmas.

Pinning is kernel controlled...

> > Page migration is not a page fault?
>
> It introduces faults; what happens when a process hits the migration
> pte? It gets a random delay and eventually services a minor fault to the
> new page.

Ok but this is similar to reclaim and other such things that are unmapping
pages.

> At which point the saw will have cut your finger off (going with the
> most popular RT application ever -- that of a bandsaw and a laser beam).

I am pretty confused by your newer notion of RT. RT was about high latency
deterministic behavior I thought. RT was basically an abused marketing
term and was referring to the bloating of the kernel with all sorts of
fair stuff that slows us down. What happened to make you work on low
latency stuff? There is some shift that you still need to go through to
make that transition. Yes, you would want to avoid reclaim and all sorts
of other stuff for low latency. So you disable auto NUMA, defrag etc to
avoid these things.

> > > This leaves the RT people unhappy -- therefore _if_ we continue with
> > > this Linux specific interpretation of mlock() we must introduce new
> > > syscalls that implement the intended mlock() semantics.
> >
> > Intended means Peter's semantics?
>
> No, I don't actually write RT applications. But I've had plenty of
> arguments with RT people when I explained to them what our mlock()
> actually does vs what they expected it to do.

Ok Guess this is all new to you at this point. I am happy to see that you
are willing to abandon your evil ways (although under pressure from your
users) and are willing to put the low latency people now in the RT camp.

> They're not happy. Aside from that; you HPC/HFT minimal latency lot
> should very well appreciate the minimal interference stuff they do
> actually expect.

Sure we do and we know how to do things to work around the "fair
scheduler" and other stuff. But you are breaking the basics of how we do
things with your conflation of pinning and mlocking.

We do not migrate, do not allow defragmentation or reclaim when running
low latency applications. These are non issues.

> This might well be; and I'm not arguing we remove this. I'm merely
> stating that it doesn't make everybody happy. Also what purpose do HPC
> type applications have for mlock()?

HPC wants to keep them in memory to avoid eviction. HPC apps are not as
sensitive to faults as low latency apps are. Minor faults have
traditionally be tolerated there. The lower you get in terms of the
latencies required the more difficult the OS control becomes.

> Here we must disagree I fear; given that mlock() is of RT origin and RT
> people very much want/expect mlock() to do what our proposed mpin() will
> do.

RT is a dirty word for me given the fairness and bloat issue. Not sure
what you mean with that. mlock is a means to keep data in memory and not a
magical wand that avoids all OS handling of the page.

> > That cannot be so since mlocked pages need to be migratable.
>
> I'm talking about the proposed mpin() stuff.

Could you write that up in detail? I am not sure how this could work at
this point.

> So I proposed most of the machinery that would be required to actually
> implement the syscalls. Except that the IB code stumped me. In
> particular I cannot easily find the userspace address to unpin for
> ipath/qib release paths.
>
> Once we have that we can trivially implement the syscalls.

Why would you need syscalls? Pinning is driver/kernel subsystem initiated
and therefore the driver can do the pin/unpin calls.

> > Pinning is not initiated by user space but by the kernel. Either
> > temporarily (page count increases are used all over the kernel for this)
> > or for longer time frame (I

Re: [PATCH 0/4] Add IPv6 support for iWARP

2013-06-07 Thread Steve Wise


Reviewed-by: Steve Wise 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][TRIVIAL] Add attribute information to SA request error messages

2013-06-07 Thread Hal Rosenstock
On 6/6/2013 10:35 AM, Line Holen wrote:
> Signed-off-by: Line Holen 

Thanks. Applied (fixing one typo).

-- Hal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mm: Revert pinned_vm braindamage

2013-06-07 Thread Peter Zijlstra
On Thu, Jun 06, 2013 at 06:46:50PM +, Christoph Lameter wrote:
> On Thu, 6 Jun 2013, Peter Zijlstra wrote:
> 
> > Since RLIMIT_MEMLOCK is very clearly a limit on the amount of pages the
> > process can 'lock' into memory it should very much include pinned pages
> > as well as mlock()ed pages. Neither can be paged.
> 
> So we we thought that this is the sum of the pages that a process has
> mlocked. Initiated by the process and/or environment explicitly. A user
> space initiated action.

Which we; also it remains fact that your changelog didn't mention this
change in semantics at all. Nor did you CC all affected parties.

However you twist this; your patch leaves an inconsistent mess. If you
really think they're two different things then you should have
introduced a second RLIMIT_MEMPIN to go along with your counter.

I'll argue against such a thing; for I think that limiting the total
amount of pages a user can exempt from paging is the far more
userful/natural thing to measure/limit.

> > Since nobody had anything constructive to say about the VM_PINNED
> > approach and the IB code hurts my head too much to make it work I
> > propose we revert said patch.
> 
> I said that the use of a PIN page flag would allow correct accounting if
> one wanted to interpret the limit the way you do.

You failed to explain how that would help any. With a pin page flag you
still need to find the mm to unaccount crap from. Also, all user
controlled address space ops operate on vmas. 

We had the VM_LOCKED far before we had the lock page flag. And you
cannot replace all VM_LOCKED utility with the pageflag either.

> > Once again the rationale; MLOCK(2) is part of POSIX Realtime Extentsion
> > (1003.1b-1993/1003.1i-1995). It states that the specified part of the
> > user address space should stay memory resident until either program exit
> > or a matching munlock() call.
> >
> > This definition basically excludes major faults from happening on the
> > pages -- a major fault being one where IO needs to happen to obtain the
> > page content; the direct implication being that page content must remain
> > in memory.
> 
> Exactly that is the definition.
> 
> > Linux has taken this literal and made mlock()ed pages subject to page
> > migration (albeit only for the explicit move_pages() syscall; but it
> > would very much like to make them subject to implicit page migration for
> > the purpose of compaction etc.).
> 
> Page migration is not a page fault? 

It introduces faults; what happens when a process hits the migration
pte? It gets a random delay and eventually services a minor fault to the
new page.

At which point the saw will have cut your finger off (going with the
most popular RT application ever -- that of a bandsaw and a laser beam).

> The ability to move a process
> completely (including its mlocked segments) is important for the manual
> migration of process memory. That is what page migration was made for. If
> mlocked pages are treated as pinnned pages then the complete process can
> no longer be moved from node to node.
> 
> > This view disregards the intention of the spec; since mlock() is part of
> > the realtime spec the intention is very much that the user address range
> > generate no faults; neither minor nor major -- any delay is
> > unacceptable.
> 
> Where does it say that no faults are generated? Dont we generate COW on
> mlocked ranges?

That's under user control. If the user uses fork() the user can avoid
those faults by pre-faulting the pages.

> > This leaves the RT people unhappy -- therefore _if_ we continue with
> > this Linux specific interpretation of mlock() we must introduce new
> > syscalls that implement the intended mlock() semantics.
> 
> Intended means Peter's semantics?

No, I don't actually write RT applications. But I've had plenty of
arguments with RT people when I explained to them what our mlock()
actually does vs what they expected it to do.

They're not happy. Aside from that; you HPC/HFT minimal latency lot
should very well appreciate the minimal interference stuff they do
actually expect.

> > It was found that there are useful purposes for this weaker mlock(), a
> > rationale to indeed have two sets of syscalls. The weaker mlock() can be
> > used in the context of security -- where we avoid sensitive data being
> > written to disk, and in the context of userspace deamons that are part
> > of the IO path -- which would otherwise form IO deadlocks.
> 
> Migratable mlocked pages enable complete process migration between nodes
> of a NUMA system for HPC workloads.

This might well be; and I'm not arguing we remove this. I'm merely
stating that it doesn't make everybody happy. Also what purpose do HPC
type applications have for mlock()?

> > The proposed second set of primitives would be mpin() and munpin() and
> > would implement the intended mlock() semantics.
> 
> I agree that we need mpin and munpin. But they should not be called mlock
> semantics.

Here we must disagree I fea