RE: Status of ummunot branch?

2013-06-10 Thread Liran Liss
Here are a few more clarifications:

1) ODP MRs can cover address ranges that do not have a mapping at registration 
time.

This means that MPI can register in advance, say, the lower GB's of the address 
space, covering malloc's primary arena.
Thus, there is no need to adjust to each increase in sbrk().

Similarly, you can register the stack region up to the maximum size of the 
stack.
The stack can grow and shrink, and ODP will always use the current mapping.

2) Virtual addresses covered by an ODP MR must have a valid mapping when they 
are is accessed (during send/receive WQE processing or as a target of an 
RDMA/atomic operation).
So, Jeff, the only thing you need to make sure is that you don't free() a 
buffer that you posted and haven't got a completion yet - but I guess that this 
is something that you already do... :)

For example, in the following scenario:
a. reg_mr(first GB of the address space)

b. p = malloc()
c. post_send(p)
d. poll for completion
e. free(p)

f. p = malloc()
g. post_send(p)
h. poll for completion
i. free(p)

(c) may incur a page fault (if not pre-fetched or faulted-in by another thread).
(e) happens after the completion, so it is guaranteed that (c), when processed 
by HW, uses the correct application buffer with the current virt-to-phys 
mapping (at HW access time)

The reallocation may or may not change the virtual-to-physical mappings.
The message may or may not be paged out (ODP does not hold a reference on the 
page).
In any case, when (g) is processed, it always uses the current mapping.

--Liran



-Original Message-
From: linux-rdma-ow...@vger.kernel.org 
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
Sent: Saturday, June 08, 2013 2:58 AM
To: Jeff Squyres (jsquyres)
Cc: Haggai Eran; Or Gerlitz; linux-rdma@vger.kernel.org; Shachar Raindel
Subject: Re: Status of ummunot branch?

On Fri, Jun 07, 2013 at 10:59:43PM +, Jeff Squyres (jsquyres) wrote:

  I don't think this covers other memory regions, like those added via mmap, 
  right?
  
 We talked about this at the MPI Forum this week; it doesn't seem like 
 ODP fixes any MPI problems.

ODP without 'register all address space' changes the nature of the problem, and 
fixes only one problem.

You do need to cache registrations, and all the tuning parameters (how much do 
I cache, how long do I hold it for, etc, etc) all still apply.

What goes away (is fixed) is the need for intercepts and the need to purge 
address space from the cache because the backing registration has become 
non-coherent/invalid. Registrations are always coherent/valid with ODP.

This cache, and the associated optimization problem, can never go away. With a 
'register all of memory' semantic the cache can move into the kernel, but the 
performance implication and overheads are all still present, just migrated.

 2. MPI still has to intercept (at least) munmap().

Curious to know what for? 

If you want to prune registrations (ie to reduce memory footprint), this can be 
done lazyily at any time (eg in a background thread or something). Read 
/proc/self/maps and purge all the registrations pointing to unmapped memory. 
Similar to garbage collection.

There is no harm in keeping a registration for a long period, except for the 
memory footprint in the kernel.

 3. Having mmap/malloc/etc. return new memory that may already be 
 registered because of a prior memory registration and subsequent 
 munmap/free/etc. is just plain weird.  Worse, if we re-register it, 
 ref counts could go such that the actual registration will never 
 actually expire until the process dies (which could lead to processes 
 with abnormally large memory footprints, because they never actually 
 let go of memory because it's still registered).

This is entirely on the registration cache implementation to sort out, there 
are lots of performance/memory trade offs.

It is only weird when you think about it in terms of buffers. memory 
registration has to do with address space, not buffers.

 What MPI wants is:
 
 1. verbs for ummunotify-like functionality 2. non-blocking memory 
 registration verbs; poll the cq to know when it has completed

To me, ODP with an additional 'register all address space' semantic, plus an 
asynchronous prefetch does both of these for you.

1. ummunotify functionality and caching is now in the kernel, under
   ODP. RDMA access to an 'all of memory' registration always does the
   right thing.
2. asynchronous prefetch (eg as a work request) triggers ODP and
   kernel actions to ready a subset of memory for RDMA, including
   all the work that memory registration does today (get_user_pages,
   COW break, etc)
   
Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to 

Re: How to do replication right with SRP or remote storage?

2013-06-10 Thread Sebastian Riemer
On 08.06.2013 04:31, Bruce McKenzie wrote:
 Hi Bart.
 
 any advice on using this fix with MD raid 1? a guide or site you know of?
 
 ive compiled ubuntu 13.04 to kernel 3.6.11 with OFED 2 from Mellanox, and it
 works ok, performance is a little better with SRP.  Some packages dont seem
 to work, ie srptools and IB-diags some commands fail, which looks like those
 tools havenet been tested with 3.6.11?  or updated.
 
 Ive tried using DRBD with pacemaker Stonith etc (which also works on 3.6.11)
 but it only works with iSCSI over IPOIB.  ie virtual nic with mounted LVM
 using scst to present file i/o.  and pacemaker to fail over the VIP to node
 2.  But OFED 2 doesnt seem to support SDP to have to rep via IPOIB which is
 slow even over dedicated IB_IPOIB nic.  IE DRBD rep is 200MB/s
 
 Any help or direction would be greatfull.
 Cheers
 Bruce McKenzie
 

(changed subject into something I think is more appropriate)

Hi Bruce,

thanks for contacting me privately in parallel. I can answer you the
replication questions. In order to share experience for others I reply
here again.

Please evaluate the ib_srp fixes from Bart and from me as well and send
us your feedback!

We are still negotiating how to do fast IO failing and the automatic
reconnect right, also together with the Mellanox SRP guys Sagi Grimberg,
Vu Pham, Oren Duer and others.

You need these patches in order to fail IO in the time you want to the
upper layers so that dm-multipath can fail over the path first and
ib_srp continuously tries to reconnect the failed path. If the other
path also fails, then very likely the storage server is down, so you
fail the IO further up to MD RAID-1 so that it can fail that replica.

For replication the last slide of my talk on LinuxTag this year could be
interesting for you:

http://www.slideshare.net/SebastianRiemer/infini-band-rdmaforstoragesrpvsiser-21791250

That slide caused a lot of discussion afterwards. The thing is that
replication of remote storage is best on the initiator (a single kernel
manages all replica, parallel network paths, symmetric latency,...).

The bad news is that replication of virtual/remote storage with MD
RAID-1 is a use case which basically works but has some issues which
Neil Brown doesn't want to have fixed in mainline. So you need a kernel
developer for some cool features like e.g. safe VM live migration.

Perhaps, I should collect all guys who require MD RAID-1 for remote
storage replication in order to put some pressure on Neil. At least some
things of this use case are easy to merge with mainline behavior like
e.g. letting MD assembly scale right (mdadm searches the whole /dev
without a need). I was surprised that he will make the data offset
settable again so that you can set it to 4 MiB (1 LV extent). We already
have that by custom patches on top of mdadm 3.2.6.

DRBD is already with iSCSI crap. 200 MB/s with IB sounds familiar. I had
250 MB/s in primary/secondary setup with DRBD during evaluation. That's
storeforward writes to the secondary which is slow. Chained network
paths! With Ethernet that hurts even more. People report 70 MB/s with
that. I've taught them how to use blktrace and it became obvious that
they were trapped in latency.

I can also recommend you Vasiliy Tolstov v.tols...@selfip.ru. He also
uses SRP with MD RAID-1. He could convince Neil to fix the MD data
offet. OpenSource is all about the right allies,

Cheers,
Sebastian

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to do replication right with SRP or remote storage?

2013-06-10 Thread Sebastian Riemer
On 10.06.2013 14:44, Bart Van Assche wrote:
 On 06/10/13 14:05, Sebastian Riemer wrote:
 Perhaps, I should collect all guys who require MD RAID-1 for remote
 storage replication in order to put some pressure on Neil.
 
 If I remember correctly one of the things Neil is trying to explain to
 md users is that when md is used without write-intent bitmap there is a
 risk of triggering a so-called write hole after a power failure ?

I'm not sure. Haven't seen something like this on the mailing list. Do
you have a reference from the archives?

I think this is handled by superblock writes in the correct order by
now. The main reason for the write-intent bitmap remains from my
knowledge that you need a full resync without it if a component device
is down for a short moment in time. It becomes faulty.
If you know that there can't be a hardware issue (e.g. virtual storage),
you can remove the faulty device and re-add it to the array.

If a device was faulty, then it assembles again. There is an error
counter in /sys/block/mdX/md/ sysfs and a maximum read error count
(usually 20) after which the faulty device doesn't assemble again.

/sys/block/mdX/md/dev-Y/errors
/sys/block/mdX/md/max_read_errors

Cheers,
Sebastian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-10 Thread Jeff Squyres (jsquyres)
On Jun 7, 2013, at 4:57 PM, Jason Gunthorpe jguntho...@obsidianresearch.com 
wrote:

 We talked about this at the MPI Forum this week; it doesn't seem
 like ODP fixes any MPI problems.
 
 ODP without 'register all address space' changes the nature of the
 problem, and fixes only one problem.

I agree that pushing all registration issues out of the application and 
(somewhere) into the verbs stack would be a nice solution.

 You do need to cache registrations, and all the tuning parameters (how
 much do I cache, how long do I hold it for, etc, etc) all still apply.
 
 What goes away (is fixed) is the need for intercepts and the need to
 purge address space from the cache because the backing registration
 has become non-coherent/invalid. Registrations are always
 coherent/valid with ODP.

 This cache, and the associated optimization problem, can never go
 away. With a 'register all of memory' semantic the cache can move into
 the kernel, but the performance implication and overheads are all
 still present, just migrated.

Good summary; and you corrected some of my mistakes -- thanks.

That being said, everyone I've talked to about ODP finds it very, very strange 
that the kernel would keep memory registrations around for memory that is no 
longer part of a process.  Not only does it lead to the new memory is 
magically already registered semantic that I find weird, it's just plain *odd* 
for the kernel to maintain state for something that doesn't exist any more.  It 
feels dirty.

Sidenote: I was just informed today that the current way MPI implementations 
implement registration cache coherence (glibc malloc hooks) has been deprecated 
and will be removed from glibc 
(http://sourceware.org/ml/libc-alpha/2011-05/msg00103.html).  This really puts 
on the pressure to find a new / proper solution.

 What MPI wants is:
 
 1. verbs for ummunotify-like functionality
 2. non-blocking memory registration verbs; poll the cq to know when it has 
 completed
 
 To me, ODP with an additional 'register all address space' semantic, plus
 an asynchronous prefetch does both of these for you.
 
 1. ummunotify functionality and caching is now in the kernel, under
   ODP. RDMA access to an 'all of memory' registration always does the
   right thing.

Register all address space is the moral equivalent of not having userspace 
registration, so let's talk about it in those terms.  Specifically, there's a 
subtle difference between:

a) telling verbs to register (0...2^64) 
   -- Which is weird because it tells verbs to register memory that isn't in 
my address space
b) telling verbs that the app doesn't want to handle registration
   -- How that gets implemented is not important (from userspace's point of 
view) -- if the kernel chooses to implement that by registering non-existent 
memory, that's the kernel's problem

I guess I'm arguing that registering non-existent memory is not the Right Thing.

Regardless of what solution is devised for registered memory management 
(ummunotify, ODP, or something else), a non-blocking verb for registering 
memory would still be a Very Useful Thing.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Status of ummunot branch?

2013-06-10 Thread Liran Liss
 -Original Message-
 From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
 ow...@vger.kernel.org] On Behalf Of Jeff Squyres (jsquyres)
 Sent: Monday, June 10, 2013 5:50 PM
 To: Jason Gunthorpe
 Cc: Haggai Eran; Or Gerlitz; linux-rdma@vger.kernel.org; Shachar Raindel
 Subject: Re: Status of ummunot branch?
 
 On Jun 7, 2013, at 4:57 PM, Jason Gunthorpe
 jguntho...@obsidianresearch.com wrote:
 
  We talked about this at the MPI Forum this week; it doesn't seem like
  ODP fixes any MPI problems.
 
  ODP without 'register all address space' changes the nature of the
  problem, and fixes only one problem.
 
 I agree that pushing all registration issues out of the application and
 (somewhere) into the verbs stack would be a nice solution.
 
  You do need to cache registrations, and all the tuning parameters (how
  much do I cache, how long do I hold it for, etc, etc) all still apply.
 
  What goes away (is fixed) is the need for intercepts and the need to
  purge address space from the cache because the backing registration
  has become non-coherent/invalid. Registrations are always
  coherent/valid with ODP.
 
  This cache, and the associated optimization problem, can never go
  away. With a 'register all of memory' semantic the cache can move into
  the kernel, but the performance implication and overheads are all
  still present, just migrated.
 
 Good summary; and you corrected some of my mistakes -- thanks.
 
 That being said, everyone I've talked to about ODP finds it very, very strange
 that the kernel would keep memory registrations around for memory that is
 no longer part of a process.  Not only does it lead to the new memory is
 magically already registered semantic that I find weird, it's just plain 
 *odd*
 for the kernel to maintain state for something that doesn't exist any more.  
 It
 feels dirty.
 
 Sidenote: I was just informed today that the current way MPI
 implementations implement registration cache coherence (glibc malloc
 hooks) has been deprecated and will be removed from glibc
 (http://sourceware.org/ml/libc-alpha/2011-05/msg00103.html).  This really
 puts on the pressure to find a new / proper solution.
 
  What MPI wants is:
 
  1. verbs for ummunotify-like functionality 2. non-blocking memory
  registration verbs; poll the cq to know when it has completed
 
  To me, ODP with an additional 'register all address space' semantic,
  plus an asynchronous prefetch does both of these for you.
 
  1. ummunotify functionality and caching is now in the kernel, under
ODP. RDMA access to an 'all of memory' registration always does the
right thing.
 
 Register all address space is the moral equivalent of not having userspace
 registration, so let's talk about it in those terms.  Specifically, there's a 
 subtle
 difference between:
 
 a) telling verbs to register (0...2^64)
-- Which is weird because it tells verbs to register memory that isn't in 
 my
 address space


Another way to look at it is specify IO access permissions for address space 
ranges.
This could be useful to implement a buffer pool to be used for a specific MR 
only, yet still map/unmap memory within this pool on the fly to optimize 
physical memory utilization.
In this case, you would provide smaller ranges than 2^64...


 b) telling verbs that the app doesn't want to handle registration
-- How that gets implemented is not important (from userspace's point of
 view) -- if the kernel chooses to implement that by registering non-existent
 memory, that's the kernel's problem
 
 I guess I'm arguing that registering non-existent memory is not the Right
 Thing.
 
 Regardless of what solution is devised for registered memory management
 (ummunotify, ODP, or something else), a non-blocking verb for registering
 memory would still be a Very Useful Thing.
 
 --
 Jeff Squyres
 jsquy...@cisco.com
 For corporate legal information go to:
 http://www.cisco.com/web/about/doing_business/legal/cri/
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in the
 body of a message to majord...@vger.kernel.org More majordomo info at
 http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of ummunot branch?

2013-06-10 Thread Jason Gunthorpe
On Mon, Jun 10, 2013 at 02:49:24PM +, Jeff Squyres (jsquyres) wrote:
 On Jun 7, 2013, at 4:57 PM, Jason Gunthorpe jguntho...@obsidianresearch.com 
 wrote:
 
  We talked about this at the MPI Forum this week; it doesn't seem
  like ODP fixes any MPI problems.
  
  ODP without 'register all address space' changes the nature of the
  problem, and fixes only one problem.
 
 I agree that pushing all registration issues out of the application
 and (somewhere) into the verbs stack would be a nice solution.

Well, it creates a mess in another sense, because now you've lost
context. When your MPI goes to do a 1byte send the kernel may well
prefetch a few megabytes of page tables, whereas an implementation in
userspace still has the context and can say, no I don't need that..

Maybe a prefetch WR can restore the lost context, donno..

 That being said, everyone I've talked to about ODP finds it very,
 very strange that the kernel would keep memory registrations around
 for memory that is no longer part of a process.  Not only does it

MRs are badly named. They are not 'memory registrations'. They are
'address registrations'. Don't conflat address === memory in your
head, then it seems weird :)

The memory the address space points to is flexible.

The address space is tied to the lifetime of the process.

It doesn't matter if there is no memory mapped to the address space,
the address space is still there.

Liran had a good example. You can register address space and then use
mmap/munmap/MAP_FIXED to mess around with where it points to.

A practical example of using this would be to avoid the need to send
scatter buffer pointers to the remote. The remote writes into a memory
ring and the ring is made 'endless' by clever use of remapping.

 Register all address space is the moral equivalent of not having
 userspace registration, so let's talk about it in those terms.
 Specifically, there's a subtle difference between:
 
 a) telling verbs to register (0...2^64) 
 b) telling verbs that the app doesn't want to handle registration

I agree, a verb to do 'B' is a cleaner choice than trying to cram this
kind of API into A...

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v5 01/28] rdma/cm: define native IB address

2013-06-10 Thread Hefty, Sean
 Define AF_IB and sockaddr_ib to allow the rdma_cm to use native IB
 addressing.
 
 Signed-off-by: Sean Hefty sean.he...@intel.com
 ---
  include/linux/socket.h |2 +
  include/rdma/ib.h  |   89 
 
  2 files changed, 91 insertions(+), 0 deletions(-)
  create mode 100644 include/rdma/ib.h
 
 diff --git a/include/linux/socket.h b/include/linux/socket.h
 index 2b9f74b..68f7120 100644
 --- a/include/linux/socket.h
 +++ b/include/linux/socket.h
 @@ -167,6 +167,7 @@ struct ucred {
  #define AF_PPPOX 24  /* PPPoX sockets*/
  #define AF_WANPIPE   25  /* Wanpipe API Sockets */
  #define AF_LLC   26  /* Linux LLC*/
 +#define AF_IB27  /* Native InfiniBand address*/
  #define AF_CAN   29  /* Controller Area Network  */
  #define AF_TIPC  30  /* TIPC sockets */
  #define AF_BLUETOOTH 31  /* Bluetooth sockets*/
 @@ -211,6 +212,7 @@ struct ucred {
  #define PF_PPPOX AF_PPPOX
  #define PF_WANPIPE   AF_WANPIPE
  #define PF_LLC   AF_LLC
 +#define PF_IBAF_IB
  #define PF_CAN   AF_CAN
  #define PF_TIPC  AF_TIPC
  #define PF_BLUETOOTH AF_BLUETOOTH

Are there any objections from the network maintainers to adding these 
definitions?

The rest of the changes from this series are restricted to the RDMA subsystem.  
Currently, the RDMA stack connects using IP addresses, which must be mapped to 
IB addresses.  This change allows the RDMA stack to establish connections using 
native IB addresses.

- Sean 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html