Re: [openib-general] Re: RDMA memory registration

2005-05-03 Thread David Addison
Greg Lindahl wrote:
On Fri, Apr 29, 2005 at 12:33:54PM -0700, Grant Grundler wrote:
Being mostly clueless about Quadrics implementation, I'm probably
missing something that makes Quadrics a MMU but not the IB variants.
Can someone clue me in please?
As far as I can tell it's mostly a marketing distinction. Many
Quadrics customers run with memory registration, and Mellanox could
probably alter their firmware to not require registration.  Myricom
certainly can, and in fact Patrick Geoffrey claimed they were doing so
in their MX software. The only one I know of that isn't that flexible
is PathScale's InfiniPath. Ours is a pure hardware mechanism, but it
requires memory registration and is clearly not an MMU.
Greg,
only a few of our evaluation customers use the patch free (and hence
page-pinning) software release.
Most do apply our simple IOPROC patch and run without requiring page
pinning whilst still achieving the peak bandwidth and low latency
of our hardware.
Cheers
Addy.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-05-03 Thread David Addison
Ronald G. Minnich wrote:
On Fri, 29 Apr 2005, Greg Lindahl wrote:
It doesn't imply that there's an MMU, either. I know that Myricom uses a
little lookup routine in software on their nic, which most people
wouldn't call an MMU. I don't know what Mellanox does for this, they
don't talk much about what's hardware and what's software on their nic.
I think Quadrics actually uses the TLB of their risc cpu on their nic
for this lookup, but that's just a guess.
but only quadrics rewrites the mm layer code ..

Hi Ron,
as our recent IOPROC patch on lkml shows, it's not that invasive. There
are just 24 hooks added to the Linux VM code paths - which we have been able to
maintain outside the mainline tree for many years now.
As these hooks only need to synchronise the Elan's MMU state with that of the
CPU, the device drivers calls don't change the Linux MM behaviour.
We believe the IOPROC patch is generic and powerful and would allow other
RDMA NICs to solve the page registration problems in a different manner.
For NICs which require page registration, new VM hooks can be used to avoid
pages being unloaded whilst DMAs are active. Our latest cut of the IOPROC patch
has such a hook.
Cheers
Addy.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-05-03 Thread David Addison
Grant Grundler wrote:
On Fri, Apr 29, 2005 at 08:22:24PM +0200, Brice Goglin wrote:
For instance, instead of adding PROT_DONT/ALWAYSCOPY, you may use
an ioproc hook in the fork path. This hook (a function in your driver)
would be called for each registered page. It will decide whether
the page should be pre-copied or not and update the registration
table (or whatever stores address translations in the NIC).
In addition, the driver would probably pre-copy cow pages when
registering them.
This doesn't scale well as more cards are added to the box.
I think I understand why it's good for single cards though.
With the IOPROC patch the device driver hooks are registered on a per process
or perhaps better still, a per VMA basis. And for processes/VMAs where there
are no registrations the overhead is very low.
With multiple cards in a box, all using different device drivers, I guess there
could end up being multiple registrations per process/VMA. But I'm not sure
this will be a common case for RDMA use in real life.
Cheers
Addy.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-05-03 Thread Grant Grundler
On Tue, May 03, 2005 at 09:42:12AM +0100, David Addison wrote:
 This doesn't scale well as more cards are added to the box.
 I think I understand why it's good for single cards though.

 With the IOPROC patch the device driver hooks are registered on a per 
 process or perhaps better still, a per VMA basis.

I was originally thinking the registrations are global (for all memory)
and not per process. Per process or per VMA seems reasonable to me.

 And for processes/VMAs where there are no registrations the overhead
 is very low.

Yes - thanks. I'm still reading the LKML thread you started:
http://lkml.org/lkml/2005/4/26/198

In particular, the comments from Brice Goglin:
http://lkml.org/lkml/2005/4/26/222

openib.org folks can find the IOPROC patch for 2.6.12-rc3 archived here:
http://lkml.org/lkml/diff/2005/4/26/198/1

 With multiple cards in a box, all using different device drivers,
 I guess there could end up being multiple registrations per process/VMA.
 But I'm not sure this will be a common case for RDMA use in real life.

I agree. Gateways between fabrics is the only case I can think of.
This won't be a problem until someone at a large national lab tries
to connect two legacy fabrics together.

thanks,
grant
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-05-03 Thread Caitlin Bestler
An ex post facto notification of a PTE change would enable the RDMA
Device driver to know when a Memory Region had been invalidated
so that it could probably declare an access violation and tear all 
the connections using it down.

But if the intent is to allow it to migrate the memory region to the
new mapping it would need a more synchronized notice. It needs
to be told of a *pending* change, so that it can indicate when it
has completed any data movements based on the old data.
It can then use the new data. This has generally been discussed
as a two part interface: suspend (to request that the old mapping
no longer be used) and resume (to resume usage of the mapping
with the new values), and it is generally done at a Memory Region
scope rather than on a per PTE basis.

RDMA has strict ordering requirements. In particular, completing
a receive work request represents a guarnatee to the consumer
that the prior writes have been updated in its buffer. With an
unsynchronized notice that PTE entry X has been changed
I don't see how it can fulfill those semantics. It cannot know if
portions of an RDMA Write were placed to the old physical
location, and therefore it cannot know that the entire RDMA
Write payload will be in user memory at the anticipated locations
when it generates the work completion. If it cannot make that
guarantee it is obligated to terminate the connection.


On 5/3/05, David Addison [EMAIL PROTECTED] wrote:
 Ronald G. Minnich wrote:
 
  On Fri, 29 Apr 2005, Greg Lindahl wrote:
 
 It doesn't imply that there's an MMU, either. I know that Myricom uses a
 little lookup routine in software on their nic, which most people
 wouldn't call an MMU. I don't know what Mellanox does for this, they
 don't talk much about what's hardware and what's software on their nic.
 I think Quadrics actually uses the TLB of their risc cpu on their nic
 for this lookup, but that's just a guess.
 
  but only quadrics rewrites the mm layer code ..
 
 
 Hi Ron,
 as our recent IOPROC patch on lkml shows, it's not that invasive. There
 are just 24 hooks added to the Linux VM code paths - which we have been able 
 to
 maintain outside the mainline tree for many years now.
 As these hooks only need to synchronise the Elan's MMU state with that of the
 CPU, the device drivers calls don't change the Linux MM behaviour.
 
 We believe the IOPROC patch is generic and powerful and would allow other
 RDMA NICs to solve the page registration problems in a different manner.
 For NICs which require page registration, new VM hooks can be used to avoid
 pages being unloaded whilst DMAs are active. Our latest cut of the IOPROC 
 patch
 has such a hook.
 
 Cheers
 Addy.
 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-05-03 Thread Ronald G. Minnich


 On 5/3/05, David Addison [EMAIL PROTECTED] wrote:
  as our recent IOPROC patch on lkml shows, it's not that invasive. There
  are just 24 hooks added to the Linux VM code paths - which we have been 
  able to
  maintain outside the mainline tree for many years now.
  As these hooks only need to synchronise the Elan's MMU state with that of 
  the
  CPU, the device drivers calls don't change the Linux MM behaviour.
  
  We believe the IOPROC patch is generic and powerful and would allow other
  RDMA NICs to solve the page registration problems in a different manner.
  For NICs which require page registration, new VM hooks can be used to avoid
  pages being unloaded whilst DMAs are active. Our latest cut of the IOPROC 
  patch
  has such a hook.
  

david, I just saw this. I'll need to look at that patch, it sounds pretty 
neat. Thanks


ron
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-05-03 Thread Caitlin Bestler
On 5/3/05, David Addison [EMAIL PROTECTED] wrote:

 We believe the IOPROC patch is generic and powerful and would allow other
 RDMA NICs to solve the page registration problems in a different manner.
 For NICs which require page registration, new VM hooks can be used to avoid
 pages being unloaded whilst DMAs are active. Our latest cut of the IOPROC 
 patch
 has such a hook.
 

The key phrase here is avoid pages being unloaded whilst DMAs are active.
Correct RDMA behavior requires preventing any loss of the content of those
pages in the period from the end of the DMA until the next completion is
reaped.

If the kernel were to start transferring the pages immediately after the DMA
completed, what would prevent the associated receive completion from being
generated before the migration was completed?

And if a migration is in progress, how is this feedback given to RDMA device
and when? Explicitly suspending a Memory Registration allows detection of
the problem while the disposition of the packet is still pending. Postponing
determination that the target memory is suspended until the actual DMA
transfer is attempted is problematic.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: RDMA memory registration

2005-04-29 Thread Roland Dreier
Brice Do you plan to work with David Addison from Quadrics ?  For
Brice sure, your hardware have very different capabilities.  But
Brice ioproc_ops is a really nice solution and might help a lot
Brice when dealing with deregistration and fork.

I'm following the discussion with interest.  Some hardware (eg
Mellanox HCAs) has the ability to use these hooks to avoid pinning
pages at all, but in general IB and iWARP need to pin pages so the
mapping doesn't change.

Brice For instance, instead of adding PROT_DONT/ALWAYSCOPY, you
Brice may use an ioproc hook in the fork path. This hook (a
Brice function in your driver) would be called for each
Brice registered page. It will decide whether the page should be
Brice pre-copied or not and update the registration table (or
Brice whatever stores address translations in the NIC).  In
Brice addition, the driver would probably pre-copy cow pages when
Brice registering them.

This sort of monkeying around with the VM from driver code seems much
more complicated than letting userspace handle it.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Grant Grundler
On Fri, Apr 29, 2005 at 08:22:24PM +0200, Brice Goglin wrote:
 For instance, instead of adding PROT_DONT/ALWAYSCOPY, you may use
 an ioproc hook in the fork path. This hook (a function in your driver)
 would be called for each registered page. It will decide whether
 the page should be pre-copied or not and update the registration
 table (or whatever stores address translations in the NIC).
 In addition, the driver would probably pre-copy cow pages when
 registering them.

This doesn't scale well as more cards are added to the box.
I think I understand why it's good for single cards though.

 It's nice to see these two works coming to LKML at the same time.
 It would be great if we could merge them and get a generic solution
 that's suitable to both registration based cards (IB/Myri/Ammasso)
 and MMU-based cards (Quadrics).

Aren't the mellanox mem-free cards more or less MMU's as well?
I had that impression after attending Dror Goldberg's talk
though I don't think he asserted that.
Openib.org developers conf (Feb 2005) slideset is here:

http://www.openib.org/docs/oib_wkshp_022005/memfree-hca-mellanox-dgoldenberg.pdf

Being mostly clueless about Quadrics implementation, I'm probably
missing something that makes Quadrics a MMU but not the IB variants.
Can someone clue me in please?

thanks,
grant
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: RDMA memory registration

2005-04-29 Thread Roland Dreier
Bill Are you suggesting making the partial pages their own VMA,
Bill or marking the entire buffer with this flag? I originally
Bill thought the entire buffer should be copy on fork (instead of
Bill copy on write), and I believe this is the path Mellanox was
Bill pursing with the VM_NO_COW flag.  However, if applications
Bill are registering gigs of ram, it would be very bad to have
Bill the entire area copied on fork.

It's up to userspace really but I would expect that the partial pages
would be in a vma by themselves.

 - R.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Greg Lindahl
On Fri, Apr 29, 2005 at 12:33:54PM -0700, Grant Grundler wrote:

 Being mostly clueless about Quadrics implementation, I'm probably
 missing something that makes Quadrics a MMU but not the IB variants.
 Can someone clue me in please?

As far as I can tell it's mostly a marketing distinction. Many
Quadrics customers run with memory registration, and Mellanox could
probably alter their firmware to not require registration.  Myricom
certainly can, and in fact Patrick Geoffrey claimed they were doing so
in their MX software. The only one I know of that isn't that flexible
is PathScale's InfiniPath. Ours is a pure hardware mechanism, but it
requires memory registration and is clearly not an MMU.

Confused yet?

-- greg
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Roland Dreier
Bill I'm very confused at this point. Can you briefly explain how
Bill this works, or point me to a description? I don't see how
Bill you could do user level I/O without registering the memory
Bill with the hardware. I'm especially confused by the comment
Bill (may not have been yours) that the memory doesn't have to be
Bill pinned.  -- Bill Jordan InfiniCon Systems

You add a hook to the kernel so it tells you if a page is about to be
paged out or otherwise move.  Then you set a bit in the adapter's page
table so that it won't try to access that page without telling you.
If the adapter asks for the page, you get the kernel to fault the page
in and program the new physical mapping in the adapter.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Rimmer, Todd
 You add a hook to the kernel so it tells you if a page is about to be
 paged out or otherwise move.  Then you set a bit in the adapter's page
 table so that it won't try to access that page without telling you.
 If the adapter asks for the page, you get the kernel to fault the page
 in and program the new physical mapping in the adapter.

But that implies the hardware has an MMU and it also puts an interrupt in the 
path per page sent.

Wasn't the assertion that there was no MMU in the hardware?

Todd Rimmer

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Ronald G. Minnich


On Fri, 29 Apr 2005, Bill Jordan wrote:

 I'm very confused at this point. Can you briefly explain how this works,
 or point me to a description? I don't see how you could do user level
 I/O without registering the memory with the hardware. I'm especially
 confused by the comment (may not have been yours) that the memory
 doesn't have to be pinned. 

you modify the mm layer of linux, so that the PTEs on the Quadrics card 
are in sync with teh PTEs int he mm layer. Then you are in a position to 
have a NIC incite page faults for incoming packets. 

I think greg got it right -- in practice, it's not done any more. Quadrics 
has a kernel-patch-free source base now, I'm told.

ron
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Greg Lindahl
 Todd But that implies the hardware has an MMU and it also puts an
 Todd interrupt in the path per page sent.
 
 Well, there's one interrupt per non-resident page sent.  But nearly
 all of the time the page will be present.

It doesn't imply that there's an MMU, either. I know that Myricom uses
a little lookup routine in software on their nic, which most people
wouldn't call an MMU. I don't know what Mellanox does for this, they
don't talk much about what's hardware and what's software on their
nic. I think Quadrics actually uses the TLB of their risc cpu on their
nic for this lookup, but that's just a guess.

-- greg

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Ronald G. Minnich


On Fri, 29 Apr 2005, Rimmer, Todd wrote:

 But that implies the hardware has an MMU and it also puts an interrupt
 in the path per page sent.

yes. it does. and it doesn't do per page sent, just per page that has no 
pte on the nic when received.

ron
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Ronald G. Minnich


On Fri, 29 Apr 2005, Greg Lindahl wrote:

 It doesn't imply that there's an MMU, either. I know that Myricom uses a
 little lookup routine in software on their nic, which most people
 wouldn't call an MMU. I don't know what Mellanox does for this, they
 don't talk much about what's hardware and what's software on their nic.
 I think Quadrics actually uses the TLB of their risc cpu on their nic
 for this lookup, but that's just a guess.

but only quadrics rewrites the mm layer code ..

ron
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Libor Michalek
On Fri, Apr 29, 2005 at 03:07:40PM -0600, Ronald G. Minnich wrote:
 On Fri, 29 Apr 2005, Greg Lindahl wrote:
 
  It doesn't imply that there's an MMU, either. I know that Myricom uses a
  little lookup routine in software on their nic, which most people
  wouldn't call an MMU. I don't know what Mellanox does for this, they
  don't talk much about what's hardware and what's software on their nic.
  I think Quadrics actually uses the TLB of their risc cpu on their nic
  for this lookup, but that's just a guess.
 
 but only quadrics rewrites the mm layer code ..

  Mellanox, although they have the capability, does not use the feature.
In the existing model the mellanox hardware assumes that the page is
present, hence the entire discussion about how to make sure the page
stays put and that the user mapping to that page stays put.

-Libor
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Caitlin Bestler
On 4/29/05, Roland Dreier [EMAIL PROTECTED] wrote:
 Bill I'm very confused at this point. Can you briefly explain how
 Bill this works, or point me to a description? I don't see how
 Bill you could do user level I/O without registering the memory
 Bill with the hardware. I'm especially confused by the comment
 Bill (may not have been yours) that the memory doesn't have to be
 Bill pinned.  -- Bill Jordan InfiniCon Systems
 
 You add a hook to the kernel so it tells you if a page is about to be
 paged out or otherwise move.  Then you set a bit in the adapter's page
 table so that it won't try to access that page without telling you.
 If the adapter asks for the page, you get the kernel to fault the page
 in and program the new physical mapping in the adapter.
 

Yes, and you could even have a system that was capable of doing
DMA to a user virtual map (in fact some minis back around 1980
had exactly that capability).

But there are *two* issues involved here:

One is that the RDMA hardware, however it is marketed, essentially
needs to act as an MMU. That means that it has to be synchronized
with normal MMU. The traditional sledge-hammer approach to 

 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Ronald G. Minnich


On Fri, 29 Apr 2005, Caitlin Bestler wrote:

 One is that the RDMA hardware, however it is marketed, essentially
 needs to act as an MMU. That means that it has to be synchronized
 with normal MMU. The traditional sledge-hammer approach to 

ah ha! his RDMA mmu just crashed his mm layer. It happens. 

ron
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA memory registration

2005-04-29 Thread Caitlin Bestler
oops, hit the send to soon. Finishing the response...

On 4/29/05, Caitlin Bestler [EMAIL PROTECTED] wrote:
 On 4/29/05, Roland Dreier [EMAIL PROTECTED] wrote:
  Bill I'm very confused at this point. Can you briefly explain how
  Bill this works, or point me to a description? I don't see how
  Bill you could do user level I/O without registering the memory
  Bill with the hardware. I'm especially confused by the comment
  Bill (may not have been yours) that the memory doesn't have to be
  Bill pinned.  -- Bill Jordan InfiniCon Systems
 
  You add a hook to the kernel so it tells you if a page is about to be
  paged out or otherwise move.  Then you set a bit in the adapter's page
  table so that it won't try to access that page without telling you.
  If the adapter asks for the page, you get the kernel to fault the page
  in and program the new physical mapping in the adapter.
 
 
 Yes, and you could even have a system that was capable of doing
 DMA to a user virtual map (in fact some minis back around 1980
 had exactly that capability).
 
 But there are *two* issues involved here:
 
 One is that the RDMA hardware, however it is marketed, essentially
 needs to act as an MMU. That means that it has to be synchronized
 with normal MMU. The traditional sledge-hammer approach to
 
synchronizing is to require that the mapping be frozen. You *could*
define a method that attempts to be more dynamic in this synchronization,
but since it is an ex post facto mechanism that must work with multiple
hardware cards it needs to be defined recognizing that it is not
instantaneous.
It is virtually the same problem as memory suspend in general, basically
   the RDMA Hardware's MMU is not making calculations for each and every
   access to the host bus.

   Secondly there is the problem that an advertised buffer is implicitly a 
   promise to the the peer that the buffer is available. Using RNRs (or dropping
   TCP segments for iWARP) while paging an image from disk is just not
   playing fair. No host should advertise 20 GB of buffers to its peer when it
   only has 2 GBs of physical memory backing it up. When an application
   registers memory it believes it has permission from the OS to advertise
   buffers within it. RNRs are appropriate to move memory around, not to
   allow a host to overadvertise.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general