Re: [openib-general] basic IB doubt

2006-08-28 Thread Talpey, Thomas
At 03:39 AM 8/26/2006, Gleb Natapov wrote:
On Fri, Aug 25, 2006 at 03:53:12PM -0400, Talpey, Thomas wrote:
 Flush (sync for_device) before posting.
 Invalidate (sync for_cpu) before processing.
 
So, before touching the data that was RDMAed into the buffer application
should cache invalidate the buffer, is this even possible from user
space? (Not on x86, but it isn't needed there.)

Interesting you should mention that. :-) There isn't a user verb for
dma_sync, there's only deregister.

The kernel can perform this for receive completions, and signaled
RDMA Reads, but it can't do so for remote RDMA Writes. Only the
upper layer knows where those went.

There are two practical solutions:

1) (practical solution) user mappings must be fully consistent,
within the capability of the hardware. Still, don't go depending
on any specific ordering here.

2) user must deregister any mapping before inspecting the result. I
doubt any of them do this, for that reason anyway.

MO is that this will bite us in the a** some day. If anybody was
running this code on the Sparc architecture it already would have.

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] basic IB doubt

2006-08-28 Thread Talpey, Thomas
At 09:00 AM 8/28/2006, Gleb Natapov wrote:
 2) user must deregister any mapping before inspecting the result. I
 doubt any of them do this, for that reason anyway.
 
This may have big performance impact.

You think? :-)

 MO is that this will bite us in the a** some day. If anybody was
 running this code on the Sparc architecture it already would have.
 
AFAIK SUN runs MPI over UDAPL, but they have their own IB
implementation, so may be they handle all coherency issues in the UDAPL
itself.

The Sparc IOMMU supports consistent mappings, in which the
i/o streaming caches are not used. There is a performance
impact to using this mode however. The best throughput is
achieved using streaming with explicit software consistency.

However, even in consistent mode, the Sparc API requires
that the synchronization calls be made. I have never gotten
a completely satisfactory answer as to why, but on the
high-end server platforms, I think it's possible that the busses
can't always snoop one another and the calls provide a push.

Will turning on the Opteron's IOMMU introduce some of these
issues to x86?

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] basic IB doubt

2006-08-28 Thread Talpey, Thomas
At 12:22 PM 8/28/2006, Jason Gunthorpe wrote:
On Mon, Aug 28, 2006 at 10:38:43AM -0400, Talpey, Thomas wrote:

 Will turning on the Opteron's IOMMU introduce some of these
 issues to x86?

No, definately not. The Opteron IOMMU (the GART) is a pure address
translation mechanism and doesn't change the operation of the caches.

Okay, that's good. However, doesn't it delay reads and writes until the
necessary table walk / mapping is resolved? Because it passes all other
cycles through, it seems to me that an interrupt may pass data, meaning
that ordering (at least) may be somewhat different when it's present.
And, those pending writes are not in the cache's consistency domain
(i.e. they can't be snooped yet, right?).

If Sun has a problem on larger systems I wonder if SGI Altix also has a
problem? SGI Altix is definately a real system that people use IB
cards in today and it would be easy to imagine such a large system
could have coherence issues with memory polling..

I'd be interested in this too.

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] basic IB doubt

2006-08-25 Thread Talpey, Thomas
At 12:40 PM 8/25/2006, Sean Hefty wrote:
Thomas How does an adapter guarantee that no bridges or other
Thomas intervening devices reorder their writes, or for that
Thomas matter flush them to memory at all!?

That's a good point.  The HCA would have to do a read to flush the
posted writes, and I'm sure it's not doing that (since it would add
horrible latency for no good reason).

I guess it's not safe to rely on ordering of RDMA writes after all.

Couldn't the same point then be made that a CQ entry may come before the data
has been posted?

When the CQ entry arrives, the context that polls it off the queue
must use the dma_sync_*() api to finalize any associated data
transactions (known by the uper layer).

This is basic, and it's the reason that a completion is so important.
The completion, in and of itself, isn't what drives the synchronization.
It's the transfer of control to the processor.

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] basic IB doubt

2006-08-25 Thread Talpey, Thomas
At 03:23 PM 8/25/2006, Greg Lindahl wrote:
On Fri, Aug 25, 2006 at 03:21:20PM -0400, [EMAIL PROTECTED] wrote:

 I presume you meant invalidate the cache, not flush it, before 
accessing DMA'ed 
 data. 

Yes, this is what I meant. Sorry!

Flush (sync for_device) before posting.
Invalidate (sync for_cpu) before processing.

On some architectures, these operations flush and/or invalidate
i/o pipeline caches as well. As they should.

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] basic IB doubt

2006-08-24 Thread Talpey, Thomas
At 07:46 PM 8/23/2006, Roland Dreier wrote:
Greg Actually, that leads me to a question: does the vendor of
Greg that adaptor say that this is actually safe? Just because
Greg something behaves one way most of the time doesn't mean it
Greg does it all of the time. So it it really smart to write
Greg non-standard-conforming programs unless the vendor stands
Greg behind that behavior?

Yes, Mellanox documents that it is safe to rely on the last byte of an
RDMA being written last.

How does an adapter guarantee that no bridges or other intervening devices
reorder their writes, or for that matter flush them to memory at all!?

Without signalling the host processor, that is. Isn't that what the dma_sync()
API is all about?

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] ib_get_dma_mr and remote access

2006-08-16 Thread Talpey, Thomas
At 05:53 PM 8/15/2006, Louis Laborde wrote:
Hi there,

I would like to know if any application today uses ib_get_dma_mr verb with
remote access flag(s).

The NFS/RDMA client does this, if configured to do so. Otherwise, it registers
specific byte regions when remote access is required. The client supports
numerous memory registration strategies, to suit user requirements and
HCA/RNIC limitations.

It seems to me that such a dependency could first, create a security hole
and second, make this verb hard to implement for some RNICs.

Yes, and yes.

If only local access is required for this special memory region, can
it be implemented with the Reserved LKey or STag0, whichever way it's
called?

Sure, and I expect many consumers would be fine with this. Note however
that iWARP RDMA Read requires remote write access to be granted on the
destination sge's, unlike IB RDMA Read, which requires only local.

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] NFS/RDMA for Linux: client and server update release 6

2006-08-04 Thread Talpey, Thomas
Network Appliance is pleased to announce release 6 of the NFS/RDMA
client and server for Linux 2.6.17. This update to the May 22 release
fixes known issues, improves usability and server stability, and supports
NFSv4. The code supports both Infiniband and iWARP transports over
the standard openfabrics Linux facility.

http://sourceforge.net/projects/nfs-rdma/

https://sourceforge.net/project/showfiles.php?group_id=97628package_id=199510

This code is running successfully at multiple user locations. A special
thanks goes to Helen Chen and her team at Sandia Labs for their help
in resolving multiple usability and stability issues. The code in the current
release was used to produce the results reported in their presentation
at the recent Commodity Cluster Computing Symposium in Baltimore.

Tom Talpey, for the NFS/RDMA project.

---

Changes since RC5

  2.6.17.* kernel/transport switch target
(also fixes IPv6 issues)
  NFS-RDMA client:
support NFSv4
  NFS-RDMA server:
kconfig changes
fully uses dma_map()/dma_unmap() api
fix race between connection acceptance and first client request
fix I/O thread not going to sleep
fix two issues in export cache handling
fix data corruption with certain pathological client alignments
  nfsrdmamount command:
support NFSv4
runtime warnings on certain systems addressed


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] Fwd: WG Action: Conclusion of IP over InfiniBand (ipoib)

2006-07-06 Thread Talpey, Thomas
FYI...

 -- Forwarded Message --
To: ietf-announce@ietf.org
From: IESG Secretary [EMAIL PROTECTED]
Date: Wed, 05 Jul 2006 15:50:01 -0400
Cc: ipoverib@ietf.org, H.K. Jerry Chu [EMAIL PROTECTED],
Bill Strahm [EMAIL PROTECTED]
Subject: WG Action: Conclusion of IP over InfiniBand (ipoib) 
List-Id: ietf-announce.ietf.org
List-Post: mailto:ietf-announce@ietf.org
List-Help: mailto:[EMAIL PROTECTED]
List-Subscribe: https://www1.ietf.org/mailman/listinfo/ietf-announce,
   mailto:[EMAIL PROTECTED]

The IP over InfiniBand WG (ipoib) in the Internet Area has concluded.

The IESG contact persons are Jari Arkko and Mark Townsley.

+++

The IPOIB working group has completed its main task of
defining how to run IP over InfiniBand. It has published
three RFCs and a fourth one is in the RFC Editor's queue,
soon to become an RFC as well.

There are some additional work items in the milestone
plan, a set of MIBs. But after reviewing the status and
activity in the group it seems best to close the WG.
There are a few individuals who are still interested in
pursuing a part of the MIB work, and they are encouraged
to submit their work as an AD sponsored document, when
the work is completed.

The mailing list for the group will remain active.

___
IETF-Announce mailing list
IETF-Announce@ietf.org
https://www1.ietf.org/mailman/listinfo/ietf-announce
 -- End of Forwarded Message --


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] max_send_sge max_sge

2006-06-28 Thread Talpey, Thomas
Yep, you're confirming my comment that the sge size is dependent
on the memory registration strategy (and not the protocol itself).
Because you have a pool approach, you potentially have a lot of
discontiguous regions. Therefore, you need more sge's. (You could
have the same issue with large preregistrations, etc.)

If it's just for RDMA Write, the penalty really isn't that high - you can
easily break the i/o up into separate RDMA Write ops and pump them
out in a sequence. The HCA streams them, and using unsignalled
completion on the WRs means the host overhead can be low.

For sends, it's more painful. You have to pull them up. Do you really
need send inlines to be that big? I guess if you're supporting a writev()
api over inline you don't have much control, but even writev has a
maxiov.

The approach the NFS/RDMA client takes is basically to have a pool
of dedicated buffers for headers, with a certain amount of space for
small sends. This maximum inline size is typically 1K or maybe 4K
(it's configurable), and it copies send data into them if it fits. All
other operations are posted as chunks, which are explicit protocol
objects corresponding to { mr, offset, length } triplets. The protocol
supports an arbitrary number of them, but typically 8 is plenty. Each
chunk results in an RDMA op from the server. If the server is coded
well, the RDMA streams beautifully and there is no bandwidth issue.

Just some ideas. I feel your pain.

Tom.

At 04:34 PM 6/27/2006, Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Tue, 27 Jun 2006 09:06 -0400:
 At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote:
 Unless you use it, passing the absolute maximum value supported by 
 hardware does
 not seem, to me, to make sense - it will just slow you down, and waste
 resources.  Is there a protocol out there that actually has a use 
for 30 sge?
 
 It's not a protocol thing, it's a memory registration thing. But I agree,
 that's a huge number of segments for send and receive. 2-4 is more
 typical. I'd be interested to know what wants 30 as well...

This is the OpenIB port of pvfs2: http://www.pvfs.org/pvfs2/download.html
See pvfs2/src/io/bmi/bmi_ib/openib.c for the bottom of the transport
stack.  The max_sge-1 aspect I'm complaining about isn't checked in yet.

It's a file system application.  The MPI-IO interface provides
datatypes and file views that let a client write complex subsets of
the in-memory data to a file with a single call.  One case that
happens is contiguous-in-file but discontiguous-in-memory, where the
file system client writes data from multiple addresses to a single
region in a file.  The application calls MPI_File_write or a
variant, and this complex buffer description filters all the way
down to the OpenIB transport, which then has to figure out how to
get the data to the server.

These separate data regions may have been allocated all at once
using MPI_Alloc_mem (rarely), or may have been used previously for
file system operations so are already pinned in the registration
cache.  Are you implying there is more memory registration work that
has to happen beyond making sure each of the SGE buffers is pinned
and has a valid lkey?

It would not be a major problem to avoid using more than a couple of
SGEs; however, I didn't see any reason to avoid them.  Please let me
know if you see a problem with this approach.

   -- Pete


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] max_send_sge max_sge

2006-06-28 Thread Talpey, Thomas
At 08:42 AM 6/28/2006, Michael S. Tsirkin wrote:
Quoting r. Talpey, Thomas [EMAIL PROTECTED]:
 Just some ideas. I feel your pain.

Is there something that would make life easier for you?

A work-request-based IBTA1.2/iWARP-compliant FMR implementation.

Please. :-)

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] max_send_sge max_sge

2006-06-28 Thread Talpey, Thomas
At 10:51 AM 6/28/2006, Michael S. Tsirkin wrote:
Yep.  We could have an option to have the stack scale the requested values down
to some legal set instead of failing an allocation.  But we couldn't come up
with a clean way to tell the stack e.g.  what should it round down: the SGE or
WR value.  Do you think selecting something arbitrarily might still be a good
idea?

No! Well, not as the default. Otherwise, the consumer has to go back
and check what happened even on success, which is a royal pain and
highly inefficient.

Maybe we should pass in an optional attribute structure, that is returned
with the granted attributes on success, or the would-have-been attributes
on failure?

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] max_send_sge max_sge

2006-06-27 Thread Talpey, Thomas
At 02:42 AM 6/27/2006, Michael S. Tsirkin wrote:
Unless you use it, passing the absolute maximum value supported by 
hardware does
not seem, to me, to make sense - it will just slow you down, and waste
resources.  Is there a protocol out there that actually has a use for 30 sge?

It's not a protocol thing, it's a memory registration thing. But I agree,
that's a huge number of segments for send and receive. 2-4 is more
typical. I'd be interested to know what wants 30 as well...

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Local QP operation error

2006-06-27 Thread Talpey, Thomas
At 09:21 AM 6/27/2006, Ramachandra K wrote:
Does this error point to some issue with the DMA address specified 
in the work request SGE ?


Ding Ding Ding Ding! :-)

We recently identified the exact issue in the NFS/RDMA server, which
happened only when running on ia64. If you're not using the dma_map_*
api, that's maybe something to look at. ;-)

Tom. 


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Mellanox HCAs: outstanding RDMAs

2006-06-16 Thread Talpey, Thomas
Mike, I am not arguing to change the standard. I am simply
saying I do not want to be a victim of the default. It is my
belief that very few upper layer programmers are aware of
this, btw.

The Linux NFS/RDMA upper layer implementation already deals
with the issue, as I mentioned. It would certainly welcome a
higher available IRD on Mellanox hardware however.

Thanks for your comments.

Tom.

At 01:55 PM 6/15/2006, Michael Krause wrote:

As one of the authors of IB and iWARP, I can say that both Roland and Todd's 
responses are correct and the intent of the specifications.  The number of 
outstanding RDMA Reads are bounded and that is communicated during session 
establishment.  The ULP can choose to be aware of this requirement (certainly 
when we wrote iSER and DA we were well aware of the requirement and we 
documented as such in the ULP specs) and track from above so that it does not 
see a stall or it can stay ignorant and deal with the stall as a result.  This 
is a ULP choice and has been intentionally done that way so that the hardware 
can be kept as simple as possible and as low cost as well while meeting the 
breadth of ULP needs that were used to develop these technologies.   

Tom, you raised this issue during iWARP's definition and the debate was 
conducted at least several times.  The outcome of these debates is reflected 
in iWARP and remains aligned with IB.  So, unless you really want to have the 
IETF and IBTA go and modify their specs, I believe you'll have to deal with 
the issue just as other ULP are doing today and be aware of the constraint and 
write the software accordingly.  The open source community isn't really the 
right forum to change iWARP and IB specifications at the end of the day.  
Build a case in the IETF and IBTA and let those bodies determine whether it is 
appropriate to modify their specs or not.  And yes, it is modification of the 
specs and therefore the hardware implementations as well address any 
interoperability requirements that would result (the change proposed could 
fragment the hardware offerings as there are many thousands of devices in the 
market that would not necessarily support this change).

Mike




At 12:07 PM 6/6/2006, Talpey, Thomas wrote:
Todd, thanks for the set-up. I'm really glad we're having this discussion!

Let me give an NFS/RDMA example to illustrate why this upper layer,
at least, doesn't want the HCA doing its flow control, or resource
management.

NFS/RDMA is a credit-based protocol which allows many operations in
progress at the server. Let's say the client is currently running with
an RPC slot table of 100 requests (a typical value).

Of these requests, some workload-specific percentage will be reads,
writes, or metadata. All NFS operations consist of one send from
client to server, some number of RDMA writes (for NFS reads) or
RDMA reads (for NFS writes), then terminated with one send from
server to client.

The number of RDMA read or write operations per NFS op depends
on the amount of data being read or written, and also the memory
registration strategy in use on the client. The highest-performing
such strategy is an all-physical one, which results in one RDMA-able
segment per physical page. NFS r/w requests are, by default, 32KB,
or 8 pages typical. So, typically 8 RDMA requests (read or write) are
the result.

To illustrate, let's say the client is processing a multi-threaded
workload, with (say) 50% reads, 20% writes, and 30% metadata
such as lookup and getattr. A kernel build, for example. Therefore,
of our 100 active operations, 50 are reads for 32KB each, 20 are
writes of 32KB, and 30 are metadata (non-RDMA). 

To the server, this results in 100 requests, 100 replies, 400 RDMA
writes, and 160 RDMA Reads. Of course, these overlap heavily due
to the widely differing latency of each op and the highly distributed
arrival times. But, for the example this is a snapshot of current load.

The latency of the metadata operations is quite low, because lookup
and getattr are acting on what is effectively cached data. The reads
and writes however, are much longer, because they reference the
filesystem. When disk queues are deep, they can take many ms.

Imagine what happens if the client's IRD is 4 and the server ignores
its local ORD. As soon as a write begins execution, the server posts
8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads
are sent, the fifth stalls, and stalls the send queue! Even when three
RDMA Reads complete, the queue remains stalled, it doesn't unblock
until the fourth is done and all the RDMA Reads have been initiated.

But, what just happened to all the other server send traffic? All those
metadata replies, and other reads which completed? They're stuck,
waiting for that one write request. In my example, these number 99 NFS
ops, i.e. 654 WRs! All for one NFS write! The client operation stream
effectively became single threaded. What good is the rapid initiation
of RDMA Reads you describe

[openib-general] Re: Mellanox HCAs: outstanding RDMAs

2006-06-06 Thread Talpey, Thomas
At 08:44 AM 6/6/2006, Michael S. Tsirkin wrote:
 MST, are you disagreeing that RDMA Reads can stall the queue?

I don't disagree with this of course. I was simply suggesting to ULP designers
to read the chapter 9.5 and become aware of the rules, taking them
into account at early stages of protocol design.

:-) RTFM?

I still think flow control is wrong and dangerous thing for RDMA Read.
If it never happened, and the connections just failed, we'd never have
the issue. Also, I'm certain we'll see upper layers that work on one
provider, only to fail on another. Sigh.

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



[openib-general] Re: Mellanox HCAs: outstanding RDMAs

2006-06-06 Thread Talpey, Thomas
At 08:56 AM 6/6/2006, Michael S. Tsirkin wrote:
 The core spec does not require it. An implementation *may* enforce it,
 but is not *required* to do so. And as pointed out in the other message,
 there are repercussions of doing so.

Interesting, I wasn't aware of such interpretation of the spec.
When QP is modified to RTS, the initiator depth is passed to it, which
suggests that the provider must obey, not ignore this parameter. No?

This is the difference between may and must. The value is provided,
but I don't see anything in the spec that makes a requirement on its
enforcement. Table 107 says the consumer can query it, that's about
as close as it comes. There's some discussion about CM exchange too.

Don't forget about iWARP, btw.

Tom.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Re: Mellanox HCAs: outstanding RDMAs

2006-06-06 Thread Talpey, Thomas
At 10:40 AM 6/6/2006, Roland Dreier wrote:
Thomas This is the difference between may and must. The value
Thomas is provided, but I don't see anything in the spec that
Thomas makes a requirement on its enforcement. Table 107 says the
Thomas consumer can query it, that's about as close as it
Thomas comes. There's some discussion about CM exchange too.

This seems like a very strained interpretation of the spec.  For

I don't see how strained has anything to do with it. It's not saying anything
either way. So, a legal implementation can make either choice. We're
talking about the spec!

But, it really doesn't matter. The point is, an upper layer should be paying
attention to the number of RDMA Reads it posts, or else suffer either the
queue-stalling or connection-failing consequences. Bad stuff either way.

Tom.


example, there's no explicit language in the IB spec that requires an
HCA to use the destination LID passed via a modify QP operation, but I
don't think anyone would seriously argue that an implementation that
sent messages to some other random destination was compliant.

In the same way, if I pass a limit for the number of outstanding
RDMA/atomic operations in to a modify QP operation, I would expect the
HCA to use that limit.

 - R.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



RE: [openib-general] Re: Mellanox HCAs: outstanding RDMAs

2006-06-06 Thread Talpey, Thomas
Todd, thanks for the set-up. I'm really glad we're having this discussion!

Let me give an NFS/RDMA example to illustrate why this upper layer,
at least, doesn't want the HCA doing its flow control, or resource
management.

NFS/RDMA is a credit-based protocol which allows many operations in
progress at the server. Let's say the client is currently running with
an RPC slot table of 100 requests (a typical value).

Of these requests, some workload-specific percentage will be reads,
writes, or metadata. All NFS operations consist of one send from
client to server, some number of RDMA writes (for NFS reads) or
RDMA reads (for NFS writes), then terminated with one send from
server to client.

The number of RDMA read or write operations per NFS op depends
on the amount of data being read or written, and also the memory
registration strategy in use on the client. The highest-performing
such strategy is an all-physical one, which results in one RDMA-able
segment per physical page. NFS r/w requests are, by default, 32KB,
or 8 pages typical. So, typically 8 RDMA requests (read or write) are
the result.

To illustrate, let's say the client is processing a multi-threaded
workload, with (say) 50% reads, 20% writes, and 30% metadata
such as lookup and getattr. A kernel build, for example. Therefore,
of our 100 active operations, 50 are reads for 32KB each, 20 are
writes of 32KB, and 30 are metadata (non-RDMA). 

To the server, this results in 100 requests, 100 replies, 400 RDMA
writes, and 160 RDMA Reads. Of course, these overlap heavily due
to the widely differing latency of each op and the highly distributed
arrival times. But, for the example this is a snapshot of current load.

The latency of the metadata operations is quite low, because lookup
and getattr are acting on what is effectively cached data. The reads
and writes however, are much longer, because they reference the
filesystem. When disk queues are deep, they can take many ms.

Imagine what happens if the client's IRD is 4 and the server ignores
its local ORD. As soon as a write begins execution, the server posts
8 RDMA Reads to fetch the client's write data. The first 4 RDMA Reads
are sent, the fifth stalls, and stalls the send queue! Even when three
RDMA Reads complete, the queue remains stalled, it doesn't unblock
until the fourth is done and all the RDMA Reads have been initiated.

But, what just happened to all the other server send traffic? All those
metadata replies, and other reads which completed? They're stuck,
waiting for that one write request. In my example, these number 99 NFS
ops, i.e. 654 WRs! All for one NFS write! The client operation stream
effectively became single threaded. What good is the rapid initiation
of RDMA Reads you describe in the face of this?

Yes, there are many arcane and resource-intensive ways around it.
But the simplest by far is to count the RDMA Reads outstanding, and
for the *upper layer* to honor ORD, not the HCA. Then, the send queue
never blocks, and the operation streams never loses parallelism. This
is what our NFS server does.

As to the depth of IRD, this is a different calculation, it's a DelayxBandwidth
of the RDMA Read stream. 4 is good for local, low latency connections.
But over a complicated switch infrastructure, or heaven forbid a dark fiber
long link, I guarantee it will cause a bottleneck. This isn't an issue except
for operations that care, but it is certainly detectable. I would like to see
if a pure RDMA Read stream can fully utilize a typical IB fabric, and how
much headroom an IRD of 4 provides. Not much, I predict.

Closing the connection if IRD is insufficient to meet goals isn't a good
answer, IMO. How does that benefit interoperability? 

Thanks for the opportunity to spout off again. Comments welcome!

Tom.

At 12:43 PM 6/6/2006, Rimmer, Todd wrote:


 Talpey, Thomas
 Sent: Tuesday, June 06, 2006 10:49 AM
 
 At 10:40 AM 6/6/2006, Roland Dreier wrote:
 Thomas This is the difference between may and must. The
value
 Thomas is provided, but I don't see anything in the spec that
 Thomas makes a requirement on its enforcement. Table 107 says
the
 Thomas consumer can query it, that's about as close as it
 Thomas comes. There's some discussion about CM exchange too.
 
 This seems like a very strained interpretation of the spec.  For
 
 I don't see how strained has anything to do with it. It's not saying
 anything
 either way. So, a legal implementation can make either choice. We're
 talking about the spec!
 
 But, it really doesn't matter. The point is, an upper layer should be
 paying
 attention to the number of RDMA Reads it posts, or else suffer either
the
 queue-stalling or connection-failing consequences. Bad stuff either
way.
 
 Tom.

Somewhere beneath this discussion is a bug in the application or IB
stack.  I'm not sure which may in the spec you are referring to, but
the mays I have found all are for cases where the responder might
support only 1 outstanding request

RE: [openib-general] Mellanox HCAs: outstanding RDMAs

2006-06-05 Thread Talpey, Thomas
At 10:03 AM 6/3/2006, Rimmer, Todd wrote: 
 Yes, the limit of outstanding RDMAs is not related to the send queue
 depth.  Of course you can post many more than 4 RDMAs to a send queue
 -- the HCA just won't have more than 4 requests outstanding at a time.

To further clarity, this parameter only affects the number of concurrent
outstanding RDMA Reads which the HCA will process.  Once it hits this
limit, the send Q will stall waiting for issued reads to complete prior
to initiating new reads.

It's worse than that - the send queue must stall for *all* operations.
Otherwise the hardware has to track in-progress operations which are
queued after stalled ones. It really breaks the initiation model.

Semantically, the provider is not required to provide any such flow control
behavior by the way. The Mellanox one apparently does, but it is not
a requirement of the verbs, it's a requirement on the upper layer. If more
RDMA Reads are posted than the remote peer supports, the connection
may break.

The number of outstanding RDMA Reads is negotiated by the CM during
connection establishment and the QP which is sending the RDMA Read must
have a value configured for this parameter which is = the remote ends
capability.

In other words, we're probably stuck at 4. :-) I don't think there is any
Mellanox-based implementation that has ever supported  4.

In previous testing by Mellanox on SDR HCAs they indicated values beyond
2-4 did not improve performance (and in fact required more RDMA
resources be allocated for the corresponding QP or HCA).  Hence I
suspect a very large value like 128 would offer no improvement over
values in the 2-8 range.

I am not so sure of that. For one thing, it's dependent on VERY small
latencies. The presence of a switch, or link extenders will make a huge
difference. Second, heavy multi-QP firmware loads will increase the
latencies. Third, constants are pretty much never a good idea in
networking.

The NFS/RDMA client tries to set the maximum IRD value it can obtain.
RDMA Read is used quite heavily by the server to fetch client data
segments for NFS writes.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Question about the IPoIB bandwidth performance ?

2006-06-05 Thread Talpey, Thomas
At 11:38 AM 6/5/2006, hbchen wrote:
Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth utilization 
is still very low.
 IPoIB=420MB/sec  
 bandwidth utilization= 420/1024 = 41.01%


Helen, have you measured the CPU utilizations during these runs?
Perhaps you are out of CPU.

Outrageous opinion follows.

Frankly, an IB HCA running Ethernet emulation is approximately the
world's worst 10GbE adapter (not to put too fine of a point on it :-) )
There is no hardware checksumming, nor large-send offloading, both
of which force overhead onto software. And, as you just discovered
it isn't even 10Gb!

In general, network emulation layers are always going to perform more
poorly than native implementations. But this is only a generality learned
from years of experience with them.

Tom.  

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Question about the IPoIB bandwidth performance ?

2006-06-05 Thread Talpey, Thomas
At 12:11 PM 6/5/2006, hbchen wrote:
Perhaps you are out of CPU.

  
Tom,
I am HB Chen from LANL not the Helen Chen from SNL.

Oops, sorry! I have too many email messages going by. :-)
HB, then.


I didn't run out of CPU.  It is about 70-80 % of CPU utilization.

But, is one CPU at 100%? Interrupt processing, for example.

  

Outrageous opinion follows.

Frankly, an IB HCA running Ethernet emulation is approximately the
world's worst 10GbE adapter (not to put too fine of a point on it :-) )
  
The IP over Myrinet ( Ethernet emulation) can reach upto 96%-98%  bandwidth 
utilization why not the IPoIB ?

I am not familiar with the implementation Myrinet uses. In any
case, I am not saying that an emulation can't reach certain goals,
just that they will pretty much always be inferior to native approaches.
Sometimes far inferior.

Tom. 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



Re: [openib-general] Question about the IPoIB bandwidth performance ?

2006-06-05 Thread Talpey, Thomas
Who said anything about Ethernnet emulation. Hal said he is running
straight Netperf over IB not ethernet emulation. I don't think that any IB
HCAs today support offloaded checksum and large send. You are comparing
apples and oranges. 

I consider IPoIB to be Ethernet emulation.

As for apples and oranges, my point exactly.

Tom.


At 12:53 PM 6/5/2006, Bernard King-Smith wrote:
 Thomas Talpey said:
 At 11:38 AM 6/5/2006, hbchen wrote:
 Even with this IB-4X = 8Gb/sec = 1024 MB/sec the IPoIB bandwidth
utilization is still very  low.
  IPoIB=420MB/sec
  bandwidth utilization= 420/1024 = 41.01%


 Helen, have you measured the CPU utilizations during these runs?
 Perhaps you are out of CPU.

 Outrageous opinion follows.

 Frankly, an IB HCA running Ethernet emulation is approximately the
 world's worst 10GbE adapter (not to put too fine of a point on it :-) )
 There is no hardware checksumming, nor large-send offloading, both
 of which force overhead onto software. And, as you just discovered
 it isn't even 10Gb!

 In general, network emulation layers are always going to perform more
 poorly than native implementations. But this is only a generality learned
 from years of experience with them

 Tom.

Hold on here

Who said anything about Ethernnet emulation. Hal said he is running
straight Netperf over IB not ethernet emulation. I don't think that any IB
HCAs today support offloaded checksum and large send. You are comparing
apples and oranges. The only appropriate comparison is to use the IBM HCA
compared to the mthca adapters. I think Hal's point is actually comparing
any IB adapter against GigE and Myrinet. Both the mthca and IBM HCA's
should get similar IPoIB performance using identical OpenIB stacks.


Bernie King-Smith
IBM Corporation
Server Group
Cluster System Performance
[EMAIL PROTECTED](845)433-8483
Tie. 293-8483 or wombat2 on NOTES

We are not responsible for the world we are born into, only for the world
we leave when we die.
So we have to accept what has gone before us and work to change the only
thing we can,
-- The Future. William Shatner


   
 openib-general-re 
 [EMAIL PROTECTED]  
 Sent by:   To 
 openib-general-bo openib-general@openib.org   
 [EMAIL PROTECTED]   cc 
   
   Subject 
 06/05/2006 12:11  openib-general Digest, Vol 24,  
 PMIssue 22
   
   
 Please respond to 
 [EMAIL PROTECTED] 
 enib.org  
   
   




Send openib-general mailing list submissions to
 openib-general@openib.org

To subscribe or unsubscribe via the World Wide Web, visit
 http://openib.org/mailman/listinfo/openib-general
or, via email, send a message with subject or body 'help' to
 [EMAIL PROTECTED]

You can reach the person managing the list at
 [EMAIL PROTECTED]

When replying, please edit your Subject line so it is more specific
than Re: Contents of openib-general digest...
Today's Topics:

   1. Re: Question about the IPoIB bandwidth   performance ?
(hbchen)
   2. Re: [PATCH] osm: trivial missing header files fix (Hal Rosenstock)
   3. Re: [PATCH] osm: trivial missing cast in osmt_service call
  for memcmp (Hal Rosenstock)
   4. Re: Question about the IPoIB bandwidth performance ?
  (Bernard King-Smith)
   5. Re: Re: [PATCH]Repost: IPoIB skb panic (Shirley Ma)
   6. Re: [PATCHv2 1/2] resend: mthca support for
max_map_per_fmr
  device attribute (Roland Dreier)
   7. Re: Question about the IPoIB bandwidth performance ?
  (Talpey, Thomas)
   8. Re: Question about the IPoIB bandwidth performance ? (hbchen)

- Message from hbchen [EMAIL PROTECTED] on Mon, 05 Jun 2006 09:38:24
-0600 -
   
  To: Hal Rosenstock [EMAIL PROTECTED] 
   
  cc: OPENIB openib-general@openib.org

Re: [openib-general] NFS/RDMA for Linux: client and server update release 5

2006-05-24 Thread Talpey, Thomas
[Cutting down the reply list to more relevant parties...]

It's hard to say what is crashing, but I suspect the CM code, due
to the process context being ib_cm. Is there some reason you're
not getting symbols in the stack trace? If you could feed this oops
text to ksymoops it will give us more information.

In any case, it appears the connection is succeeding at the server,
but the client RPC code isn't being signalled that it has done so.
Perhaps this is due to a lost reply, but the NFS code hasn't actually
started to do anything. So, I would look for IB-level issues. Is the
client running the current OpenFabrics svn top-of-tree?

Let's take this offline to diagnose, unless someone has an idea why
the CM would be failing. The ksymoops analysis would help.

Tom.



At 07:19 PM 5/23/2006, helen chen wrote:
Hi Tom,

I have downloaded your release 5 of the NFS/RDMA and am having trouble
mounting the rdma nfs, the 
./nfsrdmamount -o rdma on16-ib:/mnt/rdma /mnt/rdma command never
returned. and the dmesg for client and server are:

-- demsg from client -
RPCRDMA Module Init, register RPC RDMA transport
Defaults:
MaxRequests 50
MaxInlineRead 1024
MaxInlineWrite 1024
Padding 0
Memreg 5
RPC: Registered rdma transport module.
RPC: Registered rdma transport module.
RPC:   xprt_setup_rdma: 140.221.134.221:2049
nfs: server on16-ib not responding, timed out
Unable to handle kernel NULL pointer dereference at 
RIP:
[]
PGD a9f2b067 PUD a8ca2067 PMD 0
Oops: 0010 [1] PREEMPT SMP
CPU 1
Modules linked in: xprtrdma ib_srp iscsi_tcp scsi_transport_iscsi
scsi_mod
Pid: 346, comm: ib_cm/1 Not tainted 2.6.16.16 #4
RIP: 0010:[] []
RSP: 0018:8100af5a1c30  EFLAGS: 00010246
RAX: 8100aeff2400 RBX: 8100aeff2400 RCX: 8100afc9e458
RDX:  RSI: 8100af5a1d48 RDI: 8100aeff2440
RBP: 8100aeff2440 R08:  R09: 
R10: 0003 R11:  R12: 8100aeff2500
R13: ff99 R14: 8100af5a1d48 R15: 8036c72c
FS:  00505ae0() GS:810003ce25c0()
knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2:  CR3: ad587000 CR4: 06a0
Process ib_cm/1 (pid: 346, threadinfo 8100af5a, task
8100afea8100)
Stack: 8802a331 8100aeff2500 0001
8100aeff2440
   804011fd  8802a343
8100afdd6100
   80364ee4 0100
Call Trace: [8802a331] [804011fd]
   [8802a343] [80364ee4] [80364341]
   [8036f85c] [8036fcf2] [8036baeb]
   [8036bdc1] [8036d6fe] [8036c72c]
   [801377b4] [801377fb] [8013a960]
   [80137900] [8012309b] [8013a960]
   [8012309b] [8013a960] [8013a937]
   [8010b8d6] [8013a960] [801160b9]
   [801160b9] [801160b9] [8013a86f]
   [8010b8ce]

Code:  Bad RIP value.
RIP [] RSP 8100af5a1c30
CR2: 

--dmesg from server --
nfsd: request from insecure port 140.221.134.220, port=32768!
svc_rdma_recvfrom: transport 81007e8f2800 is closing
svc_rdma_put: Destroying transport 81007e8f2800,
cm_id=81007e945200, sk_flags=154, sk_inuse=0

Did I forget to configure necessary components into my kernel?

Thanks,
Helen

On Mon, 2006-05-22 at 13:25, Talpey, Thomas wrote:
 Network Appliance is pleased to announce release 5 of the NFS/RDMA
 client and server for Linux 2.6.16.16. This update to the April 19 release
 adds improved server parallel performance and fixes various issues. This
 code supports both Infiniband and iWARP transports.
 
 http://sourceforge.net/projects/nfs-rdma/
 
 
http://sourceforge.net/project/showfiles.php?group_id=97628package_id=191427
 
 Comments and feedback welcome. We're especially interested in
 successful test reports! Thanks.
 
 Tom Talpey, for the various NFS/RDMA projects.
 
 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit 
http://openib.org/mailman/listinfo/openib-general
 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] NFS/RDMA for Linux: client and server update release 5

2006-05-24 Thread Talpey, Thomas
OBTW, I just noticed that your server printed the message:

nfsd: request from insecure port 140.221.134.220, port=32768!

This means the /mnt/rdma export isn't configured with insecure,
and causes the server to close the connection. Because the IB CM
does not allow the client to use so-called secure ports ( 1024), you
need to set this flag on any RDMA exports, this is mentioned in our
README.

The jury is out on whether it's worth implementing the source port
emulation in the IB CM. The problem is that to do so requires the
CM to interface with the local IP port space, or manage one of its
own. So for now, NFS/RDMA just recommends using the exports
flag. Frankly, it provides no additional security, and is misnamed...

Tom.

At 07:25 AM 5/24/2006, Talpey, Thomas wrote:
[Cutting down the reply list to more relevant parties...]

It's hard to say what is crashing, but I suspect the CM code, due
to the process context being ib_cm. Is there some reason you're
not getting symbols in the stack trace? If you could feed this oops
text to ksymoops it will give us more information.

In any case, it appears the connection is succeeding at the server,
but the client RPC code isn't being signalled that it has done so.
Perhaps this is due to a lost reply, but the NFS code hasn't actually
started to do anything. So, I would look for IB-level issues. Is the
client running the current OpenFabrics svn top-of-tree?

Let's take this offline to diagnose, unless someone has an idea why
the CM would be failing. The ksymoops analysis would help.

Tom.



At 07:19 PM 5/23/2006, helen chen wrote:
Hi Tom,

I have downloaded your release 5 of the NFS/RDMA and am having trouble
mounting the rdma nfs, the 
./nfsrdmamount -o rdma on16-ib:/mnt/rdma /mnt/rdma command never
returned. and the dmesg for client and server are:

-- demsg from client -
RPCRDMA Module Init, register RPC RDMA transport
Defaults:
MaxRequests 50
MaxInlineRead 1024
MaxInlineWrite 1024
Padding 0
Memreg 5
RPC: Registered rdma transport module.
RPC: Registered rdma transport module.
RPC:   xprt_setup_rdma: 140.221.134.221:2049
nfs: server on16-ib not responding, timed out
Unable to handle kernel NULL pointer dereference at 
RIP:
[]
PGD a9f2b067 PUD a8ca2067 PMD 0
Oops: 0010 [1] PREEMPT SMP
CPU 1
Modules linked in: xprtrdma ib_srp iscsi_tcp scsi_transport_iscsi
scsi_mod
Pid: 346, comm: ib_cm/1 Not tainted 2.6.16.16 #4
RIP: 0010:[] []
RSP: 0018:8100af5a1c30  EFLAGS: 00010246
RAX: 8100aeff2400 RBX: 8100aeff2400 RCX: 8100afc9e458
RDX:  RSI: 8100af5a1d48 RDI: 8100aeff2440
RBP: 8100aeff2440 R08:  R09: 
R10: 0003 R11:  R12: 8100aeff2500
R13: ff99 R14: 8100af5a1d48 R15: 8036c72c
FS:  00505ae0() GS:810003ce25c0()
knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2:  CR3: ad587000 CR4: 06a0
Process ib_cm/1 (pid: 346, threadinfo 8100af5a, task
8100afea8100)
Stack: 8802a331 8100aeff2500 0001
8100aeff2440
   804011fd  8802a343
8100afdd6100
   80364ee4 0100
Call Trace: [8802a331] [804011fd]
   [8802a343] [80364ee4] [80364341]
   [8036f85c] [8036fcf2] [8036baeb]
   [8036bdc1] [8036d6fe] [8036c72c]
   [801377b4] [801377fb] [8013a960]
   [80137900] [8012309b] [8013a960]
   [8012309b] [8013a960] [8013a937]
   [8010b8d6] [8013a960] [801160b9]
   [801160b9] [801160b9] [8013a86f]
   [8010b8ce]

Code:  Bad RIP value.
RIP [] RSP 8100af5a1c30
CR2: 

--dmesg from server --
nfsd: request from insecure port 140.221.134.220, port=32768!
svc_rdma_recvfrom: transport 81007e8f2800 is closing
svc_rdma_put: Destroying transport 81007e8f2800,
cm_id=81007e945200, sk_flags=154, sk_inuse=0

Did I forget to configure necessary components into my kernel?

Thanks,
Helen

On Mon, 2006-05-22 at 13:25, Talpey, Thomas wrote:
 Network Appliance is pleased to announce release 5 of the NFS/RDMA
 client and server for Linux 2.6.16.16. This update to the April 19 release
 adds improved server parallel performance and fixes various issues. This
 code supports both Infiniband and iWARP transports.
 
 http://sourceforge.net/projects/nfs-rdma/
 
 
http://sourceforge.net/project/showfiles.php?group_id=97628package_id=191427
 
 Comments and feedback welcome. We're especially interested in
 successful test reports! Thanks.
 
 Tom Talpey, for the various NFS/RDMA projects

Re: [openib-general] Re: [PATCH] mthca: fix posting lists of 256 entries for tavor

2006-05-24 Thread Talpey, Thomas
At 10:52 AM 5/24/2006, Roland Dreier wrote:
Michael No idea - the site seems to be down :)

It's working from here -- must be an issue in your network.


I saw the same error, but adding www. to the openib.org url fixes it.

Tom.


Anyway the report is:

*
Host Architecture : x86_64
Linux Distribution: Fedora Core release 4 (Stentz)
Kernel Version: 2.6.11-1.1369_FC4smp
Memory size   : 4071672 kB
Driver Version: OFED-1.0-rc5-pre5
HCA ID(s) : mthca0
HCA model(s)  : 25208
FW version(s) : 4.7.600
Board(s)  : MT_00A0010001
*

posting a list of multiples of 256 WR to SRQ or QP may be corrupted.
The WR list that is being posted may be posted to a different QP than
the QP
number of the QP handle.

test to reproduce it: qp_test
daemon:
qp_test --daemon
client:
qp_test --thread=15 --oust=256 --srq CLIENT SR 1 1
 or
qp_test --thread=15 --oust=256 CLIENT SR 1 1
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [PATCH 1/2] mthca support for max_map_per_fmr device attribute

2006-05-23 Thread Talpey, Thomas
Doesn't this change only *increase* the window of vulnerability
which FMRs suffer? I.e. when you say dirty, you mean still mapped,
right?

Tom.

At 07:11 AM 5/23/2006, Or Gerlitz wrote:
Or Gerlitz wrote:
 The max fmr remaps device attribute is not set by the driver, so the generic 
 fmr_pool uses a default of 32. Enlaring this quantity would make the 
amortized 
 cost of remaps lower. With the current mthca default profile on 
memfull HCA 
 17 bits are used for MPT addressing so an FMR can be remapped 2^15 - 
1  32 times.

Actually, the bigger (than unmap amortized cost) problem i was facing 
with the unmap count being very low is the following: say my app 
publishes N credits and serving each credit consumes one FMR, so my app 
implementation created the pool with 2N FMRs and set the watermark to N.

When requests come fast enough, there's a window in time when there's 
an unmapping of N FMRs running at batch, but out of the remaining N FMRs 
some are already dirty and can't be used to serve a credit. So the app 
fails temporally... So, setting the watermark to 0.5N might solve this, 
but since enlarging the number of remaps is trivial, i'd like to do it 
first.

The app i am talking about is a SCSI LLD (eg iSER, SRP) where each SCSI 
command consumes one FMR and the LLD posts to the SCSI ML how many 
commands can be issued in parallel.

Or.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] NFS/RDMA for Linux: client and server update release 5

2006-05-22 Thread Talpey, Thomas
Network Appliance is pleased to announce release 5 of the NFS/RDMA
client and server for Linux 2.6.16.16. This update to the April 19 release
adds improved server parallel performance and fixes various issues. This
code supports both Infiniband and iWARP transports.

http://sourceforge.net/projects/nfs-rdma/

http://sourceforge.net/project/showfiles.php?group_id=97628package_id=191427

Comments and feedback welcome. We're especially interested in
successful test reports! Thanks.

Tom Talpey, for the various NFS/RDMA projects.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] CMA IPv6 support

2006-05-15 Thread Talpey, Thomas
At 01:05 PM 5/15/2006, Sean Hefty wrote:
I came to the same conclusion a couple of weeks ago.  Rdma_create_id() will
likely need an address family parameter, or the user must explicitly 
bind before calling listen.

Rdma_create_id() already takes a struct sockaddr *, which has an address
family selector (sa_family) to define the contained address format. Why is
that one not sufficient?

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] CMA IPv6 support

2006-05-15 Thread Talpey, Thomas
At 01:26 PM 5/15/2006, Talpey, Thomas wrote:
At 01:05 PM 5/15/2006, Sean Hefty wrote:
I came to the same conclusion a couple of weeks ago.  Rdma_create_id() will
likely need an address family parameter, or the user must explicitly 
bind before calling listen.

Rdma_create_id() already takes a struct sockaddr *, which has an address
family selector (sa_family) to define the contained address format. Why is
that one not sufficient?

Scratch that, I was looking at our usage one layer up in the NFS/RDMA code,
which does have the struct sockaddr *.

Looking at rdma_listen(), the code I see checks for bound state before
proceeding to listen:

int rdma_listen(struct rdma_cm_id *id, int backlog)
{
struct rdma_id_private *id_priv;
int ret;

id_priv = container_of(id, struct rdma_id_private, id);
if (!cma_comp_exch(id_priv, CMA_ADDR_BOUND, CMA_LISTEN))
return -EINVAL;
...

This makes sense, because sockets work this way, and servers generally
want to listen on a port of their own choosing.

So, I think it's already there. Right?

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] CMA IPv6 support

2006-05-15 Thread Talpey, Thomas
At 02:04 PM 5/15/2006, Sean Hefty wrote:
This is a slightly older version of the code.  There's now a call to 
bind if the
user hadn't previously called it.

Ok, and sorry for not checking the top-of-tree.

So I like the old code better (requiring the bind). Besides, if the user
does bind, then the family argument would be completely redundant.

I assume you'd continue to support rdma_bind_addr() letting the
system choose a port by binding to port 0...

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] ip over ib throughtput

2006-05-12 Thread Talpey, Thomas
Hi Shirley - I had a chance to try with the tiny blocksizes but I'm afraid
the results aren't useful to estimate max throughput. The server I am
using runs out of CPU at about 33,600 IOPS for small I/Os (=4KB), so
with 2000 byte reads, all I can get is about 65MB/sec. (I get 33MB/s
with 1KB, 120MB/s with 4KB, etc). And recall with NFS-default 32KB reads
I get 450MB/s. All these limits are due to this server's CPU at 100%. Time
to find a bigger server!

The good news is, performance is nice and flat right up until the server
hits the CPU wall. In fact, the more directio threads I run in parallel, the
lower the client overhead. With 50 threads issuing reads, I see as little as
0.5 interrupts per I/O!

Sorry I couldn't push more throughput using only small reads. I could trunk
the I/O to multiple servers, but I assume you're only interested in single-
stream results.

Tom.

At 11:11 PM 5/10/2006, Shirley Ma wrote:

Talpey, Thomas [EMAIL PROTECTED] wrote on 05/10/2006 03:10:57 PM:
 Sure, but I wonder why it's interesting. Nobody ever uses NFS in such
 small blocksizes, and 2044 bytes would mean, say, 1800 bytes of payload.
 What data are you looking for, throughput and overhead? Direct RDMA,
 or inline?
 
 Tom. 

Throughput. I am wondering how much room IPoIB performance (throughput) can 
go. 

Thanks 
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general][patch review] srp: fmr implementation,

2006-05-11 Thread Talpey, Thomas
I certainly won't shoot you - I agree. The other risk of the
current FMRs is that people will think the F means Fast.

Tom.


At 08:32 PM 5/10/2006, Tom Tucker wrote:
On Wed, 2006-05-10 at 08:53 -0700, Roland Dreier wrote:
 Thomas I am planning to test this some more in the next few
 Thomas weeks, but what I'd really like to see is an IBTA
 Thomas 1.2-compliant implementation, and one that operated on
 Thomas work queue entries (not synchronous verbs). Is that being
 Thomas worked on?
 
 No current hardware supports that as far as I know.  (Well, ipath
 could fake it since they already implement all the verbs in software)
 

I'm almost certain I'll be shot for saying this, but isn't there a
danger of confusion with real FMRs when the HW shows up? If the benefit
isn't there -- why do it if the application outcomes are almost
certainly all bad?

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general][patch review] srp: fmr implementation,

2006-05-10 Thread Talpey, Thomas
At 03:12 PM 5/9/2006, Roland Dreier wrote:
BTW, does Mellanox (or anyone else) have any numbers showing that
using FMRs makes any difference in performance on a semi-realistic benchmark?

Not me. Using the current FMRs to register/deregister windows for
each NFS/RDMA operation yields only a slight performance improvement
over ib_reg_phys_mr(), and I suspect this is mainly from the fact that
FMRs are page-rounded. Additionally, I find that the queuepair (or perhaps
the completion queue) seems to hang unpredictably, new events get
stuck, only to flush after the upper layer times out and closes the
connection.

What I really don't like about the current FMRs is that they seem to
be optimized only for lazy-deregistration, the fmr pools attempt to defer
the deregistration somewhat indefinitely. This is an enormous security
hole, and pretty much defeats the point of dynamic registration. The
NFS/RDMA client has full-physical mode for users that want speed
in well-protected environments. And it's a LOT faster.

I am planning to test this some more in the next few weeks, but what
I'd really like to see is an IBTA 1.2-compliant implementation, and one
that operated on work queue entries (not synchronous verbs). Is that
being worked on?

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] ip over ib throughtput

2006-05-10 Thread Talpey, Thomas
At 11:13 PM 5/9/2006, Shirley Ma wrote:
Have you tried to send payload smaller than 2044? Any difference?


You mean MTU or ULP payload? The default NFS reads and writes are
32KB, and in the addressing mode used in these tests they were
broken into 8 page-sized RDMA ops. So, there were 9 ops from the
server, per NFS read. I used the default MTU so these were probably
19 messages on the wire. I don't expect much difference with smaller
MTU, but smaller NFS ops would be noticeable.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] ip over ib throughtput

2006-05-10 Thread Talpey, Thomas
At 10:05 AM 5/10/2006, Shirley Ma wrote:
I meant payload less than or equal to 2044, not IB MTU. IPoIB can only 
send =2044 payload per ib_send_post(). NFS/RDMA in this case send 
32KB per ib_post_send(). 

Actually, in the cases I mentioned earlier, the NFS/RDMA server is
posting 8 4KB RDMA writes and one ~200 byte send to satisfy the
32KB direct read issued by the client. It's possible for the client to
construct many other requests however, so it's possible to result in
a 32KB single inline (nonRDMA) message, or if scatter/gather memory
registration is available, a single 32KB RDMA followed by the 200 byte
reply. Obviously, there are significant resource differences between
these. Which one to use can depend on many factors.

It would be nice to know the performance 
difference under same payload for IPoIB over UD and NFS/RDMA. Is that 
possible? 

Sure, but I wonder why it's interesting. Nobody ever uses NFS in such
small blocksizes, and 2044 bytes would mean, say, 1800 bytes of payload.
What data are you looking for, throughput and overhead? Direct RDMA,
or inline?

Tom. 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general][patch review] srp: fmr implementation,

2006-05-10 Thread Talpey, Thomas
At 11:36 AM 5/10/2006, Vu Pham wrote:
I can get ~780 MB/s max without FMRs and ~920 MB/s with FMRs 
(using 256 KB sequential read direct IO request)

In the without case, what memory registration strategy? Also,
what is the CPU utilization on the initiator in the two runs (i.e. is
the 780MB/s run CPU limited)?

Do you have performance results with smaller blocksizes?

Thanks,
Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] ip over ib throughtput

2006-05-09 Thread Talpey, Thomas
Shirley, Hassan - I am *very* interested in these results, and I
want to at least mention that I'm doing similar NFS/RDMA testing,
and getting some contrasting results.

 699040 699040  1638460.003668.07 (458MB/s) 
 cpu utilization was around 95%. 

On my dual-2.4GHz Xeon, with the relatively untuned NFS/RDMA
client on 2.6.16.6, I am able to pull about 450MB/sec of read
throughput at 35% total CPU.

This is using 16 threads of NFS direct i/o (O_DIRECT) to a midrange
NetApp server, I did achieve a similar result with the Linux NFS/RDMA
server (but only after hotwiring the ext2 interface because I don't
have the spindles). I am using a dedicated filesystem test to generate
the load, and also iozone. These NFS/RDMA direct reads use RDMA
writes from the server to the client.

Also, this was with client hyperthreading disabled and a dual-processor
Dell, I could reboot with a single CPU to get more comparable results.
But, the throughput was limited by server CPU (100%), the client was
actually loafing a little bit.

I thought it was interesting that a filesystem achieves the same
throughput at better overhead than a dedicated network test. :-) 
And I haven't played with interrupt affinity at all.

Tom.

At 07:23 PM 5/8/2006, Shirley Ma wrote:

I am testing most of my patches. Under 

1.Intel(R) Xeon(TM) CPU 2.80GHz, one cpu, 
2. fw-23108-3_4_000-MHXL-CF128-T.bin 
3. pci-x without msi_x enabled 
4. kernel 2.6.16 
5. netperf-2.4.0 
6. SVN 68XX+several IPoIB patches 

The best result I got so far: 

Testing with the following command line: 
netperf -l 60 -H 10.1.1.100 -t TCP_STREAM -i 10,2 -I 95,5 -- -m 16384 -s 
349520 -S 349520 

TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.1.100 
(10.1.1.100) port 0 AF_INET : +/-2.5% @ 95% conf. 
Recv   SendSend 
Socket Socket  Message  Elapsed 
Size   SizeSize Time Throughput 
bytes  bytes   bytessecs.10^6bits/sec 

699040 699040  1638460.003668.07 (458MB/s) 

cpu utilization was around 95%. 

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638




Hassan M. Jafri [EMAIL PROTECTED] 
Sent by: [EMAIL PROTECTED] 

05/08/2006 03:52 PM 
To
openib-general@openib.org 
cc
Subject
Re: [openib-general] ip over ib throughtput 




I cant crank out more than 150 MB/sec with my 2.0 GHz xeons. verbs level 
benchmarks, however give decent numbers for bandwidth. With netperf, the 
server side CPU usage is 99% which is much higher than other posted 
bandwidth results on this thread. Any suggestions?

Here is the complete configuration for my bandwidth tests

Kernel-2.6.15.4
netperf-2.3-3
OpenIB rev 6552
MTLP23108-CF128
Firmware 3.4.0
MSI-X is enabled for the HCA


--
Here is the netperf output



TCP STREAM TEST to 192.168.2.2
Recv   SendSend  Utilization   Service 
Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local 
remote
bytes  bytes   bytessecs.MBytes  /s  % T  % T  us/KB   us/KB

262142 262142  3276810.01   151.32   59.6699.847.700 
12.886
---

Here is ib0 config for one of the nodes

ib0   Link encap:UNSPEC  HWaddr 
00-02-04-04-FE-80-00-00-00-00-00-00-00-00-00-00
  inet addr:192.168.2.1  Bcast:192.168.2.255  Mask:255.255.255.0
  inet6 addr: fe80::202:c902:0:3ce9/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
  RX packets:1724527 errors:0 dropped:0 overruns:0 frame:0
  TX packets:9685456 errors:0 dropped:2 overruns:0 carrier:0
  collisions:0 txqueuelen:128
  RX bytes:89830114 (85.6 MiB)  TX bytes:2213308646 (2.0 GiB)








Michael S. Tsirkin wrote:
 Hi!
 What kind of performance do people see with ip over ib on gen2?
 I see about 100Mbyte/sec at 99% CPU utilisation on send,
 on an express card, Xeon 2.8GHz, SSE doorbells enabled.
 
 MST
 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit 
 http://openib.org/mailman/listinfo/openib-general
 
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] ip over ib throughtput

2006-05-09 Thread Talpey, Thomas
At 05:47 PM 5/9/2006, Shirley Ma wrote:

Thanks for sharing these test results. 

The netperf/netserver IPoIB over UD mode test spent most of time on copying 
data from user to kernel + checksum(csum_partial_copy_generic), and it only 
can send no more than mtu=2044 ib_post_send() per wiki, which definitely 
limits its performance compared to RDMA read/write. I would expect NFS/RDMA 
throughput much better than IPoIB over UD. 


Actually, I got excellent results in regular cached mode too, which
results in one data copy from the file page cache to user space. (In
NFS O_DIRECT, the RDMA is targeted at the user pages, bypassing
the cache and yielding zero-copy zero-touch even though the I/O is
kernel mediated by the NFS stack.)

Throughput remains as high as in the direct case (because it's still
not CPU limited), and utilization rises to a number less than you might
expect - 65%. Specifically, the cached i/o test used 79us/32KB, and
the direct i/o used 56us/32KB.

Of course, the NFS/RDMA copies do not need to compute the checksum,
so they are more efficient than the socket atop IPoIB. But I am not
sure that the payload per WQE is important. We are nowhere near the
op rate of the adapter. I think the more important factor is the interrupt
rate. NFS/RDMA allows the client to take a single interrupt (the server
reply) after all RDMA has occurred. Also, the client uses unsignalled
completion on as many sends as possible. I believe I measured 0.7
interrupts per NFS op in my tests.

Well, I have been very pleased with the results so far! We'll have more
detail as we go.

Tom. 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: mthca FMR correctness (and memory windows)

2006-03-21 Thread Talpey, Thomas
At 03:41 AM 3/21/2006, Michael S. Tsirkin wrote:
Which applications do register/unregister for each I/O?

Storage!

Do you have a specific benchmark in mind?

Storage!

:-)

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] mthca FMR correctness (and memory windows)

2006-03-21 Thread Talpey, Thomas
At 10:14 PM 3/20/2006, Roland Dreier wrote:
Thomas Oh yeah, I have to guess the PD too.

You can't guess the PD.  You would have to trick the victim into
putting the remote QP into the same PD.  (PDs are not represented on
the wire at all)

Ok - uncle. Since we're implementing the Linux PD protection in
the OpenIB driver, it's moot to discuss what happens if it can be
bypassed. My point is merely that the scope of the rkey is quite
important, and must not be compromisable.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] mthca FMR correctness (and memory windows)

2006-03-21 Thread Talpey, Thomas
At 01:01 AM 3/21/2006, Dror Goldenberg wrote:
Not sure we managed to convince you anything about FMRs. Anyway,

On the contrary, I feel I know them much better. :-) I'm certainly
more aware of the behavior of the fmr pool code, which is not
appropriate for storage ULPs, in my opinion.

I would suggest, even just for the sake of performance evaluation to
try the following FMR approaches:
0a - this is the allegedly fastest: using FMR to consolidate list of
pages
2a - with fmr map/unmap - same behavior as MWs (this is what you called
4)
3a - with fmr pool - will be like async unbind.

Yes, I am considering these. You're reversing my numbering, but I can
say that your 0a is definitely desirable, and 2a is what I'm attempting to
implement now.

I wouldn't be surprised if you end up finding 0a a win-win for both the
client
and the server. If you end up finding differently, then that may also be
interesting.
BTW, iSER only works this way, the RFC does not allow passing a 
chunk list as far as I know...

Yes, iSER follows the SCSI transfer mode which places a single segment
on the wire for each operation. RPC/RDMA was designed rather differently.
For one thing, NFS is not a block-oriented protocol. This means it is more
flexible w.r.t. data segmentation. Also, NFS has a much broader range of
message types, with metadata payload. These lead to requirements for a
more flexible wire structure.

I am hopeful that NFS/RDMA will lend itself well to cluster computing, due
to its good sharing semantics, transparent file API, and low overhead from
use of the RDMA fabric. The one thing I don't want to build in is some
kind of compromise on security or data integrity. No performance gain is
worth that.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] mthca FMR correctness (and memory windows)

2006-03-20 Thread Talpey, Thomas
At 12:25 PM 3/19/2006, Dror Goldenberg wrote:
When ib_unmap_fmr() is done, you can be sure that the old FMR is 
inaccessible. That's why this call blocks...

Okay, that's good.

But Tom, I think that you should be looking at rdma/ib_fmr_pool.h for a
better API
to use for FMRs. This way you can allocate a pool and remap FMRs each 
time you need one. You can look for examples in ulp/sdp directory.

Yeah, I noticed it but there is already a mechanism in the RPC/RDMA client
which supports memory windows, and it is easily adapted to include fmr's.
I might use the pool api later.

 Is there a plan to make fmr's compliant with verbs 1.2?

In the future... And it will probably be a different API, such an API
that can go through a WQE-CQE.

Yes, it certainly will be different. I would prefer the CQ completion style
to the blocking style of the current fmr's. It would allow for better overlap
of RPC processing.

Currently I must defer the deregistration until processing reaches user
context, and then the blocking operation costs a context switch. With
the memory window API, I can launch the deregistration early, and it's
often polled as complete by the time I'm ready to return from the RPC. 
So, I would prefer that fmr's used a similar process.

 Final question is memory windows. The code is there in the 
 NFS/RDMA client, but all it gets is ENOSYS from ib_bind_mw(). 
 (Yes, I know memory windows may not perform well on mthca. 
 However, they're correct and hopefully faster than ib_reg_phys_mr()).

FMR's the fastest. MWs are supported by the mthca HW. To my 
knowledge there was no demand for MWs so far and that's why the
code to handle them hasn't been implemented in mthca.

I want to quantify fastest before I agree with you. But I don't doubt
that they will perform better than memory windows, whose performance
on mthca hardware are disappointing (apparently) due to fencing and DMA
flushes. I could not achieve more than 150MB/sec using windows, while I
reached full bus bandwidth with a single full-frontal rkey. I am hoping that
fmr's come in closer to the latter.

I will not agree with your statement that nobody wants memory windows.
User space applications that don't wish to expose large amounts of memory
will certainly want them. Kernel space has the advantage here, by being
able to use fmr. User space can't do that.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] mthca FMR correctness (and memory windows)

2006-03-20 Thread Talpey, Thomas
At 02:58 PM 3/20/2006, Roland Dreier wrote:
If you want to invalidate each FMR individually, then there's not much
point in using FMRs at all.  Just register and unregister memory regions.

Really? The idea of FMR, I thought, was to preallocate the TPT entry
up front (alloc_fmr), and then populate it with the lightweight fmr api
(map_phys_fmr).

If I use ib_reg_phys_mr(), then I incur all the preallocation overhead
each time. Sure the benefit of an FMR isn't merely that the hard work
is deferred, opening a vulnerability while it's pending??

No chance.  You need to implement MW allocation in mthca.  It's not a
ton of work but it hasn't reached the top of anyone's list yet.

Rats! :-) Well, I'll maybe try to scope that, if you haven't already?

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] mthca FMR correctness (and memory windows)

2006-03-20 Thread Talpey, Thomas
At 05:09 PM 3/20/2006, Dror Goldenberg wrote:
It's not exactly the same. The important difference is about
scatter/gather.
If you use dma_mr, then you have to send a chunk list from the client to
the server. Then, for each one of the chunks, the server has to post an
RDMA read or write WQE. Also, the typical message size on the wire
will be a page (I am assuming large IOs for the purpose of this
discussion).

Yes, of course that is a consideration. The RPC/RDMA protocol carries
many more chunks for NFS_READ and NFS_WRITE RPCs in this mode.
But, the performance is still excellent, because the server can stream
RDMA Writes and/or RDMA Reads to and from the chunklists in response.

Since NFS clients typically use 32KB or 64KB sizes, such chunklists are
typically 8 or 16 elements, for which the client offers large numbers of
rdma read responder resources. Along with large numbers of RPC/RDMA
operation credits. In a typical read or write burst, I have seen the
Linux client have 10 or 20 RPC operations outstanding, each with
8 or 16 RDMA operations and two sends for the request/response.
In full transactional workloads, I have seen over a hundred RPCs.
It's pretty impressive on an analyzer.

Alternatively, if you use FMR, you can take the list of pages, the IO is
comprised of, collapse them into a virtually contiguous memory region,
and use just one chunk for the IO.
This:
- Reduces the amount of WQEs that need to be posted per IO operation
   * lower CPU utilization
- Reduces the amount of messages on the wire and increases their sizes
   * better HCA performance

It's all relative! And most definitely not a zero-sum game. Another way
of looking at it:

If the only way to get fewer messages is to incur more client overhead,
it's (probably) a bad trade. Besides, we're nowhere near the op rate of
your HCA with most storage workloads. So it's an even better strategy
to just put the work on the wire asap. Then, the throughput simply
scales (rises) with demand.

This, by the way, is why the fencing behavior of memory windows is so
painful. I would much rather take an interrupt on bind completion than
fence the entire send queue. But there isn't a standard way to do that,
even in iWARP. Sigh.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] mthca FMR correctness (and memory windows)

2006-03-20 Thread Talpey, Thomas
At 06:00 PM 3/20/2006, Sean Hefty wrote:
Can you provide more details on this statement?  When are you fencing the send 
queue when using memory windows?

Infiniband 101, and VI before it. Memory windows fence later operations
on the send queue until the bind completes. It's a misguided attempt to
make upper layers' job easier because they can post a bind and then
immediately post a send carrying the rkey. In reality, it introduces bubbles
in the send pipeline and reduces op rates dramatically.

I argued against them in iWARP verbs, and lost. If Linux could introduce
a way to make the fencing behavior optional, I would lead the parade.
I fear most hardware is implemented otherwise.

Yes, I know about binding on a separate queue. That doesn't work,
because windows are semantically not fungible (for security reasons).

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] mthca FMR correctness (and memory windows)

2006-03-20 Thread Talpey, Thomas
Ok, this is a longer answer.

At 06:08 PM 3/20/2006, Fabian Tillier wrote:
You pre-alloc the MPT entry, but not the MTT entries.  You then
populate the MTT by doing posted writes to the HCA memory (or host
memory for memfree HCAs).
...
I don't know if allocating MTT entries is really expensive.  What
costs is the fact that you need to do command interface transactions
to write the MTT entries, while FMRs support posted writes.

I don't know what MPTs and MTTs are (Mellanox implementation?) nor
do I know exactly what the overhead difference you refer to really is.
It's less about the overhead and more about the resource contention,
in my experience. 

That is, just like with alloc_fmr, you need to reserve and format an
MPT for regular memory registrations, which is a command interface
transaction.  For memory registration, one or more commands precede
this to write to the MTT. Thus, a memory registration is at a minimum
a 2 command interface transaction operation, potentially more
depending on the size of the registration.

Deregistration and freeing (not unmapping) an FMR should be
equivalent, I would think.

So, in the RPC/RDMA client, I do ib_alloc_fmr() a bunch of times way up
front, when setting up the connection. This provides the windows which
are then used to register chunks (RPC/RDMA segments).

As each RPC is placed on the wire, I borrow fmr's from the above list and
call ib_map_phys_fmr() to establish the mapping for each of its segments.
No allocation is performed on this hot path.

When the server replies, I call ib_unmap_fmr() to tear down the mappings.
No deallocation is performed, the fmrs are returned to a per-mount pool,
*after unmaping them*.

I just want the fastest possible map and unmap. I guess that means I
want fast MTT's.

I'd spoken with Dror about changing the implementation of memory
registration to always use posted writes, and we'd come to the
conclusion that this would work, though doing so was not the intended
usage and thus not something that was garanteed to work going forward.
 One of Dror's main concerns was that a future change in firmware
could break this.

Such a change would allow memory registration to require only a single
command interface transaction (and thus only a single wait operation
while that command completes).  I'd think that was beneficial, but
haven't had a chance to poke around to quatify the gains.

Again, it's not registration, it's the map/unmap. Do you believe that would
be faster with this interface? I don't think it requires an API change outside
the mthca interface, btw.

I'd still be interested in seeing regular registration calls improved,
as it's clear that an application that is sensitive about its security
must either restrict itself to send/recv, buffer the data (data copy
overhead), or register/unregister for each I/O.

Trust me, storage is sensitive to its security (and its data integrity).

As to using FMRs to create virtually contiguous regions, the last data
I saw about this related to SRP (not on OpenIB), and resulted in a
gain of ~25% in throughput when using FMRs vs the full frontal DMA
MR.  So there is definitely something to be gained by creating
virutally contiguous regions, especially if you're doing a lot of RDMA
reads for which there's a fairly low limit to how many can be in
flight (4 comes to mind).

25% throughput over what workload? And I assume, this was with the
lazy deregistration method implemented with the current fmr pool?
What was your analysis of the reason for the improvement - if it was
merely reducing the op count on the wire, I think your issue lies elsewhere.

Also, see previous paragraph - if your SRP is fast but not safe, then only
fast but not safe applications will want to use it. Fibre channel adapters
do not introduce this vulnerability, but they go fast. I can show you NFS
running this fast too, by the way.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] mthca FMR correctness (and memory windows)

2006-03-20 Thread Talpey, Thomas
At 07:50 PM 3/20/2006, Roland Dreier wrote:
Thomas Yes, I know about binding on a separate queue. That
Thomas doesn't work, because windows are semantically not
Thomas fungible (for security reasons).

Can you elaborate on the issue of fungibility?  If one entity has two
QPs, one of which it's using for traffic and one of which it's using
for MW binds, I don't see any security issue (beyond the fact that
you've now given up ordering of operations between the QPs).

If I can snoop or guess rkeys (not a huge challenge with 32 bits), and
if I can use them on an arbitrary queuepair, then I can handily peek and
poke at memory that does not belong to me.

For this reason, iWARP requires its steering tags to be scoped to a single
connection. This leverages the IP security model and provides correctness.

It is true that IB implementations generally don't do this. They should.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] mthca FMR correctness (and memory windows)

2006-03-20 Thread Talpey, Thomas
At 08:42 PM 3/20/2006, Diego Crupnicoff wrote:
 If I can snoop or guess rkeys (not a huge challenge with 32 bits), and 
 if I can use them on an arbitrary queuepair, then I can handily peek and 
 poke at memory that does not belong to me. 

No. You can't get to the Window from an arbitrary QP. Only from those QPs that 
belong to the same PD. 


Oh yeah, I have to guess the PD too.

 For this reason, iWARP requires its steering tags to be scoped to a single 
 connection. This leverages the IP security model and provides correctness. 
 
 It is true that IB implementations generally don't do this. They should. 

IB allows the 2 flavors (PD bound Windows aka type 1, and QP bound Windows aka 
type 2). 

Does mthca? I thought it's all type 1.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] mthca FMR correctness (and memory windows)

2006-03-20 Thread Talpey, Thomas
At 08:24 PM 3/20/2006, Doug O'Neil wrote:
From iWarp RDMA Verbs Section 5.2
...
Tom, I read the above as an STag that represents a MR can be used by any
QP with the same PD ID. STags that represent a MW must be used on the
same QP that created them.

The iWARP verbs were never made part of the RDDP specification,
nor would an API-based security model have passed muster in the
IETF.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] mthca FMR correctness (and memory windows)

2006-03-19 Thread Talpey, Thomas
I'm implementing FMR memory registration mode in the NFS/RDMA
client, and I've got it mostly working. However as I understand it,
mthca's existing fmr's do not guarantee that the r_key is completely
invalidated when the ib_unmap_fmr() returns. This makes using them
rather problematic, to say the least.

Now, I notice that ib_unmap_fmr() is a blocking operation (at least,
the kernel whines about semaphores being waited in interrupt context,
when I experimented with that).

Does this mean mthca's ib_unmap_fmr() is waiting for the invalidation
now, or plans to in the future?

Second comment is that the existing fmr api is (IMO) very inconsistent.
Why does ib_map_phys_fmr() take an array of u64 physaddrs and not
struct page *'s? And the unmap api mysteriously takes a struct list *,
not any object returned by ib_alloc_fmr() or ib_map_phys_fmr().

Is there a plan to make fmr's compliant with verbs 1.2?

Final question is memory windows. The code is there in the NFS/RDMA
client, but all it gets is ENOSYS from ib_bind_mw(). (Yes, I know memory
windows may not perform well on mthca. However, they're correct and
hopefully faster than ib_reg_phys_mr()).

What is the plan to implement mthca memory windows?

Thanks,
Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: Revenge of the sysfs maintainer! (was Re: [PATCH 8 of 20] ipath - sysfs support for core driver)

2006-03-10 Thread Talpey, Thomas
At 11:58 PM 3/9/2006, Bryan O'Sullivan wrote:
I'd like a mechanism that is (a) always there (b) easy for kernel to use
and (c) easy for userspace to use.  A sysfs file satisfies a, b, and c,
but I can't use it; a sysfs bin file satisfies all three (a bit worse on
b), but I can't use it; debugfs isn't there, so I can't use it.

That leaves me with few options, I think.  What do you suggest?  (Please
don't say netlink.)

mmap()?

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [PATCH 0 of 20] [RFC] ipath driver - another round for review

2006-03-10 Thread Talpey, Thomas
At 07:35 PM 3/9/2006, Bryan O'Sullivan wrote:
  - We've added an ethernet emulation driver so that if you're not
using Infiniband support, you still have a high-performance net
device (lower latency and higher bandwidth than IPoIB) for IP
traffic.

This strikes me as very unwise. In addition to duplicating a standardized
IPoIB facility, is the emulation supported by any other implementation?
Who will be using this code *without* having enabled the current OpenIB
support? What standardization is planned for this new protocol?

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [PATCH 0 of 20] [RFC] ipath driver - another round for review

2006-03-10 Thread Talpey, Thomas
At 10:59 AM 3/10/2006, Bryan O'Sullivan wrote:
On Fri, 2006-03-10 at 09:06 -0500, Talpey, Thomas wrote:

 This strikes me as very unwise. In addition to duplicating a standardized
 IPoIB facility, is the emulation supported by any other implementation?

No, it's specific to our hardware.  Its main purpose is to provide an IP
stack that works over the fabric when there are no IB drivers present,
so it's not duplicating IPoIB in any meaningful sense.

This is not sufficient justification to introduce an incompatible and redundant
Ethernet emulation layer into the core. Will it work in a system where IPoIB
is enabled? How do you handle IP addressing and discovery? Have you tested
it under all upper layers including IPv6? What apps do your users run?

 Who will be using this code *without* having enabled the current OpenIB
 support?

We already have a pile of customers using it.  It happens to have lower
latency and higher bandwidth than IPoIB, but I suspect that's in part
because we haven't had time to tune IPoIB yet.

You need to put your effort into supporting IPoIB. I would like to know
what tuning it means btw.

  What standardization is planned for this new protocol?

None at present.  It's there for people who want it, and people are
already using it.  For those who need something standards-based, there's
IPoIB.

That just doesn't cut it. Standard is better than Better.

This code is at the moment a proprietary extension, being proposed for
global inclusion.

At a minimum, you need to document its protocol, and quantify its
performance advantages. If so, perhaps it can be justified as an
experimental upper layer.

By the way, what's the name of this component?

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [PATCH 0 of 20] [RFC] ipath driver - another round for review

2006-03-10 Thread Talpey, Thomas
Hrm. Not sure how the emulation isn't Infiniband-related.

But you see the problem, right? If integrated, this becomes
a  Linux-to-Linux protocol (only). And the first question it has
to answer is why isn't this just IPoIB? I haven't seen an
answer to that.

Tom.

At 01:12 PM 3/10/2006, Bryan O'Sullivan wrote:
On Fri, 2006-03-10 at 13:02 -0500, Talpey, Thomas wrote:

 Will it work in a system where IPoIB
 is enabled?

Yes.

  How do you handle IP addressing and discovery?

DHCP and static addressing both work as you'd expect.

 Have you tested it under all upper layers including IPv6?

Yes.

  What apps do your users run?

Whatever they want.  NFS, ssh, SMB, etc, etc.

 At a minimum, you need to document its protocol, and quantify its
 performance advantages. If so, perhaps it can be justified as an
 experimental upper layer.

It's not Infiniband-related at all, if that's what you're objecting to.

 By the way, what's the name of this component?

ipath_ether.

   b

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] TX/RX_RING_SIZE as loadable parameters

2006-03-07 Thread Talpey, Thomas
At 02:11 AM 3/7/2006, Michael S. Tsirkin wrote:
 The default TX_RING_SIZE is 64, RX_RING_SIZE is 128 in IPoIB, which 

These parameters must be a power of 2, and at least 2, otherwise things
break. I'd suggest making these a log and multiplying the result by 2,
to exclude the possibility of user error.

Surely this isn't true of all hardware. If the underlying hardware requires a
power of 2, it should fix it, not make a requirement on the framework setting?

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Re: [PATCH] TX/RX_RING_SIZE as loadable parameters

2006-03-07 Thread Talpey, Thomas
At 08:00 AM 3/7/2006, Michael S. Tsirkin wrote:
 Surely this isn't true of all hardware.

This is true for all hardware.
We have code in ipoib like
   tx_req = priv-tx_ring[priv-tx_head  (IPOIB_TX_RING_SIZE - 1)]

So it only works for tz size that is a power of 2.

Sorry but that sounds like a needless restriction. Some hardware
doesn't have ring buffers at all. Well, if ipoib does it this way
though, I guess that's what it is.


 If the underlying hardware requires a
 power of 2, it should fix it, not make a requirement on the 
framework setting?

Supporting arbitrary ring size will require integer division or
coditional code on data path, I don't think it's worth it.

That's actually not what I suggested. I said the hardware driver
should change any unacceptable value to something that is.
Or, it can simply reject it.

Anyway, I definitely think it should be settable - but isn't code like
you quote going to result in changing *all* ipoib interfaces? This
kind of thing is usually a driver parameter, not an upper layer.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] TX/RX_RING_SIZE as loadable parameters

2006-03-07 Thread Talpey, Thomas
At 08:59 AM 3/7/2006, Michael S. Tsirkin wrote:
How about we take the ring buffer size from dev-tx_queue_len?

Round it up to the previous power of 2, for simplicity.
IPOIB_TX_RING_SIZE will then be just the default value, so it can
stay hardcoded. Make the RX queue say twice the number, and keep it per device.

This way I think the user can both view, and set, these values with the
standard ip command.

Makes sense to me! And it certainly meets the requirement of
per-interface tunables, using standard interfaces (ifconfig).

Remember though, tx_queue_len is only somewhat proportional
to the hardware tx queue. Technically, it's the send backlog
for use when the hardware queue is full. It's often smaller for
faster hardware.

OTOH, receive rings are usually *larger* for faster hardware.
Might be worth thinking these relationships through...

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] TX/RX_RING_SIZE as loadable parameters

2006-03-07 Thread Talpey, Thomas
At 01:20 PM 3/7/2006, Michael S. Tsirkin wrote:
Quoting r. Roland Dreier [EMAIL PROTECTED]:
 Subject: Re: [PATCH] TX/RX_RING_SIZE as loadable parameters
 
 Michael How about we take the ring buffer size from
 Michael dev-tx_queue_len?
 
 But dev-tx_queue_len is a different setting.  It's quite reasonable
 to have the tx_queue_len be set independently of the underlying
 pseudo-hardware work queues.

It kind of makes sense to have them related though, does it not?

Again - not necessarily! The tx_queue_len is a software backlog used
if the software overdrives the hardware on transmit. It avoids losing
packets, any of which can be seen in the netstat per-interface drops
statistic. It only needs to be big enough.

In fact, a large device ring means you probably only need a small tx queue.
But tuning this stuff can be a black art. Generally there is no reason to
want a silly-large tx ring. Besides, IPoIB only ever has one message in
flight, right?

In an earlier message you mentioned scaling the rx ring to the tx ring.
I think you should think more about that. The rx ring (hardware) has to
be big enough to keep packets while they await hardware interrupt processing.
So it's dependent on the arrival rate and the interrupt latency (including
any interrupt coalescing), not the transmitter depth.

All these numbers should be settable per-interface, and should attempt
to adhere to the principle of least surprise - do what other drivers in
net/ do.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: Re: RFC: e2e credits

2006-03-07 Thread Talpey, Thomas
At 04:39 PM 3/7/2006, Michael S. Tsirkin wrote:
Anyway, sine ULPs don't seem to need it, another approach would be an option to
disable these flow controls globally, and probably add a module option 
to enable
them back just in case.  That's much simpler, isn't it?

Thumbs up! If nothing uses them, why hang them around, enabled, just
to cause problems?

As an upper layer implementor, I sure don't want them in the way, nor do
I want to add special code to turn something off I wasn't even aware was
in the provider.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] TX/RX_RING_SIZE as loadable parameters

2006-03-07 Thread Talpey, Thomas
At 04:45 PM 3/7/2006, Roland Dreier wrote:
No, IPoIB can have arbitrarily many packets in flight.  It's just like
any other layer 2 net device in that respect.

I thought UD has only a single-packet window in the qp context.

There isn't much uniformity about this in drivers/net.

Unfortunately, making the ring sizes settable per-interface leads to a
lot of ugly option handling code.  Is it relly important, or can we
get by with one per-module setting?

Well, it's only important if the code that's there works well. I thought
Shirley said it doesn't. Has anyone instrumented it for overruns and
drops, and watched it under load? That would tell us what to tweak.

I dunno, it's probably deferrable (with Shirley's queue changes) as long
as there's a way to diagnose it later.

Constants are pretty much never correct in networking code. And module
parameters are darn close to being constants.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] RFC: move SDP from AF_INET_SDP to IPPROTO_SDP

2006-03-06 Thread Talpey, Thomas
We're encountering a similar situation in NFS/RDMA protocol
naming. The existing NFS client and server understand just
IPPROTO_UDP and IPPROTO_TCP.

One comment though, IP protocols are just 8 bits, 0-255. No
need to go to 65K.

I agree with Bryan though that it's not ours to say yes, it's netdev.
You should maybe stress the getaddrinfo() point more strongly, since
sharing of naming interfaces is highly desirable. SDP is all about
code compatibility, after all.

Tom.

At 01:15 PM 3/6/2006, Michael S. Tsirkin wrote:
Hi!
Would it make sense to move SDP from using a separate address family to
a separate protocol under AF_INET and AF_INET6?
Something like IPPROTO_SDP?

The main advantages are
- IPv6 support will come more naturally and without further extending
  to a yet another address family
- We could use a protocol number  64K (e.g. 7) to avoid conflicting
  with any IP based protocol.
  There are much more free protocol numbers that free family numbers
  (which only go up to 32 in linux for now).
- I could reuse more code for creating connections from af_inet.c

I also have a hunch this might make getaddrinfo work better on sdp but I'm not
sure.

Comments? Are there disadvantages to this approach that someone can see?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] NFS/RDMA client *and server* release for Linux 2.6.15

2006-03-06 Thread Talpey, Thomas
Following up on the client release of Feb 8, we are releasing
a first-functional NFS/RDMA server for Linux 2.6.15, along with
client updates based on comments received.

These are both licensed under dual BSD/GPL2 terms, and available
at the project's Sourceforge site:

http://sourceforge.net/projects/nfs-rdma/

http://sourceforge.net/project/showfiles.php?group_id=###package_id=###

Both client and server employ the native OpenIB verbs API for
RDMA, and work equally for Infiniband and iWARP.

The client and server implement the IETF draft protocol and fully
support direct (zero-copy, zero-touch) RDMA transfers at the RPC
layer. However, the write performance is not yet representative of
full RDMA operation, due to a bottleneck in the server's use of RDMA
Read, and at least one data copy in its handoff to the filesystem.
We will be rectifying the former, and investigating the latter.

Both the client and server have been tested with NFSv3 and pass
the Connectathon test suite.

Due to the additional components, the procedure for applying the
patches is substantially more involved, requiring several steps to
be followed in a particular sequence. Also, the server patches have
been separated into framework and RDMA sections, as already
been done for the client. The package README has details.

The RDMA support in this NFS server release was developed by
Tom Tucker of Open Grid Computing and we thank him for his efforts
on this.

At this time, the changes to the Linux NFS server svc framework are
in effect a first proposal for how RDMA support might be added to the
code. There are open issues in both how the module linkage should
be structured, and also how the linkage to existing code be done.

As before, we look forward to comments and feedback! Thanks for all
of it so far.

Tom Talpey, for the various NFS/RDMA projects.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] NFS/RDMA client *and server* release for Linux 2.6.15

2006-03-06 Thread Talpey, Thomas
Please ignore this draft message, which somehow escaped my
outbox! The code will be up shortly though.

Sorry for the interruption.

Tom.


At 01:50 PM 3/6/2006, Talpey, Thomas wrote:
Following up on the client release of Feb 8, we are releasing
a first-functional NFS/RDMA server for Linux 2.6.15, along with
client updates based on comments received.

These are both licensed under dual BSD/GPL2 terms, and available
at the project's Sourceforge site:

http://sourceforge.net/projects/nfs-rdma/

http://sourceforge.net/project/showfiles.php?group_id=###package_id=###

Both client and server employ the native OpenIB verbs API for
RDMA, and work equally for Infiniband and iWARP.

The client and server implement the IETF draft protocol and fully
support direct (zero-copy, zero-touch) RDMA transfers at the RPC
layer. However, the write performance is not yet representative of
full RDMA operation, due to a bottleneck in the server's use of RDMA
Read, and at least one data copy in its handoff to the filesystem.
We will be rectifying the former, and investigating the latter.

Both the client and server have been tested with NFSv3 and pass
the Connectathon test suite.

Due to the additional components, the procedure for applying the
patches is substantially more involved, requiring several steps to
be followed in a particular sequence. Also, the server patches have
been separated into framework and RDMA sections, as already
been done for the client. The package README has details.

The RDMA support in this NFS server release was developed by
Tom Tucker of Open Grid Computing and we thank him for his efforts
on this.

At this time, the changes to the Linux NFS server svc framework are
in effect a first proposal for how RDMA support might be added to the
code. There are open issues in both how the module linkage should
be structured, and also how the linkage to existing code be done.

As before, we look forward to comments and feedback! Thanks for all
of it so far.

Tom Talpey, for the various NFS/RDMA projects.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] NFS/RDMA client *and server* release for Linux 2.6.15

2006-03-06 Thread Talpey, Thomas
Following up on the client release of Feb 8, we are releasing
a first-functional NFS/RDMA server for Linux 2.6.15, along with
client updates based on comments received.

These are both licensed under dual BSD/GPL2 terms, and available
at the project's Sourceforge site:

http://sourceforge.net/projects/nfs-rdma/

http://sourceforge.net/project/showfiles.php?group_id=97628package_id=182485release_id=399220

Both client and server employ the native OpenIB verbs API for
RDMA, and work equally for Infiniband and iWARP.

The client and server implement the IETF draft protocol (*) and fully
support direct (zero-copy, zero-touch) RDMA transfers at the RPC
layer. However, the write performance is not yet representative of
full RDMA operation, due to a bottleneck in the server's use of RDMA
Read, and at least one data copy in its handoff to the filesystem.
We will be rectifying the former, and investigating the latter.

Both the client and server have been tested with NFSv3 and pass
the Connectathon test suite.

Due to the additional components, the procedure for applying the
patches is substantially more involved, requiring several steps to
be followed in a particular sequence. Also, the server patches have
been separated into framework and RDMA sections, as already
been done for the client. The package README has details.

The RDMA support in this NFS server release was developed by
Tom Tucker of Open Grid Computing and we thank him for his efforts
on this.

At this time, the changes to the Linux NFS server svc framework are
in effect a first proposal for how RDMA support might be added to the
code. There are open issues in both how the module linkage should
be structured, and also how the linkage to existing code be done.

As before, we look forward to comments and feedback! Thanks for all
of it so far.

Tom Talpey, for the various NFS/RDMA projects.

(*) Protocol docs under Internet-Drafts at bottom of page:
http://www.ietf.org/html.charters/nfsv4-charter.html

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] NFSoRDMA

2006-03-01 Thread Talpey, Thomas
At 06:25 PM 3/1/2006, Brad Dameron wrote:
Anyone have the NFS over RDMA working? I have tried getting the CITI
patches to compile with no luck. I am using the Voltaire IB card's which
appear to be Mellanox MT23108 cards. 

The CITI server will not compile and link on a generic OpenIB-enabled
kernel (it needs kDAPL). That code was a prototype and there is no
further work on it planned.

Instead we are developing a new server which uses native OpenIB and
will implement the full protocol including direct RDMA transfers and
multiple credits (two major elements not in the CITI code). It will be
released next week in first-functional form.

I assume you have seen the client announcement from a couple of
weeks back? Have you had any issues with that code?
http://openib.org/pipermail/openib-general/2006-February/016218.html

Anyway, watch here for the followup - It will have client changes based
on comments here and from others in the NFS community as well.

I'll mail here when we've assembled the patches (next week).

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] NFS/RDMA client release for Linux 2.6.15

2006-02-19 Thread Talpey, Thomas
Thanks for the detailed review! Some replies below. I left the IETF
list out of this reply since it's basically porting, not protocol.

At 07:01 AM 2/19/2006, Christoph Hellwig wrote:
On Wed, Feb 08, 2006 at 03:58:56PM -0500, Talpey, Thomas wrote:
 We have released an updated NFS/RDMA client for Linux at
 the project's Sourceforge site:

Thanks, this looks much better than the previous patch.

Comments:

  - please don't build the rdma transport unconditional, but make it
a user-visible config option

It's an option, but it's located in fs/Kconfig not net/. This is the way
SUNRPC is selected, so we simply followed that. BTW, Chuck's transport
switch doesn't support dynamically loading modules yet so there is a
dependency to work out until that's in place.

  - please use the kernel u*/s* types instead of (u)int*_t

We use uint*_t for the user-visible protocol definitions (on the wire) and
u32 etc for kernel stuff. I'll recheck if we got something wrong.

  - please include your local headers after the linux/*.h headers,

There are a couple of issues with header include ordering that seem to
change pretty often. In a couple of cases we had to rearrange things
to avoid forward declarations, I'll recheck this.

and keep all the includes at the beginning of the files, just after
the licence comment block
  - chunktype shouldn't be a typedef but a pure enum, and the
names look a bit too generic, please add an rdma_ prefix

Ok on both.

  - please kill the XDR_TARGET and pos0 macros, maybe RPC_SEND_SEG0
and RPC_SEND_LEN0, too
  - RPC_SEND_VECS should become an inline functions and be spelled
lowercase
  - RPC_SEND_COPY is probably too large to be inlined and should be
spelled lowercase
  - RPC_RECV_VECS should be an inline and spelled lowercase
  - RPC_RECV_SEG0 and PC_RECV_LEN0 should probably go away.
  - RPC_RECV_COPY is probably too large to be inlined and should be
spelled lowercase
  - RPC_RECV_COPY same comment about highmem and kmap as in
RPC_SEND_COPY

These are killable. They were there to support code sharing for 2.4 kernels
and are easy to eliminate now.

  - the CONFIG_HIGHMEM ifdef block in RPC_SEND_COPY is wrong.  Please
always use kmap, it does the right thing for non-highmem aswell.
The PageHighMem check and using kmap_high directly is always
wrong, they are internal implementation details.  I'd also suggest
evaluating kmap_atomic because it scales much better on SMP systems.

Yes, there are some issues here which we're still working out. In fact, we
can't use kunmap() in the context you mention because in 2.6.14 (or is it
.15) it started to check for being invoked in interrupt context. There is
one configuration in which we do call it in bh context. The call won't block
but the kernel BUG_ON's. This is something on our list to address.

  - please try to avoid file-scope forward-prototypes but try to order the
code in the natural flow where they aren't required

Good point. Will recheck for these.

  - structures like rpcrdma_msg that are on the wire should use __be*
for endianess annotations, and the cpu_to_be*/be*_to_cpu accessor
functions instead of hton?/ntoh?.  Please verify that these annotations
are correct using sparse -D__CHECK_ENDIAN__=1

Hmm, okay but existing RPC and NFS code don't do this. I'm reluctant to
differ from the style of the governing subsystem. I'll check w/Trond.

  - rdma_convert_physiov/rdma_convert_phys are completely broken.
page_to_phys can't be used by driver/fs code.  RDMA only deals with bus
addresses, not physical addresses.  You must use the dma mapping API
instead. Also coalescing decisions are made by the dma layer, because
they are platform dependent and much more complex then what the code
in this patch does.

Now that we are moving to OpenIB api's this is needed. There is some
thought necessary w.r.t. our max-performance mode of preregistering
memory in DMA mode. That's on our list of course.

  - transport.c is missing a GPL license statement

Oops.

  - in transport.c please don't use CamelCase variable names.

This is just for module parameters? These are going away but we don't have
the new NFS mount API yet. There is a comment to that effect but maybe
it doesn't mention the module stuff.

  - MODULE_PARM shouldn't be used in new code, but module_param instead.

Ditto.

  - please don't use the (void) function() style, it just obsfucates the
code without benefit.

Ok.

  - try_module_get(THIS_MODULE) is always wrong.  Reference counting
should happen from the calling module.

This is the same convention used by the other RPC transports. I will pass
the comment along.

  - please initialize global or file-scope spinlocks with
DEFINE_SPINLOCK().

Ok.

  - the traditional name for the second argument to spin_lock_irqsave is
just flags, not lock_flags.  This doesn't really matter, but
following such conventions makes it easier to understand

Re: [openib-general] NFS/RDMA client release for Linux 2.6.15

2006-02-19 Thread Talpey, Thomas
At 05:28 PM 2/19/2006, Roland Dreier wrote:
Christoph - rdma_convert_physiov/rdma_convert_phys are completely
Christoph broken.  page_to_phys can't be used by driver/fs code.
Christoph RDMA only deals with bus addresses, not physical
Christoph addresses.  You must use the dma mapping API
Christoph instead. Also coalescing decisions are made by the dma
Christoph layer, because they are platform dependent and much
Christoph more complex then what the code in this patch does.

Thomas Now that we are moving to OpenIB api's this is
Thomas needed. There is some thought necessary w.r.t. our
Thomas max-performance mode of preregistering memory in DMA
Thomas mode. That's on our list of course.

Again let me echo Christoph's point.  If you are passing physical
addresses into IB functions, then your code simply won't work on some
architectures.  Making sure your code actually works on something like
a ppc64 box with an IOMMU would be a good test -- the low-end IBM
POWER machines are cheap enough that you could just buy one if you
don't have easy access.

Yep, I get it!

To elaborate a little, we're not exactly passing physical addresses. What
we're doing is using the physaddr to calculate an offset relative to a base
of zero. We register the zero address and advertise RDMA buffers via
offsets relative to that r_key.

And, this is only one of many memory registration modes. We would use
memory windows, if only OpenIB provided them (yes I know the hardware
currently sucks for them). We will add FMR support shortly. In both these
modes we perform all addressing by the book via 1-1 OpenIB registration.

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] NFS performance and general disk network export advice (Linux-Windows)

2006-02-09 Thread Talpey, Thomas
At 03:17 PM 2/9/2006, Paul Baxter wrote:
I'm looking to export a filesystem from each of four linux 64bit boxes to a 
single Windows server 2003 64bit Ed.

Has anyone achieved this already using an IB transport? Can I use NFS over 
IPoIB cross platform? i.e. do both ends support a solution?

Is NFS over RDMA compatible with Windows (pretty sure the answer is no to 
this one but love to be proven wrong). I've attached Tom's announcement of 
the latest to the bottom of this email. I don't think Windows has the RDMA 
abstraction (yet)?

Not the code I posted! :-) But sure, it's possible to implement NFS/RDMA
on Windows. Let us know when you're ready to test. ;-)

Are windows IB drivers (Openib or Mellanox) compatible with these options? 
Do I layer Windows services for Unix on top of the Windows IB drivers and 
IPoIB to achieve a cross platform NFS?

You could do this but your real challenge is the upper layer IFS interface.
You would need to implement a Windows filesystem for NFS first. Of course,
there are such beasts, Hummingbird's comes to mind.

The code I posted uses strictly the OpenIB RDMA interfaces, plus CMA for
address resolution and making connections. By the way, it will work over
iWARP too.

Has anyone done much in the way of NFS performance comparisons of NFS over 
IPoIB in cross-platform situations vs say Gigabit ethernet. Does it work :) 
What is large file throughput and processor loading - I'm aiming for 150-200 
MB/s on large files on 4x SDR IB (possibly DDR if we can fit the bigger 144 
port switch chassis into our rack layout for 50-ish nodes).

NFS over IPoIB does work, but is nowhere near as low-overhead as native
NFA over RDMA. There are several issues with an IPoIB implementation,
first of all the fact that an IPoIB solution is quite a bit less optimal than
a native 10GbE NIC:

- The UD connection typically has a single message in flight, which negates
much of the streaming throughput capable with RC.
- The IPoIB layer is an emulation, and does not generally perform the hardware
checksumming and large segment offload that even 100Mb NICs provide.
- The network stack is still in the loop on both ends, adding computational
overhead and latency.
- The data must still be copied.

I have seen native zero-copy zero-touch NFS/RDMA streaming at full PCI/X
throughput using only about 20% of a dual-processor 2GHz Xeon. Typically,
most network stacks top out at 100% CPU at perhaps half this rate on similar
platforms. I'd expect IPoIB to be even less due to the reasons above.

Are there any alternatives to using NFS that may be better and that would 
'transparently' receive a performance boost with IB compared with using a 
simple NFS/gigabit ethernet solution. Must be fairly straightforward, 
ideally application neutral (configure a drive and load/unload script for 
Linux and it just happens) and compatible between Win2003 and Linux? 
Alternatives using perhaps Samba on the Linux side?

My lack of knowledge of IB in the windows world has got me concerned over 
whether this is actually achievable (easily).

I hope to be trying this once we get a Windows 2003 machine, but hope 
someone can encourage me that its a breeze prior to my coming unstuck in a 
month or so!

Some detail about the bit I do understand:

I will be using a patched Linux kernel (realtime preemption patches ) but 
prefer not to apply/track too many kernel patches as the kernel evolves. The 
NFS patches suggested by Tom in his announcement below make me a little 
nervous.

The most important patches for integrating the NFS/RDMA client are already
in the 2.6.15 kernel, but there is additional work which is still in progress.
These are the patches I refer to. One of the major ones is the ability to
dynamically load RPC transports, such as the NFS/RDMA module. So you
do need some sort of patch to use the client, currently.

The transport switch continues to evolve and become integrated into the
kernel, so the need for this particular patch will fall away eventually. FYI,
the transport switch is much more general than NFS/RDMA - it's the
underpinning of IPv6 support for the NFS client.

Your real issue in working with NFS/RDMA in the way you describe is the
availability of the server. The Linux NFS/RDMA server is still very much under
development, and will take time just to be ready for experimentation.
Especially, it will take time to get it to a state where it can perform the
way you require (performance).

Please feel free to contact me offline if you want to talk about details of
actually setting this up. With a stock 2.6.15.2 kernel and a couple of IB
cards you could get it going just to get started.

Tom.





The application will alternate between a real-time mode with (probably) no 
NFS (or similar network exporting of the disk) and an archiving mode where 
Linux will load relevant network filesystem modules and let the windows 
machine read the disks.

The reason for this odd load/unload behaviour is because our current 

Re: [openib-general] Re: [PATCH] CMA and iWARP

2006-01-24 Thread Talpey, Thomas
At 06:53 PM 1/23/2006, Roland Dreier wrote:
vetoed on netdev and b) trying to get openib and the kernel community
to accept code just so a vendor can meet a product marketing deadline.

BTW, upon reflection, the best idea for moving this forward might be
to push the Ammasso driver along with the rest of the iWARP patches,
so that there's some more context for review.  Just because a vendor
is out of business is no reason for Linux not to have a driver for a
piece of hardware.

In fact, there are a bunch of Ammasso cards out there, and also, what
better proof could you have that there isn't a hidden hardware agenda
in the submission!

Tom.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] iser/uverbs integration

2005-08-31 Thread Talpey, Thomas
At 08:06 AM 8/31/2005, Gleb Natapov wrote:
The question is what is the best way to proceed? Will the changes needed to
use userspace QP from kernel will be accepted? How NFS/RDMA works now?

To answer the second question, both client and server NFS/RDMA
create and connect all endpoints completely within the kernel.
This is also true of NFS/Sockets btw.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] Re: RDMA Generic Connection Management

2005-08-31 Thread Talpey, Thomas
At 02:06 PM 8/31/2005, Yaron Haviv wrote:
Also note that with Virtual machines this type of event may be more
frequent and we may want to decouple the ULPs from the actual hardware

s/may want/definitely want/

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] RDMA Generic Connection Management

2005-08-30 Thread Talpey, Thomas
At 10:55 AM 8/30/2005, Yaron Haviv wrote:
The iSCSI discovery may return multiple src  dst IP addresses and the
iSCSI multipath implementation will open multiple connections.
There are many TCP/IP protocols that do that at the upper layers (e.g.
GridFTP, ..), not sure how NFS does it.

The answer to that question depends on the version of NFS, and
also the implementation.

For NFSv2/v3, the situation is ad hoc. Some clients support
multiple connections which they are able to round-robin. Solaris
does this for example. The problem is, to the server each NFSv2/v3
connection appears to be a different client. Therefore the
correctness guarantees (such as they are) go out the window.
For example, a retry on a different connection is not a retry at
all, it's a new op. So, the shotgun (trunked) NFSv3 situation is
useful only for a certain class of use.

For NFSv4, it's a little better in that there is a clientid which
identifies the source. However, NFSv4 does not sufficiently deal
with the case of requests on different connections either. 

With our new NFSv4 sessions proposal, planned to be part of
NFSv4.1 (http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-sess-02.txt),
trunking is fully supported, by allowing requests to belong to a
higher-layer session regardless of what connection they arrive
on. This exists in prototype form, the NFSv4.1 spec is still being
pulled together. UMich/CITI is developing this btw. With a session,
the client gets full consistency guarantees and trunked connections
are therefore completely transparent.

One thing to stress is that the type of connection (TCP, UDP,
RDMA, etc) makes little or no difference in the trunking/multipathing
picture. In fact, with an NFSv4.1 session, a mix of such connections
is possible, and even a good idea. So it's more than a question of
what RDMA capabilities are there, it's really *all* connections.

To answer the question of how NFS finds out about multiple
connections and trunking, the answer is generally that the mount
command tells it. Mount can get this information from the command
line, or DNS. I believe Solaris uses the command line approach. There
may be a way to use the RPC portmapper for it, but the portmapper
isn't used by NFSv4.

Bottom line? NFS would love to have a way to learn multipathing
topology. But it needs to follow existing practice, such as having
an IP address / DNS expression. If the only way to find it is to query
fabric services, that's not very compelling.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA Generic Connection Management

2005-08-30 Thread Talpey, Thomas
At 01:12 PM 8/30/2005, Roland Dreier wrote:
Steve I thought all ULPs needed to register as an IB client
Steve regardless?

Right now they do, because there's no other way to get a struct
ib_device pointer.  If we add a new API that returns a struct
ib_device pointer, then inevitably consumers will use it instead of
the current client API, and then hotplug will be hopelessly broken.

Are you telling us that RPC/RDMA (for example) has to handle hotplug
events just to use IB? Isn't that the job of a lower layer? NFS/Sockets
don't have to deal with these, f'rinstance.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA Generic Connection Management

2005-08-30 Thread Talpey, Thomas
At 02:01 PM 8/30/2005, Roland Dreier wrote:
Thomas Are you telling us that RPC/RDMA (for example) has to
Thomas handle hotplug events just to use IB? Isn't that the job
Thomas of a lower layer? NFS/Sockets don't have to deal with
Thomas these, f'rinstance.

Yes, if you want to talk directly to the device then you have to make
sure that the device is still there to talk to.

Verbs don't do that?

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA Generic Connection Management

2005-08-30 Thread Talpey, Thomas
At 02:15 PM 8/30/2005, Roland Dreier wrote:
Thomas Verbs don't do that?

Not as they are currently defined.  And I don't think we want to add
reference counting (aka cache-line pingpong) into every verbs call
including the fast path to make sure that a device doesn't go away in
the middle of the call.

Well, you're saying somebody has to do it, right? Is it easier
to fob this off to upper layers that (frankly) don't care what
hardware they're talking to!? This means we have N copies
of this, and N ways to do it. Talk about cacheline pingpong.

Sorry but it suddenly sounds like we're all writing device
drivers, not developing upper layers. This is a mistake.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA Generic Connection Management

2005-08-30 Thread Talpey, Thomas
kDAPL does this!

:-)


At 02:35 PM 8/30/2005, Roland Dreier wrote:
Thomas Well, you're saying somebody has to do it, right? Is it
Thomas easier to fob this off to upper layers that (frankly)
Thomas don't care what hardware they're talking to!? This means
Thomas we have N copies of this, and N ways to do it. Talk about
Thomas cacheline pingpong.

Upper layers have the luxury of being able to do this at a
per-connection level, can sleep, etc.  If we push it down into the
verbs, then we have to do it in every verbs call, including the fast
path verbs call.  And that means we get into all sorts of crazy code
to deal with a device disappearing between a consumer calling
ib_post_send() and the core code being entered, etc.

Right now we have a very simple set of rules:

  An upper level protocol consumer may begin using an IB device as
  soon as the add method of its struct ib_client is called for that
  device.  A consumer must finish all cleanup and free all resources
  relating to a device before returning from the remove method.

  A consumer is permitted to sleep in its add and remove methods.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA Generic Connection Management

2005-08-30 Thread Talpey, Thomas
At 03:08 PM 8/30/2005, Roland Dreier wrote:
Thomas kDAPL does this!  :-)

Does what?  As far as I can tell kDAPL just ignores hotplug and
routing and hopes the problems go away ;)

I was referring to kDAPL's architecture, which does in fact address
hotplug with async evd upcalls. In the early days of the reference
port we implemented it on Solaris this way, for example.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: RDMA Generic Connection Management

2005-08-30 Thread Talpey, Thomas
At 04:10 PM 8/30/2005, Talpey, Thomas wrote:
At 03:08 PM 8/30/2005, Roland Dreier wrote:
Thomas kDAPL does this!  :-)

Does what?  As far as I can tell kDAPL just ignores hotplug and
routing and hopes the problems go away ;)

I was referring to kDAPL's architecture, which does in fact address
hotplug with async evd upcalls. In the early days of the reference
port we implemented it on Solaris this way, for example.

And I remember naming the upcall E_NIC_ON_FIRE.

There was another one after putting it out, of course. :-)

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] RDMA connection and address translation API

2005-08-25 Thread Talpey, Thomas
At 12:34 PM 8/25/2005, Roland Dreier wrote:
All implementation of NFS/RDMA on top of IB had better interoperate,
right?  Which means that someone has to specify which address
translation mechanism is the choice for NFS/RDMA.

Correct. At the moment the existing NFS/RDMA implementations
use ATS (Sun's and NetApp's).

NFS/RDMA is being defined on top of an abstract RDMA interface.
Someone has to write a spec for how that RDMA abstraction is
translated into packets on the wire for each transport that NFS/RDMA
will run on top of.

Well, we did. We specify the ULP payload of all the messages
in those two IETF documents. What we didn't do is define how
each transport handles IP addressing, that is a transport issue.

We don't need address translation over iWARP, since that uses
IP. Over IB, so far, we have used ATS. I am perfectly fine with
a better solution, but ATS has been fine too.

I am catching up to this discussion, so this is just one reply.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] RDMA connection and address translation API

2005-08-25 Thread Talpey, Thomas
At 12:56 PM 8/25/2005, Caitlin Bestler wrote:
Generic code MUST support both IPv4 and IPv6 addresses.
I've even seen code that actually does this.

Let me jump ahead to the root question. How will the NFS layer know
what address to resolve?

On IB mounts, it will need to resolve a hostname or numeric string to
a GID, in order to provide the address to connect. On TCP/UDP, or
iWARP mounts, it must resolve to IP address. The mount command
has little or no context to perform these lookups, since it does not
know what interface will be used to form the connection.

In exports, the server must inspect the source network of each
incoming request, in order to match against /etc/exports. If there
are wildcards in the file, a GID-specific algorithm must be applied.
Historically, /etc/exports contains hostnames and IPv4 netmasks/
addresses.

In either case, I think it is a red herring to assume that the GID
is actually an IPv6 address. They are not assigned by the sysadmin,
they are not subnetted, and they are quite foreign to many users.
IPv6 support for Linux NFS isn't even submitted yet, btw.

With an IP address service, we don't have to change a line of 
NFS code.

Tom.



So supporting GIDs is not that much of an issue as long
as no IB network IDs are assigned with a meaning that
conflicts with any reachable IPv6 network ID. (In other
words, assign GIDs so that they are in fact valid IPv6
addresses. Something that was always planned to be one
option for GIDs).



 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of James Lentini
 Sent: Thursday, August 25, 2005 9:48 AM
 To: Tom Tucker
 Cc: openib-general@openib.org
 Subject: RE: [openib-general] RDMA connection and address 
 translation API
 
 
 
 On Wed, 24 Aug 2005, Tom Tucker wrote:
 
   
- It's not just preventing connections to the wrong 
 local address.
  NFS-RDMA wants the remote source address (ie 
 getpeername()) so that
  it can look it up in the exports list.
  
  Agreed. But you could also get rid of ATS by allowing GIDs to be 
  specified in the exports file and then treating them like
  IPv6 addresses for the purpose of subnet comparisons.
 
 Could generic code use both GIDs and IPv4 addresses? 
 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit 
 http://openib.org/mailman/listinfo/openib-general
 
 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [ANNOUNCE] Initial trunk checkin of ISERinitiator

2005-08-19 Thread Talpey, Thomas
At 07:41 PM 8/18/2005, Grant Grundler wrote:
If kDAPL for any reason doesn't get pushed upstream to kernel.org,
we effectively don't have iSER or NFS/RDMA in linux. 
Since I think without them, linux won't be competitive in the
commercial market place.

Put another way, OpenIB want storage to use it, and vice versa.

I can speak for NFS/RDMA. If NFS/RDMA doesn't have kDAPL,
then it gets thrown backwards due to having to reimplement.
That's recoverable (sigh) but there are still missing pieces.

By far the largest is the connection and addressing models.
There is, as yet, no unified means for an upper layer to connect
over any other transport in the OpenIB framework. In fact, there
isn't even a way to use IP addressing on the OpenIB framework
now, which is an even more fundamental issue.

So, yes, without kDAPL at the moment we don't have iSER or
NFS/RDMA. We can recode the message handling pieces to
OpenIB verbs. For NFS/RDMA, that's not even a ton of work.

Then we'll be forced to reimplement or reuse pretty much
all of the connect and listen code, and the IP address translation,
atop OpenIB.

How quickly can OpenIB move to a transport model that supports
these missing pieces? I can give a different answer with that
information.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface

2005-08-12 Thread Talpey, Thomas
At 11:52 PM 8/11/2005, Tom Duffy wrote:

On Aug 11, 2005, at 2:38 PM, Hal Rosenstock wrote:
 Can anyone think of another approach to do this and keep backward
 compatibility ?

Do we need backward compatibility?  How about the stuff that includes  
if_packet.h gets rebuilt?  You are adding to the end of the struct,  
after all.

The size of the struct is less of an issue than the test for
ARPHRD_INFINIBAND. David said as much:

-- it won't work for anything else without adding
-- more special tests to that af_packet.c code

I have to say, SOCKADDR_COMPAT_LL is pretty stinky too.

Hal, why *are* you testing for ARPHRD_INFINIBAND anyway?
What different action happens in the transport-independent
code in this special case?

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface

2005-08-12 Thread Talpey, Thomas
At 09:01 AM 8/12/2005, Hal Rosenstock wrote:
It is done to preserve length checks that were already there (on struct
msghdr in packet_sendmsg and addr_len in packet_bind). I didn't want to
weaken that.

Are you sure things break if you simply build a message in user space
that's got the larger address (without changing the sockaddr_ll at all)?
It looks to me as if msg-msg_namelen/msg_name can be any appropriate
size which is at least as large as the sockaddr_ll.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface

2005-08-12 Thread Talpey, Thomas
At 10:12 AM 8/12/2005, Hal Rosenstock wrote:
If sockaddr_ll struct is left alone, I think it may be a problem on the
receive side where the size of that struct is used.

Maybe. The receive side builds the incoming sockaddr_ll in the skb-cb.
But that's 48 bytes and it goes off to your device's hard_header_parse
to do so...

You sure you have hard_header_len and all the appropriate vectors
set up properly? (netdevice.h)

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface

2005-08-12 Thread Talpey, Thomas
At 12:21 PM 8/12/2005, Tom Duffy wrote:
Can we do an audit of what stuff will break with this change?  If it  
is a handful of applications that we all have the source to, maybe it  
won't be that big of a deal.

Right now it looks to me like the app is arping, and it can be fixed
by increasing the size of the storage it allocates in its data segment,
without changing the sockaddr_ll.

Maybe others, haven't bothered to look.

struct sockaddr_ll me - union { struct sockaddr_ll xx; unsigned char yy[32]; } 
me;

Note: Hal's change requires arping to be recompiled too!
Can't stick 20 bytes into 8 there, either.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] sockaddr_ll change for IPoIB interface

2005-08-12 Thread Talpey, Thomas
At 12:57 PM 8/12/2005, Hal Rosenstock wrote:
Using old arping on IPoIB will get the error on the sendto as the
hardware type is not available at bind time.

Okay, that's a feature then, instead of Bus Error - core dumped
when 20 bytes land on top of 8, they'll get a send failure. :-)

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] mapping between IP address and device name

2005-06-28 Thread Talpey, Thomas
At 05:34 PM 6/27/2005, Roland Dreier wrote:
I'm not sure I understand this.  At best, ATS can give you back a list
of IPs.  How do you decide which one to check against the exports?

Any or all of them. Exports is a fairly simple access list, and membership
by the client is all that's required. It supports wildcards as well as single
address entries.

Here's the example from the Linux manpage:
  # sample /etc/exports file
   /   master(rw) trusty(rw,no_root_squash)
   /projects   proj*.local.domain(rw)
   /usr*.local.domain(ro) @trusted(rw)
   /home/joe   pc001(rw,all_squash,anonuid=150,anongid=100)
   /pub(ro,insecure,all_squash)

See the wildcards? If any of the machine's IPs matches one, that line
yields true. Also of course, even the non-wildcards can expand to a list
of addresses; in the first line master is a single host, any of its IP
addresses is eligible for a match.

In a pure IP world, every packet from a multihomed client carries a
source IP address.  So a server can use getpeername() to determine
which address a client is connecting from.  This is fundamentally
different from ATS.

I don't understand. ATS allows each incoming connection to map to
one or more IP addresses, effectively supporting getpeername() on
the IB QP. DAPL passes this address up to the consumer in the
connection indication via the cr_param's ia_address_ptr. The consumer
doesn't invoke ATS directly, nor would it want to. In the NFS server
case, it just needs to run this address down the exports list, same
way it would for a TCP connection or UDP datagram.

Tom. 
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [PATCH][RFC] nfsordma: initial port of nfsrdma to 2.6 and james'sss kdapl

2005-06-28 Thread Talpey, Thomas
At 06:25 PM 6/27/2005, Tom Duffy wrote:
I have done some initial work to port nfsrdma to 2.6 and to James's
kDAPL.  This builds now inside the kernel.

Tom, thanks for starting this and I'll take a look at your approach.
In fact we already have a version working on 2.6.11, the main change
merging up to 2.6.12 is having to merge with Chuck's new RPC
transport switch.

We are definitely willing to GPL the code, I am in the process of
getting that approved and putting the result in the pipeline. I am
not sure about using the OpenIB repository just yet, because of
the dependency on the RPC transport. We don't want to inject
the changes in multiple places.

Let's caucus offline to figure out how to handle the repository
question. It shouldn't be a major issue once we figure out the
dependencies. I'll get back to you tomorrow (after taking a look
at these patches too).

Tom.



You will need to follow the kDAPL directions first to put that in your
kernel tree, then slap this patch over top of that.

So you have 2.6.12 + svn drivers/infiniband + kdapl from james's tree +
this patch to get it to build.  Oh you will also need to patch the rpc
header to get it to build.

I think it is time to open up a tree in openib repository.  Tom, is
netapp willing to GPL this code?
 
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [PATCH][RFC] nfsordma: initial port of nfsrdma to 2.6 and james'sss kdapl

2005-06-28 Thread Talpey, Thomas
At 11:13 AM 6/28/2005, Tom Duffy wrote:
I am sure I got the RPC stuff wrong.  I just wanted to make it compile
against James's kDAPL and inside the drivers/infiniband directory.

Where can I find the 2.6.11 version of the patch?

Here: http://troy.citi.umich.edu/~cel/linux-2.6/2.6.11/release-notes.html

This is kind of involved, you need to apply some prepatches as well
as postpatches. The 2.6.12 is significantly cleaner but it implements
a new API, due to Trond's requests. In other words, moving target
alert.

I have a working tarball against 2.6.11. The two files rdma_transport.c
and rdma_marshal.c plus what you're doing in rdma_kdapl.c might just
work. But it's not GPL yet. Can you wait a couple of days?

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] mapping between IP address and device name

2005-06-27 Thread Talpey, Thomas
At 03:10 AM 6/26/2005, Itamar Rabenstein wrote:
But the ATS will not solve the problem of many to one.
What will the nfs module will do if the the result from the ATS will be 
a list of IP's which only one of them is has permission to the nfs ?
ATS cant tell you who is the source IP.

The NFS server exports will function just fine in such a case.
This is no different from any other multihomed client, and
/etc/exports can be configured appropriately.

What wouldn't be useful would be to use MAC addresses (GIDs)
for mounting, exports, etc. Can you imagine administering a network
where hardware addresses were the only naming? No sysadmin
would even entertain such an idea.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] mapping between IP address and device name

2005-06-24 Thread Talpey, Thomas
At 01:31 PM 6/23/2005, Roland Dreier wrote:
James kDAPL uses this feature to provide the passive side of a
James connection with the IP address of the remote peer. kDAPL
James consumers can use this information as a weak authentication
James mechanism.

This seems so weak as to be not useful, and rather expensive to boot.
To implement this, a system receiving a connection request would have
to perform an SA query to map the remote LID back to a GuidInfo
record, and then for each GID attached to the remote LID, somehow
retrieve the set of IP addresses configured for that GID (assuming
that is somehow even possible).

Yes, it's weak, but it's needed. A good example is the NFS server's
exports function. For the last 20 or so years, NFS servers have a
table which assigns access rights to filesystems by IP address, for
example restricting access, making it r/o, etc to certain classes of
client. (man exports for gory detail).

The NFS daemons inspect the peer address of incoming connections
and requests to compare them against this list. When the endpoint is
a socket, they can simply use getpeername() and a DNS op. When it's
an IB endpoint (without IPoIB or SDP), what can they use?

The requirement is that there needs to be a way to track a connection
back to a traditional hostname and/or address. Today in the Linux
NFS/RDMA work we use ATS to provide the getpeername() function.

There are stronger authentication techniques NFS can use of course.
But the vast majority of NFS users don't bother and just stuff DNS names
into their exports. Replacing these with GIDs is not acceptable (just try
asking a sysadmin if he or she wants to put mac addresses in this file!).

Tom. 
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] mapping between IP address and device name

2005-06-24 Thread Talpey, Thomas
At 12:19 PM 6/24/2005, Roland Dreier wrote:
It seems far preferable to me to just define the wire protocol of
NFS/RDMA for IB such that a client passes its IP address as part of
the connection request.  This scheme was used for SDP to avoid
precisely the complications that we're discussing now.

But that's totally and completely insecure. The goal of /etc/exports
is to place at least part of the client authentication in the network
rather than the supplied credentials. NFS has quite enough of a
history with AUTH_SYS to prove the issues there. Some of the
exports options (e.g. the *_squash ones) are specifically because
of this.

I don't care about ATS either, by the way. I'm looking for an
interoperable alternative.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] mapping between IP address and device name

2005-06-24 Thread Talpey, Thomas
At 01:02 PM 6/24/2005, Jay Rosser wrote:
On the subject of NFS/RDMA, what is the IB ServiceID space that is used? 
If I recall correctly, I have seen simply the value 2049 (i.e. the 
standard TCP/UDP port number) used in some implementations (i.e. 00 00 
00 00 00 00 20 49). Is there a mapping onto an IB ServiceID defined?

We aren't currently using the portmapper to discover the serviceid that
the NFS/RDMA server is listening on. Brent Callaghan chose serviceid 2049
as a convenience in Sun's first implementation, and so far it has stuck.

Theoretically the server can listen on any endpoint it chooses, this is
how NFS/TCP and NFS/UDP work. But typically all servers use the well
known port. It's probably a good idea to define a better default mapping.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] mapping between IP address and device name

2005-06-24 Thread Talpey, Thomas
At 12:42 PM 6/24/2005, Roland Dreier wrote:
Thomas But that's totally and completely insecure. The goal of
Thomas /etc/exports is to place at least part of the client
Thomas authentication in the network rather than the supplied
Thomas credentials. NFS has quite enough of a history with
Thomas AUTH_SYS to prove the issues there. Some of the exports
Thomas options (e.g. the *_squash ones) are specifically because
Thomas of this.

ATS is completely insecure too, right?  A client can create any old
service record in the subnet administrator's database and claim that
its GID has whatever IP address it wants.

As I said - I am not attached to ATS. I would welcome an alternative.

But in the absence of one, I like what we have. Also, I do not want
to saddle the NFS/RDMA transport with carrying an IP address purely
for the benefit of a missing transport facility. After all NFS/RDMA works
on iWARP too.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] mapping between IP address and device name

2005-06-24 Thread Talpey, Thomas
At 01:30 PM 6/24/2005, Roland Dreier wrote:
Thomas But in the absence of one, I like what we have. Also, I do
Thomas not want to saddle the NFS/RDMA transport with carrying an
Thomas IP address purely for the benefit of a missing transport
Thomas facility. After all NFS/RDMA works on iWARP too.

I'm not sure I understand this objection.  We wouldn't be saddling the
transport with anything -- simply specifying in the binding of
NFS/RDMA to IB that certain information is carried in the private data
fields of the CM messages used to establish a connection.  Clearly
iWARP would use its own mechanism for providing the peer address.

This would be exactly analogous to the situation for SDP -- obviously
SDP running on iWARP does not use the IB CM to exchange IP address
information in the same way the SDP over IB does.

Oh - I thought you meant that NFS/RDMA should have a HELLO message
carrying an IP address, like SDP/IB.

That's a nonstarter for the reason I mentioned, plus the fact that it links
this state to the connection, which might break and require reconnect.
In fact, NFSv4 and our Sessions proposal addresses this, but it doesn't
help NFSv3, which is the predominant use today.

On the other hand, placing a mandatory content in the CM exchange
brings in a whole different raft of interoperability questions, as James
mentioned earlier. For better or for worse, the ATS approach is easily
administered and does not impact any protocol layers outside of its
own. I think of it as ARP for IB.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] A new simple ulp (SPTS)

2005-06-23 Thread Talpey, Thomas
At 12:43 PM 6/22/2005, Jeff Carr wrote:
On 06/21/2005 12:50 PM, Roland Dreier wrote:

 What happens if you try replacing the send_flags line with the one you
 have commented out?
 
 +// send_wr.send_flags = IB_SEND_SIGNALED;

Thanks, you are correct. IB_SEND_SIGNALED gives me the behavior I was
expecting.

By the way, unsignaled sends can work very well indeed, but you have to be
sure to poll for completions at regular intervals. Basically, what you're doing
is ensuring that software (mthca) gets control from time to time, either via
an interrupt (signaled) or poll (unsignaled).

It's quite a challenge to get the polling right, but the reduction in interrupts
can be a win. The NFS/RDMA module does this, but it takes the approach
of occasionally posting a signaled send. The trick is getting the value of
occasionally right. :-)

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] A new simple ulp (SPTS)

2005-06-23 Thread Talpey, Thomas
At 01:11 PM 6/23/2005, Jeff Carr wrote:
I didn't know there was a nfs/rmda module?

[EMAIL PROTECTED]:/test/gen2# find . |grep -i nfs
[EMAIL PROTECTED]:/test/gen2#

Brief intro to NFS/RDMA work:

Current client version is on Sourceforge, supporting various flavors of
2.4. I'm preparing a new release for 2.6.
http://sourceforge.net/projects/nfs-rdma

The client needs a patched version of Sunrpc in order to hook in as
an NFS transport. This patch is being used by the NFS/IPv6
project as well, and it's planned for future kernel.org integration.
http://troy.citi.umich.edu/~cel/linux-2.6/2.6.12/release-notes.html
(There are also patch sets for earlier Linux revs)

Server version for 2.6 is under development at UMich CITI, scroll
down to Documents and Code sections of this page:
http://www.citi.umich.edu/projects/rdma/

Finally, NetApp and Sun have demonstrated implementations,
Solaris 10 has support for it.

We look forward to running NFS/RDMA over OpenIB, when its kDAPL
is ready.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] NFS/RDMA/kDAPL

2005-06-23 Thread Talpey, Thomas
At 03:11 PM 6/23/2005, Tom Duffy wrote:
On Thu, 2005-06-23 at 13:54 -0400, Talpey, Thomas wrote:
 We look forward to running NFS/RDMA over OpenIB, when its kDAPL
 is ready.

Let's get NFSoRDMA going sooner rather than later on James's kDAPL.  I
think this will both be a good test case as well as a vehicle to
demonstrate the functionality of kDAPL.

What can I do to help?

Can the Linux OpenIB client connect to Solaris 10? If so, we might
consider using Sol10's NFS/RDMA server. If not, we'll have to use a
NetApp filer (which is fine by me but maybe hard for you), because
the CITI NFS/RDMA server is accepting connections but not yet
processing RPCs.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] kDAPL: remove dapl_os_assert()

2005-06-23 Thread Talpey, Thomas
At 05:09 PM 6/23/2005, Grant Grundler wrote:
On Thu, Jun 23, 2005 at 04:55:38PM -0400, James Lentini wrote:
 My argument in favor of retaining them is that dapl_evd_wc_to_event() 
 will crash if the cookie NULL. A BUG_ON will detect this situation 
...
The tombstone from the data page fault panic should make this nearly
as obvious as the BUG_ON(). Yes, I agree a BUG_ON() is completely
obvious. But it's also burning CPU cycles for something that

Not to argue one way or the other, but if this cookie is NULL,
whose fault would that be? I think that should govern whether
it's common enough to warrant BUG_ON or rare enough to
warrant a straight crash. I would suggest BUG_ON only if it
were possible to trigger this from a loadable module, etc.

Tom.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


  1   2   >