from:"Caitlin Bestler"

RE: [RFC NET 00/02]: Secondary unicast address support

2007-06-21 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 From: [EMAIL PROTECTED] (Eric W. Biederman)
 Date: Thu, 21 Jun 2007 13:08:12 -0600

 However this just seems to allow a card to decode multiple mac
 addresses which in some oddball load balancing configurations may
 actually be useful, but it seems fairly limited.

 Do you have a specific use case you envision for this multiple mac
 functionality?

 Virtualization.

 If you can't tell the ethernet card that more than 1 MAC
 address are for it, you have to turn the thing into promiscuous mode.

 Networking on virtualization is typically done by giving each
 guest a unique MAC address, the guests have a virtual network
 device that connects to the control node (or dom0 in Xen
 parlace) and/or other guests.

 The control node has a switch that routes the packets from
 the guests either to other guests or out the real ethernet interface.

 Each guest gets a unique MAC so that the switch can know
 which guest an incoming packet is for.

The same software switch could also throw away the excess
frames that promiscuous mode would have admitted. Unless
the misdirected frames were common it would not seem to 
be a major CPU burden.

Keep in mind that the only MAC addresses that would have
been transmitted are the ones that the input filter would
have listed. 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [openib-general] [PATCH 1/10] cxgb3 - main header files

2007-01-09 Thread Caitlin Bestler

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of 
 Michael S. Tsirkin
 Sent: Tuesday, January 09, 2007 5:57 AM
 To: Steve Wise
 Cc: netdev@vger.kernel.org; Roland Dreier; Divy Le Ray; 
 linux-kernel@vger.kernel.org; openib-general
 Subject: Re: [openib-general] [PATCH 1/10] cxgb3 - main header files

  We also need to decide on the ib_req_notify_cq() issue.  

 Let's clarify - do you oppose doing copy_from_user from a 
 fixed address passed in during setup?

 If OK with you, this seems the best way as it is the least 
 controversial and least disruptive one.

To clarfiy my understanding of this issue:

A device MAY implement ib_req_notify_cq by updating
a location directly from user mode. Any of the techniques
that apply to other user allocated objects, such as 
the Send Queue, can be applied here.

Even those the proposed changes would be about as
low impact and benign as possible, the fact that there
are valid solutions without an API changes leans heavily
towards using those solutions.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Suppress / delay SYN-ACK

2006-10-12 Thread Caitlin Bestler


On 10/12/06, Rick Jones [EMAIL PROTECTED] wrote:

Martin Schiller wrote:
 Hi!

 I'm searching for a solution to suppress / delay the SYN-ACK packet of a
 listening server (-application) until he has decided (e.g. analysed the
 requesting ip-address or checked if the corresponding other end of a
 connection is available) if he wants to accept the connect request of the
 client. If not, it should be possible to reject the connect request.

How often do you expect the incomming call to be rejected?  I suspect that would
have a significant effect on whether the whole thing is worthwhile.

rick jones



More to the point, on what basis would the application be rejecting a
connection request based solely on the SYN?

There are only two pieces of information available: the remote IP
address and port, and the total number of pending requests. The
latter is already addressed through the backlog size, and netfilter
rules can already be used to reject based on IP address.

That would seem to limit the usefullness to scenarios where a given
remote IP address *might* be accepted based on total traffic load,
number of other connections from the same IP address, etc.  If
*all* requests from that IP address are going to be rejected, why
not use netfilter?
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [RFC] network namespaces

2006-09-06 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 
 
 Finally, as I understand both network isolation and network
 virtualization (both level2 and level3) can happily co-exist. We do
 have several filesystems in kernel. Let's have several network
 virtualization approaches, and let a user choose. Is that makes
 sense? 
 
 If there are not compelling arguments for using both ways of
 doing it is silly to merge both, as it is more maintenance overhead.
 

My reading is that full virtualization (Xen, etc.) calls for
implementing
L2 switching between the partitions and the physical NIC(s).

The tradeoffs between L2 and L3 switching are indeed complex, but
there are two implications of doing L2 switching between partitions:

1) Do we really want to ask device drivers to support L2 switching for
   partitions and something *different* for containers?

2) Do we really want any single packet to traverse an L2 switch (for
   the partition-style virtualization layer) and then an L3 switch
   (for the container-style layer)?

The full virtualization solution calls for virtual NICs with distinct
MAC addresses. Is there any reason why this same solution cannot work
for containers (just creating more than one VNIC for the partition, 
and then assigning each VNIC to a container?)

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RDMA will be reverted

2006-07-24 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 From: Steve Wise [EMAIL PROTECTED]
 Date: Wed, 05 Jul 2006 12:50:34 -0500

 However, iWARP devices _could_ integrate with netfilter.  For most
 devices the connection request event (SYN) gets passed up to the host
 driver.  So the driver can enforce filter rules then.

 This doesn't work.  In order to handle things like NAT and
 connection tracking properly you must even allow ESTABLISHED
 state packets to pass through netfilter.

 Netfilter can have rules such as NAT port 200 to 300, leave
 the other fields alone and your suggested scheme cannot handle this.

This is totally irrelevant. But it does work.

First, an RDMA connection once established associates a
TCP connection *as identified external to the box* with
an RDMA endpoint (conventionally a QP).

Performing a NAT translation on a TCP packet would certainly
be within the capabilities of an RNIC, but it would be pointless.
The relabeled TCP segment would be associated with the same QP.

Once an RDMA connection is established, the individual TCP segments
are only of interest to the RDMA endpoint. Payload is delivered
through the RDMA interface (the same one already used for
InfiniBand). The purpose of integration with netfilter would
be to ensure that no RDMA Connection could exist, or persist,
if netfilter would not allow the TCP connection to be created.

That is not a matter of packet filtering, it is matter of
administrative consistency. If someone uses netfilter to block
connections from a given IP netmask then they reasonably expect
that there will be no connections with any host within that
IP netmask. They do not expect exceptions for RDMA, iSCSI
or InfiniBand.

The existing connection management interfaces in openfabrics,
designed to support both InfiniBand and iWARP, could naturally
be extended to validate all RDMA connections using an IP address
with netfilter. This would be of real value.

The only real value of a rule such as NAT port 200 to 300 is
to allow a remote peer to establish a connection to port 200
with a local listener using port 300. That *can* be supported
without actually manipulating the header in each TCP packet.

It is also possible to discuss other netfilter functionality
that serves a valid end-user purpose, such as counting packets.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RDMA will be reverted

2006-07-24 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 From: Tom Tucker [EMAIL PROTECTED]
 Date: Wed, 05 Jul 2006 12:09:42 -0500

 A TOE net stack is closed source firmware. Linux engineers have no
 way to fix security issues that arise. As a result, only non-TOE
 users will receive security updates, leaving random windows of
 vulnerability for each TOE NIC's users. 

 - A Linux security update may or may not be relevant to a vendors
 implementation. 

 - If a vendor's implementation has a security issue then the customer
 must rely on the vendor to fix it. This is no less true for iWARP
 than for any adapter.

 This isn't how things actually work.

 Users have a computer, and they can rightly expect the
 community to help them solve problems that occur in the
 upstream kernel.

 When a bug is found and the person is using NIC X, we don't
 necessarily forward the bug report to the vendor of NIC X.
 Instead we try to fix the bug.  Many chip drivers are
 maintained by people who do not work for the company that
 makes the chip, and this works just fine.

 If only the chip vendor can fix a security problem, this
 makes Linux less agile to fix.  Even aspect of a problem on a
 Linux system that cannot be fixed entirely by the community
 is a net negative for Linux.

 - iWARP needs to do protocol processing in order to validate and
 evaluate TCP payload in advance of direct data placement. This
 requirement is independent of CPU speed.

 Yet, RDMA itself is just an optimization meant to deal with
 limitations of cpu and memory speed.  You can rephrase the
 situation in whatever way suits your argument, but it does
 not make the core issue go away :)

RDMA is a protocol that allows the application to more
precisely state the actual ordering requirements. It
improves the end-to-end interactions and has value
over a protocol with only byte or message stream
semantics regardless of local interface efficiencies.
See http://ietf.org/internet-drafts/draft-ietf-rddp-applicability-08.txt

In any event, isn't the value of an RDMA interface to applications
already settled? The question is how best to deal integrate the
usage of IP addresses with the kernel. The inability to validate
the low-level packet validation in open source code is a limitation
of *all* RDMA solutions, the transport layer of InfiniBand is just
as offloaded as it is for iWARP.

The patches proposed are intended to support integrated connection
management for RDMA connections using IP addresses, no matter what
the underlying transport is. The only difference is that *all* iWARP
connections use IP addresses.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Netchannles: first stage has been completed. Further ideas.

2006-07-22 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 Evgeniy Polyakov wrote:
 On Thu, Jul 20, 2006 at 02:21:57PM -0700, Ben Greear
 ([EMAIL PROTECTED]) wrote:
 
 Out of curiosity, is it possible to have the single producer logic
 if you have two+ ethernet interfaces handling frames for a single
 TCP connection?  (I am assuming some sort of multi-path routing
 logic...)
 
 
 I do not think it is possible with additional logic like what is
 implemented in softirqs, i.e. per cpu queues of data, which in turn
 will be converted into skbs one-by-one.
 
 Couldn't you have two NICs being handled by two separate
 CPUs, with both CPUs trying to write to the same socket queue?
 
 The receive path works with RCU locking from what I
 understand, so a protocol's receive function must be re-entrant.

Wouldn't it be easier simply not have two NICs feed the
same ring? What packets end up in which ring is fully
controllable. On the rare occasion that a single connection
must be fed by two NICs a software merge of the two rings
would be far cheaper than having to co-ordinate between
producers all the time.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RDMA will be reverted

2006-07-06 Thread Caitlin Bestler

Andi Kleen wrote:
 
 We're focusing on netfilter here. Is breaking netfilter really the
 only issue with this stuff?
 
 Another concern is that it will just not be able to keep
 up with a high rate of new connections or a high number of them
 (because the hardware has too limited state)
 

Neither iWARP or an iSCSI initiator will require extremely high
rates of connection establishment. An RNIC only establishes connections
when its services have been explicitly requested (via use of a specific
service).

In any event, the key question here is whether integration with
the netdevice improves things or whether the offload device should
be totally transparent to the kernel. If the offload device somehow
insisted on handling connection requests that the kernel would have
been able to handle then this would be an issue. But the kernel is
not currently handling RDMA Connect requests on its own, and I know
of no-one who has suggested that a software-only implementation of
RDMA is feasible at 10Gbit is feasible.

netfiler integration is definitely something that needs to be addressed,
but the L2/L3 integrations need to be in place first.

 And then there are the other issues I listed like subtle TCP bugs
 (TSO is already a nightmare in this area and it's still not quite
 right) etc. 
 

Making an RNIC fully transparent to the kernell would require it
to handle many L2 and L3 issues in parallel with the host stack.
That increases the chance of a bug, or at least a subtle difference
between the host and the RNIC which while being compliant would
be unexpected.

The purposes of the proposed patches is to enable the RNIC to be
in full compliance with the host stack on IP layer issues.



 
 It would need someone who can describe how this new RDMA device avoids
 all the problems, but so far its advocates don't seem to be interested
 in doing that and I cannot contribute more.
 

RDMA services are already defined for the kernel. The connection
management and network notifier patches are about enabling RDMA
devices to use IP addresses in a way that is consistent.

Obviously doing so is more important for an iWARP device than for
an InfiniBand device, but each InfiniBand users have expressed a
desire to use IP addressing.

Applications do not use RDMA by accident, it is a major design
decision. Once an application uses RDMA it is no longer a direct
consumer of the transport layer protocol. Indeed, one of the
main objectives of the OpenFabrics stack is to enable typical
applications to be written that will work over RDMA without
caring what the underlying transport is. The options for control
will still be there, but just as a sockets programmer does not
typically care whether their IP is carried over SLIP, PPP,
Ethernet or ATM; most RDMA developers should not have to worry
about iWARP or InfiniBand.

http://ietf.org/internet-drafts/draft-ietf-rddp-applicability-08.txt
provides an overview on how RDMA benefits applications, and when
applications would benefit from its use as compared to plain TCP.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: TOE, etc.

2006-06-28 Thread Caitlin Bestler

Herbert Xu wrote:

 
 Yes, however I think the same argument could be applied to TOE.
 
 With their RDMA NIC, we'll have TCP/SCTP connections that
 bypass netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest
 of our stack while at the same time it is using the same IP
 address as us and deciding what packets we will or won't see.
 

The whole point of the patches that opengrid has proposed is to
allow control of these issues to remain with the kernel. That is
where the ownership of the IP address logically resides, and system
administrators will expect to be able to use one set of tools to
control what is done with a given IP address.

The bypassing is already going on with iSCSI devices and with
InfiniBand devices that use IP addresses. An RDMA/IP device just
makes it harder to ignore this problem, but the problem was already
there. SDP over IB is presented to Linux users essentially as a
TOE service. Connections are made with IP and socket semantics,
and yet there is no co-ordination on routes/netfilter/etc.

I'll state right up front that I think stateful offload, when
co-ordinated with the OS, is better than stateless offload --
especially at 10G speeds.

But for plain TCP connections there are stateless offloads
available. As a product architect I am already seeking as
many ways as possible to support stateless offload as efficiently
as possible to keep that option viable for Linux users for as
high of a rate as possible. That is why we are very interested
in exploring a hardware friendly definition of vj_netchannels.

But with RDMA things are different. There is no such thing as
stateless RDMA. It is not RDMA over TCP that requires stateful
offload, it is RDMA itself. RDMA over InfiniBand is just as
much of a stateful offload as RDMA over TCP.

It is possible to build RDMA over TCP as a service that merely
uses memory mappping services in a mysterious way but is not
integrated with the network stack at all. That is essentially
how RDMA over IB is currently working.

But I believe that integrating control over the IP address,
and the associated netfilter/routing/arp/pmtu/etc issues,
is the correct path. This logic should not be duplicated,
and its control must not be split.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: TOE, etc.

2006-06-28 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 From: Steve Wise [EMAIL PROTECTED]
 Date: Wed, 28 Jun 2006 09:54:57 -0500

 Doesn't iSCSI have this same issue?

 Software iSCSI implementations don't have the issue because
 they go through the stack using normal sockets and normal
 device send and receive.

But hardware iSCSI implementations, which already exist,
do not work through normal sockets.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: TOE, etc.

2006-06-28 Thread Caitlin Bestler

Jeff Garzik wrote:
 Caitlin Bestler wrote:
 Jeff Garzik wrote:
 Caitlin Bestler wrote:
 But hardware iSCSI implementations, which already exist, do not
 work through normal sockets.
 
 No, they work through normal SCSI stack...
 
 Correct.
 
 But they then interface to the network using none of the network
 stack. The normal SCSI stack does not control that it any way.
 
 Correct.  And the network stack is completely unaware of
 whatever IP addresses, ARP tables, routing tables, etc. it is using.
 
 
 NFS over RDMA is part of the file system. That doesn't change the
 fact that it's use of IP Addresses needs to be co-ordinated with the
 network stack, and indeed that address based authentication
 *assumes* that this is the case. (and yes, there are preferable
 means of authentication, but authenticating based on IP address is
 already supported). 
 
 Sounds quite broken to me.
 
 
 But back on the main point, if implementing SCSI services over a
 TCP connection is acceptable even though it does not use a kernel
 socket, why would it not be acceptable to implement RDMA services
 over a TCP connection without using a kernel socket?
 
 Because SCSI doesn't force nasty hooks into the net stack to
 allow for
 sharing of resources with a proprietary black box of unknown quality.
 
   Jeff

RDMA can also solve all of these problems on its own. Complete with
giving the network administrator *no* conventional controls over the
IP address being used for RDMA services.

That means no standard ability to monitor connections, no standard
ability to control which connections are made with whom.

That is better? 

You seem to be practically demanding that RDMA build an entire
parallel stack.

Worse, that *each* RDMA vendor build an entire parallel stack.

Open source being what it is, that is not terribly difficult.
But exactly how does this benefit Linux users?

The proposed subscriptions are not about sharing *resources*, they
share *information* with device drivers. The quality of each
RDMA device driver will be just as known as for a SCSI driver,
an InfiniBand HCA driver, a graphics driver or a plain Ethernet
driver.



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH Round 2 0/2][RFC] Network Event Notifier Mechanism

2006-06-27 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 From: Steve Wise [EMAIL PROTECTED]
 Date: Tue, 27 Jun 2006 10:02:19 -0500

 For the RDMA kernel subsystem, however, we still need a specific
 event. We need both the old and new dst_entry struct ptrs to figure
 out which active connections were using the old dst_entry and should
 be updated to use the new dst_entry.

 This change isn't truly atomic from a kernel standpoint either.

 The new dst won't be selected by the socket until later, when
 the socket tries to send something, notices the old dst is
 obsolete, and looks up a new one.

 Your code could do the same thing.

The request to send something is posted directly form user
mode to a mapped memory ring that is reaped by the hardware.
Having the hardware fault, report that fault, and wait for
the host to update it with the new mapping is somewhat clumbsy.
It also won't work at all for existing hardware.

The best you could do is to have the driver invalidate the old
entry, then *presume* that the hardware will want the replacement
and look that up, and then forward that answer to the hardware.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 0/2][RFC] Network Event Notifier Mechanism

2006-06-22 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 On Thu, 2006-22-06 at 15:40 -0500, Steve Wise wrote:
 On Thu, 2006-06-22 at 15:43 -0400, jamal wrote:
 
 No - what these 2 gents are saying was these events and
 infrastructure already exist.
 
 Notification of the exact events needed does not exist today.
 
 
 Ok, so you cant event make use of anything that already exists?
 Or is a subset of what you need already there?
 
 The key events, again, are:
 
 - the neighbour entry mac address has changed.
 
 
 - the next hop ip address (ie the neighbour) for a given dst_entry
 has changed.
 
 
 I dont see a difference for the above two from an L2 perspective.
 Are you keeping track of IP addresses?
 You didn't answer my question in the previous email as to
 what RDMA needs to keep track of in hardware.
 

The RDMA device is handling L4 or L5 connections that 
have L3 Addresses (IP). Subscribing to the information
allows the device to keep its behaviour consistent
with the host stack.

The common alternative before proposing this integration
was to have the RDMA device sniff all incoming packets
and attempt to do parallel procesing on a large set
of lower layer protocols (ICMP, ARP, routing, ...)
Or by simply trusting that the IB network adminstrator
has faithfully replicated all IP-relevent instructions
in two forums (traditional IP nework administration
and IB network administration).

These subscriptions are an attempt to cede full control
of these issues back to one place, the kernel, and to
guarantee that an offload device can never think that
the route to to X is Y when the kernel says it is Z.
Or that it has a different PMTU, etc.

I don't have any strong opinion on the best mechanism
for implementing these subscriptions, but having correct
consistent networking behaviour depend on a user-mode
relay strikes me as odd.



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 0/2][RFC] Network Event Notifier Mechanism

2006-06-22 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 On Thu, 2006-22-06 at 15:58 -0500, Steve Wise wrote:
 On Thu, 2006-06-22 at 16:36 -0400, jamal wrote:
 
 I created a new notifier block in my patch for these network events.
 I guess I thought I was using the existing infrastructure to provide
 this notification service. (I thought my patch was lovely :)  But I
 didn't integrate with netlink for user space notification. Mainly cuz
 I didn't think these events should be propagated up to users unless
 there was a need.
 
 I think they will be useful in user space. Typically you only
 propagate them if there's a user space program subscribed to
 listening (there are hooks which will tell you if there's
 anyone listening).
 The netdevice events tend to be a lot more usable in a few
 other blocks because they are lower in the hierachy (i.e
 routing depends on ip addresses which depend on netdevices)
 within the kernel unlike in this case where you are the only
 consumer; so it does sound logical to me to do it in user
 space; however, not totally unreasonable to do it in the kernel.
 


These services are relevant to any RDMA connection. The user-space
consumer of RDMA services is no more interested in tracking the
routing of the remote IP address than the consumer of socket
services is.


 
 
 Another issue I see with netlink is that the event notifications
 aren't reliable.  Especially the CONFIG_ARPD stuff because it allocs
 an sk_buff with ATOMIC.  A lost neighbour macaddr change is perhaps
 fatal for an RDMA connection...
 
 
 This would happen in the cases where you are short on memory;
 i would suspect you will need to allocate memory in your
 driver as well to update something in the hardware as well -
 so same problem.
 You can however work around issues like these in netlink.


A direct notification call to the driver makes the driver responsible
for providing whatever buffering it requires to save the information.
And if there is insufficient memory available at least the driver
is aware of the failure.

Allowing a third component to fail to relay information means that
the driver can no longer be responsible for maintaining its own
consistency with kernel routing, ARP and neighbor tables.

Maintaining that consistency is a matter of correct network
behaviour, not doing status reports. obviously we cannot have
hardware looking at and interpreting these tables directly.
So a *reliable* subscription would seem to be the only option.

If the only subscribers who require reliable notifications are
kernel drivers, does it really mamke sense to make those changes
in code that also supports user space?
 

 
 I am still unclear:
 You have destination IP address, the dstMAC of the nexthop to
 get the packet to this IP address and i suspect some srcMAC
 address you will use sending out as well as the pathMTU to
 get there correct?
 Because of the IP address it sounds to me like you are
 populating an L3 table How is this info used in hardware? Can
 you explain how an arriving packet would be used by the RDMA
 in conjunction with this info once it is in the hardware?


Some packets are associated with established RDMA (or iSCSI)
connections, and are processed on the RDMA (or iSCSI) device.
These packets will also pass through other packets to the
host stack for processing (non-matched Ethernet frames for
IP networks, and IPoIB tunneled frames for IB networks).

The device provides L5 services (RDMA and/or iSCSI) in addition
to L2 services (as an Ethernet device). The rest of the network
rightfully demands that the left hand knows what the right hand
is doing. So information that is provided to a host, ARP/ICMP,
should affect the behaviour of *all* connections from that host.

Do you agree that having the device subsribe to the kernel
maintained tables is a better solution than having it attempt
to guess the correct values in parallel?
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver.

2006-06-15 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 On Thu, 2006-06-15 at 08:41 -0500, Steve Wise wrote:
 On Wed, 2006-06-14 at 20:35 -0500, Bob Sharp wrote:
 
 +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) {
 +
 
 snip
 
 +  case C2_RES_IND_EP:{
 +
 +  struct c2wr_ae_connection_request *req =
 +  wr-ae.ae_connection_request;
 +  struct iw_cm_id *cm_id =
 +  (struct iw_cm_id *)resource_user_context;
 +
 +  pr_debug(C2_RES_IND_EP event_id=%d\n, event_id);
 +  if (event_id != CCAE_CONNECTION_REQUEST) {
 +  pr_debug(%s: Invalid event_id: %d\n,
 +  __FUNCTION__, event_id);
 +  break;
 +  }
 +  cm_event.event = IW_CM_EVENT_CONNECT_REQUEST;
 +  cm_event.provider_data = (void*)(unsigned
long)req-cr_handle;
 +  cm_event.local_addr.sin_addr.s_addr = req-laddr;
 +  cm_event.remote_addr.sin_addr.s_addr = req-raddr;
 +  cm_event.local_addr.sin_port = req-lport;
 +  cm_event.remote_addr.sin_port = req-rport;
 +  cm_event.private_data_len =
 +  be32_to_cpu(req-private_data_length);
 +
 +  if (cm_event.private_data_len) {
 
 
 It looks to me as if pdata is leaking here since it is not tracked
 and the upper layers do not free it.  Also, if pdata is freed after
 the call to cm_id-event_handler returns, it exposes an issue in
 user space where the private data is garbage.  I suspect the iwarp
 cm should be copying this data before it returns.
 
 
 Good catch.
 
 Yes, I think the IWCM should copy the private data in the upcall.  If
 it does, then the amso driver doesn't need to kmalloc()/copy at all.
 It can pass a ptr to its MQ entry directly...
 
 
 Now that I've looked more into this, I'm not sure there's a
 simple way for the IWCM to copy the pdata on the upcall.
 Currently, the IWCM's event upcall, cm_event_handler(),
 simply queues the work for processing on a workqueue thread.
 So there's no per-event logic at all there.
 Lemme think on this more.  Stay tuned.
 
 Either way, the amso driver has a memory leak...
 

Having the IWCM copy the pdata during the upcall also leaves
the greatest flexibility for the driver on how/where the pdata
is captured. The IWCM has to deal with user-mode, indefinite
delays waiting for a response and user-mode processes that die
while holding a connection request. So it makes sense for that
layer to do the allocating and copying.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [openib-general] [PATCH v2 1/2] iWARP Connection Manager.

2006-06-14 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 On Tue, 2006-06-13 at 16:46 -0500, Steve Wise wrote:
 On Tue, 2006-06-13 at 14:36 -0700, Sean Hefty wrote:
 Er...no. It will lose this event. Depending on the event...the
 carnage varies. We'll take a look at this.
 
 
 This behavior is consistent with the Infiniband CM (see
 drivers/infiniband/core/cm.c function cm_recv_handler()).  But I
 think we should at least log an error because a lost event will
 usually stall the rdma connection.
 
 I believe that there's a difference here.  For the Infiniband CM, an
 allocation error behaves the same as if the received MAD were lost
 or dropped.  Since MADs are unreliable anyway, it's not so much that
 an IB CM event gets lost, as it doesn't ever occur.  A remote CM
 should retry the send, which hopefully allows the
 connection to make forward progress.
 
 
 hmm.  Ok.  I see.  I misunderstood the code in cm_recv_handler().
 
 Tom and I have been talking about what we can do to not drop the
 event. Stay tuned.
 
 Here's a simple solution that solves the problem:
 
 For any given cm_id, there are a finite (and small) number of
 outstanding CM events that can be posted.  So we just
 pre-allocate them when the cm_id is created and keep them on
 a free list hanging off of the cm_id struct.  Then the event
 handler function will pull from this free list.
 
 The only case where there is any non-finite issue is on the
 passive listening cm_id.  Each incoming connection request
 will consume a work struct.  So based on client connects, we
 could run out of work structs.
 However, the CMA has the concept of a backlog, which is
 defined as the max number of pending unaccepted connection
 requests.  So we allocate these work structs based on that
 number (or a computation based on that number), and if we run
 out, we simply drop the incoming connection request due to
 backlog overflow (I suggest we log the drop event too).
 When a MPA connection request is dropped, the (IETF
 conforming) MPA client will eventually time out the
 connection and the consumer can retry.
 
 Comments?
 

If the IWCM cannot accept a Connection Request event from
the driver then *someone* should generate a non-peer reject
MPA Response frame. Since the IWCM does not have the resources
to relay the event, it probably does not have the resources
to generate the MPA Response frame either. So simply returning
an I'm Busy error and expecting the driver to handle it
makes sense to me.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager.

2006-06-01 Thread Caitlin Bestler


 
 There's a difference between trying to handle the user calling
 disconnect/destroy at the same time a call to accept/connect is
 active, versus the user calling disconnect/destroy after
 accept/connect have returned.  In the latter case, I think you're
 fine.  In the first case, this is allowing a user to call
 destroy at the same time that they're calling accept/connect.
 Additionally, there's no guarantee that the F_CONNECT_WAIT flag has
 been set by accept/connect by the time disconnect/destroy tests it.
 
 The problem is that we can't synchronously cancel an
 outstanding connect request. Once we've asked the adapter to
 connect, we can't tell him to stop, we have to wait for it to
 fail. During the time period between when we ask to connect
 and the adapter says yeah-or-nay, the user hits ctrl-C. This
 is the case where disconnect and/or destroy gets called and
 we have to block it waiting for the outstanding connect
 request to complete.
 
 One alternative to this approach is to do the kfree of the
 cm_id in the deref logic. This was the original design and
 leaves the object around to handle the completion of the
 connect and still allows the app to clean up and go away
 without all this waitin' around. When the adapter finally
 finishes and releases it's reference, the object is kfree'd.
 
 Hope this helps.
 
Why couldn't you synchronously put the cm_id in a state of
pending delete and do the actual delete when the RNIC
provides a response to the request? There could even be
an optional method to see if the device is capable of
cancelling the request. I know it can't yank a SYN back
from the wire, but it could refrain from retransmitting.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH] TCP Veno module for kernel 2.6.16.13

2006-05-25 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 On Thu, 25 May 2006 13:23:50 -0700 (PDT) David Miller
 [EMAIL PROTECTED] wrote: 

 From: #ZHOU BIN# [EMAIL PROTECTED]
 Date: Thu, 25 May 2006 16:30:48 +0800

 Yes, I agree. Actually the main contribution of TCP Veno is not in
 this AI phase. No matter the ABC is added or not, TCP Veno can
 always improve the performance over wireless networks, according to
 our tests.

 It seems to me that the wireless issue is seperate from congestion
 control. 

 The key is to identify true loss due to overflow of intermediate
 router queues, vs. false loss which is due to temporary radio
 signal interference.

 Is it really possible to tell the two apart.  

Loss due to true congestion as opposed to loss due to radio signal
interference (true loss, but falsely inferred to be congestion) is
actually very possible, at L2 and only if the hop experiencing problems
is the first or last hop. There are numerous indicators that 
the link is experiencing link-related drops (FEC corrections, 
signal to noise, etc.).

The *desirability* of using this data is debatable, but it most
certainly is possible.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: VJ Channel API - driver level (PATCH)

2006-05-03 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of David S. Miller
 Sent: Tuesday, May 02, 2006 11:48 PM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org
 Subject: Re: VJ Channel API - driver level (PATCH)

 I don't think we should be defining driver APIs when we haven't even
 figured out how the core of it would even work yet.

 A key part of this is the netfilter bits, that will require
 non-trivial flow identification, a hash will simply not be enough,
 and it will not be allowed to not support the netfilter bits properly
 since everyone will have netfilter enabled in one way or another.

 Hi Dave,

 Do you have suggestions on potential hardware
 assists/offloads for netfilter?

 I suppose some of it can be worthwhile, although in general
 may be too complex to implement - especially above 1 Gig.

 I'd expect high end NIC ASICs to implement rx steering based
 upon some sort of hash (for load balancing), as well as
 explicit 1:1 steering between a sw channel and a hw
 channel. Both options for channel configuration are present
 in the driver interface.
 If netfilter assists can be done in hardware, I agree the
 driver interface will need to add support for these -
 otherwise, netfilter processing will stay above the driver.

Even if the hardware cannot fully implement netfilter rules
there is still value in having an interface that documents 
exactly how much filtering a given piece of hardware can do.
There is no point in having the kernel repeat packet classifications
that have already been done by the NIC.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: VJ Channel API - driver level (PATCH)

2006-05-03 Thread Caitlin Bestler

Are you proposing a mechanism for the consuming end of a tx
channel to support a large number of channels, or are you
assuming that the number of tx channels will be small enough
that simply polling them in priority order is adequate?

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: VJ Channel API - driver level (PATCH)

2006-05-03 Thread Caitlin Bestler

Evgeniy Polyakov wrote:
 On Wed, May 03, 2006 at 08:56:23AM -0700, Caitlin Bestler
 ([EMAIL PROTECTED]) wrote:
 I'd expect high end NIC ASICs to implement rx steering based upon
 some sort of hash (for load balancing), as well as explicit 1:1
 steering between a sw channel and a hw channel. Both options for
 channel configuration are present in the driver interface.
 If netfilter assists can be done in hardware, I agree the driver
 interface will need to add support for these - otherwise, netfilter
 processing will stay above the driver.
 
 
 
 Even if the hardware cannot fully implement netfilter rules there is
 still value in having an interface that documents exactly how much
 filtering a given piece of hardware can do.
 There is no point in having the kernel repeat packet classifications
 that have already been done by the NIC.
 
 Please do not suppose that vj channel must rely on
 underlaying hardware.
 New interface MUST work better or at least not worse than
 existing skb queueing for majority of users, and I doubt
 users with netfilter capable hardware are there.
 It is only some hint to the SW, not rules, that hardware can provide.
 The best would be ipv4/ipv6 hashing, and I think it is enough.

I agree. I was just stating that *if* there is direct hardware 
support then the software should be enabled to skip 
redundant checks. What I'm suggesting is really the
equivalent of knowing whether the hardware generates
or checks CRCs and TCP checksums. Don't mandate
the feature, just have the option to avoid redundant work.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: VJ Channel API - driver level (PATCH)

2006-05-03 Thread Caitlin Bestler

David S. Miller wrote:
 From: Evgeniy Polyakov [EMAIL PROTECTED]
 Date: Wed, 3 May 2006 22:07:40 +0400

 On Wed, May 03, 2006 at 08:56:23AM -0700, Caitlin Bestler
 ([EMAIL PROTECTED]) wrote:
 I'd expect high end NIC ASICs to implement rx steering based upon
 some sort of hash (for load balancing), as well as explicit 1:1
 steering between a sw channel and a hw channel. Both options for
 channel configuration are present in the driver interface.
 If netfilter assists can be done in hardware, I agree the driver
 interface will need to add support for these - otherwise,
 netfilter processing will stay above the driver.

 Even if the hardware cannot fully implement netfilter rules there is
 still value in having an interface that documents exactly how much
 filtering a given piece of hardware can do.
 There is no point in having the kernel repeat packet classifications
 that have already been done by the NIC.

 Please do not suppose that vj channel must rely on underlaying
 hardware. 

 I am not.  I am just saying that it is futile to build
 hardware that cannot handle netfilter at least to some
 extent, because this will result in HW net channels being
 disabled for %99 of real users which makes the hardware just a waste.

Or netfilters being disabled, which would be just as bad or worse.
The kernel and hardware need to co-operate so that users are not
asked to make artificial choices between performance and security.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch

2006-04-28 Thread Caitlin Bestler

Evgeniy Polyakov wrote:
 On Thu, Apr 27, 2006 at 02:12:09PM -0700, Caitlin Bestler
 ([EMAIL PROTECTED]) wrote:
 So the real issue is when there is an intelligent device that uses
 hardware packet classification to place the packet in the correct
 ring. We don't want to bypass packet filtering, but it would be
 terribly wasteful to reclassify the packet.
 Intelligent NICs will have packet classification capabilities to
 support RDMA and iSCSI. Those capabilities should be available to
 benefit SOCK_STREAM and SOCK_DGRAM users as well without it being a
 choice of either turning all stack control over to the NIC or
 ignorign all NIC capabilities beyound pretending to be a dumb
 Ethernet NIC. 
 
 Btw, how is it supposed to work without header split capabale
 hardware?

Hardware that can classify packets is obviously capable of doing
header data separation, but that does not mean that it has to do so.

If the host wants header data separation it's real value is that when
packets arrive in order that fewer distinct copies are required to
move the data to the user buffer (because separated data can
be placed back-to-back in a data-only ring). But that's an optimization,
it's not needed to make the idea worth doing, or even necessarily
in the first implementation.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch

2006-04-28 Thread Caitlin Bestler

Evgeniy Polyakov wrote:
 On Fri, Apr 28, 2006 at 08:59:19AM -0700, Caitlin Bestler
 ([EMAIL PROTECTED]) wrote:
 Btw, how is it supposed to work without header split capabale
 hardware?
 
 Hardware that can classify packets is obviously capable of doing
 header data separation, but that does not mean that it has to do so.
 
 If the host wants header data separation it's real value is that when
 packets arrive in order that fewer distinct copies are required to
 move the data to the user buffer (because separated data can be
 placed back-to-back in a data-only ring). But that's an
 optimization, it's not needed to make the idea worth doing, or even
 necessarily in the first implementation.
 
 If there is dataflow, not flow of packets or flow of data
 with holes, it could be possible to modify recv() to just
 return the right pointer, so in theory userspace
 modifications would be minimal.
 With copy in place it completely does not differ from current
 design with copy_to_user() being used since memcpy() is just
 slightly faster than copy*user().

If the app is really ready to use a modified interface we might as well
just give them a QP/CQ interface. But I suppose receive by pointer
interfaces don't really stretch the sockets interface all that badly.
The key is that you have to decide how the buffer is released,
is it the next call? Or a separate call? Does releasing buffer
N+2 release buffers N and N+1? What you want to avoid 
is having to keep a scoreboard of which buffers have been
released.

But in context, header/data separation would allow in order
packets to have the data be placed back to back, which 
could allow a single recv to report the payload of multiple
successive TCP segments. So the benefit of header/data
separation remains the same, and I still say it's a optimization
that should not be made a requirement. The benefits of vj_channels
exist even without them. When the packet classifier runs on the
host, header/data separation would not be free. I want to enable
hardware offloads, not make the kernel bend over backwards
to emulate how hardware would work. I'm just hoping that we
can agree to let hardware do its work without being forced to
work the same way the kernel does (i.e., running down a long
list of arbitrary packet filter rules on a per packet basis).


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch

2006-04-28 Thread Caitlin Bestler

Evgeniy Polyakov wrote:

 
 I see your point, and respectfully disagree.
 The more complex userspace interface we create the less users
 it will have. It is completely unconvenient to read 100 bytes
 and receive only 80, since 20 were eaten by header. And what
 if we need only 20, but packet contains 100, introduce per packet
 head pointer? For purpose of benchmarking it works perfectly - read
 the whole packet, one can event touch that data to emulate real
 work, but for the real world it becomes practically unusabl.
 

In a straight-forward user-mode library using existing interfaces the
message would be interleaved with the headers in the inbound ring.
While the inbound ring is part of user memory, it is not what the
user would process from, that would be the buffer they supplied 
in a call to read() or recvmsg(), that buffer would have to make
no allowances for interleaved headers.

Enabling zero-copy when a buffer is pre-posted is possible, but
modestly complex. Research on MPI and SDP have generally
shown that the unless the pinning overhead is eliminated somehow
that the buffers have to be quite large before zero-copy reception
becomes a benefit. vj_netchannels represent a strategy of minimizing
registration/pinning costs even if it means paying for an extra copy.
Because the extra copy is closely tied to the activation of the data
sink consumer the cost of that extra copy is greatly reduced because
it places the data in the cache immediately before the application
will in fact use the received data.

Also keep in mind that once the issues are resolved to allow the
netchannel rings to be directly visible to a user-mode client that
enhanced/specialized interfaces can easily be added in user-mode
libraries. So focusing on supporting existing conventional interfaces
is probably the best approach for the initial efforts.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch

2006-04-28 Thread Caitlin Bestler

David S. Miller wrote:
 From: Rusty Russell [EMAIL PROTECTED]
 Date: Sat, 29 Apr 2006 08:04:04 +1000

 You're still thinking you can bypass classifiers for established
 sockets, but I really don't think you can.  I think the simplest
 solution is to effectively remove from (or flag) the established 
 listening hashes anything which could be effected by classifiers, so
 those packets get send through the default channel.

 OK, when rules are installed, the socket channel mappings are
 flushed.  This is your idea right?

You mean when new rules are installed that would conflict with
an existing mapping, right?

Bumping every connection out of vj-channel mode whenever any new
rule was installed would be very counter-productive.

Ultimately, you only want a direct-to-user vj-channel when all
packets assigned to it would be passed by netchannels, and maybe
increment a single packet counter. Checking a single QoS rate
limiter may be possible too, but if there are more complex
rules then the channel has to be kept in kernel because it
wouldn't make sense to trust user-mode code to apply the
netchannel rules reliably.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch

2006-04-27 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 From: Evgeniy Polyakov [EMAIL PROTECTED]
 Date: Thu, 27 Apr 2006 15:51:26 +0400

 There are some caveats here found while developing zero-copy sniffer
 [1]. Project's goal was to remap skbs into userspace in real-time.
 While absolute numbers (posted to netdev@) were really high, it is
 only applicable to read-only application. As was shown in IOAT
 thread, data must be warmed in caches, so reading from mapped area
 will be as fast as memcpy() (read+write), and copy_to_user()
 actually almost equal to memcpy() (benchmarks were posted to
 netdev@). And we must add remapping overhead.

 Yes, all of these issues are related quite strongly.  Thanks
 for making the connection explicit.

 But, the mapping overhead is zero for this net channel stuff,
 at least as it is implemented and designed by Kelly.  Ring
 buffer is setup ahead of time into the user's address space,
 and a ring of buffers into that area are given to the networking card.

 We remember the translations here, so no get_user_pages() on
 each transfer and garbage like that.  And yes this all harks
 back to the issues that are discussed in Chapter 5 of
 Networking Algorithmics.
 But the core thing to understand is that by defining a new
 API and setting up the buffer pool ahead of time, we avoid all of the
 get_user_pages() overhead while retaining full kernel/user protection.

 Evgeniy, the difference between this and your work is that
 you did not have an intelligent piece of hardware that could
 be told to recognize flows, and only put packets for a
 specific flow into that's flow's buffer pool.

 If we want to dma data from nic into premapped userspace area, this
 will strike with message sizes/misalignment/slow read and so on, so
 preallocation has even more problems.

 I do not really think this is an issue, we put the full
 packet into user space and teach it where the offset is to
 the actual data.
 We'll do the same things we do today to try and get the data
 area aligned.  User can do whatever is logical and relevant
 on his end to deal with strange cases.

 In fact we can specify that card has to take some care to get
 data area of packet aligned on say an 8 byte boundary or
 something like that.  When we don't have hardware assist, we
 are going to be doing copies.

 This change also requires significant changes in application, at
 least until recv/send are changed, which is not the best thing to do.

 This is exactly the point, we can only do a good job and
 receive zero copy if we can change the interfaces, and that's
 exactly what we're doing here.

 I do think that significant win in VJ's tests belongs not to
 remapping and cache-oriented changes, but to move all protocol
 processing into process' context.

 I partly disagree.  The biggest win is eliminating all of the
 control overhead (all of softint RX + protocol demux + IP
 route lookup + socket lookup is turned into single flow
 demux), and the SMP safe data structure which makes it
 realistic enough to always move the bulk of the packet work
 to the socket's home cpu.

 I do not think userspace protocol implementation buys enough
 to justify it.  We have to do the protection switch in and
 out of kernel space anyways, so why not still do the
 protected protocol processing work in the kernel?  It is
 still being done on the user's behalf, contributes to his
 time slice, and avoids all of the terrible issues of
 userspace protocol implementations.

 So in my mind, the optimal situation from both a protection
 preservation and also a performance perspective is net
 channels to kernel socket protocol processing, buffers DMA'd
 directly into userspace if hardware assist is present.

Having a ring that is already flow qualified is indeed the
most important savings, and worth pursuing even if reaching
consensus on how to safely enable user-mode L4 processing.
The latter *can* be a big advantage when the L4 processing
can be done based on a user-mode call from an already
scheduled process. But the benefit is not there for a process
that needs to be woken up each time it receives a short request.

So the real issue is when there is an intelligent device that
uses hardware packet classification to place the packet in
the correct ring. We don't want to bypass packet filtering,
but it would be terribly wasteful to reclassify the packet.
Intelligent NICs will have packet classification capabilities
to support RDMA and iSCSI. Those capabilities should be available
to benefit SOCK_STREAM and SOCK_DGRAM users as well without it
being a choice of either turning all stack control over to
the NIC or ignorign all NIC capabilities beyound pretending
to be a dumb Ethernet NIC.

For example, counting packets within an approved connection
is a valid goal that the final solution should support. But
would a simple count be sufficient, or do we truly need the
full flexibility currently found in netfilter?

Obviously all of this does not need to be resolved in

RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch

2006-04-26 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 Ok I have comments already just glancing at the initial patch.
 
 With the 32-bit descriptors in the channel, you indeed end up
 with a fixed sized pool with a lot of hard-to-finesse sizing
 and lookup problems to solve.
 
 So what I wanted to do was finesse the entire issue by simply
 side-stepping it initially.  Use a normal buffer with a tail
 descriptor, when you enqueue you give a tail descriptor pointer.
 
 Yes, it's weirder to handle this in hardware, but it's not
 impossible and using real pointers means two things:
 
 1) You can design a simple netif_receive_skb() channel that works
today, encapsulation of channel buffers into an SKB is like
15 lines of code and no funny lookups.
 
 2) People can start porting the input path of drivers right now and
retain full functionality and test anything they want.  This is
important for getting the drivers stable as fast as possible.
 
 And it also means we can tackle the buffer pool issue of the
 32-bit descriptors later, if we actually want to do things
 that way, I think we probably don't.
 
 To be honest, I don't think using a 32-bit descriptor is so
 critical even from a hardware implementation perspective.
 Yes, on 64-bit you're dealing with a 64-bit quantity so the
 number of entries in the channel are halfed from what a 32-bit arch
 uses. 
 
 Yes I say this for 2 reasons:
 
 1) We have no idea whether it's critical to have ~512 entries
in the channel which is about what a u32 queue entry type
affords you on x86 with 4096 byte page size.
 
 2) Furthermore, it is sized by page size, and most 64-bit platforms
use an 8K base page size anyways, so the number of queue entries
ends of being the same.  Yes, I know some 64-bit platforms use
a 4K page size, please see #1 :-)
 
 I really dislike the pools of buffers, partly because they
 are fixed size (or dynamically sized and even more expensive
 to implement), but moreso because there is all of this
 absolutely stupid state management you eat just to get at the
 real data.  That's pointless, we're trying to make this as
 light as possible.  Just use real pointers and describe the
 packet with a tail descriptor.
 
 We can use a u64 or whatever in a hardware implementation.
 
 Next, you can't even begin to work on the protocol channels
 before you do one very important piece of work.  Integration
 of all of the ipv4 and ipv6 protocol hash tables into a
 central code, it's a total prerequisite.  Then you modify
 things to use a generic
 inet_{,listen_}lookup() or inet6_{,listen_}lookup() that
 takes a protocol number as well as saddr/daddr/sport/dport
 and searches from a central table.
 
 So I think I'll continue working on my implementation, it's
 more transitional and that's how we have to do this kind of work. -

The major element I liked about Kelly's approach is that the ring
is clearly designed to allow a NIC to place packets directly into
a ring that is directly accessible by the user. Evolutionary steps
are good, but isn't direct placement into a user-accessible simple
ring buffer the ultimate justification of netchannels?

But that doesn't mean that we have to have a very artificial definition
of the ring based on presumptions that hardware only understands 512n
sized buffers. Hardware today is typically just as smart as the
processors
that IP networks were first designed on, if not more so.

Central integration also will need to be integrated with packet
filtering.
In particular, once a flow has been assigned to a netchannel ring, who
is
responsible for doing the packet filtering? Or is it enough to check the
packet filter when the net channel flow is created?



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch

2006-04-26 Thread Caitlin Bestler

David S. Miller wrote:

 
 I personally think allowing sockets to trump firewall rules
 is an acceptable relaxation of the rules in order to simplify
 the implementation.

I agree.  I have never seen a set of netfilter rules that
would block arbitrary packets *within* an established connection.

Technically you can create such rules, but every single set
of rules actually deployed that I have ever seen started with
a rule to pass all packets for established connections, and
then proceeded to control which connections could be initiated
or accepted.



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch

2006-04-26 Thread Caitlin Bestler

Jeff Garzik wrote:
 Caitlin Bestler wrote:
 David S. Miller wrote:
 
 I personally think allowing sockets to trump firewall rules is an
 acceptable relaxation of the rules in order to simplify the
 implementation.
 
 I agree.  I have never seen a set of netfilter rules that would block
 arbitrary packets *within* an established connection.
 
 Technically you can create such rules, but every single set of rules
 actually deployed that I have ever seen started with a rule to pass
 all packets for established connections, and then proceeded to
 control which connections could be initiated or accepted.
 
 Oh, there are plenty of examples of filtering within an established
 connection:  input rules.  I've seen drop all packets from these
 IPs type rules frequently.  Victims of DoS use those kinds of
 rules to stop packets as early as possible.
 
   Jeff

If you are dropping all packets from IP X, then how was the connection
established? Obviously we are only dealing with connections that
were established before the rule to drop all packets from IP X
was created.

That calls for an ability to revoke the assignment of any flow to
a vj_netchannel when a new rule is created that would filter any
packet that would be classified by the flow.

Basically the rule is that a delegation to a vj_netchannel is
only allowed for flows where *all* packets assigned to that flow
(input or output) would receive a 'pass' from netchannels.

That makes sense.  What I don't see a need for is examing *each*
delegated packet against the entire set of existing rules. Basically,
a flow should either be rule-compliant or not. If it is not, then
the delegation of the flow should be abandoned. If that requires
re-importing TCP state, then perhaps the TCP connection needs to
be aborted.

In any event, if netfilter is selectively rejecting packets in the
middle
of a connection then the connection is going to fail anyway. 




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch

2006-04-26 Thread Caitlin Bestler

David S. Miller wrote:
 From: Jeff Garzik [EMAIL PROTECTED]
 Date: Wed, 26 Apr 2006 15:46:58 -0400

 Oh, there are plenty of examples of filtering within an established
 connection:  input rules.  I've seen drop all packets from these
 IPs type rules frequently.  Victims of DoS use those kinds of rules
 to stop packets as early as possible.

 Yes, good point, but this applies to listening connections.

 We'll need to figure out a way to deal with this.

 It occurs to me that for established connections, netfilter
 can simply remove all matching entries from the netchannel lookup
 tables. 

 But that still leaves the thorny listening socket issue.
 This may by itself make netfilter netchannel support
 important and that brings up a lot of issues about classifier
 algorithms. 

 All of this I wanted to avoid as we start this work :-)

 We can think about how to approach these other problems and
 start with something simple meanwhile.  That seems to me to
 be the best approach moving forward.

 It's important to start really simple else we'll just keep
 getting bogged down in complexity and details and never
 implement anything.

How does this sound?

The netchannel qualifiers should only deal with TCP packets
for established connections. Listens would continue to be 
dealt with by the existing stack logic, vj_channelizing
only occurring when the the connection was accepted.

The vj_netchannel qualifiers would conceptually take place
before the netfilter rules (to avoid making deployment
of netchannels dependent on netfilter) but their creation
would have to be approved by netfilter (if netfiler was
active). Netfilter could also revoke vj_channel qualifiers.

If the rule is that if a vj_netchannel rule exists then it
must be ok with netfilter is actually very easy to implement.
During early development you simply tell the testers hey,
don't set up any netchannels that netfilter would reject
and defer implementing enforcement until after the netchannels
code actually works. After all, if it is isn't actually successfully
transmitting or receiving packets yet it can't really be acting
contrary to netfilter policy.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch

2006-04-26 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 From: Caitlin Bestler [EMAIL PROTECTED]
 Date: Wed, 26 Apr 2006 15:53:44 -0700

 The netchannel qualifiers should only deal with TCP packets for
 established connections. Listens would continue to be dealt with by
 the existing stack logic, vj_channelizing only occurring when the the
 connection was accepted.

 I consider netchannel support for listening TCP sockets to be
 absolutely essential. -

Meaning that inbound SYNs would be placed in a net channel
for processing by a Consumer at the other end of the ring?

If so the rules filtering SYNs would have to be applied either
before it went into the ring, or when the consumer end takes
them out. The latter makes more sense to me, because the rules
about what remote hosts can initiate a connection request to
a given TCP port can be fairly complex for a variety of
legitimate reasons.

Would it be reasonable to state that a net channel carrying
SYNs should not be set up when the consumer is a user mode
process?

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: Van Jacobson's net channels and real-time

2006-04-24 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 Subject: Re: Van Jacobson's net channels and real-time
 
 
 On Mon, 24 Apr 2006, Auke Kok wrote:
 
 Ingo Oeser wrote:
 On Saturday, 22. April 2006 15:49, Jörn Engel wrote:
 That was another main point, yes.  And the endpoints should be as
 little burden on the bottlenecks as possible.  One bottleneck is
 the receive interrupt, which shouldn't wait for cachelines from
 other cpus too much.
 
 Thats right. This will be made a non issue with early demuxing on
 the NIC and MSI (or was it MSI-X?) which will select the right CPU
 based on hardware channels.
 
 MSI-X. with MSI you still have only one cpu handling all MSI
 interrupts and that doesn't look any different than ordinary
 interrupts. MSI-X will allow much better interrupt handling across
 several cpu's. 
 
 Auke
 -
 
 Message signaled interrupts are just a kudge to save a trace
 on a PC board (read make junk cheaper still). They are not
 faster and may even be slower. They will not be the salvation
 of any interrupt latency problems. The solutions for
 increasing networking speed, where the bit-rate on the wire
 gets close to the bit-rate on the bus, is to put more and
 more of the networking code inside the network board. The CPU
 get interrupted after most things (like network handshakes)
 are complete.
 

The number of hardware interrupts supported is a bit out of scope.
Whatever the capacity is, the key is to have as few meaningless
interrupts as possible.

In the context of netchannels this would mean that an interrupt
should only be fired when there is a sufficient number of packets
for the user-mode code to process. Fully offloading the protocol
to the hardware is certainly one option, that I also thinks make
sense, but the goal of netchannels is to try to optimize performance
while keeping TCP processing on host.

More hardware offload is distinctly possible and relevant in this
context. Statefull offload, such as TSO, are fully relevant.
Going directly from the NIC to the channel is also possible (after
the channel is set up by the kernel, of course). If the NIC is
aware of the channels directly then interrupts can be limited to
packets that cross per-channel thresholds configured directly
by the ring consumer.


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 2/6] IB: match connection requests based on private data

2006-03-06 Thread Caitlin Bestler

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Sean Hefty
 Sent: Monday, March 06, 2006 11:04 AM
 To: 'Roland Dreier'
 Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org; 
 openib-general@openib.org
 Subject: [PATCH 2/6] IB: match connection requests based on 
 private data

 Extend matching connection requests to listens in the 
 Infiniband CM to include private data checks.

 This allows applications to listen on the same service 
 identifier, with private data directing the request to the 
 appropriate application.

 Signed-off-by: Sean Hefty [EMAIL PROTECTED]

The term private data is intended to convey the
intent that the data is private to the application
layer and is opaque to middleware and the network.

By what mechanism does the listening application
delegate how much of the private data for use by
the CM for sub-dividing a listen? What does an 
application do if it wishes to retain full ownership
of the private data?

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RFC: move SDP from AF_INET_SDP to IPPROTO_SDP

2006-03-06 Thread Caitlin Bestler

 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of David Stevens
 Sent: Monday, March 06, 2006 11:49 AM
 To: Michael S. Tsirkin
 Cc: Linux Kernel Mailing List; netdev@vger.kernel.org
 Subject: Re: RFC: move SDP from AF_INET_SDP to IPPROTO_SDP

 I don't know any details about SDP, but if there are no 
 differences at the protocol layer, then neither the address 
 family nor the protocol is appropriate. If it's just an API 
 change, the socket type is the right selector. So, maybe 
 SOCK_DIRECT to go along with SOCK_STREAM, SOCK_DGRAM, etc.
 +-DLS

That wouldn't work either. The whole point of SDP, or TOE,
is that the API is either totally unchanged or at least
essentially unchanged.

Whenever an IP Address is used (SDP/iWARP, TOE and potentially
SDP/IB) changing from AF_INET* is wrong.

For both SDP/iWARP and SDP/IB you could argue that a different
wire protocol is in use so IPPROTO_SDP is acceptable.
That's probably the best answer as long as we are stuck
under the restriction that the selection of an alternate
stack cannot be done in the exact manner that the consumer
wants it done (that is transparently to the application).

There are even some corner case scenarios where the 
application might care whether their SOCK_STREAM was
carried over SDP or plain TCP. So a protocol based
distinction is probably the least misleading of all
the explicit selection options.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RFC: move SDP from AF_INET_SDP to IPPROTO_SDP

2006-03-06 Thread Caitlin Bestler

 -Original Message-
 From: David Stevens [mailto:[EMAIL PROTECTED] 
 Sent: Monday, March 06, 2006 12:32 PM
 To: Caitlin Bestler
 Cc: Linux Kernel Mailing List; Michael S. Tsirkin; 
 netdev@vger.kernel.org
 Subject: RE: RFC: move SDP from AF_INET_SDP to IPPROTO_SDP

 IPPROTO_* should match the protocol field on the wire, which 
 I gather isn't different. And I'm assuming there is no 
 standard API defined already...

SDP uses the existing standard sockets API.
That was the intent in its design, and it is
the sole justification for its use. If you are
not using the existing sockets API then your
application would be *far* better off coding
directly to RDMA.

The wire protocol *is* different, it uses RDMA.
There is some justification for the application
knowing this, albeit slight ones. For example 
you need to know if the peer supports SDP and
it might effect how intermediate firewalls
need to be configured.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [RFC] Some infrastructure for interrupt-less TX

2006-02-22 Thread Caitlin Bestler

[EMAIL PROTECTED] wrote:
 Below patch wasn't even compile tested.  I'm not involved
 with network drivers anymore, so my personal interest is
 fairly low.  But since I firmly believe in the advantages and
 feasibility of interrupt-less TX, there should at least be an
 ugly broken patch to flame about.  Go for it, tell me how stupid I am!
 
 Jörn

I am assuming the real goal is avoiding interrupts when
transmit completions can be reported without them on a
reasonably periodic basis.

Wouldn't that goal be achievable by the type of transmit
buffer ring implied for net channels?

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [openib-general] Re: [Rdma-developers] Meeting (07/22)summary:OpenRDMA community development discussion

2005-08-02 Thread Caitlin Bestler

Generally there are two cases to consider: when the TCP mode is not visible
and when it is.

When it is not visible it is certainly easy to manage the TCP connection
with subset logic within the RDMA stack and never involve the host
stack. This is certainly what the initial proposal will rely upon. In 
the long term it has the problems you cited. Having two stacks
accept TCP connections means that *both* must be updated
to stay current with the latest DoS attacks. While it is more 
work for the RDMA device, I think there is general agreement 
amongs the hardware vendors that this is something that the
OS *should* retain control of. Deciding which connections may
be accepted is inherently an OS function.

Beyond that there is a distinct programming model, already 
accepted in IETF specifications, that requires the application
to begin work in streaming (i.e., socket) mode, and then only
convert to RDMA mode once the two peers have agreed upon
that optimization. To support that model you will eventually 
have to allow the host stack to transfer a TCP connection to
the RDMA stack *or* you will require the RDMA stack to 
provide full TCP/socket functionality.

So the real question is not whether to allow the RDMA stack
to take a connection from the host stack, but whether to
force the RDMA stack to yield control of the connection to
the host during critical connection setup so that this step
remains firmly under OS control and oversight.


On 8/2/05, Tom Tucker [EMAIL PROTECTED] wrote:
  
  
  'Christoph Hellwig' wrote:
  
  
  Can you provide more details on exactly why you think this is a horrible
 idea? I agree it will be complex, but it _could_ be scoped such that the
 complexity is reduced. For instance, the offload function could fail
 (with EBUSY or something) if there is _any_ data pending on the socket.
 Thus removing any requirement to pass down pending unacked outgoing data, or
 pending data that has been received but not yet read by the application.
 The idea here is that the applications at the top know they are going into
 RDMA mode and have effectively quiesced the connection before attempting to
 move the connection into RDMA mode. We could, in fact, _require_ the
 connect be quiesced to keep things simpler. I'm quickly sinking into gory
 details, but I want to know if you have other reasons (other than the
 complextity) for why this is a bad idea.
  
  I think your writeup here is more than explanation enough. The offload
 can only work for few special cases, and even for those it's rather
 complicated, especially if you take things as ipsec or complex tunneling
 that get more and more common into account. 
  I think Steve's point was that it *can* be simplified as necessary to meet
 the demands/needs of the Linux community. It is certainly technically
 possible, but agreeably complicated to offload an active socket.
  
  
  What do you archive by
 implementing the offload except trying to make it look more integrated
 to the user than it actually is? Just offload rmda protocols to the
 RDMA hardware and keep the IP stack out of that complexity.
  You get the benefit of things like SYN flood DOS attack avoidance built
 into the host stack without replicating this functionality in the offloaded
 adapter. There are other benefits of integration like failover, etc... IMHO,
 however, the bulk of the benefits are for ULP offload like RDMA where the
 remote peer may not be capable of HW RDMA acceleration. This kind of thing
 could be determined in streaming mode using the host stack and then
 migrated to an adapter for HW acceleration only if the remote peer is
 capable.
 
  
  
  ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit
 http://openib.org/mailman/listinfo/openib-general
  
  
  
 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit
 http://openib.org/mailman/listinfo/openib-general
 
 
b
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

38 matches

Mail list logo