RE: [RFC NET 00/02]: Secondary unicast address support
[EMAIL PROTECTED] wrote: From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 21 Jun 2007 13:08:12 -0600 However this just seems to allow a card to decode multiple mac addresses which in some oddball load balancing configurations may actually be useful, but it seems fairly limited. Do you have a specific use case you envision for this multiple mac functionality? Virtualization. If you can't tell the ethernet card that more than 1 MAC address are for it, you have to turn the thing into promiscuous mode. Networking on virtualization is typically done by giving each guest a unique MAC address, the guests have a virtual network device that connects to the control node (or dom0 in Xen parlace) and/or other guests. The control node has a switch that routes the packets from the guests either to other guests or out the real ethernet interface. Each guest gets a unique MAC so that the switch can know which guest an incoming packet is for. The same software switch could also throw away the excess frames that promiscuous mode would have admitted. Unless the misdirected frames were common it would not seem to be a major CPU burden. Keep in mind that the only MAC addresses that would have been transmitted are the ones that the input filter would have listed. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [openib-general] [PATCH 1/10] cxgb3 - main header files
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael S. Tsirkin Sent: Tuesday, January 09, 2007 5:57 AM To: Steve Wise Cc: netdev@vger.kernel.org; Roland Dreier; Divy Le Ray; linux-kernel@vger.kernel.org; openib-general Subject: Re: [openib-general] [PATCH 1/10] cxgb3 - main header files We also need to decide on the ib_req_notify_cq() issue. Let's clarify - do you oppose doing copy_from_user from a fixed address passed in during setup? If OK with you, this seems the best way as it is the least controversial and least disruptive one. To clarfiy my understanding of this issue: A device MAY implement ib_req_notify_cq by updating a location directly from user mode. Any of the techniques that apply to other user allocated objects, such as the Send Queue, can be applied here. Even those the proposed changes would be about as low impact and benign as possible, the fact that there are valid solutions without an API changes leans heavily towards using those solutions. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Suppress / delay SYN-ACK
On 10/12/06, Rick Jones [EMAIL PROTECTED] wrote: Martin Schiller wrote: Hi! I'm searching for a solution to suppress / delay the SYN-ACK packet of a listening server (-application) until he has decided (e.g. analysed the requesting ip-address or checked if the corresponding other end of a connection is available) if he wants to accept the connect request of the client. If not, it should be possible to reject the connect request. How often do you expect the incomming call to be rejected? I suspect that would have a significant effect on whether the whole thing is worthwhile. rick jones More to the point, on what basis would the application be rejecting a connection request based solely on the SYN? There are only two pieces of information available: the remote IP address and port, and the total number of pending requests. The latter is already addressed through the backlog size, and netfilter rules can already be used to reject based on IP address. That would seem to limit the usefullness to scenarios where a given remote IP address *might* be accepted based on total traffic load, number of other connections from the same IP address, etc. If *all* requests from that IP address are going to be rejected, why not use netfilter? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC] network namespaces
[EMAIL PROTECTED] wrote: Finally, as I understand both network isolation and network virtualization (both level2 and level3) can happily co-exist. We do have several filesystems in kernel. Let's have several network virtualization approaches, and let a user choose. Is that makes sense? If there are not compelling arguments for using both ways of doing it is silly to merge both, as it is more maintenance overhead. My reading is that full virtualization (Xen, etc.) calls for implementing L2 switching between the partitions and the physical NIC(s). The tradeoffs between L2 and L3 switching are indeed complex, but there are two implications of doing L2 switching between partitions: 1) Do we really want to ask device drivers to support L2 switching for partitions and something *different* for containers? 2) Do we really want any single packet to traverse an L2 switch (for the partition-style virtualization layer) and then an L3 switch (for the container-style layer)? The full virtualization solution calls for virtual NICs with distinct MAC addresses. Is there any reason why this same solution cannot work for containers (just creating more than one VNIC for the partition, and then assigning each VNIC to a container?) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RDMA will be reverted
[EMAIL PROTECTED] wrote: From: Steve Wise [EMAIL PROTECTED] Date: Wed, 05 Jul 2006 12:50:34 -0500 However, iWARP devices _could_ integrate with netfilter. For most devices the connection request event (SYN) gets passed up to the host driver. So the driver can enforce filter rules then. This doesn't work. In order to handle things like NAT and connection tracking properly you must even allow ESTABLISHED state packets to pass through netfilter. Netfilter can have rules such as NAT port 200 to 300, leave the other fields alone and your suggested scheme cannot handle this. This is totally irrelevant. But it does work. First, an RDMA connection once established associates a TCP connection *as identified external to the box* with an RDMA endpoint (conventionally a QP). Performing a NAT translation on a TCP packet would certainly be within the capabilities of an RNIC, but it would be pointless. The relabeled TCP segment would be associated with the same QP. Once an RDMA connection is established, the individual TCP segments are only of interest to the RDMA endpoint. Payload is delivered through the RDMA interface (the same one already used for InfiniBand). The purpose of integration with netfilter would be to ensure that no RDMA Connection could exist, or persist, if netfilter would not allow the TCP connection to be created. That is not a matter of packet filtering, it is matter of administrative consistency. If someone uses netfilter to block connections from a given IP netmask then they reasonably expect that there will be no connections with any host within that IP netmask. They do not expect exceptions for RDMA, iSCSI or InfiniBand. The existing connection management interfaces in openfabrics, designed to support both InfiniBand and iWARP, could naturally be extended to validate all RDMA connections using an IP address with netfilter. This would be of real value. The only real value of a rule such as NAT port 200 to 300 is to allow a remote peer to establish a connection to port 200 with a local listener using port 300. That *can* be supported without actually manipulating the header in each TCP packet. It is also possible to discuss other netfilter functionality that serves a valid end-user purpose, such as counting packets. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RDMA will be reverted
[EMAIL PROTECTED] wrote: From: Tom Tucker [EMAIL PROTECTED] Date: Wed, 05 Jul 2006 12:09:42 -0500 A TOE net stack is closed source firmware. Linux engineers have no way to fix security issues that arise. As a result, only non-TOE users will receive security updates, leaving random windows of vulnerability for each TOE NIC's users. - A Linux security update may or may not be relevant to a vendors implementation. - If a vendor's implementation has a security issue then the customer must rely on the vendor to fix it. This is no less true for iWARP than for any adapter. This isn't how things actually work. Users have a computer, and they can rightly expect the community to help them solve problems that occur in the upstream kernel. When a bug is found and the person is using NIC X, we don't necessarily forward the bug report to the vendor of NIC X. Instead we try to fix the bug. Many chip drivers are maintained by people who do not work for the company that makes the chip, and this works just fine. If only the chip vendor can fix a security problem, this makes Linux less agile to fix. Even aspect of a problem on a Linux system that cannot be fixed entirely by the community is a net negative for Linux. - iWARP needs to do protocol processing in order to validate and evaluate TCP payload in advance of direct data placement. This requirement is independent of CPU speed. Yet, RDMA itself is just an optimization meant to deal with limitations of cpu and memory speed. You can rephrase the situation in whatever way suits your argument, but it does not make the core issue go away :) RDMA is a protocol that allows the application to more precisely state the actual ordering requirements. It improves the end-to-end interactions and has value over a protocol with only byte or message stream semantics regardless of local interface efficiencies. See http://ietf.org/internet-drafts/draft-ietf-rddp-applicability-08.txt In any event, isn't the value of an RDMA interface to applications already settled? The question is how best to deal integrate the usage of IP addresses with the kernel. The inability to validate the low-level packet validation in open source code is a limitation of *all* RDMA solutions, the transport layer of InfiniBand is just as offloaded as it is for iWARP. The patches proposed are intended to support integrated connection management for RDMA connections using IP addresses, no matter what the underlying transport is. The only difference is that *all* iWARP connections use IP addresses. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Netchannles: first stage has been completed. Further ideas.
[EMAIL PROTECTED] wrote: Evgeniy Polyakov wrote: On Thu, Jul 20, 2006 at 02:21:57PM -0700, Ben Greear ([EMAIL PROTECTED]) wrote: Out of curiosity, is it possible to have the single producer logic if you have two+ ethernet interfaces handling frames for a single TCP connection? (I am assuming some sort of multi-path routing logic...) I do not think it is possible with additional logic like what is implemented in softirqs, i.e. per cpu queues of data, which in turn will be converted into skbs one-by-one. Couldn't you have two NICs being handled by two separate CPUs, with both CPUs trying to write to the same socket queue? The receive path works with RCU locking from what I understand, so a protocol's receive function must be re-entrant. Wouldn't it be easier simply not have two NICs feed the same ring? What packets end up in which ring is fully controllable. On the rare occasion that a single connection must be fed by two NICs a software merge of the two rings would be far cheaper than having to co-ordinate between producers all the time. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RDMA will be reverted
Andi Kleen wrote: We're focusing on netfilter here. Is breaking netfilter really the only issue with this stuff? Another concern is that it will just not be able to keep up with a high rate of new connections or a high number of them (because the hardware has too limited state) Neither iWARP or an iSCSI initiator will require extremely high rates of connection establishment. An RNIC only establishes connections when its services have been explicitly requested (via use of a specific service). In any event, the key question here is whether integration with the netdevice improves things or whether the offload device should be totally transparent to the kernel. If the offload device somehow insisted on handling connection requests that the kernel would have been able to handle then this would be an issue. But the kernel is not currently handling RDMA Connect requests on its own, and I know of no-one who has suggested that a software-only implementation of RDMA is feasible at 10Gbit is feasible. netfiler integration is definitely something that needs to be addressed, but the L2/L3 integrations need to be in place first. And then there are the other issues I listed like subtle TCP bugs (TSO is already a nightmare in this area and it's still not quite right) etc. Making an RNIC fully transparent to the kernell would require it to handle many L2 and L3 issues in parallel with the host stack. That increases the chance of a bug, or at least a subtle difference between the host and the RNIC which while being compliant would be unexpected. The purposes of the proposed patches is to enable the RNIC to be in full compliance with the host stack on IP layer issues. It would need someone who can describe how this new RDMA device avoids all the problems, but so far its advocates don't seem to be interested in doing that and I cannot contribute more. RDMA services are already defined for the kernel. The connection management and network notifier patches are about enabling RDMA devices to use IP addresses in a way that is consistent. Obviously doing so is more important for an iWARP device than for an InfiniBand device, but each InfiniBand users have expressed a desire to use IP addressing. Applications do not use RDMA by accident, it is a major design decision. Once an application uses RDMA it is no longer a direct consumer of the transport layer protocol. Indeed, one of the main objectives of the OpenFabrics stack is to enable typical applications to be written that will work over RDMA without caring what the underlying transport is. The options for control will still be there, but just as a sockets programmer does not typically care whether their IP is carried over SLIP, PPP, Ethernet or ATM; most RDMA developers should not have to worry about iWARP or InfiniBand. http://ietf.org/internet-drafts/draft-ietf-rddp-applicability-08.txt provides an overview on how RDMA benefits applications, and when applications would benefit from its use as compared to plain TCP. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: TOE, etc.
Herbert Xu wrote: Yes, however I think the same argument could be applied to TOE. With their RDMA NIC, we'll have TCP/SCTP connections that bypass netfilter, tc, IPsec, AF_PACKET/tcpdump and the rest of our stack while at the same time it is using the same IP address as us and deciding what packets we will or won't see. The whole point of the patches that opengrid has proposed is to allow control of these issues to remain with the kernel. That is where the ownership of the IP address logically resides, and system administrators will expect to be able to use one set of tools to control what is done with a given IP address. The bypassing is already going on with iSCSI devices and with InfiniBand devices that use IP addresses. An RDMA/IP device just makes it harder to ignore this problem, but the problem was already there. SDP over IB is presented to Linux users essentially as a TOE service. Connections are made with IP and socket semantics, and yet there is no co-ordination on routes/netfilter/etc. I'll state right up front that I think stateful offload, when co-ordinated with the OS, is better than stateless offload -- especially at 10G speeds. But for plain TCP connections there are stateless offloads available. As a product architect I am already seeking as many ways as possible to support stateless offload as efficiently as possible to keep that option viable for Linux users for as high of a rate as possible. That is why we are very interested in exploring a hardware friendly definition of vj_netchannels. But with RDMA things are different. There is no such thing as stateless RDMA. It is not RDMA over TCP that requires stateful offload, it is RDMA itself. RDMA over InfiniBand is just as much of a stateful offload as RDMA over TCP. It is possible to build RDMA over TCP as a service that merely uses memory mappping services in a mysterious way but is not integrated with the network stack at all. That is essentially how RDMA over IB is currently working. But I believe that integrating control over the IP address, and the associated netfilter/routing/arp/pmtu/etc issues, is the correct path. This logic should not be duplicated, and its control must not be split. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: TOE, etc.
[EMAIL PROTECTED] wrote: From: Steve Wise [EMAIL PROTECTED] Date: Wed, 28 Jun 2006 09:54:57 -0500 Doesn't iSCSI have this same issue? Software iSCSI implementations don't have the issue because they go through the stack using normal sockets and normal device send and receive. But hardware iSCSI implementations, which already exist, do not work through normal sockets. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: TOE, etc.
Jeff Garzik wrote: Caitlin Bestler wrote: Jeff Garzik wrote: Caitlin Bestler wrote: But hardware iSCSI implementations, which already exist, do not work through normal sockets. No, they work through normal SCSI stack... Correct. But they then interface to the network using none of the network stack. The normal SCSI stack does not control that it any way. Correct. And the network stack is completely unaware of whatever IP addresses, ARP tables, routing tables, etc. it is using. NFS over RDMA is part of the file system. That doesn't change the fact that it's use of IP Addresses needs to be co-ordinated with the network stack, and indeed that address based authentication *assumes* that this is the case. (and yes, there are preferable means of authentication, but authenticating based on IP address is already supported). Sounds quite broken to me. But back on the main point, if implementing SCSI services over a TCP connection is acceptable even though it does not use a kernel socket, why would it not be acceptable to implement RDMA services over a TCP connection without using a kernel socket? Because SCSI doesn't force nasty hooks into the net stack to allow for sharing of resources with a proprietary black box of unknown quality. Jeff RDMA can also solve all of these problems on its own. Complete with giving the network administrator *no* conventional controls over the IP address being used for RDMA services. That means no standard ability to monitor connections, no standard ability to control which connections are made with whom. That is better? You seem to be practically demanding that RDMA build an entire parallel stack. Worse, that *each* RDMA vendor build an entire parallel stack. Open source being what it is, that is not terribly difficult. But exactly how does this benefit Linux users? The proposed subscriptions are not about sharing *resources*, they share *information* with device drivers. The quality of each RDMA device driver will be just as known as for a SCSI driver, an InfiniBand HCA driver, a graphics driver or a plain Ethernet driver. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH Round 2 0/2][RFC] Network Event Notifier Mechanism
[EMAIL PROTECTED] wrote: From: Steve Wise [EMAIL PROTECTED] Date: Tue, 27 Jun 2006 10:02:19 -0500 For the RDMA kernel subsystem, however, we still need a specific event. We need both the old and new dst_entry struct ptrs to figure out which active connections were using the old dst_entry and should be updated to use the new dst_entry. This change isn't truly atomic from a kernel standpoint either. The new dst won't be selected by the socket until later, when the socket tries to send something, notices the old dst is obsolete, and looks up a new one. Your code could do the same thing. The request to send something is posted directly form user mode to a mapped memory ring that is reaped by the hardware. Having the hardware fault, report that fault, and wait for the host to update it with the new mapping is somewhat clumbsy. It also won't work at all for existing hardware. The best you could do is to have the driver invalidate the old entry, then *presume* that the hardware will want the replacement and look that up, and then forward that answer to the hardware. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/2][RFC] Network Event Notifier Mechanism
[EMAIL PROTECTED] wrote: On Thu, 2006-22-06 at 15:40 -0500, Steve Wise wrote: On Thu, 2006-06-22 at 15:43 -0400, jamal wrote: No - what these 2 gents are saying was these events and infrastructure already exist. Notification of the exact events needed does not exist today. Ok, so you cant event make use of anything that already exists? Or is a subset of what you need already there? The key events, again, are: - the neighbour entry mac address has changed. - the next hop ip address (ie the neighbour) for a given dst_entry has changed. I dont see a difference for the above two from an L2 perspective. Are you keeping track of IP addresses? You didn't answer my question in the previous email as to what RDMA needs to keep track of in hardware. The RDMA device is handling L4 or L5 connections that have L3 Addresses (IP). Subscribing to the information allows the device to keep its behaviour consistent with the host stack. The common alternative before proposing this integration was to have the RDMA device sniff all incoming packets and attempt to do parallel procesing on a large set of lower layer protocols (ICMP, ARP, routing, ...) Or by simply trusting that the IB network adminstrator has faithfully replicated all IP-relevent instructions in two forums (traditional IP nework administration and IB network administration). These subscriptions are an attempt to cede full control of these issues back to one place, the kernel, and to guarantee that an offload device can never think that the route to to X is Y when the kernel says it is Z. Or that it has a different PMTU, etc. I don't have any strong opinion on the best mechanism for implementing these subscriptions, but having correct consistent networking behaviour depend on a user-mode relay strikes me as odd. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/2][RFC] Network Event Notifier Mechanism
[EMAIL PROTECTED] wrote: On Thu, 2006-22-06 at 15:58 -0500, Steve Wise wrote: On Thu, 2006-06-22 at 16:36 -0400, jamal wrote: I created a new notifier block in my patch for these network events. I guess I thought I was using the existing infrastructure to provide this notification service. (I thought my patch was lovely :) But I didn't integrate with netlink for user space notification. Mainly cuz I didn't think these events should be propagated up to users unless there was a need. I think they will be useful in user space. Typically you only propagate them if there's a user space program subscribed to listening (there are hooks which will tell you if there's anyone listening). The netdevice events tend to be a lot more usable in a few other blocks because they are lower in the hierachy (i.e routing depends on ip addresses which depend on netdevices) within the kernel unlike in this case where you are the only consumer; so it does sound logical to me to do it in user space; however, not totally unreasonable to do it in the kernel. These services are relevant to any RDMA connection. The user-space consumer of RDMA services is no more interested in tracking the routing of the remote IP address than the consumer of socket services is. Another issue I see with netlink is that the event notifications aren't reliable. Especially the CONFIG_ARPD stuff because it allocs an sk_buff with ATOMIC. A lost neighbour macaddr change is perhaps fatal for an RDMA connection... This would happen in the cases where you are short on memory; i would suspect you will need to allocate memory in your driver as well to update something in the hardware as well - so same problem. You can however work around issues like these in netlink. A direct notification call to the driver makes the driver responsible for providing whatever buffering it requires to save the information. And if there is insufficient memory available at least the driver is aware of the failure. Allowing a third component to fail to relay information means that the driver can no longer be responsible for maintaining its own consistency with kernel routing, ARP and neighbor tables. Maintaining that consistency is a matter of correct network behaviour, not doing status reports. obviously we cannot have hardware looking at and interpreting these tables directly. So a *reliable* subscription would seem to be the only option. If the only subscribers who require reliable notifications are kernel drivers, does it really mamke sense to make those changes in code that also supports user space? I am still unclear: You have destination IP address, the dstMAC of the nexthop to get the packet to this IP address and i suspect some srcMAC address you will use sending out as well as the pathMTU to get there correct? Because of the IP address it sounds to me like you are populating an L3 table How is this info used in hardware? Can you explain how an arriving packet would be used by the RDMA in conjunction with this info once it is in the hardware? Some packets are associated with established RDMA (or iSCSI) connections, and are processed on the RDMA (or iSCSI) device. These packets will also pass through other packets to the host stack for processing (non-matched Ethernet frames for IP networks, and IPoIB tunneled frames for IB networks). The device provides L5 services (RDMA and/or iSCSI) in addition to L2 services (as an Ethernet device). The rest of the network rightfully demands that the left hand knows what the right hand is doing. So information that is provided to a host, ARP/ICMP, should affect the behaviour of *all* connections from that host. Do you agree that having the device subsribe to the kernel maintained tables is a better solution than having it attempt to guess the correct values in parallel? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [openib-general] [PATCH v2 1/7] AMSO1100 Low Level Driver.
[EMAIL PROTECTED] wrote: On Thu, 2006-06-15 at 08:41 -0500, Steve Wise wrote: On Wed, 2006-06-14 at 20:35 -0500, Bob Sharp wrote: +void c2_ae_event(struct c2_dev *c2dev, u32 mq_index) { + snip + case C2_RES_IND_EP:{ + + struct c2wr_ae_connection_request *req = + wr-ae.ae_connection_request; + struct iw_cm_id *cm_id = + (struct iw_cm_id *)resource_user_context; + + pr_debug(C2_RES_IND_EP event_id=%d\n, event_id); + if (event_id != CCAE_CONNECTION_REQUEST) { + pr_debug(%s: Invalid event_id: %d\n, + __FUNCTION__, event_id); + break; + } + cm_event.event = IW_CM_EVENT_CONNECT_REQUEST; + cm_event.provider_data = (void*)(unsigned long)req-cr_handle; + cm_event.local_addr.sin_addr.s_addr = req-laddr; + cm_event.remote_addr.sin_addr.s_addr = req-raddr; + cm_event.local_addr.sin_port = req-lport; + cm_event.remote_addr.sin_port = req-rport; + cm_event.private_data_len = + be32_to_cpu(req-private_data_length); + + if (cm_event.private_data_len) { It looks to me as if pdata is leaking here since it is not tracked and the upper layers do not free it. Also, if pdata is freed after the call to cm_id-event_handler returns, it exposes an issue in user space where the private data is garbage. I suspect the iwarp cm should be copying this data before it returns. Good catch. Yes, I think the IWCM should copy the private data in the upcall. If it does, then the amso driver doesn't need to kmalloc()/copy at all. It can pass a ptr to its MQ entry directly... Now that I've looked more into this, I'm not sure there's a simple way for the IWCM to copy the pdata on the upcall. Currently, the IWCM's event upcall, cm_event_handler(), simply queues the work for processing on a workqueue thread. So there's no per-event logic at all there. Lemme think on this more. Stay tuned. Either way, the amso driver has a memory leak... Having the IWCM copy the pdata during the upcall also leaves the greatest flexibility for the driver on how/where the pdata is captured. The IWCM has to deal with user-mode, indefinite delays waiting for a response and user-mode processes that die while holding a connection request. So it makes sense for that layer to do the allocating and copying. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [openib-general] [PATCH v2 1/2] iWARP Connection Manager.
[EMAIL PROTECTED] wrote: On Tue, 2006-06-13 at 16:46 -0500, Steve Wise wrote: On Tue, 2006-06-13 at 14:36 -0700, Sean Hefty wrote: Er...no. It will lose this event. Depending on the event...the carnage varies. We'll take a look at this. This behavior is consistent with the Infiniband CM (see drivers/infiniband/core/cm.c function cm_recv_handler()). But I think we should at least log an error because a lost event will usually stall the rdma connection. I believe that there's a difference here. For the Infiniband CM, an allocation error behaves the same as if the received MAD were lost or dropped. Since MADs are unreliable anyway, it's not so much that an IB CM event gets lost, as it doesn't ever occur. A remote CM should retry the send, which hopefully allows the connection to make forward progress. hmm. Ok. I see. I misunderstood the code in cm_recv_handler(). Tom and I have been talking about what we can do to not drop the event. Stay tuned. Here's a simple solution that solves the problem: For any given cm_id, there are a finite (and small) number of outstanding CM events that can be posted. So we just pre-allocate them when the cm_id is created and keep them on a free list hanging off of the cm_id struct. Then the event handler function will pull from this free list. The only case where there is any non-finite issue is on the passive listening cm_id. Each incoming connection request will consume a work struct. So based on client connects, we could run out of work structs. However, the CMA has the concept of a backlog, which is defined as the max number of pending unaccepted connection requests. So we allocate these work structs based on that number (or a computation based on that number), and if we run out, we simply drop the incoming connection request due to backlog overflow (I suggest we log the drop event too). When a MPA connection request is dropped, the (IETF conforming) MPA client will eventually time out the connection and the consumer can retry. Comments? If the IWCM cannot accept a Connection Request event from the driver then *someone* should generate a non-peer reject MPA Response frame. Since the IWCM does not have the resources to relay the event, it probably does not have the resources to generate the MPA Response frame either. So simply returning an I'm Busy error and expecting the driver to handle it makes sense to me. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [openib-general] Re: [PATCH 1/2] iWARP Connection Manager.
There's a difference between trying to handle the user calling disconnect/destroy at the same time a call to accept/connect is active, versus the user calling disconnect/destroy after accept/connect have returned. In the latter case, I think you're fine. In the first case, this is allowing a user to call destroy at the same time that they're calling accept/connect. Additionally, there's no guarantee that the F_CONNECT_WAIT flag has been set by accept/connect by the time disconnect/destroy tests it. The problem is that we can't synchronously cancel an outstanding connect request. Once we've asked the adapter to connect, we can't tell him to stop, we have to wait for it to fail. During the time period between when we ask to connect and the adapter says yeah-or-nay, the user hits ctrl-C. This is the case where disconnect and/or destroy gets called and we have to block it waiting for the outstanding connect request to complete. One alternative to this approach is to do the kfree of the cm_id in the deref logic. This was the original design and leaves the object around to handle the completion of the connect and still allows the app to clean up and go away without all this waitin' around. When the adapter finally finishes and releases it's reference, the object is kfree'd. Hope this helps. Why couldn't you synchronously put the cm_id in a state of pending delete and do the actual delete when the RNIC provides a response to the request? There could even be an optional method to see if the device is capable of cancelling the request. I know it can't yank a SYN back from the wire, but it could refrain from retransmitting. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] TCP Veno module for kernel 2.6.16.13
[EMAIL PROTECTED] wrote: On Thu, 25 May 2006 13:23:50 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: #ZHOU BIN# [EMAIL PROTECTED] Date: Thu, 25 May 2006 16:30:48 +0800 Yes, I agree. Actually the main contribution of TCP Veno is not in this AI phase. No matter the ABC is added or not, TCP Veno can always improve the performance over wireless networks, according to our tests. It seems to me that the wireless issue is seperate from congestion control. The key is to identify true loss due to overflow of intermediate router queues, vs. false loss which is due to temporary radio signal interference. Is it really possible to tell the two apart. Loss due to true congestion as opposed to loss due to radio signal interference (true loss, but falsely inferred to be congestion) is actually very possible, at L2 and only if the hop experiencing problems is the first or last hop. There are numerous indicators that the link is experiencing link-related drops (FEC corrections, signal to noise, etc.). The *desirability* of using this data is debatable, but it most certainly is possible. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: VJ Channel API - driver level (PATCH)
[EMAIL PROTECTED] wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David S. Miller Sent: Tuesday, May 02, 2006 11:48 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org Subject: Re: VJ Channel API - driver level (PATCH) I don't think we should be defining driver APIs when we haven't even figured out how the core of it would even work yet. A key part of this is the netfilter bits, that will require non-trivial flow identification, a hash will simply not be enough, and it will not be allowed to not support the netfilter bits properly since everyone will have netfilter enabled in one way or another. Hi Dave, Do you have suggestions on potential hardware assists/offloads for netfilter? I suppose some of it can be worthwhile, although in general may be too complex to implement - especially above 1 Gig. I'd expect high end NIC ASICs to implement rx steering based upon some sort of hash (for load balancing), as well as explicit 1:1 steering between a sw channel and a hw channel. Both options for channel configuration are present in the driver interface. If netfilter assists can be done in hardware, I agree the driver interface will need to add support for these - otherwise, netfilter processing will stay above the driver. Even if the hardware cannot fully implement netfilter rules there is still value in having an interface that documents exactly how much filtering a given piece of hardware can do. There is no point in having the kernel repeat packet classifications that have already been done by the NIC. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: VJ Channel API - driver level (PATCH)
Are you proposing a mechanism for the consuming end of a tx channel to support a large number of channels, or are you assuming that the number of tx channels will be small enough that simply polling them in priority order is adequate? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: VJ Channel API - driver level (PATCH)
Evgeniy Polyakov wrote: On Wed, May 03, 2006 at 08:56:23AM -0700, Caitlin Bestler ([EMAIL PROTECTED]) wrote: I'd expect high end NIC ASICs to implement rx steering based upon some sort of hash (for load balancing), as well as explicit 1:1 steering between a sw channel and a hw channel. Both options for channel configuration are present in the driver interface. If netfilter assists can be done in hardware, I agree the driver interface will need to add support for these - otherwise, netfilter processing will stay above the driver. Even if the hardware cannot fully implement netfilter rules there is still value in having an interface that documents exactly how much filtering a given piece of hardware can do. There is no point in having the kernel repeat packet classifications that have already been done by the NIC. Please do not suppose that vj channel must rely on underlaying hardware. New interface MUST work better or at least not worse than existing skb queueing for majority of users, and I doubt users with netfilter capable hardware are there. It is only some hint to the SW, not rules, that hardware can provide. The best would be ipv4/ipv6 hashing, and I think it is enough. I agree. I was just stating that *if* there is direct hardware support then the software should be enabled to skip redundant checks. What I'm suggesting is really the equivalent of knowing whether the hardware generates or checks CRCs and TCP checksums. Don't mandate the feature, just have the option to avoid redundant work. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: VJ Channel API - driver level (PATCH)
David S. Miller wrote: From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Wed, 3 May 2006 22:07:40 +0400 On Wed, May 03, 2006 at 08:56:23AM -0700, Caitlin Bestler ([EMAIL PROTECTED]) wrote: I'd expect high end NIC ASICs to implement rx steering based upon some sort of hash (for load balancing), as well as explicit 1:1 steering between a sw channel and a hw channel. Both options for channel configuration are present in the driver interface. If netfilter assists can be done in hardware, I agree the driver interface will need to add support for these - otherwise, netfilter processing will stay above the driver. Even if the hardware cannot fully implement netfilter rules there is still value in having an interface that documents exactly how much filtering a given piece of hardware can do. There is no point in having the kernel repeat packet classifications that have already been done by the NIC. Please do not suppose that vj channel must rely on underlaying hardware. I am not. I am just saying that it is futile to build hardware that cannot handle netfilter at least to some extent, because this will result in HW net channels being disabled for %99 of real users which makes the hardware just a waste. Or netfilters being disabled, which would be just as bad or worse. The kernel and hardware need to co-operate so that users are not asked to make artificial choices between performance and security. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
Evgeniy Polyakov wrote: On Thu, Apr 27, 2006 at 02:12:09PM -0700, Caitlin Bestler ([EMAIL PROTECTED]) wrote: So the real issue is when there is an intelligent device that uses hardware packet classification to place the packet in the correct ring. We don't want to bypass packet filtering, but it would be terribly wasteful to reclassify the packet. Intelligent NICs will have packet classification capabilities to support RDMA and iSCSI. Those capabilities should be available to benefit SOCK_STREAM and SOCK_DGRAM users as well without it being a choice of either turning all stack control over to the NIC or ignorign all NIC capabilities beyound pretending to be a dumb Ethernet NIC. Btw, how is it supposed to work without header split capabale hardware? Hardware that can classify packets is obviously capable of doing header data separation, but that does not mean that it has to do so. If the host wants header data separation it's real value is that when packets arrive in order that fewer distinct copies are required to move the data to the user buffer (because separated data can be placed back-to-back in a data-only ring). But that's an optimization, it's not needed to make the idea worth doing, or even necessarily in the first implementation. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
Evgeniy Polyakov wrote: On Fri, Apr 28, 2006 at 08:59:19AM -0700, Caitlin Bestler ([EMAIL PROTECTED]) wrote: Btw, how is it supposed to work without header split capabale hardware? Hardware that can classify packets is obviously capable of doing header data separation, but that does not mean that it has to do so. If the host wants header data separation it's real value is that when packets arrive in order that fewer distinct copies are required to move the data to the user buffer (because separated data can be placed back-to-back in a data-only ring). But that's an optimization, it's not needed to make the idea worth doing, or even necessarily in the first implementation. If there is dataflow, not flow of packets or flow of data with holes, it could be possible to modify recv() to just return the right pointer, so in theory userspace modifications would be minimal. With copy in place it completely does not differ from current design with copy_to_user() being used since memcpy() is just slightly faster than copy*user(). If the app is really ready to use a modified interface we might as well just give them a QP/CQ interface. But I suppose receive by pointer interfaces don't really stretch the sockets interface all that badly. The key is that you have to decide how the buffer is released, is it the next call? Or a separate call? Does releasing buffer N+2 release buffers N and N+1? What you want to avoid is having to keep a scoreboard of which buffers have been released. But in context, header/data separation would allow in order packets to have the data be placed back to back, which could allow a single recv to report the payload of multiple successive TCP segments. So the benefit of header/data separation remains the same, and I still say it's a optimization that should not be made a requirement. The benefits of vj_channels exist even without them. When the packet classifier runs on the host, header/data separation would not be free. I want to enable hardware offloads, not make the kernel bend over backwards to emulate how hardware would work. I'm just hoping that we can agree to let hardware do its work without being forced to work the same way the kernel does (i.e., running down a long list of arbitrary packet filter rules on a per packet basis). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
Evgeniy Polyakov wrote: I see your point, and respectfully disagree. The more complex userspace interface we create the less users it will have. It is completely unconvenient to read 100 bytes and receive only 80, since 20 were eaten by header. And what if we need only 20, but packet contains 100, introduce per packet head pointer? For purpose of benchmarking it works perfectly - read the whole packet, one can event touch that data to emulate real work, but for the real world it becomes practically unusabl. In a straight-forward user-mode library using existing interfaces the message would be interleaved with the headers in the inbound ring. While the inbound ring is part of user memory, it is not what the user would process from, that would be the buffer they supplied in a call to read() or recvmsg(), that buffer would have to make no allowances for interleaved headers. Enabling zero-copy when a buffer is pre-posted is possible, but modestly complex. Research on MPI and SDP have generally shown that the unless the pinning overhead is eliminated somehow that the buffers have to be quite large before zero-copy reception becomes a benefit. vj_netchannels represent a strategy of minimizing registration/pinning costs even if it means paying for an extra copy. Because the extra copy is closely tied to the activation of the data sink consumer the cost of that extra copy is greatly reduced because it places the data in the cache immediately before the application will in fact use the received data. Also keep in mind that once the issues are resolved to allow the netchannel rings to be directly visible to a user-mode client that enhanced/specialized interfaces can easily be added in user-mode libraries. So focusing on supporting existing conventional interfaces is probably the best approach for the initial efforts. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
David S. Miller wrote: From: Rusty Russell [EMAIL PROTECTED] Date: Sat, 29 Apr 2006 08:04:04 +1000 You're still thinking you can bypass classifiers for established sockets, but I really don't think you can. I think the simplest solution is to effectively remove from (or flag) the established listening hashes anything which could be effected by classifiers, so those packets get send through the default channel. OK, when rules are installed, the socket channel mappings are flushed. This is your idea right? You mean when new rules are installed that would conflict with an existing mapping, right? Bumping every connection out of vj-channel mode whenever any new rule was installed would be very counter-productive. Ultimately, you only want a direct-to-user vj-channel when all packets assigned to it would be passed by netchannels, and maybe increment a single packet counter. Checking a single QoS rate limiter may be possible too, but if there are more complex rules then the channel has to be kept in kernel because it wouldn't make sense to trust user-mode code to apply the netchannel rules reliably. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
[EMAIL PROTECTED] wrote: From: Evgeniy Polyakov [EMAIL PROTECTED] Date: Thu, 27 Apr 2006 15:51:26 +0400 There are some caveats here found while developing zero-copy sniffer [1]. Project's goal was to remap skbs into userspace in real-time. While absolute numbers (posted to netdev@) were really high, it is only applicable to read-only application. As was shown in IOAT thread, data must be warmed in caches, so reading from mapped area will be as fast as memcpy() (read+write), and copy_to_user() actually almost equal to memcpy() (benchmarks were posted to netdev@). And we must add remapping overhead. Yes, all of these issues are related quite strongly. Thanks for making the connection explicit. But, the mapping overhead is zero for this net channel stuff, at least as it is implemented and designed by Kelly. Ring buffer is setup ahead of time into the user's address space, and a ring of buffers into that area are given to the networking card. We remember the translations here, so no get_user_pages() on each transfer and garbage like that. And yes this all harks back to the issues that are discussed in Chapter 5 of Networking Algorithmics. But the core thing to understand is that by defining a new API and setting up the buffer pool ahead of time, we avoid all of the get_user_pages() overhead while retaining full kernel/user protection. Evgeniy, the difference between this and your work is that you did not have an intelligent piece of hardware that could be told to recognize flows, and only put packets for a specific flow into that's flow's buffer pool. If we want to dma data from nic into premapped userspace area, this will strike with message sizes/misalignment/slow read and so on, so preallocation has even more problems. I do not really think this is an issue, we put the full packet into user space and teach it where the offset is to the actual data. We'll do the same things we do today to try and get the data area aligned. User can do whatever is logical and relevant on his end to deal with strange cases. In fact we can specify that card has to take some care to get data area of packet aligned on say an 8 byte boundary or something like that. When we don't have hardware assist, we are going to be doing copies. This change also requires significant changes in application, at least until recv/send are changed, which is not the best thing to do. This is exactly the point, we can only do a good job and receive zero copy if we can change the interfaces, and that's exactly what we're doing here. I do think that significant win in VJ's tests belongs not to remapping and cache-oriented changes, but to move all protocol processing into process' context. I partly disagree. The biggest win is eliminating all of the control overhead (all of softint RX + protocol demux + IP route lookup + socket lookup is turned into single flow demux), and the SMP safe data structure which makes it realistic enough to always move the bulk of the packet work to the socket's home cpu. I do not think userspace protocol implementation buys enough to justify it. We have to do the protection switch in and out of kernel space anyways, so why not still do the protected protocol processing work in the kernel? It is still being done on the user's behalf, contributes to his time slice, and avoids all of the terrible issues of userspace protocol implementations. So in my mind, the optimal situation from both a protection preservation and also a performance perspective is net channels to kernel socket protocol processing, buffers DMA'd directly into userspace if hardware assist is present. Having a ring that is already flow qualified is indeed the most important savings, and worth pursuing even if reaching consensus on how to safely enable user-mode L4 processing. The latter *can* be a big advantage when the L4 processing can be done based on a user-mode call from an already scheduled process. But the benefit is not there for a process that needs to be woken up each time it receives a short request. So the real issue is when there is an intelligent device that uses hardware packet classification to place the packet in the correct ring. We don't want to bypass packet filtering, but it would be terribly wasteful to reclassify the packet. Intelligent NICs will have packet classification capabilities to support RDMA and iSCSI. Those capabilities should be available to benefit SOCK_STREAM and SOCK_DGRAM users as well without it being a choice of either turning all stack control over to the NIC or ignorign all NIC capabilities beyound pretending to be a dumb Ethernet NIC. For example, counting packets within an approved connection is a valid goal that the final solution should support. But would a simple count be sufficient, or do we truly need the full flexibility currently found in netfilter? Obviously all of this does not need to be resolved in
RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
[EMAIL PROTECTED] wrote: Ok I have comments already just glancing at the initial patch. With the 32-bit descriptors in the channel, you indeed end up with a fixed sized pool with a lot of hard-to-finesse sizing and lookup problems to solve. So what I wanted to do was finesse the entire issue by simply side-stepping it initially. Use a normal buffer with a tail descriptor, when you enqueue you give a tail descriptor pointer. Yes, it's weirder to handle this in hardware, but it's not impossible and using real pointers means two things: 1) You can design a simple netif_receive_skb() channel that works today, encapsulation of channel buffers into an SKB is like 15 lines of code and no funny lookups. 2) People can start porting the input path of drivers right now and retain full functionality and test anything they want. This is important for getting the drivers stable as fast as possible. And it also means we can tackle the buffer pool issue of the 32-bit descriptors later, if we actually want to do things that way, I think we probably don't. To be honest, I don't think using a 32-bit descriptor is so critical even from a hardware implementation perspective. Yes, on 64-bit you're dealing with a 64-bit quantity so the number of entries in the channel are halfed from what a 32-bit arch uses. Yes I say this for 2 reasons: 1) We have no idea whether it's critical to have ~512 entries in the channel which is about what a u32 queue entry type affords you on x86 with 4096 byte page size. 2) Furthermore, it is sized by page size, and most 64-bit platforms use an 8K base page size anyways, so the number of queue entries ends of being the same. Yes, I know some 64-bit platforms use a 4K page size, please see #1 :-) I really dislike the pools of buffers, partly because they are fixed size (or dynamically sized and even more expensive to implement), but moreso because there is all of this absolutely stupid state management you eat just to get at the real data. That's pointless, we're trying to make this as light as possible. Just use real pointers and describe the packet with a tail descriptor. We can use a u64 or whatever in a hardware implementation. Next, you can't even begin to work on the protocol channels before you do one very important piece of work. Integration of all of the ipv4 and ipv6 protocol hash tables into a central code, it's a total prerequisite. Then you modify things to use a generic inet_{,listen_}lookup() or inet6_{,listen_}lookup() that takes a protocol number as well as saddr/daddr/sport/dport and searches from a central table. So I think I'll continue working on my implementation, it's more transitional and that's how we have to do this kind of work. - The major element I liked about Kelly's approach is that the ring is clearly designed to allow a NIC to place packets directly into a ring that is directly accessible by the user. Evolutionary steps are good, but isn't direct placement into a user-accessible simple ring buffer the ultimate justification of netchannels? But that doesn't mean that we have to have a very artificial definition of the ring based on presumptions that hardware only understands 512n sized buffers. Hardware today is typically just as smart as the processors that IP networks were first designed on, if not more so. Central integration also will need to be integrated with packet filtering. In particular, once a flow has been assigned to a netchannel ring, who is responsible for doing the packet filtering? Or is it enough to check the packet filter when the net channel flow is created? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
David S. Miller wrote: I personally think allowing sockets to trump firewall rules is an acceptable relaxation of the rules in order to simplify the implementation. I agree. I have never seen a set of netfilter rules that would block arbitrary packets *within* an established connection. Technically you can create such rules, but every single set of rules actually deployed that I have ever seen started with a rule to pass all packets for established connections, and then proceeded to control which connections could be initiated or accepted. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
Jeff Garzik wrote: Caitlin Bestler wrote: David S. Miller wrote: I personally think allowing sockets to trump firewall rules is an acceptable relaxation of the rules in order to simplify the implementation. I agree. I have never seen a set of netfilter rules that would block arbitrary packets *within* an established connection. Technically you can create such rules, but every single set of rules actually deployed that I have ever seen started with a rule to pass all packets for established connections, and then proceeded to control which connections could be initiated or accepted. Oh, there are plenty of examples of filtering within an established connection: input rules. I've seen drop all packets from these IPs type rules frequently. Victims of DoS use those kinds of rules to stop packets as early as possible. Jeff If you are dropping all packets from IP X, then how was the connection established? Obviously we are only dealing with connections that were established before the rule to drop all packets from IP X was created. That calls for an ability to revoke the assignment of any flow to a vj_netchannel when a new rule is created that would filter any packet that would be classified by the flow. Basically the rule is that a delegation to a vj_netchannel is only allowed for flows where *all* packets assigned to that flow (input or output) would receive a 'pass' from netchannels. That makes sense. What I don't see a need for is examing *each* delegated packet against the entire set of existing rules. Basically, a flow should either be rule-compliant or not. If it is not, then the delegation of the flow should be abandoned. If that requires re-importing TCP state, then perhaps the TCP connection needs to be aborted. In any event, if netfilter is selectively rejecting packets in the middle of a connection then the connection is going to fail anyway. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
David S. Miller wrote: From: Jeff Garzik [EMAIL PROTECTED] Date: Wed, 26 Apr 2006 15:46:58 -0400 Oh, there are plenty of examples of filtering within an established connection: input rules. I've seen drop all packets from these IPs type rules frequently. Victims of DoS use those kinds of rules to stop packets as early as possible. Yes, good point, but this applies to listening connections. We'll need to figure out a way to deal with this. It occurs to me that for established connections, netfilter can simply remove all matching entries from the netchannel lookup tables. But that still leaves the thorny listening socket issue. This may by itself make netfilter netchannel support important and that brings up a lot of issues about classifier algorithms. All of this I wanted to avoid as we start this work :-) We can think about how to approach these other problems and start with something simple meanwhile. That seems to me to be the best approach moving forward. It's important to start really simple else we'll just keep getting bogged down in complexity and details and never implement anything. How does this sound? The netchannel qualifiers should only deal with TCP packets for established connections. Listens would continue to be dealt with by the existing stack logic, vj_channelizing only occurring when the the connection was accepted. The vj_netchannel qualifiers would conceptually take place before the netfilter rules (to avoid making deployment of netchannels dependent on netfilter) but their creation would have to be approved by netfilter (if netfiler was active). Netfilter could also revoke vj_channel qualifiers. If the rule is that if a vj_netchannel rule exists then it must be ok with netfilter is actually very easy to implement. During early development you simply tell the testers hey, don't set up any netchannels that netfilter would reject and defer implementing enforcement until after the netchannels code actually works. After all, if it is isn't actually successfully transmitting or receiving packets yet it can't really be acting contrary to netfilter policy. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
[EMAIL PROTECTED] wrote: From: Caitlin Bestler [EMAIL PROTECTED] Date: Wed, 26 Apr 2006 15:53:44 -0700 The netchannel qualifiers should only deal with TCP packets for established connections. Listens would continue to be dealt with by the existing stack logic, vj_channelizing only occurring when the the connection was accepted. I consider netchannel support for listening TCP sockets to be absolutely essential. - Meaning that inbound SYNs would be placed in a net channel for processing by a Consumer at the other end of the ring? If so the rules filtering SYNs would have to be applied either before it went into the ring, or when the consumer end takes them out. The latter makes more sense to me, because the rules about what remote hosts can initiate a connection request to a given TCP port can be fairly complex for a variety of legitimate reasons. Would it be reasonable to state that a net channel carrying SYNs should not be set up when the consumer is a user mode process? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Van Jacobson's net channels and real-time
[EMAIL PROTECTED] wrote: Subject: Re: Van Jacobson's net channels and real-time On Mon, 24 Apr 2006, Auke Kok wrote: Ingo Oeser wrote: On Saturday, 22. April 2006 15:49, Jörn Engel wrote: That was another main point, yes. And the endpoints should be as little burden on the bottlenecks as possible. One bottleneck is the receive interrupt, which shouldn't wait for cachelines from other cpus too much. Thats right. This will be made a non issue with early demuxing on the NIC and MSI (or was it MSI-X?) which will select the right CPU based on hardware channels. MSI-X. with MSI you still have only one cpu handling all MSI interrupts and that doesn't look any different than ordinary interrupts. MSI-X will allow much better interrupt handling across several cpu's. Auke - Message signaled interrupts are just a kudge to save a trace on a PC board (read make junk cheaper still). They are not faster and may even be slower. They will not be the salvation of any interrupt latency problems. The solutions for increasing networking speed, where the bit-rate on the wire gets close to the bit-rate on the bus, is to put more and more of the networking code inside the network board. The CPU get interrupted after most things (like network handshakes) are complete. The number of hardware interrupts supported is a bit out of scope. Whatever the capacity is, the key is to have as few meaningless interrupts as possible. In the context of netchannels this would mean that an interrupt should only be fired when there is a sufficient number of packets for the user-mode code to process. Fully offloading the protocol to the hardware is certainly one option, that I also thinks make sense, but the goal of netchannels is to try to optimize performance while keeping TCP processing on host. More hardware offload is distinctly possible and relevant in this context. Statefull offload, such as TSO, are fully relevant. Going directly from the NIC to the channel is also possible (after the channel is set up by the kernel, of course). If the NIC is aware of the channels directly then interrupts can be limited to packets that cross per-channel thresholds configured directly by the ring consumer. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/6] IB: match connection requests based on private data
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Sean Hefty Sent: Monday, March 06, 2006 11:04 AM To: 'Roland Dreier' Cc: netdev@vger.kernel.org; linux-kernel@vger.kernel.org; openib-general@openib.org Subject: [PATCH 2/6] IB: match connection requests based on private data Extend matching connection requests to listens in the Infiniband CM to include private data checks. This allows applications to listen on the same service identifier, with private data directing the request to the appropriate application. Signed-off-by: Sean Hefty [EMAIL PROTECTED] The term private data is intended to convey the intent that the data is private to the application layer and is opaque to middleware and the network. By what mechanism does the listening application delegate how much of the private data for use by the CM for sub-dividing a listen? What does an application do if it wishes to retain full ownership of the private data? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RFC: move SDP from AF_INET_SDP to IPPROTO_SDP
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David Stevens Sent: Monday, March 06, 2006 11:49 AM To: Michael S. Tsirkin Cc: Linux Kernel Mailing List; netdev@vger.kernel.org Subject: Re: RFC: move SDP from AF_INET_SDP to IPPROTO_SDP I don't know any details about SDP, but if there are no differences at the protocol layer, then neither the address family nor the protocol is appropriate. If it's just an API change, the socket type is the right selector. So, maybe SOCK_DIRECT to go along with SOCK_STREAM, SOCK_DGRAM, etc. +-DLS That wouldn't work either. The whole point of SDP, or TOE, is that the API is either totally unchanged or at least essentially unchanged. Whenever an IP Address is used (SDP/iWARP, TOE and potentially SDP/IB) changing from AF_INET* is wrong. For both SDP/iWARP and SDP/IB you could argue that a different wire protocol is in use so IPPROTO_SDP is acceptable. That's probably the best answer as long as we are stuck under the restriction that the selection of an alternate stack cannot be done in the exact manner that the consumer wants it done (that is transparently to the application). There are even some corner case scenarios where the application might care whether their SOCK_STREAM was carried over SDP or plain TCP. So a protocol based distinction is probably the least misleading of all the explicit selection options. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RFC: move SDP from AF_INET_SDP to IPPROTO_SDP
-Original Message- From: David Stevens [mailto:[EMAIL PROTECTED] Sent: Monday, March 06, 2006 12:32 PM To: Caitlin Bestler Cc: Linux Kernel Mailing List; Michael S. Tsirkin; netdev@vger.kernel.org Subject: RE: RFC: move SDP from AF_INET_SDP to IPPROTO_SDP IPPROTO_* should match the protocol field on the wire, which I gather isn't different. And I'm assuming there is no standard API defined already... SDP uses the existing standard sockets API. That was the intent in its design, and it is the sole justification for its use. If you are not using the existing sockets API then your application would be *far* better off coding directly to RDMA. The wire protocol *is* different, it uses RDMA. There is some justification for the application knowing this, albeit slight ones. For example you need to know if the peer supports SDP and it might effect how intermediate firewalls need to be configured. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [RFC] Some infrastructure for interrupt-less TX
[EMAIL PROTECTED] wrote: Below patch wasn't even compile tested. I'm not involved with network drivers anymore, so my personal interest is fairly low. But since I firmly believe in the advantages and feasibility of interrupt-less TX, there should at least be an ugly broken patch to flame about. Go for it, tell me how stupid I am! Jörn I am assuming the real goal is avoiding interrupts when transmit completions can be reported without them on a reasonably periodic basis. Wouldn't that goal be achievable by the type of transmit buffer ring implied for net channels? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] Re: [Rdma-developers] Meeting (07/22)summary:OpenRDMA community development discussion
Generally there are two cases to consider: when the TCP mode is not visible and when it is. When it is not visible it is certainly easy to manage the TCP connection with subset logic within the RDMA stack and never involve the host stack. This is certainly what the initial proposal will rely upon. In the long term it has the problems you cited. Having two stacks accept TCP connections means that *both* must be updated to stay current with the latest DoS attacks. While it is more work for the RDMA device, I think there is general agreement amongs the hardware vendors that this is something that the OS *should* retain control of. Deciding which connections may be accepted is inherently an OS function. Beyond that there is a distinct programming model, already accepted in IETF specifications, that requires the application to begin work in streaming (i.e., socket) mode, and then only convert to RDMA mode once the two peers have agreed upon that optimization. To support that model you will eventually have to allow the host stack to transfer a TCP connection to the RDMA stack *or* you will require the RDMA stack to provide full TCP/socket functionality. So the real question is not whether to allow the RDMA stack to take a connection from the host stack, but whether to force the RDMA stack to yield control of the connection to the host during critical connection setup so that this step remains firmly under OS control and oversight. On 8/2/05, Tom Tucker [EMAIL PROTECTED] wrote: 'Christoph Hellwig' wrote: Can you provide more details on exactly why you think this is a horrible idea? I agree it will be complex, but it _could_ be scoped such that the complexity is reduced. For instance, the offload function could fail (with EBUSY or something) if there is _any_ data pending on the socket. Thus removing any requirement to pass down pending unacked outgoing data, or pending data that has been received but not yet read by the application. The idea here is that the applications at the top know they are going into RDMA mode and have effectively quiesced the connection before attempting to move the connection into RDMA mode. We could, in fact, _require_ the connect be quiesced to keep things simpler. I'm quickly sinking into gory details, but I want to know if you have other reasons (other than the complextity) for why this is a bad idea. I think your writeup here is more than explanation enough. The offload can only work for few special cases, and even for those it's rather complicated, especially if you take things as ipsec or complex tunneling that get more and more common into account. I think Steve's point was that it *can* be simplified as necessary to meet the demands/needs of the Linux community. It is certainly technically possible, but agreeably complicated to offload an active socket. What do you archive by implementing the offload except trying to make it look more integrated to the user than it actually is? Just offload rmda protocols to the RDMA hardware and keep the IP stack out of that complexity. You get the benefit of things like SYN flood DOS attack avoidance built into the host stack without replicating this functionality in the offloaded adapter. There are other benefits of integration like failover, etc... IMHO, however, the bulk of the benefits are for ULP offload like RDMA where the remote peer may not be capable of HW RDMA acceleration. This kind of thing could be determined in streaming mode using the host stack and then migrated to an adapter for HW acceleration only if the remote peer is capable. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general b - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html