Re: Linux TCP in the presence of delays or drops...
Hi David, My intention when I wrote the second mail was just to provide some more examples that further elaborate my first question. But as you noticed, I couldnt resist the temptation to slip in a couple of new questions on the new post :-(...sorry and will take your advice into consideration on my future postings. Thanks for the tip!! Regards, Oumer David Miller wrote: From: Oumer Teyeb <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 19:49:28 +0200 it would be so great if some of you could spare a few minutes and take a look at the traces I provided.see below for the original postng... If people are too backlogged and busy to reply to your original posting, you will only ensure that it will take even longer by bombarding the list with even more information and questions on top of your original large query. Just be patient. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
On Mon, 2006-07-31 at 21:47 -0700, David Miller wrote: > From: Rusty Russell <[EMAIL PROTECTED]> > Date: Fri, 28 Jul 2006 15:54:04 +1000 > > > (1) I am imagining some Grand Unified Flow Cache (Olsson trie?) that > > holds (some subset of?) flows. A successful lookup immediately after > > packet comes off NIC gives destiny for packet: what route, (optionally) > > what socket, what filtering, what connection tracking (& what NAT), etc? > > I don't know if this should be a general array of fn & data ptrs, or > > specialized fields for each one, or a mix. Maybe there's a "too hard, > > do slow path" bit, or maybe hard cases just never get put in the cache. > > Perhaps we need a separate one for locally-generated packets, a-la > > ip_route_output(). Anyway, we trade slightly more expensive flow setup > > for faster packet processing within flows. > > So, specifically, one of the methods you are thinking about might > be implemented by adding: > > void (*input)(struct sk_buff *, void *); > void *input_data; > > to "struct flow_cache_entry" or whatever replaces it? Probably needs a return value to indicate stop packet processing, and to be completely general I think we'd want more than one, eg: #define MAX_GUFC_INPUTS 5 unsigned int num_inputs; int (*input[MAX_GUFC_INPUTS])(struct sk_buff *, void *); void *input_data[MAX_GUFC_INPUTS]; > This way we don't need some kind of "type" information in > the flow cache entry, since the input handler knows the type. Some things may want to jam more than a pointer into the cache entry, so we might do something clever later, but as a first cut this would seem to work. > > One way to do this is to add a "have_interest" callback into the > > hook_ops, which takes each about-to-be-inserted GUFC entry and adds any > > destinies this hook cares about. In the case of packet filtering this > > would do a traversal and append a fn/data ptr to the entry for each rule > > which could effect it. > > Can you give a concrete example of how the GUFC might make use > of this? Just some small abstract code snippets will do. OK, I take it back. I was thinking that on a miss, the GUFC called into each subsystem to populate the new GUFC entry. That would be a radical departure from the current code, so forget it. So, on a GUFC miss, we could create a new GUFC entry (on stack?), hang it off the skb, then as each subsystem adds to it as we go through. At some point (handwave?) we collect the skb->gufc and insert it into the trie. For iptables, as a first step we'd simply do (open-coded for now): /* FIXME: Do acceleration properly */ struct gufc *gufc = skb->gufc; if (!gufc || gufc->num_inputs == MAX_INPUTS) { skb->gufc = NULL; } else { gufc->input[gufc->num_inputs] = traverse_entire_table; gufc->input_data[gufc->num_inputs++] = this_table; } Later we'd get funky: /* Filtering code here */ ... if (num_rules_applied > 1 || !only_needed_flow_info) { gufc->input[gufc->num_inputs] = traverse_entire_table; gufc->input_data[gufc->num_inputs++] = this_table; } else if (num_rules_applied == 1) { gufc->input[gufc->num_inputs] = traverse_one_rule; gufc->input_data[gufc->num_inputs++] = last_rule; } Note that this could be cleverer, too: if (result == NF_DROP && only_needed_flow_info) { // Who cares about other inputs, we're going to drop gufc->input[0] = drop_skb; gufc->num_inputs = 1; } Two potential performance issues: 1) When we change rules, iptables replaces entire table from userspace. We need pkttables (which uses incremental rule updates) to flush intelligently. 2) Every iptables rule currently keeps pkt/byte counters, meaning we can't bypass rules even though they might have no effect on the packet (eg. iptables -A INPUT -i eth0 -j ETH0_RULES). We can address this by having pkt/byte counters in the gufc entry and a method of pushing them back to iptables when the gufc entry is pruned, and manually traversing the trie to flush them when the user asks for counters. > I had the idea of a lazy scheme. When we create a GUFC entry, we > tack it onto a DMA'able linked list the card uses. We do not > notify the card, we just entail the update onto the list. > > Then, if the card misses it's on-chip GUFC table on an incoming > packet, it checks the DMA update list by reading it in from memory. > It updates it's GUFC table with whatever entries are found on this > list, then it retries to classify the packet. I had assumed we would simply do full lookup on non-hw-classified packets, so async insertion is a non-issue. Can we assume hardware will cover entire GUFC trie? > This seems like a possible good solution until we try to address GUFC > entry deletion, which unfortunat
Re: [RFC 1/4] kevent: core files.
On Mon, Jul 31, 2006 at 03:00:28PM -0700, David Miller ([EMAIL PROTECTED]) wrote: > From: Evgeniy Polyakov <[EMAIL PROTECTED]> > Date: Mon, 31 Jul 2006 23:41:43 +0400 > > > Since kevents are never generated by kernel, but only marked as ready, > > length of the main queue performs as flow control, so we can create a > > mapped buffer which will have space equal to the main queue length > > multiplied by size of the copied to userspace structure plus 16 bits for > > the start index of the kernel writing side, i.e. it will store offset > > where the oldest event was placed. > > > > Since queue length is a limited factor and thus no new events can be added > > when queue is full, that means that buffer is full too and userspace > > must read events. When syscall is called to add new kevent and provided > > there offset differs from what kernel stored, that means that all events > > from kernel to provided index have been read and new events can be added. > > Thus we can even allow read-only mapping. Kernel's index is incremented > > modulo queue length. If kevent was removed after it was marked as > > ready, it's copy stays in the mapped buffer, but special flag can be > > assigned to show that kevent is no longer valid. > > This sounds reasonable. > > However we must be mindful that the thread of control trying to > add a new event might not be in a position to drain the queue > of pending events when the queue is full. Usually he will be > trying to add an event in response to handling another event. > > So we'd have cases like this, assume we start with a full event > queue: > > thread Athread B > > dequeue event > aha, new connection > accept() > register new kevent > queue is now full again > add kevent on new > connection > > At this point thread A doesn't have very many options when the kevent > add fails. You cannot force this thread to read more events, since he > may not be in a state where he is easily able to do so. By default all kevents are not removed from the queue, so accept events will be in the queue and thread B will fail to register new kevent. To remove kevent from the queue user should either set one-shot flag or do it by special command. So if we are in position when queue is full and all events are not one-shot, control thread must think about what does it do, and remove some of them (and next time add them with one-shot flag). -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FACK and CWND
From: "Ma Lin" <[EMAIL PROTECTED]> Date: Fri, 28 Jul 2006 18:37:15 +0800 > In FACK, the holes between SACK blocks are considered as loss. To a > sender, when SACK comes in, loss_out would be non-zero. According to > linux-2.6.17.7/net/ipv4/tcp_input.c, function tcp_time_to_recover(), > this non-zero loss_out will send the sender into "Recovery" state, > in which, CWND could be reduced. In one word, it seems that, FACK > would allow SACK holes to reduce CWND. That's right, because when tp->lost_out is set we have some form of absolute proof that packets were lost. Note that even when not receiving SACK blocks, ie. pure Reno, we emulate the SACK information the best we can. So, if we have real SACK blocks, tcp_update_scoreboard() will mark all packets in the retransmit queue up to "fackets_out" minus "reordering" as lost. Else, for non-SACK, only the head packet in the retransmit queue will be marked as lost. > However, in the paper "Congestion Control in Linux TCP", Section 3, > subsection Recovery, it says that Recovery state is triggered by > "sufficient amount of successive duplicate ACK", to my understand, > that means 3-dup. Under Linux it has more complicated definition. We wait until we see at least "tp->reordering" packets lost. Dynamically we try to determine how deeply packets are being reordered on the connection. Using this value, we use "tp->fackets_out - tp->reordering" as how many packets we think have been proven as lost. You will note that any code path that falls through to to end tcp_fastretrans_alert() will retransmit one packet using a call to tcp_xmit_retransmit_queue(). And one such code path is the transition to TCP_CA_Recovery which is guarded by the tcp_time_to_recover() check, which encapsulates the two tests we've discussed as: if (tp->lost_out) return 1; if (tcp_fackets_out(tp) > tp->reordering) return 1; The next few checks try to handle some fringe cases, such as the head packet in the retransmit queue having been sent more than an RTO ago, and also having so few packets in the retransmit queue that normal recovery mechanisms cannot function properly: if (tcp_head_timedout(sk, tp)) return 1; packets_out = tp->packets_out; if (packets_out <= tp->reordering && tp->sacked_out >= max_t(__u32, packets_out/2, sysctl_tcp_reordering) && !tcp_may_send_now(sk, tp)) { return 1; } Hope this helps. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] gre: transparent ethernet bridging
On Tue, 01 Aug 2006 11:15:29 +1000 Philip Craig <[EMAIL PROTECTED]> wrote: > Stephen Hemminger wrote: > > On Mon, 31 Jul 2006 20:06:41 +1000 > > Philip Craig <[EMAIL PROTECTED]> wrote: > > > >> This patch implements transparent ethernet bridging for gre tunnels. > >> There are a few outstanding issues. > > > > Why not use existing bridge code? > > It does use the existing bridge code. Perhaps the name is misleading. > All it does is encapsulate the full ethernet header in a gre packet, > rather than only layer 3. That is, currently gre uses ARPHRD_IPGRE, > but bridging requires ARPHRD_ETHER. > I am not against making the bridge code smarter to handle other encapsulation. > >> Some routers set LLC_SAP_BSPAN in the gre protocol field, and then > >> give the bpdu packet without any other ethernet/llc header. This patch > >> currently tries to fake the ethernet/llc header before passing the > >> packet up, but it is buggy (mac addresses are wrong at least). Maybe a > >> better approach is to call directly into the bridging code. I didn't try > >> that at first because it isn't modular, and may break other things that > >> want to see the packet. > > > > Existing bridge code already has spanning tree. > > Yes, and I want to use that. But this packet is a bit strange in > that it does not have the ethernet header on it. So what is the > best way to pass it to existing code? Either fake the ethernet > header, or pass it directly? Likewise if the bridge STP bpdu input code was smarter, it could deal with it maybe? > > >> +#if 0 > >>dev = alloc_netdev(sizeof(*t), name, ipgre_tunnel_setup); > >> +#else > >> + dev = alloc_netdev(sizeof(*t), name, ipgre_ether_tunnel_setup); > >> +#endif > > > > "Do, or do not there is no try" > > I am looking for comments as to whether adding a netlink interface > to control this is appropriate. If we make bridge code type aware, then the ipgre tunnel wouldn't have to change. > >> +__be16 ipgre_type_trans(struct sk_buff *skb, int offset) > >> +{ > >> + u8 *h = skb->data; > >> + __be16 flags = *(__be16*)h; > >> + __be16 proto = *(__be16*)(h + 2); > >> + > >> + /* WCCP version 1 and 2 protocol decoding. > >> + * - Change protocol to IP > >> + * - When dealing with WCCPv2, Skip extra 4 bytes in GRE header > >> + */ > >> + if (flags == 0 && > >> + proto == __constant_htons(ETH_P_WCCP)) { > >> + proto = __constant_htons(ETH_P_IP); > >> + if ((*(h + offset) & 0xF0) != 0x40) > >> + offset += 4; > >> + } > > > > Don't use __constant_htons() except in initializers and switch cases > > (where gcc is too stupid to optimize the macro). > > > > This is a problem in the existing code, which I am simply moving > around. Should I fix it at the same time? Usually if a diff touches some code, I try to make it use current practice. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
From: Rusty Russell <[EMAIL PROTECTED]> Date: Fri, 28 Jul 2006 15:54:04 +1000 > (1) I am imagining some Grand Unified Flow Cache (Olsson trie?) that > holds (some subset of?) flows. A successful lookup immediately after > packet comes off NIC gives destiny for packet: what route, (optionally) > what socket, what filtering, what connection tracking (& what NAT), etc? > I don't know if this should be a general array of fn & data ptrs, or > specialized fields for each one, or a mix. Maybe there's a "too hard, > do slow path" bit, or maybe hard cases just never get put in the cache. > Perhaps we need a separate one for locally-generated packets, a-la > ip_route_output(). Anyway, we trade slightly more expensive flow setup > for faster packet processing within flows. So, specifically, one of the methods you are thinking about might be implemented by adding: void (*input)(struct sk_buff *, void *); void *input_data; to "struct flow_cache_entry" or whatever replaces it? This way we don't need some kind of "type" information in the flow cache entry, since the input handler knows the type. > One way to do this is to add a "have_interest" callback into the > hook_ops, which takes each about-to-be-inserted GUFC entry and adds any > destinies this hook cares about. In the case of packet filtering this > would do a traversal and append a fn/data ptr to the entry for each rule > which could effect it. Can you give a concrete example of how the GUFC might make use of this? Just some small abstract code snippets will do. > The other way is to have the hooks register what they are interested in > into a general data structure which GUFC entry creation then looks up > itself. This general data structure will need to support wildcards > though. My gut reaction is that imposing a global data structure on all object classes is not prudent. When we take a GUFC miss, it seems better we call into the subsystems to resolve things. It can implement whatever slow path lookup algorithm is most appropriate for it's data. > We also need efficient ways of reflecting rule changes into the GUFC. > We can be pretty slack with conntrack timeouts, but we either need to > flush or handle callbacks from GUFC on timed-out entries. Packet > filtering changes need to be synchronous, definitely. This, I will remind, is similar to the problem of doing RCU locking of the TCP hash tables. > (3) Smart NICs that do some flowid work themselves can accelerate lookup > implicitly (same flow goes to same CPU/thread) or explicitly (each > CPU/thread maintains only part of GUFC which it needs, or even NIC > returns flow cookie which is pointer to GUFC entry or subtree?). AFAICT > this will magnify the payoff from the GUFC. I want to warn you about HW issues that I mentioned to Alexey the other week. If we are not careful, we can run into the same issues TOE cards run into, performance wise. Namely, it is important to be careful about how the GUFC table entries get updated in the card. If you add them synchronously, your connection rates will deteriorate dramatically. I had the idea of a lazy scheme. When we create a GUFC entry, we tack it onto a DMA'able linked list the card uses. We do not notify the card, we just entail the update onto the list. Then, if the card misses it's on-chip GUFC table on an incoming packet, it checks the DMA update list by reading it in from memory. It updates it's GUFC table with whatever entries are found on this list, then it retries to classify the packet. This seems like a possible good solution until we try to address GUFC entry deletion, which unfortunately cannot be evaluated in a lazy fashion. It must be synchronous. This is because if, for example, we just killed off a TCP socket we must make sure we don't hit the GUFC entry for the TCP identity of that socket any longer. Just something to think about, when considering how to translate these ideas into hardware. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [parisc-linux] [git patches] tulip fixes from parisc-linux
On Sun, Jul 30, 2006 at 02:54:56PM -0400, Kyle McMartin wrote: > On Sun, Jul 30, 2006 at 11:35:32AM -0700, Andrew Morton wrote: > > hm. A couple of those patches have been futzing around in -mm for over a > > year and have been nacked by Jeff and are a regular source of grumpygrams. > > I've been sitting on them in the pathetic hope that someone will one day > > get down and address the bugs which they fix in an acceptable fashion, > > whatever that is. > > > > Jeff/Val seemed willing to merge the fixes as they stood. parisc-linux > merged Francois' tulip workqueue patch some time ago, and have been > running with it since without issue. This defers the tulip_select_media > work to process context, and so should be less of an issue. Hey Kyle, Thanks for splitting these out. Could you do us a favor and post the patches themselves? I'm not the only one who doesn't use git, and it will be a lot less confusing if we can directly ack the patches in email instead of referring to them third-hand. Thanks, -VAL - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space
David Miller <[EMAIL PROTECTED]> writes: >> hdlc_fr: logical PVC devices have no headers (plain IPv4 etc. as seen >> by tcpdump), but they append FR headers (4 or 10 bytes long) just >> before passing the skb to physical device. > > If you hooked up fr_hard_header into dev->hard_header instead of > invoking it via pvc_xmit(), everything would be fine. That would have to be master_device->hard_header(), but the network stack (IP and friends) has to send packets to pvc_device. I can't make the headers show up on pvc device - that would break packet interface and Ethernet framing. The headers have to be visible only on master (physical) device. > The complexity of this function arises from the fact that it prepends > headers of differing lengths depending upon the protocol type > being encapsulated, and this is the problem you should aim to > solve. Actually I don't think there is a problem with different header lengths. The driver indicates it wants 10 bytes and that's enough for all cases (except Ethernet framing where it indicates and uses 14 bytes and reallocs before prepending another 10 bytes). > Alexey, any suggestions on how to handle this kind of thing? What's wrong with my patch? If it can's be accepted I can just add an empty pvc->hard_header(). That won't make other drivers work reliably, though, and it's IMHO hardly their author's fault. I don't think we've ever advertised "hard_header_len is valid only with non-NULL hard_header". -- Krzysztof Halasa - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding offloading IPv6 addrconf and ndisc
On Tue, 2006-01-08 at 11:30 +1000, Herbert Xu wrote: > > You can now disable the OOM killer on a per-process basis by > > echo -17 > /proc//oom_adj > nice to know ;-> At least you can protect some apps if you need to. Only racoon and quagga are important for me. But what happens then if you have a beast that just chews memory forever? I suppose other poor apps will just get shot. My plan was just to write a simple daemon that uses the genetlink API that Shailabh (IBM) and company wrote and just restart the app if i see it disappear. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linville's L2 rant... -- Re: PATCH Fix bonding active-backup behavior for VLAN interfaces
On Mon, 2006-31-07 at 08:30 -0400, John W. Linville wrote: > On Mon, Jul 31, 2006 at 10:15:40AM +0200, Christophe Devriese wrote: > > > If you bond 2 vlan subinterfaces, the patch is not necessary at all. In > > that > > case also the source device will be changed from eth0. to bond. So > > that's correct behavior no ? > > > > In the second case, you create vlan subifs on a bonding device, vlan > > subinterfaces will be created on the slave interfaces. In that case the > > vlan > > (This is not directed at Christophe, or anyone in particular...) > > > > Am I the only one that thinks that our handling of LAN L2 stuff > is at best a little "too" flexible (and at worst a collection of > nasty hacks)? > > I mean, do we really need both the ability to bond multiple vlan > interfaces AND the ability to have vlan interfaces on top of a bond? > How many people really appreciate the subtle(?) differences? > > Then throw bridging into the mix! If I'm using VLANs and bonds in > a bridged environment, do I bridge the bonds, or bond the bridges? > Do the VLANs come before the bonds? after the bridges? or somewhere > in-between? Do all these combinations even work together? Who has > the definitive answer (besides the code itself)? > > I have no doubt that there are plenty of opportunities for cleverness > here (and no doubt dragons too). I just doubt that most of them > are worth the complexities introduced by our current collection of > "transparently" stackable pseudo-drivers and strategically placed hacks > (e.g. skb_bond). All that, and it still isn't clear to me how we > can cleanly accomodate 802.1s (which adds VLAN awareness to bridging). > > Do we hold the view that our L2 code is on par with the rest of > our code? Is there an appetite for a clean-up? Or is it just me? > > > > If you made it this far, thanks for listening...I feel better now. :-) Yes, I made it this far and you do make good arguement (or i may be over-dosed ;->). I have seen the following setups that are useful: 1) Vlans with bridges; in which one or more vlans exist per ethernet port. Broadcast packets within such vlans are restricted to just those vlans by the bridge. 2) complicate the above a little by having multiple spanning trees. 3) Add to the above link layer HA (802.1ad or otherwise as presented today by Bonding). To answer your question; i think yes we need all 3. Unfortunately the 3 above are all done by different people with different intentions altogether. I think BGrears end goal was VLANs for an end host. I think Lennert wrote the original Bridge code and for a while had some VLAN code that worked well with bridging (that code died as far as i know). Then bonding - theres some pre-historic relation to it since D Becker days and then the good folks from Intel adding about 1M features to it. Yes, the fact all 3 need to work together is a mess ;-> (but there are good pragmatic reasons for them to work together)... Hope that helps ;-> cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding offloading IPv6 addrconf and ndisc
On Mon, Jul 31, 2006 at 09:24:27PM -0400, Jamal Hadi Salim wrote: > > In regards to reliability: The thing that really fscks people using > daemons from what i have seen is the oom killer policies and the lack of > correlation by apps. I just watched quagga die horribly on a 256M > machine on friday once we hit around 100K routes and a lot of route > cache hits. So apps like that may need a total rewrite. I am not looking > forward to trying to get racoon to do 50K SAs and 100K SPDs on the same > machine ;-> You can now disable the OOM killer on a per-process basis by echo -17 > /proc//oom_adj Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding offloading IPv6 addrconf and ndisc
On Mon, 2006-31-07 at 17:49 -0700, Roland Dreier wrote: > David> Why is this a relevant analogy? Well, you have physical > David> hard-disks in your computer today, but at some point that > David> device becomes largely superfluous. It makes more sense to > David> have just a cpu with a 10-gigabit ethernet interface > David> incorporated onto the cpu die, and the majority if not all > David> of your disk access is remote. > > Isn't most of the iSCSI control plane in userspace right now? I know iscsi is supposed to integrate with ipsec as well (and SLP for discovery) - does that happen in user space as well? Dave (I am under heavy flu dose, so I may be incoherent;->) but heres a devils advocate bit for you: TCP FIN/SYN are just control packets - so move the connection setup/teardown out to user space;->. You can then add all sorts of funky DOS detection/prevention schemes as needed - makes it easy to experiment with. Actually move the slow path as well, SACK processing etc (i know it is in process context today, but thats in the kernel). Just leave VJs fast path in the kernel. Extend the user space bit to be the new VJ (channels stuff but just for control) - asynch notification to carry the control/slow path packets to user space. In regards to ARP/NDISC being in user space: note people are talking about secure DHCP or some form of initial pre-layer2 addressing over EAP or something along those lines; i.e if you are not securely validated at the L2 level you are not even getting an IP address. In regards to reliability: The thing that really fscks people using daemons from what i have seen is the oom killer policies and the lack of correlation by apps. I just watched quagga die horribly on a 256M machine on friday once we hit around 100K routes and a lot of route cache hits. So apps like that may need a total rewrite. I am not looking forward to trying to get racoon to do 50K SAs and 100K SPDs on the same machine ;-> I think I like what Hugo is saying ;-> I just hope he has time and resources to produce code. cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] gre: transparent ethernet bridging
Stephen Hemminger wrote: > On Mon, 31 Jul 2006 20:06:41 +1000 > Philip Craig <[EMAIL PROTECTED]> wrote: > >> This patch implements transparent ethernet bridging for gre tunnels. >> There are a few outstanding issues. > > Why not use existing bridge code? It does use the existing bridge code. Perhaps the name is misleading. All it does is encapsulate the full ethernet header in a gre packet, rather than only layer 3. That is, currently gre uses ARPHRD_IPGRE, but bridging requires ARPHRD_ETHER. >> Some routers set LLC_SAP_BSPAN in the gre protocol field, and then >> give the bpdu packet without any other ethernet/llc header. This patch >> currently tries to fake the ethernet/llc header before passing the >> packet up, but it is buggy (mac addresses are wrong at least). Maybe a >> better approach is to call directly into the bridging code. I didn't try >> that at first because it isn't modular, and may break other things that >> want to see the packet. > > Existing bridge code already has spanning tree. Yes, and I want to use that. But this packet is a bit strange in that it does not have the ethernet header on it. So what is the best way to pass it to existing code? Either fake the ethernet header, or pass it directly? >> +#if 0 >> dev = alloc_netdev(sizeof(*t), name, ipgre_tunnel_setup); >> +#else >> +dev = alloc_netdev(sizeof(*t), name, ipgre_ether_tunnel_setup); >> +#endif > > "Do, or do not there is no try" I am looking for comments as to whether adding a netlink interface to control this is appropriate. >> +__be16 ipgre_type_trans(struct sk_buff *skb, int offset) >> +{ >> +u8 *h = skb->data; >> +__be16 flags = *(__be16*)h; >> +__be16 proto = *(__be16*)(h + 2); >> + >> +/* WCCP version 1 and 2 protocol decoding. >> + * - Change protocol to IP >> + * - When dealing with WCCPv2, Skip extra 4 bytes in GRE header >> + */ >> +if (flags == 0 && >> +proto == __constant_htons(ETH_P_WCCP)) { >> +proto = __constant_htons(ETH_P_IP); >> +if ((*(h + offset) & 0xF0) != 0x40) >> +offset += 4; >> +} > > Don't use __constant_htons() except in initializers and switch cases > (where gcc is too stupid to optimize the macro). > This is a problem in the existing code, which I am simply moving around. Should I fix it at the same time? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space
From: Krzysztof Halasa <[EMAIL PROTECTED]> Date: Tue, 01 Aug 2006 03:04:28 +0200 > hdlc_fr: logical PVC devices have no headers (plain IPv4 etc. as seen > by tcpdump), but they append FR headers (4 or 10 bytes long) just > before passing the skb to physical device. If you hooked up fr_hard_header into dev->hard_header instead of invoking it via pvc_xmit(), everything would be fine. The complexity of this function arises from the fact that it prepends headers of differing lengths depending upon the protocol type being encapsulated, and this is the problem you should aim to solve. Alexey, any suggestions on how to handle this kind of thing? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
From: Evgeniy Polyakov <[EMAIL PROTECTED]> Date: Fri, 28 Jul 2006 09:23:12 +0400 > I completely agree that existing kevent interface is not the best, so > I'm opened for any suggestions. > Should kevent creation/removing/modification be separated too? I do not think so, object for these 3 operations are the same, so there are no typing issues. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space
David Miller <[EMAIL PROTECTED]> writes: > Krzysztof, which device driver exactly creates this problem > in the first place? I have a report (not sure but I think it's that) with hdlc_fr (Frame Relay). Grepping through the tree there might be problems with: - net/8021q/vlan.c (probably not with normal Ethernet, but there is a code path which could potentially be a problem with NETIF_F_HW_VLAN_TX) - net/atm/clip.c - net/appletalk/* - drivers/net/gianfar.c - drivers/net/wan/lapbether.c - drivers/s390/net/netiucv.c will not oops but merely drop the packet and print a warning. and possibly others, I haven't checked the whole tree. Some (not all) of them might be false positives, though. Fortunately most of the time skb comes with preallocated header space (that common skb_reserve(2) I think) and thus the reports aren't frequent (personally I have never seen that). > If you have headers to prepend for your device, why do you set the > header building function to NULL? :-) hdlc_fr: logical PVC devices have no headers (plain IPv4 etc. as seen by tcpdump), but they append FR headers (4 or 10 bytes long) just before passing the skb to physical device. -- Krzysztof Halasa - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
From: Zach Brown <[EMAIL PROTECTED]> Date: Thu, 27 Jul 2006 12:18:42 -0700 [ I kept this thread around in my inbox because I wanted to give it some deep thought, so sorry for replying to old bits... ] > So as the kernel generates events in the ring it only produces an event > if the ownership field says that userspace has consumed it and in doing > so it sets the ownership field to tell userspace that an event is > waiting. userspace and the kernel now each follow their index around > the ring as the ownership field lets them produce or consume the event > at their index. Can someone tell me if the cache coherence costs of > this are extreme? I'm hoping they're not. No need for an owner field, we can use something like a VJ netchannel datastructure for this. Kernel only writes to producer index and user only writes to consumer index. > So, great, glibc can now find pending events very quickly if they're > waiting in the ring and can fall back to the collection syscall if it > wants to wait and the ring is empty. If it consumes events via the > syscall it increases its ring index by the number the syscall returned. I do not think if we do a ring buffer that events should be obtainable via a syscall at all. Rather, I think this system call should be purely "sleep until ring is not empty". This is actually reasonably simple stuff to implement as Evgeniy has tried to explain. Events in kevent live on a ready list when they have triggered. Existence on a list determined the state, and I think this design btw invalidates some of the arguments against using netlink that Ulrich mentions in his paper. If netlink socket queuing fails, well then kevent stays on ready list and that is all until the kevent can be successfully published to the user. I am not advocating netlink at all for this, as the ring buffer idea is much better. The ring buffer size, as Evgeniy also tried to describe, is bounded purely by the number of registered events. So event loop of application might look something like this: struct ukevent cur_event; struct timeval timeo; setup_timeout(&timeo); for (;;) { int err; while(!(err = ukevent_dequeue(evt_fd, evt_ring, &cur_event, &timeo))) { struct my_event_object *o = event_to_object(&cur_event); o->dispatch(o, &cur_event); setup_timeout(&timeo); } if (err == -ETIMEDOUT) timeout_processing(); else event_error_processing(err); } ukevent_dequeue() is perhaps some GLIBC implemented routine which does something like: int err; for (;;) { if (!evt_ring_empty(evt_ring)) { struct ukevent *p = evt_ring_consume(evt_ring); memcpy(event_p, p, sizeof(struct ukevent)); return 0; } err = kevent_wait(evt_fd, timeo_p); if (err < 0) break; } return err; It's just some stupid ideas... we could also choose to expose the ring buffer layout directly to the user event loop and let it perform the dequeue operation and kevent_wait() calls directly. I don't see why not to allow that. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding offloading IPv6 addrconf and ndisc
David> Why is this a relevant analogy? Well, you have physical David> hard-disks in your computer today, but at some point that David> device becomes largely superfluous. It makes more sense to David> have just a cpu with a 10-gigabit ethernet interface David> incorporated onto the cpu die, and the majority if not all David> of your disk access is remote. Isn't most of the iSCSI control plane in userspace right now? - R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding offloading IPv6 addrconf and ndisc
From: Andi Kleen <[EMAIL PROTECTED]> Date: Tue, 1 Aug 2006 02:31:58 +0200 > Playing devil's advocate here: if the packets are processed on > two different CPUs then this could also happen and break the test > case. > > So the test is probably a bit fragile. Good point. > I generally agree it's better to keep this in kernel though. To drive this home even more, I do not believe that the people who advocate pushing NDISC and ARP policy into userspace would be very happy if something like the RAID transformations were moved into userspace and they were not able to access their disks if the RAID transformer process in userspace died. Why is this a relevant analogy? Well, you have physical hard-disks in your computer today, but at some point that device becomes largely superfluous. It makes more sense to have just a cpu with a 10-gigabit ethernet interface incorporated onto the cpu die, and the majority if not all of your disk access is remote. At that point, network access equals disk access. It would be amusing to need to restart such an NDISC/ARP daemon if it were to live on a remote volume. :-) I understand full well that on special purpose network devices this control vs. data plane seperation into userspace might make a lot of sense. But for a general purpose operating system, such as Linux, the greater concern is resiliency to failures and each piece of core functionality you move to userspace is a new potential point of failure. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding offloading IPv6 addrconf and ndisc
> If we process these in sequence in software interrupt, everything > is fine. Processing of "A" will add the address, and the test > ping packet "B" will respond properly. > > If you defer "A", everything breaks and the test packet "B" will > get processed first and not work. Playing devil's advocate here: if the packets are processed on two different CPUs then this could also happen and break the test case. So the test is probably a bit fragile. Currently it is unlikely to happen because of interrupt affinity for a single device, but in future with MSI-X support it might not. I generally agree it's better to keep this in kernel though. -Andi - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding offloading IPv6 addrconf and ndisc
Hello Hugo, Hugo Santos wrote: > Hi, > >> On the other hand, if a ND daemon loose the synchronization, it is >> unpredicable, I guess. > >What do you mean by synchronization in this context? My idea was to > keep the ND state machine inside the kernel, and instead have the > daemon be reactive. That means it would send messages on behalf of the > kernel, and apply information based on received signalling (besides, ND > is reseliant to loss of messages). Taking your example, if the kernel > is using a neighbor entry and you replace it (either changing it's > state or link-layer address), the kernel will adapt, i believe it is > predictable. To be honest, i'm only worried about possible lost netlink > messages; but the daemon may be implemented to handle this, re-sending > while an ACK isn't receiving, thus minimizing any de-synchronization > possibilities. > The kernel maintains the ND state by itself and the daemon touches the state. I think the daemon should aware the state. It is what I meant with "synchronization". Anyway I do not intend to prevent you from your work anymore. I quit discussion without seeing the codes. >> BTW, we have a choice which we implement a functionality as a >> module. I think it can achieve some of what you want. > >Well, exporting the functionality to a module would be a start to > have one moving it out of the kernel. :-) > >Hugo -- Kazunori Miyazawa - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
> Ok, let's do it in the following way: > I present new version of kevent with new syscalls and fixed issues mentioned > before, while people look at it we can end up with mapped buffer design. > Is it ok? Yeah, that sounds good. I'm looking forward to seeing the next set of patches :). - z - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Off by one buglets
From: Ralf Baechle <[EMAIL PROTECTED]> Date: Fri, 30 Jun 2006 15:29:01 +0100 > Ages ago, changeset > > http://www.kernel.org/git/?p=linux/kernel/git/tglx/history.git;a=commit;h=22d864d542a0b92116751186f1794c7d0f1ca1b9 > > which converted several protocols from using open coded comparisons to > use the helper function sk_acceptq_is_full() did introduce a bunch of > off by one errors - sk_acceptq_is_full checks for > sk_ack_backlog > sk_max_ack_backlog but it replaced >= or == comparisons. > > Below patch is really only meant to illustrate the change, not to be > applied. I looked at this again, and the change is perfectly fine. This patch merely shows that previously the protocols were very inconsistent about what sk_max_ack_backlog really meant. All Arnaldo's changeset did was enforce a consistent meaning of this limit across the entire tree. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
From: Brent Cook <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 17:16:48 -0500 > There has to be some thread that is responsible for reading > events. Perhaps a reasonable thing for a blocked thread that cannot > process events to do is to yield to one that can? The reason one decentralizes event processing into threads is so that once they are tasked to process some event they need not be concerned with event state. They are designed to process their event through to the end, then return to the top level and say "any more work for me?" - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
On Monday 31 July 2006 17:00, David Miller wrote: > > So we'd have cases like this, assume we start with a full event > queue: > > thread Athread B > > dequeue event > aha, new connection > accept() > register new kevent > queue is now full again > add kevent on new > connection > > At this point thread A doesn't have very many options when the kevent > add fails. You cannot force this thread to read more events, since he > may not be in a state where he is easily able to do so. There has to be some thread that is responsible for reading events. Perhaps a reasonable thing for a blocked thread that cannot process events to do is to yield to one that can? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
From: Evgeniy Polyakov <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 23:41:43 +0400 > Since kevents are never generated by kernel, but only marked as ready, > length of the main queue performs as flow control, so we can create a > mapped buffer which will have space equal to the main queue length > multiplied by size of the copied to userspace structure plus 16 bits for > the start index of the kernel writing side, i.e. it will store offset > where the oldest event was placed. > > Since queue length is a limited factor and thus no new events can be added > when queue is full, that means that buffer is full too and userspace > must read events. When syscall is called to add new kevent and provided > there offset differs from what kernel stored, that means that all events > from kernel to provided index have been read and new events can be added. > Thus we can even allow read-only mapping. Kernel's index is incremented > modulo queue length. If kevent was removed after it was marked as > ready, it's copy stays in the mapped buffer, but special flag can be > assigned to show that kevent is no longer valid. This sounds reasonable. However we must be mindful that the thread of control trying to add a new event might not be in a position to drain the queue of pending events when the queue is full. Usually he will be trying to add an event in response to handling another event. So we'd have cases like this, assume we start with a full event queue: thread Athread B dequeue event aha, new connection accept() register new kevent queue is now full again add kevent on new connection At this point thread A doesn't have very many options when the kevent add fails. You cannot force this thread to read more events, since he may not be in a state where he is easily able to do so. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: warning at net/core/dev.c:1171/skb_checksum_help() 2.6.18-rc3
From: Patrick McHardy <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 23:36:29 +0200 > David Miller wrote: > > Does this matter? > > I don't think it does. Its a huge corner case (unloading of the > module which issued the QUEUE verdict while queueing the packet), > and worst case is that we drop some segments or the entire packet. Ok, that's what I thought. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: warning at net/core/dev.c:1171/skb_checksum_help() 2.6.18-rc3
David Miller wrote: > I noticed a subtle semantic change for nf_queue(). Previously, if we > can't grab the module reference for the matching entry, we'd not free > the skb, return 0, and the caller tries to iterate to the next hook. > > That behavior is preserved for singleton frames, but that's not what > happens for GSO frames. Instead, the GSO frame is split up and we > always return "1" even if some of the subsegments cause __nf_queue() > to return 0 due to the case described in the previous paragraph. I couldn't think of a better way to handle this except to just deliver everything we can and drop the rest, since the caller doesn't know anything about the individual segments we can't simply deliver the remaining ones to the next hook. > It is, however, mindful to free up the kfree_skb so it doesn't cause a > leak or anything like that. > > Does this matter? I don't think it does. Its a huge corner case (unloading of the module which issued the QUEUE verdict while queueing the packet), and worst case is that we drop some segments or the entire packet. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
skge and sky2 backported driver.
This is a backport of the current 2.6.18 version of skge and sky2 drivers for use with older kernels. The drivers depend on the CRC32 module. It has been compiled and tested on RHEL (2.4) and 2.6.8 but should work on other kernels past that point. It does depend on ethtool_ops, if_vlan and mii support. This version is somewhat different than the current 2.6 version: * no suspend/resume or wake on LAN * no support for reading original MAC address * no support for MSI on sky2 * receive checksumming default off because earlier kernels often do not handle it properly with vlan's or PPP. * sky2 does not use NAPI because it required changes to netdevice.h and/or tweaking internals of netdevice interface to handle dual port status. * sky2 defaults to TSO off because until 2.6.13 there are issues with TCP congestion control and TSO. * sky2 doesn't use VLAN acceleration features probably doesn't make much difference. THIS IS NOT SUPPORTED, IT IS PROVIDED AS IS. In other words, go ahead and mail bug reports to me <[EMAIL PROTECTED]> but don't expect me to be able to fix them. http://developer.osdl.org/shemminger/releases/skge-sky2-backport.tar.bz2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding offloading IPv6 addrconf and ndisc
So all of you userland control-plane fanatics, how will you handle things like NFS root with these daemon-required variants of NDISC and ARP? I know the devils' advocate responses already, so don't bother with responses saying things like 1) "do it in the initial ramdisk, we only need the daemon to setup the NDISC entries to talk to the NFS server" or 2) "IPSEC's control plane is in userspace and therefore we can't do NFS root over IPSEC, why is that ok and key'd NDISC is not?" I think we are building systems which gradually are becomming less and less reliable, with increasing numbers of possible points of failure. Flexibility is overrated. There are many crucial optimizations and simplifications we cannot perform because we've made certain aspects of network configuration far too flexible. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: warning at net/core/dev.c:1171/skb_checksum_help() 2.6.18-rc3
From: Patrick McHardy <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 20:36:58 +0200 > I'm going to do some more testing now .. Thanks for all of this work Patrick. I noticed a subtle semantic change for nf_queue(). Previously, if we can't grab the module reference for the matching entry, we'd not free the skb, return 0, and the caller tries to iterate to the next hook. That behavior is preserved for singleton frames, but that's not what happens for GSO frames. Instead, the GSO frame is split up and we always return "1" even if some of the subsegments cause __nf_queue() to return 0 due to the case described in the previous paragraph. It is, however, mindful to free up the kfree_skb so it doesn't cause a leak or anything like that. Does this matter? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: neigh_lookup lockdep bug.
From: Dave Jones <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 16:50:04 -0400 > 2.6.18rc2-gitSomething on my firewall box just triggered this.. Lockdep is perhaps confused. > [515613.904945] swapper/0 is trying to acquire lock: > [515613.931489] (&tbl->lock){-+-+}, at: [] neigh_lookup+0x50/0xaf > [515613.964369] > [515613.964373] but task is already holding lock: > [515614.006550] (&skb_queue_lock_key){-+..}, at: [] > neigh_proxy_process+0x20/0xc2 The skb_queue_lock in question is &tbl->proxy_queue.lock > [515614.103459] the existing dependency chain (in reverse order) is: > [515614.148752] > [515614.148755] -> #2 (&skb_queue_lock_key){-+..}: > [515614.10][] lock_acquire+0x4b/0x6c > [515614.215554][] _spin_lock_irqsave+0x22/0x32 > [515614.243606][] skb_dequeue+0x12/0x43 > [515614.269657][] skb_queue_purge+0x14/0x1b > [515614.296565][] neigh_update+0x317/0x353 This is a different queue lock, namely &neigh->arp_queue.lock Like the ipv6 trace we got yesterday from Matt Domsche, lockdep is aparently confusing two instances of the skb_queue_lock_key > [515614.677724] -> #0 (&tbl->lock){-+-+}: > [515614.707327][] lock_acquire+0x4b/0x6c > [515614.729897][] _read_lock_bh+0x1e/0x2d > [515614.752546][] neigh_lookup+0x50/0xaf > [515614.774754][] neigh_event_ns+0x2c/0x77 > [515614.797271][] arp_process+0x366/0x4e4 > [515614.819349][] parp_redo+0x8/0xa > [515614.839660][] neigh_proxy_process+0x66/0xc2 > [515614.862931][] run_timer_softirq+0x108/0x167 > [515614.886048][] __do_softirq+0x78/0xf2 > [515614.907136][] do_softirq+0x5a/0xbe > [515614.927553] And this path takes &neigh->proxy_queue.lock, then &tbl->lock I don't see the problem. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH dscape] d80211: Switch d80211.h to IEEE80211_ style names
On Monday 31 July 2006 13:31, John W. Linville wrote: > As usual I'll depend on Jiri to merge d80211 stack patches, then > send me a pull request. If I apply your "Switch drivers to d80211" > series now, that will undoutedly cause a breakage when Jiri asks me > to pull this later. > Yeah, there needs to be a new (and smaller) set of patches to switch drivers to the d80211.h header. > I presume that at least parts of those patches will still be necessary > or desirable after the d80211 symbol rename gets merged. Are you > preparing a new patch series for when that happens? > I'm not quite sure whether switching d80211_mgmt.h will be worthwhile. If it isn't, then I can get to preparing a new patch series. At any rate, I think the most important thing right now is fixing the conflicts w/ wireless-dev in the patch that updates the d80211 drivers to use the new names so we can get the rename stuff into wireless-dev. -Michael Wu pgpTpG6pQBruA.pgp Description: PGP signature
neigh_lookup lockdep bug.
2.6.18rc2-gitSomething on my firewall box just triggered this.. Dave [515613.791771] === [515613.841467] [ INFO: possible circular locking dependency detected ] [515613.873284] --- [515613.904945] swapper/0 is trying to acquire lock: [515613.931489] (&tbl->lock){-+-+}, at: [] neigh_lookup+0x50/0xaf [515613.964369] [515613.964373] but task is already holding lock: [515614.006550] (&skb_queue_lock_key){-+..}, at: [] neigh_proxy_process+0x20/0xc2 [515614.043225] [515614.043228] which lock already depends on the new lock. [515614.043234] [515614.103456] [515614.103459] the existing dependency chain (in reverse order) is: [515614.148752] [515614.148755] -> #2 (&skb_queue_lock_key){-+..}: [515614.10][] lock_acquire+0x4b/0x6c [515614.215554][] _spin_lock_irqsave+0x22/0x32 [515614.243606][] skb_dequeue+0x12/0x43 [515614.269657][] skb_queue_purge+0x14/0x1b [515614.296565][] neigh_update+0x317/0x353 [515614.323004][] arp_process+0x4aa/0x4e4 [515614.349004][] arp_rcv+0xd4/0xf1 [515614.373209][] netif_receive_skb+0x204/0x271 [515614.400405][] process_backlog+0x99/0xfa [515614.426351][] net_rx_action+0x9d/0x196 [515614.451856][] __do_softirq+0x78/0xf2 [515614.476660][] do_softirq+0x5a/0xbe [515614.500737] [515614.500741] -> #1 (&n->lock){-+-+}: [515614.532763][] lock_acquire+0x4b/0x6c [515614.556814][] _write_lock+0x19/0x28 [515614.580398][] neigh_periodic_timer+0x98/0x13c [515614.606447][] run_timer_softirq+0x108/0x167 [515614.631798][] __do_softirq+0x78/0xf2 [515614.655122][] do_softirq+0x5a/0xbe [515614.677721] [515614.677724] -> #0 (&tbl->lock){-+-+}: [515614.707327][] lock_acquire+0x4b/0x6c [515614.729897][] _read_lock_bh+0x1e/0x2d [515614.752546][] neigh_lookup+0x50/0xaf [515614.774754][] neigh_event_ns+0x2c/0x77 [515614.797271][] arp_process+0x366/0x4e4 [515614.819349][] parp_redo+0x8/0xa [515614.839660][] neigh_proxy_process+0x66/0xc2 [515614.862931][] run_timer_softirq+0x108/0x167 [515614.886048][] __do_softirq+0x78/0xf2 [515614.907136][] do_softirq+0x5a/0xbe [515614.927553] [515614.927557] other info that might help us debug this: [515614.927563] [515614.966774] 1 lock held by swapper/0: [515614.982693] #0: (&skb_queue_lock_key){-+..}, at: [] neigh_proxy_process+0x20/0xc2 [515615.013575] [515615.013578] stack backtrace: [515615.037414] [] show_trace_log_lvl+0x54/0xfd [515615.057910] [] show_trace+0xd/0x10 [515615.075934] [] dump_stack+0x19/0x1b [515615.094167] [] print_circular_bug_tail+0x59/0x64 [515615.116172] [] __lock_acquire+0x808/0x997 [515615.136514] [] lock_acquire+0x4b/0x6c [515615.155699] [] _read_lock_bh+0x1e/0x2d [515615.175098] [] neigh_lookup+0x50/0xaf [515615.197276] [] neigh_event_ns+0x2c/0x77 [515615.220267] [] arp_process+0x366/0x4e4 [515615.243248] [] parp_redo+0x8/0xa [515615.264645] [] neigh_proxy_process+0x66/0xc2 [515615.288899] [] run_timer_softirq+0x108/0x167 [515615.309972] [] __do_softirq+0x78/0xf2 [515615.328940] [] do_softirq+0x5a/0xbe [515615.347150] [] irq_exit+0x3d/0x3f [515615.365067] [] smp_apic_timer_interrupt+0x79/0x7e [515615.387057] [] apic_timer_interrupt+0x2a/0x30 -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH dscape] d80211: Switch d80211.h to IEEE80211_ style names
On Thu, Jul 27, 2006 at 12:37:14AM -0700, Michael Wu wrote: > Alright, I've replaced all + lines with spaces with tabs. > > I also fixed one long line. The rest of them are nearly impossible to shorten > well. The "(fc & IEEE80211_FCTL_FTYPE) == IEEE80211_FTYPE_DATA" style is > really killing us (unless you want to break after ==, which is rather bad). I > think we should switch back to a macro for that. > > I would prefer if we could get this merged soon and put in that line > shortening macro later (or whatever solution that's best), but it's your > call. Michael, As usual I'll depend on Jiri to merge d80211 stack patches, then send me a pull request. If I apply your "Switch drivers to d80211" series now, that will undoutedly cause a breakage when Jiri asks me to pull this later. I presume that at least parts of those patches will still be necessary or desirable after the d80211 symbol rename gets merged. Are you preparing a new patch series for when that happens? Thanks, John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space
From: Krzysztof Halasa <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 22:04:33 +0200 > This is non-trivial because hard_header and hard_start_xmit > functions currently can't return new skb address (hard_header() > can't use skb_realloc_headroom() at all, xmit() can't use it if > there is a need to requeue the packet). > > Or can you just realloc the data portion of skb without changing skb > struct address? The skb may be referenced by other things. Krzysztof, which device driver exactly creates this problem in the first place? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space
From: Krzysztof Halasa <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 22:04:33 +0200 > Alexey Kuznetsov <[EMAIL PROTECTED]> writes: > > > All the rest of places just check, that there is enough space > > for their immediate needs. If dev->hard_header() is NULL, it means that > > stack does not need any space at all, so that it does not need to worry. > > Why do you think dev->hard_header == NULL means there is no need for > header space? Isn't it dev->hard_header_len = 0? Why would a device > set hard_header_len to non-zero if it doesn't need header space? If you have headers to prepend for your device, why do you set the header building function to NULL? :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SMSC LAN911x and LAN921x vendor driver
Hi Francois, Thanks for your feedback, I have a few questions. > > +return serviced; > > +} > > + > > +/* Autodetects and initialises external phy for SMSC9115 and SMSC9117 flavors. > > + * If something goes wrong, returns -ENODEV to revert back to internal phy. */ > > +static int smsc911x_phy_initialise_external(struct smsc911x_data *pdata) > > +{ > > +unsigned int address; > > +unsigned int hwcfg; > > +unsigned int phyid1; > > +unsigned int phyid2; > > + > > +hwcfg = smsc911x_reg_read(pdata, HW_CFG); > > + > > +/* External phy is requested, supported, and detected */ > > +if (hwcfg & HW_CFG_EXT_PHY_DET_) { > > + > > +/* Attempt to switch to external phy for auto-detecting > > + * its address. Assuming tx and rx are stopped because > > + * smsc911x_phy_initialise is called before > > + * smsc911x_rx_initialise and tx_initialise. > > + */ > > + > > +/* Disable phy clocks to the MAC */ > > +hwcfg &= (~HW_CFG_PHY_CLK_SEL_); > > +hwcfg |= HW_CFG_PHY_CLK_SEL_CLK_DIS_; > > +smsc911x_reg_write(hwcfg, pdata, HW_CFG); > > +udelay(10); /* Enough time for clocks to stop */ > > I assume that writes are never posted, right ? > I don't understand the question, what do you mean? > > +static void smsc911x_rx_multicast_update(struct smsc911x_data *pdata) > > +{ > > +unsigned long flags; > > +unsigned int timeout; > > +unsigned int mac_cr; > > + > > +/* This function is only called for older LAN911x devices > > + * (revA or revB), where MAC_CR, HASHH and HASHL should not > > + * be modified during Rx - newer devices immediately update the > > + * registers */ > > + > > +local_irq_save(flags); > > + > > +/* Stop Rx */ > > +mac_cr = smsc911x_mac_read(pdata, MAC_CR); > > +mac_cr &= ~(MAC_CR_RXEN_); > > +smsc911x_mac_write(pdata, MAC_CR, mac_cr); > > + > > +/* Poll until Rx has stopped. If a frame is being recieved, this will > > + * block until the end of this frame. (this may take a long time at > > + * 10Mbps) */ > > +timeout = 2000; > > +while ((timeout--) > > + && (!(smsc911x_reg_read(pdata, INT_STS) & INT_STS_RXSTOP_INT_))) { > > +udelay(1); > > > In a completely ideal world the driver would probably race outside of an > irq disabled section until it grabs the napi poll handler, thus preserving > the nice low latency property of the kernel. > > Nevermind :o} Agreed, I would like to find a nicer way to do this. It's a nasty workaround to a nasty hardware issue :o} There are two problems. First, on older hardware revisions the multicast hash filters (as well as the promisc flag) cannot be modified while rx is active (bad things might happen). There is an interrupt which can be used to indicate RX has stopped, but on early hardware this is not 100% reliable. The current solution is the simplest option, and works. A better way could be to use the RX_STOP interrupt, but also schedule a task to run later "just in case"? Best Regards, -- Steve Glendinning SMSC GmbH m: +44 777 933 9124 e: [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux TCP in the presence of delays or drops...
From: Oumer Teyeb <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 19:49:28 +0200 > it would be so great if some of you could spare a few minutes and take a > look at the traces I provided.see below for the original postng... If people are too backlogged and busy to reply to your original posting, you will only ensure that it will take even longer by bombarding the list with even more information and questions on top of your original large query. Just be patient. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
From: Thomas Graf <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 17:41:42 +0200 > * Herbert Xu <[EMAIL PROTECTED]> 2006-08-01 00:01 > > Actually, if we're adding policy routing, we should seriously consider > > whether living without a routing cache is still viable or not because > > the cost of a route lookup has just gone up. > > Absolutely. This is something I wanted to bring up too. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull 'bcm43xx' branch of wireless-2.6?
On Mon, Jul 31, 2006 at 04:01:43PM -0400, Jeff Garzik wrote: > John W. Linville wrote: > >Jeff, if a 10ms maximum delay is still acceptable to you, then please > >pull from the bcm43xx branch of wireless-2.6 into the upstream branch > >of netdev-2.6. > > Just to be clear, 'upstream' not 'upstream-fixes', correct? Yes, queued for 2.6.19. Thanks, John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
* Patrick McHardy <[EMAIL PROTECTED]> 2006-07-31 20:01 > Thomas Graf wrote: > > * Ville Nuorvala <[EMAIL PROTECTED]> 2006-07-31 17:46 > > > >>Shouldn't all these (struct fib_rule_hdr included) actually be defined > >>in include/linux/rtnetlink.h? > > > > > > We used to stuff everything into rtnetlink.h for no good reason. Having > > independant include/linux/.h to export the interface to > > userspace and include/net/.h to export the kernel interface > > instead of contributing to the ifdef hell seems a lot cleaner to me. > > > I agree, but then we should also split up rtnetlink.h. Having one > special case will just make it harder to find. Already done in the patchset converting things to the new netlink interface that I'll start submiting in the next days. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull 'bcm43xx' branch of wireless-2.6?
John W. Linville wrote: Jeff, if a 10ms maximum delay is still acceptable to you, then please pull from the bcm43xx branch of wireless-2.6 into the upstream branch of netdev-2.6. Just to be clear, 'upstream' not 'upstream-fixes', correct? P.S. FWIW, I'm still not totally happy w/ the (potential for a) long busy wait. But, this series of patches makes things better by 100x over what is currently in the tree. So, it seems worthwhile. I'll keep further reductions as an item on my TODO list, FWIW... :-) Agreed. And there are some existing busy-waits that (1) _obviously_ need to be converted to mdelay(), and (2) eventually need to be converted to msleep(). Overall, the Linux kernel community consensus is that long synchronous delays spinning the CPU should be avoided. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
On Mon, Jul 31, 2006 at 02:33:22PM +0400, Evgeniy Polyakov ([EMAIL PROTECTED]) wrote: > Ok, let's do it in the following way: > I present new version of kevent with new syscalls and fixed issues mentioned > before, while people look at it we can end up with mapped buffer design. > Is it ok? Since kevents are never generated by kernel, but only marked as ready, length of the main queue performs as flow control, so we can create a mapped buffer which will have space equal to the main queue length multiplied by size of the copied to userspace structure plus 16 bits for the start index of the kernel writing side, i.e. it will store offset where the oldest event was placed. Since queue length is a limited factor and thus no new events can be added when queue is full, that means that buffer is full too and userspace must read events. When syscall is called to add new kevent and provided there offset differs from what kernel stored, that means that all events from kernel to provided index have been read and new events can be added. Thus we can even allow read-only mapping. Kernel's index is incremented modulo queue length. If kevent was removed after it was marked as ready, it's copy stays in the mapped buffer, but special flag can be assigned to show that kevent is no longer valid. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Please pull 'bcm43xx' branch of wireless-2.6?
On Thu, Jul 27, 2006 at 08:37:53PM -0400, John W. Linville wrote: > As most of us are painfully aware, there is a blockage in getting > bcm43xx patches upstream.(*) > (*) http://marc.theaimsgroup.com/?l=linux-netdev&m=115137403631920&w=2 After re-reading that thread, I realized that Jeff had indicated that the original maximum delay of 10ms would be acceptable to him. http://marc.theaimsgroup.com/?l=linux-netdev&m=115138026614994&w=2 The current bcm43xx sources had already reduced the maximum delay to 100ms, and the d80211 (aka wireless-dev) driver had already dropped it to 10ms. So, I applied a simple patch to drop the delay to 10ms on this branch as well. Jeff, if a 10ms maximum delay is still acceptable to you, then please pull from the bcm43xx branch of wireless-2.6 into the upstream branch of netdev-2.6. Thanks, John P.S. FWIW, I'm still not totally happy w/ the (potential for a) long busy wait. But, this series of patches makes things better by 100x over what is currently in the tree. So, it seems worthwhile. I'll keep further reductions as an item on my TODO list, FWIW... :-) --- The following changes since commit 8f0f850e240df5bea027caeb1723142c50e37e57: Daniel Drake: softmac: Add MAINTAINERS entry are found in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-2.6.git bcm43xx John W. Linville: bcm43xx: fix-up build breakage from merging patches out of order bcm43xx: reduce mac_suspend delay loop count Larry Finger: bcm43xx: improved statistics bcm43xx: add missing mac_suspended initialization Michael Buesch: bcm43xx: suspend MAC while executing long pwork bcm43xx: lower mac_suspend udelay bcm43xx: fix mac_suspend refcount bcm43xx: init routine rewrite drivers/net/wireless/bcm43xx/bcm43xx.h | 52 +- drivers/net/wireless/bcm43xx/bcm43xx_debugfs.c | 46 ++ drivers/net/wireless/bcm43xx/bcm43xx_debugfs.h |1 drivers/net/wireless/bcm43xx/bcm43xx_main.c| 687 ++-- drivers/net/wireless/bcm43xx/bcm43xx_main.h|3 drivers/net/wireless/bcm43xx/bcm43xx_sysfs.c | 70 ++ drivers/net/wireless/bcm43xx/bcm43xx_wx.c | 28 + drivers/net/wireless/bcm43xx/bcm43xx_xmit.c|5 8 files changed, 565 insertions(+), 327 deletions(-) diff --git a/drivers/net/wireless/bcm43xx/bcm43xx.h b/drivers/net/wireless/bcm43xx/bcm43xx.h index ee6571e..c6ee1e9 100644 --- a/drivers/net/wireless/bcm43xx/bcm43xx.h +++ b/drivers/net/wireless/bcm43xx/bcm43xx.h @@ -504,6 +504,12 @@ struct bcm43xx_phyinfo { * This lock is only used by bcm43xx_phy_{un}lock() */ spinlock_t lock; + + /* Firmware. */ + const struct firmware *ucode; + const struct firmware *pcm; + const struct firmware *initvals0; + const struct firmware *initvals1; }; @@ -593,12 +599,14 @@ struct bcm43xx_coreinfo { u8 available:1, enabled:1, initialized:1; - /** core_id ID number */ - u16 id; /** core_rev revision number */ u8 rev; /** Index number for _switch_core() */ u8 index; + /** core_id ID number */ + u16 id; + /** Core-specific data. */ + void *priv; }; /* Additional information for each 80211 core. */ @@ -647,7 +655,10 @@ enum { BCM43xx_STAT_RESTARTING,/* controller_restart() called. */ }; #define bcm43xx_status(bcm)atomic_read(&(bcm)->init_status) -#define bcm43xx_set_status(bcm, stat) atomic_set(&(bcm)->init_status, (stat)) +#define bcm43xx_set_status(bcm, stat) do {\ + atomic_set(&(bcm)->init_status, (stat));\ + smp_wmb(); \ + } while (0) /**** THEORY OF LOCKING *** * @@ -721,10 +732,6 @@ #endif struct bcm43xx_coreinfo core_80211[ BCM43xx_MAX_80211_CORES ]; /* Additional information, specific to the 80211 cores. */ struct bcm43xx_coreinfo_80211 core_80211_ext[ BCM43xx_MAX_80211_CORES ]; - /* Index of the current 80211 core. If current_core is not -* an 80211 core, this is -1. -*/ - int current_80211_core_idx; /* Number of available 80211 cores. */ int nr_80211_available; @@ -737,6 +744,8 @@ #endif u32 irq_savedstate; /* Link Quality calculation context. */ struct bcm43xx_noise_calculation noisecalc; + /* if > 0 MAC is suspended. if == 0 MAC is enabled. */ + int mac_suspended; /* Threshold values. */ //TODO: The RTS thr has to be _used_. Currently, it is only set via WX. @@ -759,12 +768,6 @@ #endif struct bcm43xx_key key[54]; u8 default_key_idx; - /* Firmware. */ - const struct firmware *ucode; - const struct firmware *pcm; - const struct firmware *initvals0; - const struct firmware
[PATCH 2/2] forcedeth: mac address corrected
This patch will correct the mac address and set a flag to indicate that it is already corrected in case nv_probe is called again. For example, when you use kexec to restart the kernel. Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]> --- orig-2.6/drivers/net/forcedeth.c2006-07-06 15:06:27.0 -0400 +++ new-2.6/drivers/net/forcedeth.c 2006-07-06 15:06:58.0 -0400 @@ -109,6 +109,7 @@ * 0.54: 21 Mar 2006: Fix spin locks for multi irqs and cleanup. * 0.55: 22 Mar 2006: Add flow control (pause frame). * 0.56: 22 Mar 2006: Additional ethtool config and moduleparam support. + * 0.57: 14 May 2006: Mac address set in probe/remove and order corrections. * * Known bugs: * We suspect that on some hardware no TX done interrupts are generated. @@ -120,7 +121,7 @@ * DEV_NEED_TIMERIRQ will not harm you on sane hardware, only generating a few * superfluous timer interrupts from the nic. */ -#define FORCEDETH_VERSION "0.56" +#define FORCEDETH_VERSION "0.57" #define DRV_NAME "forcedeth" #include @@ -262,7 +263,8 @@ NvRegRingSizes = 0x108, #define NVREG_RINGSZ_TXSHIFT 0 #define NVREG_RINGSZ_RXSHIFT 16 - NvRegUnknownTransmitterReg = 0x10c, + NvRegTransmitPoll = 0x10c, +#define NVREG_TRANSMITPOLL_MAC_ADDR_REV0x8000 NvRegLinkSpeed = 0x110, #define NVREG_LINKSPEED_FORCE 0x1 #define NVREG_LINKSPEED_10 1000 @@ -1178,7 +1180,7 @@ KERN_INFO "nv_stop_tx: TransmitterStatus remained busy"); udelay(NV_TXSTOP_DELAY2); - writel(0, base + NvRegUnknownTransmitterReg); + writel(readl(base + NvRegTransmitPoll) & NVREG_TRANSMITPOLL_MAC_ADDR_REV, base + NvRegTransmitPoll); } static void nv_txrx_reset(struct net_device *dev) @@ -3917,7 +3919,7 @@ oom = nv_init_ring(dev); writel(0, base + NvRegLinkSpeed); - writel(0, base + NvRegUnknownTransmitterReg); + writel(readl(base + NvRegTransmitPoll) & NVREG_TRANSMITPOLL_MAC_ADDR_REV, base + NvRegTransmitPoll); nv_txrx_reset(dev); writel(0, base + NvRegUnknownSetupReg6); @@ -4082,7 +4084,7 @@ unsigned long addr; u8 __iomem *base; int err, i; - u32 powerstate; + u32 powerstate, txreg; dev = alloc_etherdev(sizeof(struct fe_priv)); err = -ENOMEM; @@ -4269,12 +4271,30 @@ np->orig_mac[0] = readl(base + NvRegMacAddrA); np->orig_mac[1] = readl(base + NvRegMacAddrB); - dev->dev_addr[0] = (np->orig_mac[1] >> 8) & 0xff; - dev->dev_addr[1] = (np->orig_mac[1] >> 0) & 0xff; - dev->dev_addr[2] = (np->orig_mac[0] >> 24) & 0xff; - dev->dev_addr[3] = (np->orig_mac[0] >> 16) & 0xff; - dev->dev_addr[4] = (np->orig_mac[0] >> 8) & 0xff; - dev->dev_addr[5] = (np->orig_mac[0] >> 0) & 0xff; + /* check the workaround bit for correct mac address order */ + txreg = readl(base + NvRegTransmitPoll); + if (txreg & NVREG_TRANSMITPOLL_MAC_ADDR_REV) { + /* mac address is already in correct order */ + dev->dev_addr[0] = (np->orig_mac[0] >> 0) & 0xff; + dev->dev_addr[1] = (np->orig_mac[0] >> 8) & 0xff; + dev->dev_addr[2] = (np->orig_mac[0] >> 16) & 0xff; + dev->dev_addr[3] = (np->orig_mac[0] >> 24) & 0xff; + dev->dev_addr[4] = (np->orig_mac[1] >> 0) & 0xff; + dev->dev_addr[5] = (np->orig_mac[1] >> 8) & 0xff; + } else { + /* need to reverse mac address to correct order */ + dev->dev_addr[0] = (np->orig_mac[1] >> 8) & 0xff; + dev->dev_addr[1] = (np->orig_mac[1] >> 0) & 0xff; + dev->dev_addr[2] = (np->orig_mac[0] >> 24) & 0xff; + dev->dev_addr[3] = (np->orig_mac[0] >> 16) & 0xff; + dev->dev_addr[4] = (np->orig_mac[0] >> 8) & 0xff; + dev->dev_addr[5] = (np->orig_mac[0] >> 0) & 0xff; + /* set permanent address to be correct aswell */ + np->orig_mac[0] = (dev->dev_addr[0] << 0) + (dev->dev_addr[1] << 8) + + (dev->dev_addr[2] << 16) + (dev->dev_addr[3] << 24); + np->orig_mac[1] = (dev->dev_addr[4] << 0) + (dev->dev_addr[5] << 8); + writel(txreg|NVREG_TRANSMITPOLL_MAC_ADDR_REV, base + NvRegTransmitPoll); + } memcpy(dev->perm_addr, dev->dev_addr, dev->addr_len); if (!is_valid_ether_addr(dev->perm_addr)) {
[PATCH 1/2] forcedeth: move mac address setup/teardown
This patch moves the mac address setup/teardown to the nv_probe/nv_remove functions. This fixes WOL wakeup since on nv_close we would reverse the mac address. Also, bonding driver will reset address after nv_close is called. Signed-Off-By: Ayaz Abdulla <[EMAIL PROTECTED]> --- orig-2.6/drivers/net/forcedeth.c2006-07-06 15:05:31.0 -0400 +++ new-2.6/drivers/net/forcedeth.c 2006-07-06 15:05:54.0 -0400 @@ -3895,10 +3895,9 @@ dprintk(KERN_DEBUG "nv_open: begin\n"); - /* 1) erase previous misconfiguration */ + /* erase previous misconfiguration */ if (np->driver_data & DEV_HAS_POWER_CNTRL) nv_mac_reset(dev); - /* 4.1-1: stop adapter: ignored, 4.3 seems to be overkill */ writel(NVREG_MCASTADDRA_FORCE, base + NvRegMulticastAddrA); writel(0, base + NvRegMulticastAddrB); writel(0, base + NvRegMulticastMaskA); @@ -3913,7 +3912,7 @@ if (np->pause_flags & NV_PAUSEFRAME_TX_CAPABLE) writel(NVREG_TX_PAUSEFRAME_DISABLE, base + NvRegTxPauseFrame); - /* 2) initialize descriptor rings */ + /* initialize descriptor rings */ set_bufsize(dev); oom = nv_init_ring(dev); @@ -3924,15 +3923,11 @@ np->in_shutdown = 0; - /* 3) set mac address */ - nv_copy_mac_to_hw(dev); - - /* 4) give hw rings */ + /* give hw rings */ setup_hw_rings(dev, NV_SETUP_RX_RING | NV_SETUP_TX_RING); writel( ((np->rx_ring_size-1) << NVREG_RINGSZ_RXSHIFT) + ((np->tx_ring_size-1) << NVREG_RINGSZ_TXSHIFT), base + NvRegRingSizes); - /* 5) continue setup */ writel(np->linkspeed, base + NvRegLinkSpeed); if (np->desc_ver == DESC_VER_1) writel(NVREG_TX_WM_DESC1_DEFAULT, base + NvRegTxWatermark); @@ -3950,7 +3945,6 @@ writel(NVREG_IRQSTAT_MASK, base + NvRegIrqStatus); writel(NVREG_MIISTAT_MASK2, base + NvRegMIIStatus); - /* 6) continue setup */ writel(NVREG_MISC1_FORCE | NVREG_MISC1_HD, base + NvRegMisc1); writel(readl(base + NvRegTransmitterStatus), base + NvRegTransmitterStatus); writel(NVREG_PFF_ALWAYS, base + NvRegPacketFilterFlags); @@ -4076,12 +4070,6 @@ if (np->wolenabled) nv_start_rx(dev); - /* special op: write back the misordered MAC address - otherwise -* the next nv_probe would see a wrong address. -*/ - writel(np->orig_mac[0], base + NvRegMacAddrA); - writel(np->orig_mac[1], base + NvRegMacAddrB); - /* FIXME: power down nic */ return 0; @@ -4309,6 +4297,9 @@ dev->dev_addr[0], dev->dev_addr[1], dev->dev_addr[2], dev->dev_addr[3], dev->dev_addr[4], dev->dev_addr[5]); + /* set mac address */ + nv_copy_mac_to_hw(dev); + /* disable WOL */ writel(0, base + NvRegWakeUpFlags); np->wolenabled = 0; @@ -4421,9 +4412,17 @@ static void __devexit nv_remove(struct pci_dev *pci_dev) { struct net_device *dev = pci_get_drvdata(pci_dev); + struct fe_priv *np = netdev_priv(dev); + u8 __iomem *base = get_hwbase(dev); unregister_netdev(dev); + /* special op: write back the misordered MAC address - otherwise +* the next nv_probe would see a wrong address. +*/ + writel(np->orig_mac[0], base + NvRegMacAddrA); + writel(np->orig_mac[1], base + NvRegMacAddrB); + /* free all structures */ free_rings(dev); iounmap(get_hwbase(dev));
Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
Thomas Graf wrote: > * Ville Nuorvala <[EMAIL PROTECTED]> 2006-07-31 17:46 > >>Shouldn't all these (struct fib_rule_hdr included) actually be defined >>in include/linux/rtnetlink.h? > > > We used to stuff everything into rtnetlink.h for no good reason. Having > independant include/linux/.h to export the interface to > userspace and include/net/.h to export the kernel interface > instead of contributing to the ifdef hell seems a lot cleaner to me. I agree, but then we should also split up rtnetlink.h. Having one special case will just make it harder to find. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux TCP in the presence of delays or drops...
Hi, it would be so great if some of you could spare a few minutes and take a look at the traces I provided.see below for the original postng...I just had a couple of things to add which I noticed in linux TCP behaviour which I have not seen documented anywhere else (or which I might have misread..:-)...and below I have given yet another trace that illustrates one of the TCP linux behaviour which I am having trouble understanding -If multiple timeouts occur for one packet then even if we are using the timestamp option or FRTO TCP linux is not able to detect spurious retransmissions... and TCP linux is able to detect spurious retransmissions only for a single timeout for one packet or fast retransmissions that are caused by duplicate ACK reception.I have some traces that show this behaviour, let me know if you are interested. -In the cases where TCP timestamp or FRTO is not able to detect spurious retransmissions, the performance degrades even more than when TCP timestamp or FRTO option are not used I also have one additional trace that shows the problem with the case of an explained pause in the tcp sender during retransmission which I found really hard to explain it is similar to the case 1) but this time I am doing an upgrade instead from a 384kbps connection to 1Mbps connection the traces and tcptrace time sequence curve can be found at... http://kom.aau.dk/~oumer/drop_0_delay_UPGRADE_SERVER.dat http://kom.aau.dk/~oumer/drop_0_delay_UPGRADE_CLIENT.dat and the tcptrace time sequence curve can be found in http://kom.aau.dk/~oumer/drop_0_delay_UPGRADE.ps as you can see from the server side trace... (all the packets shown here are retransmissions because I flushed the sender's buffer at time instant 17:26:24.657) 17:26:26.261972 2267693336:2267694796(1460) ack 3498775069 win 5840 (DF) 17:26:26.319180 . ack 2267694796 win 61320 (DF) [tos 0x8] 17:26:26.321961 2267694796:2267696256(1460) ack 3498775069 win 5840 (DF) 17:26:26.379160 . ack 2267696256 win 61320 (DF) [tos 0x8] 17:26:26.381940 . 2267696256:2267697716(1460) ack 3498775069 win 5840 (DF) 17:26:26.439138 . ack 2267697716 win 61320 (DF) [tos 0x8] 17:26:26.441925 2267697716:2267699176(1460) ack 3498775069 win 5840 (DF) 17:26:26.499144 ack 2267699176 win 61320 (DF) [tos 0x8] 17:26:28.234327 2267699176:2267700636(1460) ack 3498775069 win 5840 (DF) eventhough the server got an ACK with # ack 2267699176 at timeinstant 17:26:26.49...it waited till 17:26:28.234 to resend the packet... which is around 1.73 seconds... I have checked with other traces where I introduced delay and for the link the first timeout occurs after 1.73 second, which seems to be the RTO at that time, and for no apparent reason TCP is wating for a timeout... case 1 is quite similar but there the retransmissions were triggered by timeout to begin with, here the retransmissions are triggered by duplicate ACKs...in the case1 described below this abnormal behaviour occured after only a couple of packets were retransmitted...here it took quite some retransmissions before the same problem happend... any insight into this is greatly appreciated!! Thanks in advance, Oumer Oumer Teyeb wrote: Hi all, I have some questions regarding Linux TCP in the presence of delays or packet drops. It is somehow long mail, but the questions are two or three, just wanted to provide a detailed information so that the problem is clear. thanx for the patience!! Best regards, Oumer Note that for the traces referred here, SACK,timestamps, and FRTO are all disabled... 1) packet drops I have a trace where the tcp sender window is flushed and then the connection speed is changed from 1Mbps to 384kbps... The trace files from both the client and the server side can be found at http://kom.aau.dk/~oumer/drop_0_delay_SERVER.dat http://kom.aau.dk/~oumer/drop_0_delay_CLIENT.dat and the tcptrace time sequence curve can be found in http://kom.aau.dk/~oumer/drop_0_delay.ps as can be seen from the plot and the trace files at around 17:19:35.705733, the window was flushed (both the sender's and receivers), and hence packets with seq numbers from 1840001135 upto 1840058075 were dropped (39 packets)...and also the ACK for 1840001135 was also dropped (from the traces this can be seen as it appears in the client trace but not on the server trace)... and since there were still packets to be sent the sender keeps sending a few more packets and when few of them are received (from the client side trace..) 17:19:35.938017 1840059535:1840060995(1460) ack 3059152863 win 5840 (DF)... 17:19:35.938028 ack 1840001135 win 62780 (DF) [tos 0x8]...first ACK that is going to be received by the sender 17:19:35.969316 1840060995:1840062455(1460) ack 3059152863 win 5840 (DF) 17:19:35.969325 1840001135 win 62780 (DF) [tos 0x8]first duplicate ACK 17:19:36.000519 1840062455:1840063915(1460) ack 3059152863 win 5840 (DF) 17:19:3
[RFC] irqbalance: Mark in-kernel irqbalance as obsolete, set to N by default
We've recently seen a number of user bug reports against e1000 that the in-kernel irqbalance code is detrimental to network latency. The algorithm keeps swapping irq's for NICs from cpu to cpu causing extremely high network latency (>1000ms). Another NIC driver (cxgb) already has severe warnings in their documentation file against using CONFIG_IRQBALANCE, but this is a general problem for all NIC drivers and other subsystems. This is especially so with cpufreq scaling where the system is slowed down and the migrations take much longer. I suggest that the in-kernel irqbalance is phased out, by marking it OBSOLETE first and (perhaps) removing the code later. The userspace irqbalance daemon written by Arjan van de Ven does a wonderful job and should be used instead. Signed-off-by: Auke Kok <[EMAIL PROTECTED]> --- Kconfig | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) --- diff --git a/arch/i386/Kconfig b/arch/i386/Kconfig index daa75ce..5a40cfe 100644 --- a/arch/i386/Kconfig +++ b/arch/i386/Kconfig @@ -690,12 +690,19 @@ config EFI kernel should continue to boot on existing non-EFI platforms. config IRQBALANCE - bool "Enable kernel irq balancing" + bool "Enable kernel irq balancing (obsolete)" depends on SMP && X86_IO_APIC - default y + default n help - The default yes will allow the kernel to do irq load balancing. - Saying no will keep the kernel from doing irq load balancing. + The kernel irq balance will migrate interrupts between cpu's + constantly, which may help reduce load in some cases. It is not + beneficial for latency however, and a user-space daemon is available + that does a much better job. + + The default no will keep the kernel from doing irq load balancing. + Say yes will allow the kernel to do irq load balancing. + + If unsure, say N. # turning this on wastes a bunch of space. # Summit needs it only when NUMA is on
Re: Linville's L2 rant... -- Re: PATCH Fix bonding active-backup behavior for VLAN interfaces
On Monday 31 July 2006 14:30, you wrote: > (This is not directed at Christophe, or anyone in particular...) > > > > Am I the only one that thinks that our handling of LAN L2 stuff > is at best a little "too" flexible (and at worst a collection of > nasty hacks)? > > I mean, do we really need both the ability to bond multiple vlan > interfaces AND the ability to have vlan interfaces on top of a bond? > How many people really appreciate the subtle(?) differences? > > Then throw bridging into the mix! If I'm using VLANs and bonds in > a bridged environment, do I bridge the bonds, or bond the bridges? In all honesty, you cannot bond bridges :-p > Do the VLANs come before the bonds? after the bridges? or somewhere > in-between? Do all these combinations even work together? Who has > the definitive answer (besides the code itself)? > > I have no doubt that there are plenty of opportunities for cleverness > here (and no doubt dragons too). I just doubt that most of them > are worth the complexities introduced by our current collection of > "transparently" stackable pseudo-drivers and strategically placed hacks > (e.g. skb_bond). All that, and it still isn't clear to me how we > can cleanly accomodate 802.1s (which adds VLAN awareness to bridging). > > Do we hold the view that our L2 code is on par with the rest of > our code? Is there an appetite for a clean-up? Or is it just me? A vlan capable bridge with trunk ports and access ports would be nice :-p I think the current code is nice. You need it to properly support virtualization and I find it very useful where I work to have this option. Regards, Christophe - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Runtime power management for network interfaces
Randy.Dunlap wrote: On Tue, 25 Jul 2006 09:20:06 -0700 Auke Kok wrote: Alan Stern wrote: During a Power Management session at the Ottawa Linux Symposium, it was generally agreed that network interface drivers ought to automatically suspend their devices (if possible) whenever: (1) The interface is ifconfig'ed down, or (2) No link is available. Presumably (1) should be easy enough to implement. (2) might or might not be feasible, depending on how much WOL support is available. (It might not be feasible at all for wireless networking.) Still, there can be no question that it would be a Good Thing for laptops to power-down their ethernet controllers when the network cable is unplugged. Has any progress been made in this direction? If not, a natural approach would be to start with a reference implementation in one driver which could then be copied to other drivers. Intel's newer e1000's (ich7 onboard e1000 and newer versions for instance) already support this feature partially - the MAC stays on but the PHY can be powered off when no link is present. In order to enable this feature you will need to turn it on explicitly at load time: modprobe e1000 SmartPowerDownEnable=1 Please add that to Documentation/networking/e1000.txt. I'm long overdue with documentation updates ATM, I'll see if I can fix that :) Cheers, Auke - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Runtime power management for network interfaces
On Tue, 25 Jul 2006 11:59:52 -0400 (EDT) Alan Stern <[EMAIL PROTECTED]> wrote: > During a Power Management session at the Ottawa Linux Symposium, it was > generally agreed that network interface drivers ought to automatically > suspend their devices (if possible) whenever: > > (1) The interface is ifconfig'ed down, or > > (2) No link is available. This is hard because most of the power may be consumed by the PHY interface and it needs to be alive to see link. > > Presumably (1) should be easy enough to implement. (2) might or might not > be feasible, depending on how much WOL support is available. (It might > not be feasible at all for wireless networking.) Still, there can be no > question that it would be a Good Thing for laptops to power-down their > ethernet controllers when the network cable is unplugged. > > Has any progress been made in this direction? If not, a natural approach > would be to start with a reference implementation in one driver which > could then be copied to other drivers. > The problem is not generic, it really is specific to each device. We have all the necessary infrastructure to do the right thing in the network device driver, but in many cases we don't have the code or the technical information to do proper power management. -- Stephen Hemminger <[EMAIL PROTECTED]> "And in the Packet there writ down that doome" - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] gre: transparent ethernet bridging
On Mon, 31 Jul 2006 20:06:41 +1000 Philip Craig <[EMAIL PROTECTED]> wrote: > This patch implements transparent ethernet bridging for gre tunnels. > There are a few outstanding issues. Why not use existing bridge code? > There is no way for userspace to select the type of gre tunnel. The > #if 0 near the top of the patch forces all gre tunnels to be bridges. > The problem is that userspace uses an IPPROTO_ to select the type of > tunnel, but both types of gre tunnel are IPPROTO_GRE. I can't see > anything else in struct ip_tunnel_parm that could be used to select > this. One approach that I've seen mentioned in the archives is to add > a netlink interface to replace the tunnel ioctls. > > Network loops are bad. See the comments at the top of ip_gre.c for > a description of how gre tunnels handle this normally. But for gre > bridges, we don't want to copy the ttl (it breaks routing protocols), > and we don't want to force DF (we want to bridge 1500 byte packets). > I couldn't think of any solution for this. > > Some routers set LLC_SAP_BSPAN in the gre protocol field, and then > give the bpdu packet without any other ethernet/llc header. This patch > currently tries to fake the ethernet/llc header before passing the > packet up, but it is buggy (mac addresses are wrong at least). Maybe a > better approach is to call directly into the bridging code. I didn't try > that at first because it isn't modular, and may break other things that > want to see the packet. Existing bridge code already has spanning tree. > --- linux-2.6.x/net/ipv4/ip_gre.c 18 Jun 2006 23:30:56 - 1.1.1.33 > +++ linux-2.6.x/net/ipv4/ip_gre.c 31 Jul 2006 09:57:41 - > @@ -30,6 +30,8 @@ > #include > #include > #include > +#include > +#include > > #include > #include > @@ -41,6 +43,8 @@ > #include > #include > #include > +#include > +#include > > #ifdef CONFIG_IPV6 > #include > @@ -119,6 +123,7 @@ > > static int ipgre_tunnel_init(struct net_device *dev); > static void ipgre_tunnel_setup(struct net_device *dev); > +static void ipgre_ether_tunnel_setup(struct net_device *dev); > > /* Fallback tunnel: no source, no destination, no key, no options */ > > @@ -274,7 +279,11 @@ static struct ip_tunnel * ipgre_tunnel_l > goto failed; > } > > +#if 0 > dev = alloc_netdev(sizeof(*t), name, ipgre_tunnel_setup); > +#else > + dev = alloc_netdev(sizeof(*t), name, ipgre_ether_tunnel_setup); > +#endif "Do, or do not there is no try" > +__be16 ipgre_type_trans(struct sk_buff *skb, int offset) > +{ > + u8 *h = skb->data; > + __be16 flags = *(__be16*)h; > + __be16 proto = *(__be16*)(h + 2); > + > + /* WCCP version 1 and 2 protocol decoding. > + * - Change protocol to IP > + * - When dealing with WCCPv2, Skip extra 4 bytes in GRE header > + */ > + if (flags == 0 && > + proto == __constant_htons(ETH_P_WCCP)) { > + proto = __constant_htons(ETH_P_IP); > + if ((*(h + offset) & 0xF0) != 0x40) > + offset += 4; > + } Don't use __constant_htons() except in initializers and switch cases (where gcc is too stupid to optimize the macro). -- Stephen Hemminger <[EMAIL PROTECTED]> "And in the Packet there writ down that doome" - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IBM (Lenovo) T60: e1000 driver high latency
Hello Auke, > > CONFIG_IRQBALANCE=y thanks for the feedback. The behaviour improoved. In my first tests it wasn't so good. But now it seems perfect: (thinkpad) [~] ping 131.188.30.102 PING 131.188.30.102 (131.188.30.102) 56(84) bytes of data. 64 bytes from 131.188.30.102: icmp_seq=1 ttl=64 time=419 ms 64 bytes from 131.188.30.102: icmp_seq=2 ttl=64 time=0.264 ms 64 bytes from 131.188.30.102: icmp_seq=3 ttl=64 time=0.701 ms 64 bytes from 131.188.30.102: icmp_seq=4 ttl=64 time=0.630 ms 64 bytes from 131.188.30.102: icmp_seq=5 ttl=64 time=0.710 ms 64 bytes from 131.188.30.102: icmp_seq=6 ttl=64 time=0.638 ms 64 bytes from 131.188.30.102: icmp_seq=7 ttl=64 time=0.588 ms 64 bytes from 131.188.30.102: icmp_seq=8 ttl=64 time=0.517 ms 64 bytes from 131.188.30.102: icmp_seq=9 ttl=64 time=0.445 ms 64 bytes from 131.188.30.102: icmp_seq=10 ttl=64 time=0.374 ms --- 131.188.30.102 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 8996ms rtt min/avg/max/mdev = 0.264/42.447/419.606/125.719 ms (thinkpad) [~] ping 131.188.30.102 PING 131.188.30.102 (131.188.30.102) 56(84) bytes of data. 64 bytes from 131.188.30.102: icmp_seq=1 ttl=64 time=0.547 ms 64 bytes from 131.188.30.102: icmp_seq=2 ttl=64 time=0.502 ms 64 bytes from 131.188.30.102: icmp_seq=3 ttl=64 time=0.402 ms 64 bytes from 131.188.30.102: icmp_seq=4 ttl=64 time=0.329 ms --- 131.188.30.102 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 2998ms rtt min/avg/max/mdev = 0.329/0.445/0.547/0.085 ms (thinkpad) [~] ping 131.188.30.102 PING 131.188.30.102 (131.188.30.102) 56(84) bytes of data. 64 bytes from 131.188.30.102: icmp_seq=1 ttl=64 time=0.301 ms 64 bytes from 131.188.30.102: icmp_seq=2 ttl=64 time=0.753 ms 64 bytes from 131.188.30.102: icmp_seq=3 ttl=64 time=0.681 ms 64 bytes from 131.188.30.102: icmp_seq=4 ttl=64 time=0.609 ms 64 bytes from 131.188.30.102: icmp_seq=5 ttl=64 time=0.538 ms 64 bytes from 131.188.30.102: icmp_seq=6 ttl=64 time=0.466 ms 64 bytes from 131.188.30.102: icmp_seq=7 ttl=64 time=0.374 ms 64 bytes from 131.188.30.102: icmp_seq=8 ttl=64 time=0.308 ms --- 131.188.30.102 ping statistics --- 8 packets transmitted, 8 received, 0% packet loss, time 6993ms rtt min/avg/max/mdev = 0.301/0.503/0.753/0.161 ms (thinkpad) [~] ping www.heise.de PING www.heise.de (193.99.144.85) 56(84) bytes of data. 64 bytes from www.heise.de (193.99.144.85): icmp_seq=1 ttl=246 time=1019 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=2 ttl=246 time=15.8 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=3 ttl=246 time=1000 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=4 ttl=246 time=360 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=5 ttl=246 time=39.1 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=6 ttl=246 time=360 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=7 ttl=246 time=1000 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=8 ttl=246 time=360 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=9 ttl=246 time=360 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=10 ttl=246 time=360 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=11 ttl=246 time=1000 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=12 ttl=246 time=319 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=13 ttl=246 time=20.5 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=14 ttl=246 time=350 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=15 ttl=246 time=792 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=16 ttl=246 time=350 ms 64 bytes from www.heise.de (193.99.144.85): icmp_seq=17 ttl=246 time=1000 ms --- www.heise.de ping statistics --- 18 packets transmitted, 17 received, 5% packet loss, time 17013ms rtt min/avg/max/mdev = 15.835/512.385/1019.753/360.201 ms, pipe 2 (thinkpad) [~] ping www.google.de PING www.l.google.com (66.249.85.104) 56(84) bytes of data. 64 bytes from 66.249.85.104: icmp_seq=1 ttl=246 time=732 ms 64 bytes from 66.249.85.104: icmp_seq=2 ttl=246 time=999 ms 64 bytes from 66.249.85.104: icmp_seq=3 ttl=246 time=1000 ms 64 bytes from 66.249.85.104: icmp_seq=4 ttl=246 time=400 ms --- www.l.google.com ping statistics --- 5 packets transmitted, 4 received, 20% packet loss, time 4388ms rtt min/avg/max/mdev = 400.939/783.424/1000.436/246.402 ms, pipe 2 (thinkpad) [~] uname -a Linux thinkpad 2.6.17.7 #3 SMP Mon Jul 31 17:44:21 CEST 2006 i686 GNU/Linux (thinkpad) [~] date Mon Jul 31 17:47:39 CEST 2006 (thinkpad) [~] ping www.google.de PING www.l.google.com (66.249.85.99) 56(84) bytes of data. 64 bytes from 66.249.85.99: icmp_seq=1 ttl=246 time=13.3 ms 64 bytes from 66.249.85.99: icmp_seq=2 ttl=246 time=13.4 ms 64 bytes from 66.249.85.99: icmp_seq=3 ttl=246 time=13.4 ms 64 bytes from 66.249.85.99: icmp_seq=4 ttl=246 time=13.3 ms 64 bytes from 66.249.85.99: icmp_seq=5 ttl=246 time=13.2 ms 64 bytes from 66.249.85.99: icmp_seq=6 ttl=246 time=13.6 ms 64 bytes from 66.249.85.99: icmp_seq=7 ttl=246 time=13.6 ms 64 bytes from 6
Re: eth2.100: received packet with own address as source address
On Mon, Jul 31, 2006 at 12:27:19AM -0400, David Coulson wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > I have a machine running 2.6.18-rc3 with a bridge config that looks like > this: > > cr1:~# brctl show > bridge name bridge id STP enabled interfaces > vlan100 36b0.0007e90f40c1 yes eth0.100 > eth2.100 > vlan101 5dc0.0007e90f40c1 yes eth0.101 > eth2.101 > vlan102 5dc0.00e08163c33f yes eth3 > vlan200 5dc0.0007e90f40c1 yes eth0.200 > eth2.200 > vlan201 5dc0.0007e90f40c1 yes eth0.201 > eth2.201 > vlan300 5dc0.0007e90f40c1 yes eth0.300 > eth2.300 > vlan301 5dc0.0007e90f40c1 yes eth0.301 > eth2.301 > vlan302 5dc0.0007e90f40c1 yes eth0.302 > eth2.302 > vlan303 5dc0.0007e90f40c1 yes eth0.303 > eth2.303 > > All bridges, except for vlan102, are running STP and appear to have > elected themselves the root bridge. > > I see this on the console: > > printk: 18 messages suppressed. > eth2.100: received packet with own address as source address > printk: 20 messages suppressed. > eth2.200: received packet with own address as source address > This usually means that you have a loop somewhere, I can't say specifically that this is the case with your current setup, but it would be interesting to capture traffic on that interface and determine what it looks like in more detail. > This repeats continuously, only indicating an issue on the two specific > ports mentioned above. This confuses me: > > 1) All VLANs (except for 102) are in an identical configuration > 2) No other VLANs exhibit the same kernel message > 3) I have another machine, running 2.6.18-rc2 with the same config and > no kernel message > > Is this message specifically related to BPDU frames, or is it pertaining > to any Ethernet frame on the port? > > Here is the STP config for one of the bridges. What's the next step to > troubleshoot this? Capture some of the suspicious traffic. > > vlan200 > bridge id 5dc0.0007e90f40c1 > designated root5dc0.0007e90f40c1 > root port 0path cost 0 > max age 20.00 bridge max age > 20.00 > hello time2.00 bridge hello time >2.00 > forward delay 5.00 bridge forward delay >5.00 > ageing time 300.01 > hello timer 1.67 tcn timer >0.00 > topology change timer 0.00 gc timer >0.03 > flags > > > eth0.200 (1) > port id8001state > forwarding > designated root5dc0.0007e90f40c1 path cost 19 > designated bridge 5dc0.0007e90f40c1 message age timer >0.00 > designated port8001forward delay timer >0.00 > designated cost 0hold timer >0.67 > flags > > eth2.200 (2) > port id8002state > blocking > designated root5dc0.0007e90f40c1 path cost 4 > designated bridge 5dc0.0007e90f40c1 message age timer > 19.67 > designated port8001forward delay timer >0.00 > designated cost 0hold timer >0.00 > flags > > > David > > - -- > David J. Coulson > email: [EMAIL PROTECTED] > web: http://www.davidcoulson.net/ > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.3 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org > > iD8DBQFEzYanTIgPQWnLowkRApT1AJ9yXl/O+rzacF+mpM7hhNtsEh/ufACfQCHk > mBGBxl5Iscj7vbFlM0IzY/Y= > =ZIs7 > -END PGP SIGNATURE- > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
* Herbert Xu <[EMAIL PROTECTED]> 2006-08-01 00:01 > Without a route cache, I think our only choice is to search through > all tables. The same thing applies to PMTU updates as well. I think PMTU etc. should be moved out of the route into a some form of flow cache. It's currently using rt6_lookup() which even goes through the rules. Doing a few thousand trie lookups after Patrick's changes in the worst case for every redirect might be acceptable but doing so for every PMTU update could become an issue. > Actually, if we're adding policy routing, we should seriously consider > whether living without a routing cache is still viable or not because > the cost of a route lookup has just gone up. Absolutely. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: fix kernel panic from no dev->hard_header_len space
Hello! > It does seem weird that IP output won't pay attention to Not so weird, actually. The logic was: Only initial skb allocation tries to reserve all the space to avoid copies in the future. All the rest of places just check, that there is enough space for their immediate needs. If dev->hard_header() is NULL, it means that stack does not need any space at all, so that it does not need to worry. Right logic for reallocation would be: if (skb_headroom(skb) < space_which_I_need_now) { skb2 = skb_realloc_headroom(skb, space_for_future); } That logic was not followed exactly only because of laziness, each time some device is found which forgets to check for space, so reallocation is made in absolutely inappropriate places. F.e. ip_forward() does not need to reallocate skb when skb_headroom() < dev->hard_header_len. It does and it is not good. Good example is ipip tunnel. It sets: dev->hard_header_len = sizeof(iphdr) + LL_MAX_HEADER because it does not know, what device will be used. It is lots of space and most likely it will not use it. So, initial allocation reserves lots of space, but all the rest of stack should not reallocate, tunnel will take care of this itself. Alexey - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
* Ville Nuorvala <[EMAIL PROTECTED]> 2006-07-31 16:55 > > When locating routes for redirects only the main table is > > searched for now. Since policy rules will not be reversible > > it is unclear whether it makes sense to change this. > > This is a good point. You are absolutely correct about the policy rules. > > IIRC, I initially looked through all the tables, but skipped this > behavior when I rewrote the code for 2.6.11. Currently I'm once again > in favor of looping through them all. This is IMO at least closer to the > spirit of RFC 2461 section 8.3. where a host SHOULD update its > destination cache upon receiving a redirect. If we don't look through > all tables, we can't ensure this happens. I agree, it will depend on what way is being followed regarding a flow cache or route cache. > > +#define RT6_TABLE_UNSPEC RT_TABLE_UNSPEC > > +#define RT6_TABLE_MAIN RT_TABLE_MAIN > > +#define RT6_TABLE_LOCALRT6_TABLE_MAIN > > +#define RT6_TABLE_DFLT RT6_TABLE_MAIN > > +#define RT6_TABLE_INFO RT6_TABLE_MAIN > > IMO it's a bit inconsistent to define a separate table entry for Route > Information generated routes, but not Prefix Information based ones. > What do you say about adding a RT6_TABLE_PRFX? Sounds good. > > @@ -1435,12 +1523,15 @@ static struct rt6_info *rt6_add_route_in > > struct rt6_info *rt6_get_dflt_router(struct in6_addr *addr, struct > > net_device *dev) > > { > > struct rt6_info *rt; > > - struct fib6_node *fn; > > + struct fib6_table *table; > > > > - fn = &ip6_routing_table; > > + /* TODO: It might be better to search all tables */ > > + table = fib6_get_table(RT6_TABLE_DFLT); > > As long as the table for default routes is RT6_TABLE_DFLT and can't be > configured by the user, I think the correct behavior is just to search > RT6_TABLE_DFLT. I agree, I intended to remove that comment but missed it. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
* Ville Nuorvala <[EMAIL PROTECTED]> 2006-07-31 17:46 > > Derived from net/ipv6/fib_rules.c > > do you mean net/ipv4/fib_rules.c or net/ipv6/fib6_rules.c? :-) Hehe, I meant net/ipv4/fib_rules.c :-) > > +struct fib_rule_hdr > > +{ > > + __u8family; > > + __u8dst_len; > > + __u8src_len; > > + __u8tos; > > + > > + __u8table; > > + __u8res1; /* reserved */ > > + __u8res2; /* reserved */ > > + __u8action; > > + > > + __u32 flags; > > +}; > > I'm wondering if this is guaranteed to be equvalent to struct rtmsg? > > struct rtmsg > { > unsigned char rtm_family; > unsigned char rtm_dst_len; > unsigned char rtm_src_len; > unsigned char rtm_tos; > > unsigned char rtm_table; /* Routing table id */ > unsigned char rtm_protocol; /* Routing protocol; see below > */ > unsigned char rtm_scope; /* See below */ > unsigned char rtm_type; /* See below*/ > > unsignedrtm_flags; > }; > > Won't we otherwise be breaking the existing userland interface? It is equivalent but you're right, it would break userland interfaces otherwise. I've defined this new header to add implicit names and stop the confusion with unused fields. > > +enum > > +{ > > + FRA_UNSPEC, > > + FRA_DST,/* destination address */ > > + FRA_SRC,/* source address */ > > + FRA_IFNAME, /* interface name */ > > + FRA_UNUSED1, > > + FRA_UNUSED2, > > + FRA_PRIORITY, /* priority/preference */ > > + FRA_UNUSED3, > > + FRA_UNUSED4, > > + FRA_UNUSED5, > > + FRA_FWMARK, /* netfilter mark (IPv4) */ > > + FRA_FLOW, /* flow/class id */ > > + __FRA_MAX > > +}; > > + > > +#define FRA_MAX (__FRA_MAX - 1) > > + > > +enum > > +{ > > + FR_ACT_UNSPEC, > > + FR_ACT_TO_TBL, /* Pass to fixed table */ > > + FR_ACT_RES1, > > + FR_ACT_RES2, > > + FR_ACT_RES3, > > + FR_ACT_RES4, > > + FR_ACT_BLACKHOLE, /* Drop without notification */ > > + FR_ACT_UNREACHABLE, /* Drop with ENETUNREACH */ > > + FR_ACT_PROHIBIT,/* Drop with EACCES */ > > + __FR_ACT_MAX, > > +}; > > + > > +#define FR_ACT_MAX (__FR_ACT_MAX - 1) > > + > > +#endif > > Shouldn't all these (struct fib_rule_hdr included) actually be defined > in include/linux/rtnetlink.h? We used to stuff everything into rtnetlink.h for no good reason. Having independant include/linux/.h to export the interface to userspace and include/net/.h to export the kernel interface instead of contributing to the ifdef hell seems a lot cleaner to me. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IBM (Lenovo) T60: e1000 driver high latency
Thomas Glanzmann wrote: Hello, [ resend because .config and the used kernel version was missing ] Linux Kernel Version: Linus Vanilla Tree; .config attached. I recently aquired a Lenovo (IBM) T60 with a e1000 network card. I experience high latency with this networkcard: Pings last upto 1 second where the ping should be around 25 ms. I googled a bit and found the following: - Enable NAPI, which didn't worked for me. 64 bytes from 192.168.0.223: icmp_seq=30 ttl=64 time=1004 ms 64 bytes from 192.168.0.223: icmp_seq=31 ttl=64 time=0.444 ms 64 bytes from 192.168.0.223: icmp_seq=32 ttl=64 time=1006 ms 64 bytes from 192.168.0.223: icmp_seq=33 ttl=64 time=0.739 ms Someone reported this problem on the e1000 bug tracker at e1000.sf.net. He also reported that the behaviour goes away completely if he disables the in-kernel irq balancer: : If I disable in kernel config Irq Balancing pings are : much better but not the best :-) : : 64 bytes from 192.168.3.74: icmp_seq=29 ttl=64 time=12.7 ms : 64 bytes from 192.168.3.74: icmp_seq=30 ttl=64 time=10.0 ms : 64 bytes from 192.168.3.74: icmp_seq=31 ttl=64 time=7.3 ms : 64 bytes from 192.168.3.74: icmp_seq=32 ttl=64 time=4.5 ms that's a large difference from >> 1000ms, and I cannot suspect otherwise that the kernel irqbalance is wreaking havoc in your system, trying to swap the entire context between each core (t60 is a core duo) every second or so. I've never believed much in the kernel irq balancer, the userspace daemon written by Arjan van der Ven just does a much better job, so can you try to disable the kernel irqbalancer? > CONFIG_IRQBALANCE=y turn that off ;) Cheers, Auke - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND 4/5] [IPV6]: Policy Routing Rules
Thomas Graf wrote: > Adds support for policy routing rules including a new > local table for routes with a local destination. Looks good! > Signed-off-by: Thomas Graf <[EMAIL PROTECTED]> Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND 3/5] [NET]: Protocol Independant Policy Routing Rules Framework
Thomas Graf wrote: Hi Thomas, > Derived from net/ipv6/fib_rules.c do you mean net/ipv4/fib_rules.c or net/ipv6/fib6_rules.c? :-) A couple of comments below. > Signed-off-by: Thomas Graf <[EMAIL PROTECTED]> > > Index: net-2.6.git/include/linux/fib_rules.h > === > --- /dev/null > +++ net-2.6.git/include/linux/fib_rules.h > @@ -0,0 +1,60 @@ > +#ifndef __LINUX_FIB_RULES_H > +#define __LINUX_FIB_RULES_H > + > +#include > +#include > + > +/* rule is permanent, and cannot be deleted */ > +#define FIB_RULE_PERMANENT 1 > + > +struct fib_rule_hdr > +{ > + __u8family; > + __u8dst_len; > + __u8src_len; > + __u8tos; > + > + __u8table; > + __u8res1; /* reserved */ > + __u8res2; /* reserved */ > + __u8action; > + > + __u32 flags; > +}; I'm wondering if this is guaranteed to be equvalent to struct rtmsg? struct rtmsg { unsigned char rtm_family; unsigned char rtm_dst_len; unsigned char rtm_src_len; unsigned char rtm_tos; unsigned char rtm_table; /* Routing table id */ unsigned char rtm_protocol; /* Routing protocol; see below */ unsigned char rtm_scope; /* See below */ unsigned char rtm_type; /* See below*/ unsignedrtm_flags; }; Won't we otherwise be breaking the existing userland interface? > +enum > +{ > + FRA_UNSPEC, > + FRA_DST,/* destination address */ > + FRA_SRC,/* source address */ > + FRA_IFNAME, /* interface name */ > + FRA_UNUSED1, > + FRA_UNUSED2, > + FRA_PRIORITY, /* priority/preference */ > + FRA_UNUSED3, > + FRA_UNUSED4, > + FRA_UNUSED5, > + FRA_FWMARK, /* netfilter mark (IPv4) */ > + FRA_FLOW, /* flow/class id */ > + __FRA_MAX > +}; > + > +#define FRA_MAX (__FRA_MAX - 1) > + > +enum > +{ > + FR_ACT_UNSPEC, > + FR_ACT_TO_TBL, /* Pass to fixed table */ > + FR_ACT_RES1, > + FR_ACT_RES2, > + FR_ACT_RES3, > + FR_ACT_RES4, > + FR_ACT_BLACKHOLE, /* Drop without notification */ > + FR_ACT_UNREACHABLE, /* Drop with ENETUNREACH */ > + FR_ACT_PROHIBIT,/* Drop with EACCES */ > + __FR_ACT_MAX, > +}; > + > +#define FR_ACT_MAX (__FR_ACT_MAX - 1) > + > +#endif Shouldn't all these (struct fib_rule_hdr included) actually be defined in include/linux/rtnetlink.h? Otherwise, looks good. Regards, Ville - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debugging kernel lockups during network activity
On 28-07-2006 16:39, Jarek Poplawski wrote: ... It has some great patch to queue scheduler by Hubert Xu. I think it is I'm immensly sorry to change the name of Mr Herbert Xu. And I thought it's easy name - just like some famous conductor (but not so famous). You'll not believe, but when I wrote this I checked mails not to misspell his surname! Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] NetLabel: core network changes
On Monday 31 July 2006 8:43 am, Venkat Yekkirala wrote: > > The NetLabel patch allows administrators to assign specific a CIPSO > > DOI/configuration to each LSM "domain". Blindly using the > > CIPSO tag that the > > remote host sends could violate the administrator's NetLabel > > configuration. > > > > The current patch reads the CIPSO tag off the child socket, > > translating the > > tag according to the CIPSO DOI configuration to arrive at the > > correct/desired > > LSM security attributes. These LSM security attributes and > > the "domain" are > > then used to set the NetLabel on the socket. In the case > > where everyone is > > well behaved this should have no effect on the socket IP > > options and the > > packets sent across the wire. However, in the case of a > > not-nice remote host > > the outgoing CIPSO tag may change to match the administrators desired > > settings. > > I wonder if waiting till accept isn't too late though. Perhaps this > should be done when the openreq is created so the syn-ack and such > will go out with the right tag? Stephen Smalley and I had several long discussions about this and my opinion, which seemed to be at least acceptable to Stephen, was that it was okay since there was no actual data being sent only TCP control messages. However, like I said earlier, the exact details of this are going to change as I am going to port the code to use the new accept() LSM hooks so this is really a not much of a concern anymore ... -- paul moore linux security @ hp - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
On Tue, Aug 01, 2006 at 12:01:03AM +1000, Herbert Xu wrote: > Ville Nuorvala <[EMAIL PROTECTED]> wrote: > > > >> When locating routes for redirects only the main table is > >> searched for now. Since policy rules will not be reversible > >> it is unclear whether it makes sense to change this. > > > > This is a good point. You are absolutely correct about the policy rules. > > > > IIRC, I initially looked through all the tables, but skipped this > > behavior when I rewrote the code for 2.6.11. Currently I'm once again > > in favor of looping through them all. This is IMO at least closer to the > > spirit of RFC 2461 section 8.3. where a host SHOULD update its > > destination cache upon receiving a redirect. If we don't look through > > all tables, we can't ensure this happens. > > Without a route cache, I think our only choice is to search through > all tables. The same thing applies to PMTU updates as well. Actually, if we're adding policy routing, we should seriously consider whether living without a routing cache is still viable or not because the cost of a route lookup has just gone up. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
Ville Nuorvala <[EMAIL PROTECTED]> wrote: > >> When locating routes for redirects only the main table is >> searched for now. Since policy rules will not be reversible >> it is unclear whether it makes sense to change this. > > This is a good point. You are absolutely correct about the policy rules. > > IIRC, I initially looked through all the tables, but skipped this > behavior when I rewrote the code for 2.6.11. Currently I'm once again > in favor of looping through them all. This is IMO at least closer to the > spirit of RFC 2461 section 8.3. where a host SHOULD update its > destination cache upon receiving a redirect. If we don't look through > all tables, we can't ensure this happens. Without a route cache, I think our only choice is to search through all tables. The same thing applies to PMTU updates as well. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] [IPV6]: Multiple Routing Tables
Thomas Graf wrote: > Adds the framework to support multiple IPv6 routing tables. > Currently all automatically generated routes are put into the > same table. This could be changed at a later point after > considering the produced locking overhead. Hi Thomes, some minor comments below. > When locating routes for redirects only the main table is > searched for now. Since policy rules will not be reversible > it is unclear whether it makes sense to change this. This is a good point. You are absolutely correct about the policy rules. IIRC, I initially looked through all the tables, but skipped this behavior when I rewrote the code for 2.6.11. Currently I'm once again in favor of looping through them all. This is IMO at least closer to the spirit of RFC 2461 section 8.3. where a host SHOULD update its destination cache upon receiving a redirect. If we don't look through all tables, we can't ensure this happens. > Index: net-2.6.git/include/net/ip6_fib.h > === > --- net-2.6.git.orig/include/net/ip6_fib.h > +++ net-2.6.git/include/net/ip6_fib.h > @@ -143,12 +146,41 @@ struct rt6_statistics { > > typedef void (*f_pnode)(struct fib6_node *fn, void *); > > -extern struct fib6_node ip6_routing_table; > +struct fib6_table { > + struct hlist_node tb6_hlist; > + u32 tb6_id; > + rwlock_ttb6_lock; > + struct fib6_nodetb6_root; > +}; > + > +#define RT6_TABLE_UNSPEC RT_TABLE_UNSPEC > +#define RT6_TABLE_MAIN RT_TABLE_MAIN > +#define RT6_TABLE_LOCAL RT6_TABLE_MAIN > +#define RT6_TABLE_DFLT RT6_TABLE_MAIN > +#define RT6_TABLE_INFO RT6_TABLE_MAIN IMO it's a bit inconsistent to define a separate table entry for Route Information generated routes, but not Prefix Information based ones. What do you say about adding a RT6_TABLE_PRFX? > Index: net-2.6.git/net/ipv6/route.c > === > --- net-2.6.git.orig/net/ipv6/route.c > +++ net-2.6.git/net/ipv6/route.c > @@ -1435,12 +1523,15 @@ static struct rt6_info *rt6_add_route_in > struct rt6_info *rt6_get_dflt_router(struct in6_addr *addr, struct > net_device *dev) > { > struct rt6_info *rt; > - struct fib6_node *fn; > + struct fib6_table *table; > > - fn = &ip6_routing_table; > + /* TODO: It might be better to search all tables */ > + table = fib6_get_table(RT6_TABLE_DFLT); As long as the table for default routes is RT6_TABLE_DFLT and can't be configured by the user, I think the correct behavior is just to search RT6_TABLE_DFLT. Otherwise it looks very good! Regards, Ville - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/7] NetLabel: core network changes
> The NetLabel patch allows administrators to assign specific a CIPSO > DOI/configuration to each LSM "domain". Blindly using the > CIPSO tag that the > remote host sends could violate the administrator's NetLabel > configuration. > > The current patch reads the CIPSO tag off the child socket, > translating the > tag according to the CIPSO DOI configuration to arrive at the > correct/desired > LSM security attributes. These LSM security attributes and > the "domain" are > then used to set the NetLabel on the socket. In the case > where everyone is > well behaved this should have no effect on the socket IP > options and the > packets sent across the wire. However, in the case of a > not-nice remote host > the outgoing CIPSO tag may change to match the administrators desired > settings. I wonder if waiting till accept isn't too late though. Perhaps this should be done when the openreq is created so the syn-ack and such will go out with the right tag? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Linville's L2 rant... -- Re: PATCH Fix bonding active-backup behavior for VLAN interfaces
On Mon, Jul 31, 2006 at 10:15:40AM +0200, Christophe Devriese wrote: > If you bond 2 vlan subinterfaces, the patch is not necessary at all. In that > case also the source device will be changed from eth0. to bond. So > that's correct behavior no ? > > In the second case, you create vlan subifs on a bonding device, vlan > subinterfaces will be created on the slave interfaces. In that case the vlan (This is not directed at Christophe, or anyone in particular...) Am I the only one that thinks that our handling of LAN L2 stuff is at best a little "too" flexible (and at worst a collection of nasty hacks)? I mean, do we really need both the ability to bond multiple vlan interfaces AND the ability to have vlan interfaces on top of a bond? How many people really appreciate the subtle(?) differences? Then throw bridging into the mix! If I'm using VLANs and bonds in a bridged environment, do I bridge the bonds, or bond the bridges? Do the VLANs come before the bonds? after the bridges? or somewhere in-between? Do all these combinations even work together? Who has the definitive answer (besides the code itself)? I have no doubt that there are plenty of opportunities for cleverness here (and no doubt dragons too). I just doubt that most of them are worth the complexities introduced by our current collection of "transparently" stackable pseudo-drivers and strategically placed hacks (e.g. skb_bond). All that, and it still isn't clear to me how we can cleanly accomodate 802.1s (which adds VLAN awareness to bridging). Do we hold the view that our L2 code is on par with the rest of our code? Is there an appetite for a clean-up? Or is it just me? If you made it this far, thanks for listening...I feel better now. :-) John -- John W. Linville [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix UDP filter condition when do checksum
From: Wei Yongjun <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 06:33:42 -0400 > In udp_queue_rcv_skb(), checksum condition is error. When UDP filter is > set, checksum is be done, but if UDP filter is not set, checksum will > not be done. So I think this is a BUG. It is not a bug, we defer the checksum, when we can, to sys_recvmsg() where we can combine the copy into userspace and the checksum calculation into one operation. We cannot do this deferral when there is a filter attached, and that is why the check is the way it is. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix UDP filter condition when do checksum
Wei Yongjun <[EMAIL PROTECTED]> wrote: > In udp_queue_rcv_skb(), checksum condition is error. When UDP filter is > set, checksum is be done, but if UDP filter is not set, checksum will > not be done. So I think this is a BUG. Following is my patch: > > --- a/net/ipv4/udp.c2006-07-31 09:33:45.392479344 -0400 > +++ b/net/ipv4/udp.c2006-07-31 17:10:41.271632200 -0400 > @@ -1018,7 +1018,7 @@ static int udp_queue_rcv_skb(struct sock >/* FALLTHROUGH -- it's a UDP Packet */ >} > > - if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { > + if (!sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { >if (__udp_checksum_complete(skb)) { >UDP_INC_STATS_BH(UDP_MIB_INERRORS); >kfree_skb(skb); For the record, this isn't correct since the only reason we're computing a checksum here rather than recv(2) time is if we have a filter attached. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: warning at net/core/dev.c:1171/skb_checksum_help() 2.6.18-rc3
On Mon, Jul 31, 2006 at 12:39:51PM +0200, Patrick McHardy wrote: > > These are the patches (some variantions tested, but not all) on > top of Herbert's CHECKSUM_PARTIAL patch. The first one fixes up > the CHECKSUM_PARTIAL patch for 2.6.18-rc3, the second one fixes > checksumming in all of netfilter besides ip_queue, the third one > fixes ip_queue. Thank you very much for working on this Patrick! > Its actually not that much, if Herbert is fine with putting the > CHECKSUM_PARTIAL patch in 2.6.18 I'll do some more testing and > then I think these can go in as well. You guys know I'm a coward when it comes to pushing things into rc :) So I'd rather see a patch to disable the warnings for 2.6.18 so that the proper fix can be tested more thoroughly. We should remember that the 2.6.18 minus the warning is still going to be heaps better in this regard compared to 2.6.17 where all TSO packets were essentially discarded due to the incorrect checksum (when the NAT module is loaded). > [NET]: Fix up CHECKSUM_PARTIAL patch for 2.6.18-rc3 > > Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]> Please merge this with my earlier patch. I'm not that fussed about having my changeset go in :) > diff --git a/net/ipv4/netfilter/ip_nat_core.c > b/net/ipv4/netfilter/ip_nat_core.c > index 1741d55..731efbb 100644 > --- a/net/ipv4/netfilter/ip_nat_core.c > +++ b/net/ipv4/netfilter/ip_nat_core.c > @@ -443,7 +443,9 @@ int ip_nat_icmp_reply_translation(struct > > /* We're actually going to mangle it beyond trivial checksum > adjustment, so make sure the current checksum is correct. */ > - if ((*pskb)->ip_summed != CHECKSUM_UNNECESSARY) { > + > + if ((*pskb)->ip_summed != CHECKSUM_UNNECESSARY && > + (*pskb)->ip_summed != CHECKSUM_PARTIAL) { > hdrlen = (*pskb)->nh.iph->ihl * 4; > if ((u16)csum_fold(skb_checksum(*pskb, hdrlen, > (*pskb)->len - hdrlen, 0))) Call me picky, but I'd prefer it to actually look like switch ((*pskb)->ip_summed) { case CHECKSUM_COMPLETE: if (!(u16)csum_fold(skb->csum)) break; /* fall through */ case CHECKSUM_NONE: hdrlen = (*pskb)->nh.iph->ihl * 4; if ((u16)csum_fold(skb_checksum(*pskb, hdrlen, (*pskb)->len - hdrlen, 0))) return 0; } just because we probably won't revisit this code path for another million years to add this optimisation :) > diff --git a/net/ipv4/netfilter/ip_nat_helper.c > b/net/ipv4/netfilter/ip_nat_helper.c > index cbcaa45..dd0ddd4 100644 > --- a/net/ipv4/netfilter/ip_nat_helper.c > +++ b/net/ipv4/netfilter/ip_nat_helper.c > @@ -165,7 +165,7 @@ ip_nat_mangle_tcp_packet(struct sk_buff > { > struct iphdr *iph; > struct tcphdr *tcph; > - int datalen; > + int oldlen, datalen; > > if (!skb_make_writable(pskb, (*pskb)->len)) > return 0; > @@ -180,13 +180,22 @@ ip_nat_mangle_tcp_packet(struct sk_buff > iph = (*pskb)->nh.iph; > tcph = (void *)iph + iph->ihl*4; > > + oldlen = (*pskb)->len - iph->ihl*4; > mangle_contents(*pskb, iph->ihl*4 + tcph->doff*4, > match_offset, match_len, rep_buffer, rep_len); > > datalen = (*pskb)->len - iph->ihl*4; > - tcph->check = 0; > - tcph->check = tcp_v4_check(tcph, datalen, iph->saddr, iph->daddr, > -csum_partial((char *)tcph, datalen, 0)); > + if ((*pskb)->ip_summed != CHECKSUM_PARTIAL) { > + tcph->check = 0; > + tcph->check = tcp_v4_check(tcph, datalen, > +iph->saddr, iph->daddr, > +csum_partial((char *)tcph, > + datalen, 0)); > + } else > + tcph->check = nf_proto_csum_update(*pskb, > +htons(oldlen) ^ 0x, > +htons(datalen), > +tcph->check, 1); OK, this is so incredibly clever that I probably won't understand it until tomorrow :) > @@ -238,22 +248,30 @@ ip_nat_mangle_udp_packet(struct sk_buff > > iph = (*pskb)->nh.iph; > udph = (void *)iph + iph->ihl*4; > + > + oldlen = (*pskb)->len - iph->ihl*4; > mangle_contents(*pskb, iph->ihl*4 + sizeof(*udph), > match_offset, match_len, rep_buffer, rep_len); > > /* update the length of the UDP packet */ > - udph->len = htons((*pskb)->len - iph->ihl*4); > + datalen = (*pskb)->len - iph->ihl*4; > + udph->len = htons(datalen); > > /* fix udp checksum if udp checksum was previously calculated */ > - if (udph->check) { > - int datalen = (*pskb)->len - iph->ihl * 4; > + if (!
Re: [RFC 1/4] kevent: core files.
On Mon, Jul 31, 2006 at 03:57:16AM -0700, David Miller wrote: > > So I would say for up to 4 or 5 events, system call overhead alone > touches as many cache lines as the events themselves. Absolutely. The other to consider is that events don't come from the hardware. Events are written by the kernel. So if user-space is just reading the events that we've written, then there are no cache misses at all. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] [IPV6]: Remove ndiscs rt6_lock dependency
David Miller wrote: > From: Thomas Graf <[EMAIL PROTECTED]> > Date: Thu, 27 Jul 2006 00:00:01 +0200 > >> (Ab)using rt6_lock wouldn't work anymore if rt6_lock is >> converted into a per table lock. >> >> Signed-off-by: Thomas Graf <[EMAIL PROTECTED]> > > This one looks great. > > Signed-off-by: David S. Miller <[EMAIL PROTECTED]> Ditto. Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]> > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Multiple IPV6 Routing Tables & Policy Routing
Thomas Graf wrote: > Hello, Hi Thomas! > Thought it might be time to go through a round of comments > on this work. Even though I've almost rewritten all the code > the patches are based on the work found on www.mobile-ipv6.org. > I have no idea which code was written by whom so just email me > to get the credits right. The policy routing stuff (multiple tables and source address based routing) was almost entirely written by me. Therefore you can apply my name as you see fit ;-) Tushar Gohad at MontaVista, Benjamin Thery at Bull and of course USAGI have also worked on the code. > Main differences to the version found on mobile-ipv6.org is > that I removed table refcnt and defined that tables cannot > disappear once created to simplify things and avoid too many > atomic operations when looking up routes. Yes, that sounds good. As the ipv6 module doesn't really seem to become unloadable anytime soon, there isn't really any good reason to refcount the tables. > I've replaced the > table array with a hash table to prepare it for > 255 tables > and made things aware of the new default router selection > code and experimental route info stuff added recently. Good! I never had the time to merge our changes with 2.6.17. > It's not final but somewhat working, I'm eager to see comments > or patches. I'll try to comment on them the best I can. > I apologize if I've tramped onto anybody's foot > by taking this up and submitting it, this isn't meant as an > attempt to steal credits but rather to pick up good code and > finally get it upstream after a very long while. No offense taken! It's great that someone wants to push these things upstream as I personally have neither had the time nor the energy to do so lately. Regards, Ville - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
From: Evgeniy Polyakov <[EMAIL PROTECTED]> Date: Mon, 31 Jul 2006 14:50:37 +0400 > In syscall time kevents copy 40bytes for each event + 12 bytes of header > (number of events, timeout and command number). That's likely two cache > lines if only one event is reported. Do you know how many cachelines are dirtied by system call entry and exit on typical system? On sparc64 it is a minimum of 3 64-byte cachelines just to save and restore the system call time cpu register state. If application is deep in a call chain, register windows might spill and each such register window will dirty 2 more cachelines as they are dumped to the stack. I am not even talking about the other basic necessities of doing a system call such as touching various task_struct and thread_info state to check for pending signals etc. System call overhead is non-trivial especially when you are using it to move only a few small objects into and out of the kernel. So I would say for up to 4 or 5 events, system call overhead alone touches as many cache lines as the events themselves. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
On Mon, Jul 31, 2006 at 08:35:55PM +1000, Herbert Xu ([EMAIL PROTECTED]) wrote: > Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > > > >> - if there is space, report it in the ring buffer. Yes, the buffer > >> can be optional, then all events are reported by the system call. > > > > That requires a copy, which can neglect syscall overhead. > > Do we really want it to be done? > > Please note that we're talking about events here, not actual data. So > only the event is being copied, which is presumably rather small compared > to the data. In syscall time kevents copy 40bytes for each event + 12 bytes of header (number of events, timeout and command number). That's likely two cache lines if only one event is reported. -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SNMPv2 udpInDatagrams counter error
Wei Yongjun <[EMAIL PROTECTED]> wrote: > > I also send the same mail several ago, and get no response. You > patch is fine. But I think following code has no effect: > > if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { > > It just let UDP datagrams with checksum error be added into UDP receive > queue, and then discard it. I think this can be used to capture a UDP We normally postpone the checksum computation until the user does a recv(2). However, if there is a socket filter attached then we need to verify the checksum right now because the socket filter will be applied at the very next step. So yes it does let UDP datagrams with checksum errors onto the UDP rcv queue if there no socket filters attached, but this is intentional since we want to postpone the cost of checksum computation until the point when we have to copy the data to user-space where it becomes much cheaper. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: warning at net/core/dev.c:1171/skb_checksum_help() 2.6.18-rc3
Patrick McHardy wrote: > David Miller wrote: > >>I would like to see this fixed for 2.6.18, no later. >> >>Either that or disable the bug trap, but taking this route >>is severely discouraged. :) > > > I'm actually updateing my patch for this on top of Herbert's > CHECKSUM_PARTIAL patch right now. Unfortunately I targeted 2.6.19, > so the fixes are on top of a few cleanups (which unconvered a few > unrelated bugs as well). I'll post it when I'm done so we can > decide how to proceed. These are the patches (some variantions tested, but not all) on top of Herbert's CHECKSUM_PARTIAL patch. The first one fixes up the CHECKSUM_PARTIAL patch for 2.6.18-rc3, the second one fixes checksumming in all of netfilter besides ip_queue, the third one fixes ip_queue. Its actually not that much, if Herbert is fine with putting the CHECKSUM_PARTIAL patch in 2.6.18 I'll do some more testing and then I think these can go in as well. [NET]: Fix up CHECKSUM_PARTIAL patch for 2.6.18-rc3 Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]> --- commit 17a40f32fc339e9f6feeb042db58d30c8caf2fad tree 479e926c12606667a91d483223b4416da56227d5 parent 296b866d72ee7a8a577908323f2a7e8e92f4001f author Patrick McHardy <[EMAIL PROTECTED]> Mon, 31 Jul 2006 09:23:27 +0200 committer Patrick McHardy <[EMAIL PROTECTED]> Mon, 31 Jul 2006 09:23:27 +0200 include/linux/netdevice.h |4 ++-- net/core/dev.c|8 net/ipv4/tcp.c|4 ++-- net/ipv4/tcp_ipv4.c |2 +- net/ipv6/tcp_ipv6.c |2 +- 5 files changed, 10 insertions(+), 10 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 75f02d8..b5b9a33 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -973,7 +973,7 @@ extern void dev_mcast_init(void); extern int netdev_max_backlog; extern int weight_p; extern int netdev_set_master(struct net_device *dev, struct net_device *master); -extern int skb_checksum_help(struct sk_buff *skb, int inward); +extern int skb_checksum_help(struct sk_buff *skb); extern struct sk_buff *skb_gso_segment(struct sk_buff *skb, int features); #ifdef CONFIG_BUG extern void netdev_rx_csum_fault(struct net_device *dev); @@ -1009,7 +1009,7 @@ static inline int netif_needs_gso(struct { return skb_is_gso(skb) && (!skb_gso_ok(skb, dev->features) || - unlikely(skb->ip_summed != CHECKSUM_HW)); + unlikely(skb->ip_summed != CHECKSUM_PARTIAL)); } #endif /* __KERNEL__ */ diff --git a/net/core/dev.c b/net/core/dev.c index 90fb267..528c5f3 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1157,12 +1157,12 @@ EXPORT_SYMBOL(netif_device_attach); * Invalidate hardware checksum when packet is to be mangled, and * complete checksum manually on outgoing path. */ -int skb_checksum_help(struct sk_buff *skb, int inward) +int skb_checksum_help(struct sk_buff *skb) { unsigned int csum; int ret = 0, offset = skb->h.raw - skb->data; - if (inward) + if (skb->ip_summed == CHECKSUM_COMPLETE) goto out_set_summed; if (unlikely(skb_shinfo(skb)->gso_size)) { @@ -1219,7 +1219,7 @@ struct sk_buff *skb_gso_segment(struct s skb->mac_len = skb->nh.raw - skb->data; __skb_pull(skb, skb->mac_len); - if (unlikely(skb->ip_summed != CHECKSUM_HW)) { + if (unlikely(skb->ip_summed != CHECKSUM_PARTIAL)) { static int warned; WARN_ON(!warned); @@ -1233,7 +1233,7 @@ struct sk_buff *skb_gso_segment(struct s rcu_read_lock(); list_for_each_entry_rcu(ptype, &ptype_base[ntohs(type) & 15], list) { if (ptype->type == type && !ptype->dev && ptype->gso_segment) { - if (unlikely(skb->ip_summed != CHECKSUM_HW)) { + if (unlikely(skb->ip_summed != CHECKSUM_PARTIAL)) { err = ptype->gso_send_check(skb); segs = ERR_PTR(err); if (err || skb_gso_ok(skb, features)) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 40ada0b..c452373 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2204,7 +2204,7 @@ struct sk_buff *tcp_tso_segment(struct s th->fin = th->psh = 0; th->check = ~csum_fold(th->check + delta); - if (skb->ip_summed != CHECKSUM_HW) + if (skb->ip_summed != CHECKSUM_PARTIAL) th->check = csum_fold(csum_partial(skb->h.raw, thlen, skb->csum)); @@ -2218,7 +2218,7 @@ struct sk_buff *tcp_tso_segment(struct s delta = htonl(oldlen + (skb->tail - skb->h.raw) + skb->data_len); th->check = ~csum_fold(th->check + delta); - if (skb->ip_summed != CHECKSUM_HW) + if (skb->ip_summed != CHECKSUM_PARTIAL) th->check = csum_fold(csum_partial(skb->h.raw, thlen, skb->csum)); diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 114830f..be056d1 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -510,7 +510,7 @@ int tcp_v4_gso_send_check(struct sk_buff th->check = 0; th->check = ~tcp_v4_check(th, skb->len, iph->saddr, iph->daddr, 0); skb->csum = offsetof(struct tcphdr, check); - skb->ip_summed = CHECKSUM_HW; + skb->ip_summed = CHECKSUM_PARTIAL; return 0; } diff --git a/net/ipv6/tcp
Re: [RFC 1/4] kevent: core files.
Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > >> - if there is space, report it in the ring buffer. Yes, the buffer >> can be optional, then all events are reported by the system call. > > That requires a copy, which can neglect syscall overhead. > Do we really want it to be done? Please note that we're talking about events here, not actual data. So only the event is being copied, which is presumably rather small compared to the data. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/4] kevent: core files.
On Sat, Jul 29, 2006 at 09:18:47AM -0700, Ulrich Drepper ([EMAIL PROTECTED]) wrote: > Evgeniy Polyakov wrote: > > Btw, why do we want mapped ring of ready events? > > If user requestd some event, he definitely wants to get them back when > > they are ready, and not to check and then get them? > > Could you please explain more on this issue? > > If of course makes no sense to enter the kernel to actually get the > event. This should be done by storing the event in the ring buffer. > I.e., there are two ways to get an event: > > - with a syscall. This can report as many events at once as the caller > provides space for. And no event which is reported in the run buffer > should be reported this way > > - if there is space, report it in the ring buffer. Yes, the buffer > can be optional, then all events are reported by the system call. That requires a copy, which can neglect syscall overhead. Do we really want it to be done? > So the use case would be like this: > > > wait_and_get_event: > > is buffer empty ? > > yes -> make syscall > > no -> get event from buffer > > > To avoid races, the syscall needs to take a parameter indicating the > last event checked out from the buffer. If in the meantime the kernel > put another event in the buffer the syscall immediately returns. > Similar to what we do in the futex syscall. And how "misordering" between queue and buffer is going to be managed? I.e. when buffer is full and events are placed into queue, so syscall could get them, and then syscall is called to get events from the queue but not from the buffer - we can endup taking events from buffer while old are placed in the queue. And how waiting will be done without syscalls? Will glibc take care of it? > The question is how to best represent the ring buffer. Zach and some > others had some ready responses in Ottawa. The important thing is to > avoid cache line ping pong when possible. > > Is the ring buffer absolutely necessary? Probably not. But it has the > potential to help quite a bit. Don't look at the problem to solve in > the context of heavy I/O operations when another syscall here and there > doesn't matter. With this single event mechanism for every possible > event the kernel can generate programming can look quite different. > E.g., every read() call can implicitly we changed into an async read > call followed by a user-level reschedule. This rescheduling allows > another thread of execution to run while the read request is processed. > I.e., it's basically a setjmp() followed by a goto into the inner loop > to get the next event. And now suddenly the event notification > mechanism really should be as fast as possible. If we submit basically > every request asynchronously and are not creating dedicated threads for > specific tasks anymore we > > a) have a lot more event notifications > > b) the probability of an event being reported when we want the receive >the next one if higher (i.e., the case where no syscall vs syscall >makes a difference) > > Yes, all this will require changes in the way programs a written but we > shouldn't limit the way we can write programs unnecessarily. I think > that given increasing discrepancies in relative speed/latency of the > peripherals and the CPU this is one possible solution to keep the CPUs > busy without resorting to a gazillion separate threads in each program. Ok, let's do it in the following way: I present new version of kevent with new syscalls and fixed issues mentioned before, while people look at it we can end up with mapped buffer design. Is it ok? > -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ > -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SNMPv2 udpInDatagrams counter error
This change does not effect to tcpdump, only let UDP filter can not received UDP datagrams with checksum error. It is not a good idea, but I think is the best way to resolve this problem. If you want to capture error UDP packet, you can used tcpdump. On Monday 31 July 2006 04:57, Gerrit Renker wrote: > Hi, > > | if (!sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { > | > | IPv6 doesn't do this, so I think delete condition 'sk->sk_filter' is > | better. Do you think so? > > I think the sk->sk_filter is there for a good reason. If you delete it, > that routine is forced to always compute UDP checksums, even if the only > receiving application is a tcpdump process. I may be wrong here, but I > think that deleting the sk_filter statement is not at a good idea. > > The other alternatives discussed (afaik) so far were: > > 1) Move the increment of UDP_MIB_INDATAGRAMS from udp_queue_rcv_skb() to > udp_recvmsg() (first patch uploaded to > http://bugzilla.kernel.org/show_bug.cgi?id=6660). This was discussed: not a > good idea, since in-kernel applications may use the data_ready handler > rather than udp_recvmsg(). > > 2) Decrement UDP_MIB_INDATAGRAMS in udp_recvmsg() when the checksum turns > out to be wrong (second patch uploaded to above address). This would be a > fix to the problem you are stating, it also solves the problem of missing > out the data_ready handlers in (1), and was suggested earlier on this > mailing list. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [3/4] kevent: AIO, aio_sendfile() implementation.
On Thu, Jul 27, 2006 at 11:44:23AM -0700, Ulrich Drepper wrote: > Badari Pulavarty wrote: > > Before we spend too much time cleaning up and merging into mainline - > > I would like an agreement that what we add is good enough for glibc > > POSIX AIO. > > I haven't seen a description of the interface so far. Would be good if Did Sébastien's mail with the description help ? > it existed. But I briefly mentioned one quirk in the interface about > which Suparna wasn't sure whether it's implemented/implementable in the > current interface. > > If a lio_listio call is made the individual requests are handle just as > if they'd be issue separately. I.e., the notification specified in the > individual aiocb is performed when the specific request is done. Then, > once all requests are done, another notification is made, this time > controlled by the sigevent parameter if lio_listio. Looking at the code in lio kernel patch, this should be already covered: if (iocb->ki_signo) __aio_send_signal(iocb); + if (iocb->ki_lio) + lio_check(iocb->ki_lio); That is, it first checks the notification in the individual iocb, and then the one for the LIO. > > > Another feature which I always wanted: the current lio_listio call > returns in blocking mode only if all requests are done. In non-blocking > mode it returns immediately and the program needs to poll the aiocbs. > What is needed is something in the middle. For instance, if multiple > read requests are issued the program might be able to start working as > soon as one request is satisfied. I.e., a call similar to lio_listio > would be nice which also takes another parameter specifying how many of > the NENT aiocbs have to finish before the call returns. I imagine the kernel could enable this by incorporating this additional parameter for IOCB_CMD_GROUP in the ABI (in the default case this should be the same as the total number of iocbs submitted to lio_listio). Now should the at least NENT check apply only to LIO_WAIT or also to the LIO_NOWAIT notification case ? BTW, the native io_getevents does support a min_nr wakeup already, except that it applies to any iocb on the io_context, and not just a given lio_listio call. Regards Suparna -- Suparna Bhattacharya ([EMAIL PROTECTED]) Linux Technology Center IBM Software Lab, India - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] gre: transparent ethernet bridging
This patch implements transparent ethernet bridging for gre tunnels. There are a few outstanding issues. There is no way for userspace to select the type of gre tunnel. The #if 0 near the top of the patch forces all gre tunnels to be bridges. The problem is that userspace uses an IPPROTO_ to select the type of tunnel, but both types of gre tunnel are IPPROTO_GRE. I can't see anything else in struct ip_tunnel_parm that could be used to select this. One approach that I've seen mentioned in the archives is to add a netlink interface to replace the tunnel ioctls. Network loops are bad. See the comments at the top of ip_gre.c for a description of how gre tunnels handle this normally. But for gre bridges, we don't want to copy the ttl (it breaks routing protocols), and we don't want to force DF (we want to bridge 1500 byte packets). I couldn't think of any solution for this. Some routers set LLC_SAP_BSPAN in the gre protocol field, and then give the bpdu packet without any other ethernet/llc header. This patch currently tries to fake the ethernet/llc header before passing the packet up, but it is buggy (mac addresses are wrong at least). Maybe a better approach is to call directly into the bridging code. I didn't try that at first because it isn't modular, and may break other things that want to see the packet. --- linux-2.6.x/net/ipv4/ip_gre.c 18 Jun 2006 23:30:56 - 1.1.1.33 +++ linux-2.6.x/net/ipv4/ip_gre.c 31 Jul 2006 09:57:41 - @@ -30,6 +30,8 @@ #include #include #include +#include +#include #include #include @@ -41,6 +43,8 @@ #include #include #include +#include +#include #ifdef CONFIG_IPV6 #include @@ -119,6 +123,7 @@ static int ipgre_tunnel_init(struct net_device *dev); static void ipgre_tunnel_setup(struct net_device *dev); +static void ipgre_ether_tunnel_setup(struct net_device *dev); /* Fallback tunnel: no source, no destination, no key, no options */ @@ -274,7 +279,11 @@ static struct ip_tunnel * ipgre_tunnel_l goto failed; } +#if 0 dev = alloc_netdev(sizeof(*t), name, ipgre_tunnel_setup); +#else + dev = alloc_netdev(sizeof(*t), name, ipgre_ether_tunnel_setup); +#endif if (!dev) return NULL; @@ -550,6 +559,68 @@ ipgre_ecn_encapsulate(u8 tos, struct iph return INET_ECN_encapsulate(tos, inner); } +__be16 ipgre_type_trans(struct sk_buff *skb, int offset) +{ + u8 *h = skb->data; + __be16 flags = *(__be16*)h; + __be16 proto = *(__be16*)(h + 2); + + /* WCCP version 1 and 2 protocol decoding. +* - Change protocol to IP +* - When dealing with WCCPv2, Skip extra 4 bytes in GRE header +*/ + if (flags == 0 && + proto == __constant_htons(ETH_P_WCCP)) { + proto = __constant_htons(ETH_P_IP); + if ((*(h + offset) & 0xF0) != 0x40) + offset += 4; + } + + skb->mac.raw = skb->nh.raw; + skb->nh.raw = __pskb_pull(skb, offset); + skb_postpull_rcsum(skb, skb->h.raw, offset); +#ifdef CONFIG_NET_IPGRE_BROADCAST + if (MULTICAST(iph->daddr)) { + /* Looped back packet, drop it! */ + if (((struct rtable*)skb->dst)->fl.iif == 0) + return 0; + /* tunnel->stat.multicast++; */ + skb->pkt_type = PACKET_BROADCAST; + } +#endif + + return proto; +} + +extern const u8 br_group_address[ETH_ALEN]; + +__be16 ipgre_ether_type_trans(struct sk_buff *skb, struct net_device *dev, + int offset) +{ + u8 *h = skb->data; + __be16 proto = *(__be16*)(h + 2); + + if (proto == htons(ETH_P_BRIDGE)) { + if (!pskb_may_pull(skb, offset + ETH_HLEN)) + return 0; + skb_pull_rcsum(skb, offset); + return eth_type_trans(skb, dev); + } else if (proto == htons(LLC_SAP_BSPAN)) { + skb_pull_rcsum(skb, offset); + + llc_pdu_header_init(skb, LLC_PDU_TYPE_U, LLC_SAP_BSPAN, + LLC_SAP_BSPAN, LLC_PDU_CMD); + llc_pdu_init_as_ui_cmd(skb); + + llc_mac_hdr_init(skb, dev->dev_addr, dev->dev_addr); + skb_pull(skb, ETH_HLEN); + + return htons(ETH_P_802_2); + } + + return 0; +} + static int ipgre_rcv(struct sk_buff *skb) { struct iphdr *iph; @@ -603,32 +674,8 @@ static int ipgre_rcv(struct sk_buff *skb if ((tunnel = ipgre_tunnel_lookup(iph->saddr, iph->daddr, key)) != NULL) { secpath_reset(skb); - skb->protocol = *(u16*)(h + 2); - /* WCCP version 1 and 2 protocol decoding. -* - Change protocol to IP -* - When dealing with WCCPv2, Skip extra 4 bytes in GRE header -*/ - if (flags == 0 && - skb->protocol == __constant_htons(ETH_P_W
Re: [patch 1/1]SNMPv2 "ipv6IfStatsOutFragCreates" counter error
Hello. The patch seems sane to me. In article <[EMAIL PROTECTED]> (at Tue, 01 Aug 2006 05:45:39 -0400), weidong <[EMAIL PROTECTED]> says: > signed-off-by: Wei Dong <[EMAIL PROTECTED]> Acked-by: YOSHIFUJI Hideaki <[EMAIL PROTECTED]> --yoshfuji - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 0/1]SNMPv2 "ipv6IfStatsInHdrErrors" counter error
Hello. Next time, please put your "Signed-off-by" line before the patch. Thank you. In article <[EMAIL PROTECTED]> (at Tue, 01 Aug 2006 05:45:33 -0400), weidong <[EMAIL PROTECTED]> says: > signed-off-by:Wei Dong <[EMAIL PROTECTED]> Acked-by: YOSHIFUJI Hideaki <[EMAIL PROTECTED]> --yoshfuji - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
IBM (Lenovo) T60: e1000 driver high latency
Hello, [ resend because .config and the used kernel version was missing ] Linux Kernel Version: Linus Vanilla Tree; .config attached. I recently aquired a Lenovo (IBM) T60 with a e1000 network card. I experience high latency with this networkcard: Pings last upto 1 second where the ping should be around 25 ms. I googled a bit and found the following: - Enable NAPI, which didn't worked for me. 64 bytes from 192.168.0.223: icmp_seq=30 ttl=64 time=1004 ms 64 bytes from 192.168.0.223: icmp_seq=31 ttl=64 time=0.444 ms 64 bytes from 192.168.0.223: icmp_seq=32 ttl=64 time=1006 ms 64 bytes from 192.168.0.223: icmp_seq=33 ttl=64 time=0.739 ms 64 bytes from 192.168.0.223: icmp_seq=34 ttl=64 time=1006 ms 64 bytes from 192.168.0.223: icmp_seq=35 ttl=64 time=0.603 ms 64 bytes from 192.168.0.223: icmp_seq=36 ttl=64 time=1001 ms 64 bytes from 192.168.0.223: icmp_seq=37 ttl=64 time=0.736 ms 02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller Subsystem: Lenovo Unknown device 2001 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- SERR- # # Automatically generated make config: don't edit # Linux kernel version: 2.6.18-rc3 # Mon Jul 31 17:53:27 2006 # CONFIG_X86_32=y CONFIG_GENERIC_TIME=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_SEMAPHORE_SLEEPERS=y CONFIG_X86=y CONFIG_MMU=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_DMI=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # Code maturity level options # CONFIG_EXPERIMENTAL=y CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 # # General setup # CONFIG_LOCALVERSION="" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_POSIX_MQUEUE=y CONFIG_BSD_PROCESS_ACCT=y # CONFIG_BSD_PROCESS_ACCT_V3 is not set # CONFIG_TASKSTATS is not set CONFIG_SYSCTL=y # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y # CONFIG_CPUSETS is not set # CONFIG_RELAY is not set CONFIG_INITRAMFS_SOURCE="" CONFIG_UID16=y # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set # CONFIG_EMBEDDED is not set CONFIG_KALLSYMS=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_BASE_FULL=y CONFIG_RT_MUTEXES=y CONFIG_FUTEX=y CONFIG_EPOLL=y CONFIG_SHMEM=y CONFIG_SLAB=y CONFIG_VM_EVENT_COUNTERS=y # CONFIG_TINY_SHMEM is not set CONFIG_BASE_SMALL=0 # CONFIG_SLOB is not set # # Loadable module support # CONFIG_MODULES=y CONFIG_MODULE_UNLOAD=y CONFIG_MODULE_FORCE_UNLOAD=y CONFIG_MODVERSIONS=y # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_KMOD=y CONFIG_STOP_MACHINE=y # # Block layer # CONFIG_LBD=y # CONFIG_BLK_DEV_IO_TRACE is not set # CONFIG_LSF is not set # # IO Schedulers # CONFIG_IOSCHED_NOOP=y CONFIG_IOSCHED_AS=y CONFIG_IOSCHED_DEADLINE=y CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_AS=y # CONFIG_DEFAULT_DEADLINE is not set # CONFIG_DEFAULT_CFQ is not set # CONFIG_DEFAULT_NOOP is not set CONFIG_DEFAULT_IOSCHED="anticipatory" # # Processor type and features # CONFIG_SMP=y CONFIG_X86_PC=y # CONFIG_X86_ELAN is not set # CONFIG_X86_VOYAGER is not set # CONFIG_X86_NUMAQ is not set # CONFIG_X86_SUMMIT is not set # CONFIG_X86_BIGSMP is not set # CONFIG_X86_VISWS is not set # CONFIG_X86_GENERICARCH is not set # CONFIG_X86_ES7000 is not set # CONFIG_M386 is not set # CONFIG_M486 is not set # CONFIG_M586 is not set # CONFIG_M586TSC is not set # CONFIG_M586MMX is not set # CONFIG_M686 is not set # CONFIG_MPENTIUMII is not set # CONFIG_MPENTIUMIII is not set CONFIG_MPENTIUMM=y # CONFIG_MPENTIUM4 is not set # CONFIG_MK6 is not set # CONFIG_MK7 is not set # CONFIG_MK8 is not set # CONFIG_MCRUSOE is not set # CONFIG_MEFFICEON is not set # CONFIG_MWINCHIPC6 is not set # CONFIG_MWINCHIP2 is not set # CONFIG_MWINCHIP3D is not set # CONFIG_MGEODEGX1 is not set # CONFIG_MGEODE_LX is not set # CONFIG_MCYRIXIII is not set # CONFIG_MVIAC3_2 is not set # CONFIG_X86_GENERIC is not set CONFIG_X86_CMPXCHG=y CONFIG_X86_XADD=y CONFIG_X86_L1_CACHE_SHIFT=6 CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_X86_WP_WORKS_OK=y CONFIG_X86_INVLPG=y CONFIG_X86_BSWAP=y CONFIG_X86_POPAD_OK=y CONFIG_X86_CMPXCHG64=y CONFIG_X86_GOOD_APIC=y CONFIG_X86_INTEL_USERCOPY=y CONFIG_X86_USE_PPRO_CHECKSUM=y CONFIG_X86_TSC=y CONFIG_HPET_TIMER=y CONFIG_HPET_EMULATE_RTC=y CONFIG_NR_CPUS=8 # CONFIG_SCHED_SMT is not set CONFIG_SCHED_MC=y CONFIG_PREEMPT_NONE=y # CONFIG_PREEMPT_VOLUNTARY is not set # CONFIG_PREEMPT is not set # CONFIG_PREEMPT_BKL is not set CONFIG_X86_LOCAL_APIC=y CONFIG_X86_IO_APIC=y CONFIG_X86_MCE=y CONFIG_X86_MCE_NONFATAL=y CONFIG_X86_MCE_P4THERMAL=y CONFIG_VM86=y # CONFIG_TOSHIBA is not set # CONFIG_I8K is not set # CONFIG_X86_REBOOTFIXUPS is not set CONFIG_MICROCODE=y CONFIG_X86_MSR=y CONFIG_X86_CPUID=y # # Firmware Drivers
[PATCH] Fix UDP filter condition when do checksum
In udp_queue_rcv_skb(), checksum condition is error. When UDP filter is set, checksum is be done, but if UDP filter is not set, checksum will not be done. So I think this is a BUG. Following is my patch: --- a/net/ipv4/udp.c2006-07-31 09:33:45.392479344 -0400 +++ b/net/ipv4/udp.c2006-07-31 17:10:41.271632200 -0400 @@ -1018,7 +1018,7 @@ static int udp_queue_rcv_skb(struct sock /* FALLTHROUGH -- it's a UDP Packet */ } - if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { + if (!sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { if (__udp_checksum_complete(skb)) { UDP_INC_STATS_BH(UDP_MIB_INERRORS); kfree_skb(skb); Signed-off-by: Wei Yongjun <[EMAIL PROTECTED]> `ß' ' c link > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/1]SNMPv2 "ipv6IfStatsInHdrErrors" counter error
Hi, All When I tested Linux kernel 2.6.17.7 about statistics "ipv6IfStatsInHdrErrors", found that this counter couldn't increase correctly. The criteria is RFC2465: ipv6IfStatsInHdrErrors OBJECT-TYPE SYNTAX Counter3 MAX-ACCESS read-only STATUS current DESCRIPTION "The number of input datagrams discarded due to errors in their IPv6 headers, including version number mismatch, other format errors, hop count exceeded, errors discovered in processing their IPv6 options, etc." ::= { ipv6IfStatsEntry 2 } When I send TTL=0 and TTL=1 a packet to a router which need to be forwarded, router just sends an ICMPv6 message to tell the sender that TIME_EXCEED and HOPLIMITS, but no increments for this counter(in the function ip6_forward). The following is the patch for this issue. diff -ruN old/net/ipv6/ip6_output.c new/net/ipv6/ip6_output.c --- old/net/ipv6/ip6_output.c 2006-07-25 11:36:01.0 +0800 +++ new/net/ipv6/ip6_output.c 2006-07-31 16:16:13.0 +0800 @@ -356,6 +356,7 @@ skb->dev = dst->dev; icmpv6_send(skb, ICMPV6_TIME_EXCEED, ICMPV6_EXC_HOPLIMIT, 0, skb->dev); + IP6_INC_STATS_BH(IPSTATS_MIB_INHDRERRORS); kfree_skb(skb); return -ETIMEDOUT; signed-off-by:Wei Dong <[EMAIL PROTECTED]> Regards Wei Dong - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/1]SNMPv2 "ipv6IfStatsOutFragCreates" counter error
Hi, All When I tested linux kernel 2.6.71.7 about statistics "ipv6IfStatsOutFragCreates", and found that it couldn't increase correctly. The criteria is RFC 2465: ipv6IfStatsOutFragCreates OBJECT-TYPE SYNTAX Counter32 MAX-ACCESS read-only STATUS current DESCRIPTION "The number of output datagram fragments that have been generated as a result of fragmentation at this output interface." ::= { ipv6IfStatsEntry 15 } I think there are two issues in Linux kernel. 1st: RFC2465 specifies the counter is "The number of output datagram fragments...". I think increasing this counter after output a fragment successfully is better. And it should not be increased even though a fragment is created but failed to output. 2nd: If we send a big ICMP/ICMPv6 echo request to a host, and receive ICMP/ICMPv6 echo reply consisted of some fragments. As we know that in Linux kernel first fragmentation occurs in ICMP layer(maybe saying transport layer is better), but this is not the "real" fragmentation,just do some "pre-fragment" -- allocate space for date, and form a frag_list, etc. The "real" fragmentation happens in IP layer -- set offset and MF flag and so on. So I think in "fast path" for ip_fragment/ip6_fragment, if we send a fragment which "pre-fragment" by upper layer we should also increase "ipv6IfStatsOutFragCreates". The following is the patch for the issues mentioned above: diff -ruN old/net/ipv4/ip_output.c new/net/ipv4/ip_output.c --- old/net/ipv4/ip_output.c2006-07-25 11:36:01.0 +0800 +++ new/net/ipv4/ip_output.c2006-07-31 16:24:57.0 +0800 @@ -527,6 +527,8 @@ err = output(skb); + if (!err) + IP_INC_STATS(IPSTATS_MIB_FRAGCREATES); if (err || !frag) break; @@ -650,9 +652,6 @@ /* * Put this fragment into the sending queue. */ - - IP_INC_STATS(IPSTATS_MIB_FRAGCREATES); - iph->tot_len = htons(len + hlen); ip_send_check(iph); @@ -660,6 +659,8 @@ err = output(skb2); if (err) goto fail; + + IP_INC_STATS(IPSTATS_MIB_FRAGCREATES); } kfree_skb(skb); IP_INC_STATS(IPSTATS_MIB_FRAGOKS); diff -ruN old/net/ipv6/ip6_output.c new/net/ipv6/ip6_output.c --- old/net/ipv6/ip6_output.c 2006-07-25 11:36:01.0 +0800 +++ new/net/ipv6/ip6_output.c 2006-07-31 16:24:21.0 +0800 @@ -593,6 +593,9 @@ } err = output(skb); + if(!err) + IP6_INC_STATS(IPSTATS_MIB_FRAGCREATES); + if (err || !frag) break; @@ -704,12 +707,11 @@ /* * Put this fragment into the sending queue. */ - - IP6_INC_STATS(IPSTATS_MIB_FRAGCREATES); - err = output(frag); if (err) goto fail; + + IP6_INC_STATS(IPSTATS_MIB_FRAGCREATES); } kfree_skb(skb); IP6_INC_STATS(IPSTATS_MIB_FRAGOKS); signed-off-by: Wei Dong <[EMAIL PROTECTED]> Regards Wei Dong - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPV6]: Audit all ip6_dst_lookup/ip6_dst_store calls
On Sun, Jul 30, 2006 at 10:32:10PM -0500, Matt Domsch wrote: > > I applied this on 2.6.18-rc3, and it panics immediately as the first > IPv6 TCP (ssh) session is initiated to the system. Executive summary: 1) We resolved one lockdep warning only to stumble onto another lockdep validator bug. 2) There is something broken in the x86_64 unwind code which is causing it to panic just about everytime somebody calls dump_stack(). Andi, this is the second time I've seen a report where an otherwise harmless dump_stack call (the other one was caused by a WARN_ON) gets turned into a panic by the stack unwind code on x86_64. This particular report is with 2.6.18-rc3 so it looks like whatever bug is causing it hasn't been fixed yet. Could you please have a look at it? Thanks. > = > [ INFO: possible recursive locking detected ] > - > swapper/0 is trying to acquire lock: > (slock-AF_INET6){-+..}, at: [] sk_clone+0xd2/0x3a8 > > but task is already holding lock: > (slock-AF_INET6){-+..}, at: [] tcp_v6_rcv+0x30e/0x76e > [ipv6] > > other info that might help us debug this: > 1 lock held by swapper/0: > #0: (slock-AF_INET6){-+..}, at: [] tcp_v6_rcv+0x30e/0x76e > [ipv6] > > stack backtrace: > > Call Trace: > [] show_trace+0xae/0x30e > [] dump_stack+0x15/0x17 > [] __lock_acquire+0x12e/0xa18 > [] lock_acquire+0x4b/0x69 > [] _spin_lock+0x25/0x31 > [] sk_clone+0xd2/0x3a8 > [] inet_csk_clone+0x11/0x6f > [] tcp_create_openreq_child+0x24/0x49c > [] :ipv6:tcp_v6_syn_recv_sock+0x2c5/0x6be > [] tcp_check_req+0x1d1/0x326 > [] :ipv6:tcp_v6_do_rcv+0x15d/0x372 > [] :ipv6:tcp_v6_rcv+0x71f/0x76e > [] :ipv6:ip6_input+0x223/0x315 > [] :ipv6:ipv6_rcv+0x254/0x2af > [] netif_receive_skb+0x260/0x2dd > [] :e1000:e1000_clean_rx_irq+0x423/0x4c2 > [] :e1000:e1000_clean+0x88/0x17d > [] net_rx_action+0xac/0x1d1 > [] __do_softirq+0x68/0xf5 > [] call_softirq+0x1e/0x28 Now let's move onto the lockdep validator bug :) Ingo/Arjen, I thought we've already fixed this before but somehow I can't find anything in the email archives so perhaps I'm mixing it up with another recursive lock false-positive. The problem here is really quite simple: when we accept a TCP connection there are two sockets involved. The listening socket which existed before the connection came in and the socket we construct for the newly arrived connection. The code works something like this: * Take slock on listening socket. * Construct child socket. * Take slock on child socket. As we never do the locking in the opposite direction (child followed by listening socket) this is safe. So perhaps we need to add some extra annotation in sk_clone? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SNMPv2 udpInDatagrams counter error
Yes, you are right. I also send the same mail several ago, and get no response. You patch is fine. But I think following code has no effect: if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { It just let UDP datagrams with checksum error be added into UDP receive queue, and then discard it. I think this can be used to capture a UDP datagrams use a filter. But I if use a filter to capture a UDP datagrams, the code should like that: if (!sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { IPv6 doesn't do this, so I think delete condition 'sk->sk_filter' is better. Do you think so? Specially, do you try to send UDP datagrams with checksum error to echo- udp(port 7), may be your patch will let neither udpInDatagrams nor udpInErrors be increased. Because in my test, that datagrams can be send to echo-udp and get a echo reply. - Original Message - From: "Gerrit Renker" <[EMAIL PROTECTED]> To: Sent: Monday, July 31, 2006 4:19 PM Subject: Re: [PATCH] SNMPv2 udpInDatagrams counter error > This has been raised earlier, cf. http://bugzilla.kernel.org/show_bug.cgi?id=6660 > > Wei Yongjun wrote: > | When I send a UDP datagrams with checksum error to target, I found that: > | Under IPv6, counter udpInErrors increased, but under IPv4 counter > | udpInDatagrams increased. I lookup into the source code, and found that, > | under IPv4 UDP datagrams with checksum error will be delivered to UDP > | receive queue, but IPv6 does not. IPv4 delivered into UDP receive queue, > | increased udpInDatagrams, then discard it before delivered to UDP user. > | RFC said udpInDatagrams is the total number of UDP datagrams delivered > | to UDP users, so udpInDatagrams should not be increased while UDP > | datagrams with checksum error received. > | > | Refer to RFC2013: > |udpInDatagrams OBJECT-TYPE > |SYNTAX Counter32 > |MAX-ACCESS read-only > |STATUS current > |DESCRIPTION > |"The total number of UDP datagrams delivered to UDP > | users." > |::= { udp 1 } > | > | Following is my patch: > | > | --- a/net/ipv4/udp.c 2006-07-31 09:33:45.392479344 -0400 > | +++ b/net/ipv4/udp.c 2006-07-31 09:34:26.430240656 -0400 > | @@ -1018,7 +1018,7 @@ static int udp_queue_rcv_skb(struct sock > | /* FALLTHROUGH -- it's a UDP Packet */ > | } > | > | - if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { > | + if (skb->ip_summed != CHECKSUM_UNNECESSARY) { > | if (__udp_checksum_complete(skb)) { > | UDP_INC_STATS_BH(UDP_MIB_INERRORS); > | kfree_skb(skb); > | > | Signed-off-by: Wei Yongjun <[EMAIL PROTECTED]> > | > | > | - > | To unsubscribe from this list: send the line "unsubscribe netdev" in > | the body of a message to [EMAIL PROTECTED] > | More majordomo info at http://vger.kernel.org/majordomo-info.html > | > | > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SNMPv2 udpInDatagrams counter error
Hi, | if (!sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { | | IPv6 doesn't do this, so I think delete condition 'sk->sk_filter' is better. | Do you think so? I think the sk->sk_filter is there for a good reason. If you delete it, that routine is forced to always compute UDP checksums, even if the only receiving application is a tcpdump process. I may be wrong here, but I think that deleting the sk_filter statement is not at a good idea. The other alternatives discussed (afaik) so far were: 1) Move the increment of UDP_MIB_INDATAGRAMS from udp_queue_rcv_skb() to udp_recvmsg() (first patch uploaded to http://bugzilla.kernel.org/show_bug.cgi?id=6660). This was discussed: not a good idea, since in-kernel applications may use the data_ready handler rather than udp_recvmsg(). 2) Decrement UDP_MIB_INDATAGRAMS in udp_recvmsg() when the checksum turns out to be wrong (second patch uploaded to above address). This would be a fix to the problem you are stating, it also solves the problem of missing out the data_ready handlers in (1), and was suggested earlier on this mailing list. -- Gerrit - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] SNMPv2 udpInDatagrams counter error
This has been raised earlier, cf. http://bugzilla.kernel.org/show_bug.cgi?id=6660 Wei Yongjun wrote: | When I send a UDP datagrams with checksum error to target, I found that: | Under IPv6, counter udpInErrors increased, but under IPv4 counter | udpInDatagrams increased. I lookup into the source code, and found that, | under IPv4 UDP datagrams with checksum error will be delivered to UDP | receive queue, but IPv6 does not. IPv4 delivered into UDP receive queue, | increased udpInDatagrams, then discard it before delivered to UDP user. | RFC said udpInDatagrams is the total number of UDP datagrams delivered | to UDP users, so udpInDatagrams should not be increased while UDP | datagrams with checksum error received. | | Refer to RFC2013: |udpInDatagrams OBJECT-TYPE |SYNTAX Counter32 |MAX-ACCESS read-only |STATUS current |DESCRIPTION |"The total number of UDP datagrams delivered to UDP | users." |::= { udp 1 } | | Following is my patch: | | --- a/net/ipv4/udp.c 2006-07-31 09:33:45.392479344 -0400 | +++ b/net/ipv4/udp.c 2006-07-31 09:34:26.430240656 -0400 | @@ -1018,7 +1018,7 @@ static int udp_queue_rcv_skb(struct sock | /* FALLTHROUGH -- it's a UDP Packet */ | } | | -if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { | +if (skb->ip_summed != CHECKSUM_UNNECESSARY) { | if (__udp_checksum_complete(skb)) { | UDP_INC_STATS_BH(UDP_MIB_INERRORS); | kfree_skb(skb); | | Signed-off-by: Wei Yongjun <[EMAIL PROTECTED]> | | | - | To unsubscribe from this list: send the line "unsubscribe netdev" in | the body of a message to [EMAIL PROTECTED] | More majordomo info at http://vger.kernel.org/majordomo-info.html | | - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PATCH Fix bonding active-backup behavior for VLAN interfaces
On Monday 31 July 2006 05:50, you wrote: > From: Ben Greear <[EMAIL PROTECTED]> > Date: Fri, 28 Jul 2006 14:55:17 -0700 > > > The skb_bond method assigns skb->dev when it does the 'keep', > > but the VLAN code immediately over-writes the skb->dev when > > searching for the vlan device. > > > > What is the purpose of assinging skb->dev to the master device? > > This makes me consider this patch highly dubious, at best. > > The whole intention of bonding on input is to make all packets > incoming on the individual bond slaves to look like they come in via > the master device. > > Therefore, even when the bond slaves are VLAN devices, in the end the > skb->dev should be the bond master device _not_ the VLAN device. > > I'm not applying this patch, it doesn't look correct at all. That code is not introduced by this patch, but is already in the kernel. This patch is about having the same behavior for the vlan accelerated input path and the normal input path. If you bond 2 vlan subinterfaces, the patch is not necessary at all. In that case also the source device will be changed from eth0. to bond. So that's correct behavior no ? In the second case, you create vlan subifs on a bonding device, vlan subinterfaces will be created on the slave interfaces. In that case the vlan code will reassign the skb->dev node, and because skb_bond needs to know the actual input device in order to make an informed drop decision before passing this code (skb active-backup mode needs to drop packets from the backup slave interface, if you don't do that you get big problems with broadcasts). The same struct vlan_group is assigned to all slave devices and so the only vlan subinterfaces that exist in this case are the bond. subinterfaces, and the vlan path for both slaves will assign the bond. interface to skb->dev, thereby erasing the information about where the packet came from. I have tested the patch, and it works correctly, in both cases on my test sytem (where I join vlan subifs on a bonding device into a bridge and have xen guests' vifX.Y interfaces connected to those bridges, which is a configuration we imho really want to support) (without this patch, as explained earlier in this thread, this config does not work) Regards, Christophe - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] SNMPv2 udpInDatagrams counter error
When I send a UDP datagrams with checksum error to target, I found that: Under IPv6, counter udpInErrors increased, but under IPv4 counter udpInDatagrams increased. I lookup into the source code, and found that, under IPv4 UDP datagrams with checksum error will be delivered to UDP receive queue, but IPv6 does not. IPv4 delivered into UDP receive queue, increased udpInDatagrams, then discard it before delivered to UDP user. RFC said udpInDatagrams is the total number of UDP datagrams delivered to UDP users, so udpInDatagrams should not be increased while UDP datagrams with checksum error received. Refer to RFC2013: udpInDatagrams OBJECT-TYPE SYNTAX Counter32 MAX-ACCESS read-only STATUS current DESCRIPTION "The total number of UDP datagrams delivered to UDP users." ::= { udp 1 } Following is my patch: --- a/net/ipv4/udp.c2006-07-31 09:33:45.392479344 -0400 +++ b/net/ipv4/udp.c2006-07-31 09:34:26.430240656 -0400 @@ -1018,7 +1018,7 @@ static int udp_queue_rcv_skb(struct sock /* FALLTHROUGH -- it's a UDP Packet */ } - if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { + if (skb->ip_summed != CHECKSUM_UNNECESSARY) { if (__udp_checksum_complete(skb)) { UDP_INC_STATS_BH(UDP_MIB_INERRORS); kfree_skb(skb); Signed-off-by: Wei Yongjun <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html