Re: [PATCH] net: Fragment large datagrams even when IP_HDRINCL is set.
Hello! I can tell why it has not been done initially. Main problem was in IP options, which can be present in raw packet. They have to be properly fragmented, some options are to be deleted on fragments. Not that it is too complicated, it is just boring and ugly and inconsistent with IP_HDRINCL logic. So, it was done in logically consistent way: did you order IP_HDRINCL? Please, then deal with fragments yourself. Alexey
Re: [PATCH net-2.6 0/3]: Three TCP fixes
Hello! My theory is that it could relate to tcp_cwnd_restart and tcp_cwnd_application_limited using it and the others are just then accidently changed as well. Perhaps I'll have to dig once again to changelog history to see if there's some clue (unless Alexey shed some light to this)... Yes, it is RFC2861. There is some rationale in preamble to section 3. But even for this case it can be set to full cwnd instead. Also, I used it to calculate prior_ssthresh to restore ssthresh in the case when we detect false retransmission. It is not an accident. I really mean that tcp_current_sshresh() is correct estimate for current ssthresh and this should be the same as in restart situation. Of course, rationale of RFC2861 does not apply to this case _directly_. But it was at the end of chain of syllogisms. Look: If we have not infinite ssthresh and cwnd ssthresh, we are already in congestion avoidance mode and we can set prior_ssthresh to any value in the range ssthresh...cwnd without significant changes in behaviour wrt congestion avoidance. ssthresh would be an obvious underestimate. cwnd can be an overestimate, because the path is surely congested (cwnd ssthresh) and cwnd is not a measure of current capacity. Behavioral changes happen, when final cwnd is too low, because pipe was drained during recovery. We allow to slow start to sshresh, but value of ssthresh should be the same as we use for restart case. Alexey -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] [UDP6]: Counter increment on BH mode
On Mon, Dec 03, 2007 at 10:39:35PM +1100, Herbert Xu wrote: So we need to fix this, and whatever the fix is will probably render the BH/USER distinction obsolete. Hmm, I would think opposite. USER (or generic) is expensive variant, BH is lite. No? Alexey -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/ipv4/arp.c: Fix arp reply when sender ip 0 (was: Strange behavior in arp probe reply, bug or feature?)
Hello! Is there a reason that the target hardware address isn't the target hardware address? It is bound only to the fact that linux uses protocol address of the machine, which responds. It would be highly confusing (more than confusing :-)), if we used our protocol address and hardware address of requestor. But if you use zero protocol address as source, you really can use any hw address. The dhcp clients I examined, and the implementation of the arpcheck that I use will compare the target hardware field of the arp-reply and match it against its own mac, to verify the reply. And this fails with the current implementation in the kernel. 1. Do not do this. Mainly, because you already know that this does not work with linux. :-) Logically, target hw address in arp reply is just a nonsensial redundancy, it should not be checked and even looked at. 2. What's about your suggestion, I thought about this and I am going to agree. Arguments, which convinced me are: - arping still works. - any piece of reasonable software should work. - if Windows understands DaD (is it really true? I cannot believe) and it is unhappy about our responce and does not block use of duplicate address only due to this, we _must_ accomodate ASAP. - if we do,we have to use 0 protocol address, no choice. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/ipv4/arp.c: Fix arp reply when sender ip 0 (was: Strange behavior in arp probe reply, bug or feature?)
Hello! Send a correct arp reply instead of one with sender ip and sender hardware adress in target fields. I do not see anything more legal in setting target address to 0. Actually, semantics of target address in ARP reply is ambiguous. If it is a reply to some real request, it is set to address of requestor and protocol requires recipient of this arp reply to test that the address matches its own address before creating new entry triggered by unsolicited arp reply. That's all. In the case of duplicate address detection, requestor does not have any address, so that it is absolutely not essential what we use as target address. The only place, which could depend on this is the tool, which tests for duplicate address. At least, arping written by me, should work with any variant. So, please, could you explain what did force you to think that use of 0 is better? Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [2.6 patch] remove Documentation/networking/routing.txt
Hello! This file is so outdated that I can't see any value in keeping it. Absolutely agree. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] ip_gre: sendto/recvfrom NBMA address
Hello! I was able to set a nbma gre tunnel, add routes to it and it worked perfectly ok. Link-level next hop worked: ip route add route via link-level-address dev tunnel-dev onlink This can work if you use gre0. By plain luck it has all-zero dev_addr. It will break on nbma devices set with: ip tunnel add XXX mode gre local a.b.c.d [key whatever] ... Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] ip_gre: sendto/recvfrom NBMA address
Hello! When GRE tunnel is in NBMA mode, this patch allows an application to use a PF_PACKET socket to: - send a packet to specific NBMA address with sendto() - use recvfrom() to receive packet and check which NBMA address it came from This is required to implement properly NHRP over GRE tunnel. Ack. This is good idea. Frankly, I was sure ip_gre worked in this way all these years. I do not remember any reasons why it was crippled. The only dubious case is when next hop is set using routing tables. But code in ipgre_tunnel_xmit() is ready to accept this situation, it checks for zero destination address and fixes it when it is able to. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] ip_gre: sendto/recvfrom NBMA address
Hello! Me wrote: Ack. This is good idea. Frankly, I was sure ip_gre worked in this way all these years. I do not remember any reasons why it was crippled. The only dubious case is when next hop is set using routing tables. But code in ipgre_tunnel_xmit() is ready to accept this situation, it checks for zero destination address and fixes it when it is able to. Nevertheless, it does not work. The reason is that NOARP arp entries on device with initialized hard_header are initialized not to all zeros, but to dev-dev_addr. So that, netxthop from routing tables is ignored and all gre packets are lost in loopback. Not good. The problem can be ignored. I am even not sure that someone uses this feature. Actually, it was not recommended in documentation. Alternatively, arp.c can be changed to generate 0 addresses instead of dev-dev_addr. Normally it is equally good, but I am not sure about possible side effects. Another thoughts? Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/10] [NET]: Avoid unnecessary cloning for ingress filtering
Hello! If it is causing trouble, then one idea would be to move the resetting to a wrapper function which calls clone first and then resets the other fields. All actions currently cloning would need to be mod-ed to use that call. I see not so many places inside net/sched/act* where skb_clone is used. Exactly one in act_mirred.c, to be more exact. Did I miss something? If you mean that place, your suggestion is correct and everything is OK. But if there is at least one place where an external skb_clone() should do this... this cannot be correct. Tell me, it does not matter what some random skb_clone() makes with tc_verd, please. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SFQ qdisc crashes with limit of 2 packets
Hello! Remove artificial limitation for sfq queue limit. This is followup to Patrick's patch. A little optimization to enqueue routine allows to remove artificial limitation on queue length. Plus, testing showed that hash function used by SFQ is too bad or even worse. It does not even sweep the whole range of hash values. Switched to Jenkins' hash. Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED] diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c index 3a23e30..b542c87 100644 --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -19,6 +19,7 @@ #include linux/init.h #include linux/ipv6.h #include linux/skbuff.h +#include linux/jhash.h #include net/ip.h #include net/netlink.h #include net/pkt_sched.h @@ -95,7 +96,7 @@ struct sfq_sched_data /* Variables */ struct timer_list perturb_timer; - int perturbation; + u32 perturbation; sfq_index tail; /* Index of current slot in round */ sfq_index max_depth; /* Maximal depth */ @@ -109,12 +110,7 @@ struct sfq_sched_data static __inline__ unsigned sfq_fold_hash(struct sfq_sched_data *q, u32 h, u32 h1) { - int pert = q-perturbation; - - /* Have we any rotation primitives? If not, WHY? */ - h ^= (h1pert) ^ (h1(0x1F - pert)); - h ^= h10; - return h 0x3FF; + return jhash_2words(h, h1, q-perturbation) (SFQ_HASH_DIVISOR - 1); } static unsigned sfq_hash(struct sfq_sched_data *q, struct sk_buff *skb) @@ -256,6 +252,13 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc* sch) q-ht[hash] = x = q-dep[SFQ_DEPTH].next; q-hash[x] = hash; } + /* If selected queue has length q-limit, this means that +* all another queues are empty and that we do simple tail drop, +* i.e. drop _this_ packet. +*/ + if (q-qs[x].qlen = q-limit) + return qdisc_drop(skb, sch); + sch-qstats.backlog += skb-len; __skb_queue_tail(q-qs[x], skb); sfq_inc(q, x); @@ -294,6 +297,19 @@ sfq_requeue(struct sk_buff *skb, struct Qdisc* sch) } sch-qstats.backlog += skb-len; __skb_queue_head(q-qs[x], skb); + /* If selected queue has length q-limit+1, this means that +* all another queues are empty and we do simple tail drop. +* This packet is still requeued at head of queue, tail packet +* is dropped. +*/ + if (q-qs[x].qlen q-limit) { + skb = q-qs[x].prev; + __skb_unlink(skb, q-qs[x]); + sch-qstats.drops++; + sch-qstats.backlog -= skb-len; + kfree_skb(skb); + return NET_XMIT_CN; + } sfq_inc(q, x); if (q-qs[x].qlen == 1) { /* The flow is new */ if (q-tail == SFQ_DEPTH) { /* It is the first flow */ @@ -370,12 +386,10 @@ static void sfq_perturbation(unsigned long arg) struct Qdisc *sch = (struct Qdisc*)arg; struct sfq_sched_data *q = qdisc_priv(sch); - q-perturbation = net_random()0x1F; + get_random_bytes(q-perturbation, 4); - if (q-perturb_period) { - q-perturb_timer.expires = jiffies + q-perturb_period; - add_timer(q-perturb_timer); - } + if (q-perturb_period) + mod_timer(q-perturb_timer, jiffies + q-perturb_period); } static int sfq_change(struct Qdisc *sch, struct rtattr *opt) @@ -391,7 +405,7 @@ static int sfq_change(struct Qdisc *sch, struct rtattr *opt) q-quantum = ctl-quantum ? : psched_mtu(sch-dev); q-perturb_period = ctl-perturb_period*HZ; if (ctl-limit) - q-limit = min_t(u32, ctl-limit, SFQ_DEPTH - 2); + q-limit = min_t(u32, ctl-limit, SFQ_DEPTH - 1); qlen = sch-q.qlen; while (sch-q.qlen q-limit) @@ -400,8 +414,8 @@ static int sfq_change(struct Qdisc *sch, struct rtattr *opt) del_timer(q-perturb_timer); if (q-perturb_period) { - q-perturb_timer.expires = jiffies + q-perturb_period; - add_timer(q-perturb_timer); + mod_timer(q-perturb_timer, jiffies + q-perturb_period); + get_random_bytes(q-perturbation, 4); } sch_tree_unlock(sch); return 0; @@ -423,12 +437,13 @@ static int sfq_init(struct Qdisc *sch, struct rtattr *opt) q-dep[i+SFQ_DEPTH].next = i+SFQ_DEPTH; q-dep[i+SFQ_DEPTH].prev = i+SFQ_DEPTH; } - q-limit = SFQ_DEPTH - 2; + q-limit = SFQ_DEPTH - 1; q-max_depth = 0; q-tail = SFQ_DEPTH; if (opt == NULL) { q-quantum = psched_mtu(sch-dev); q-perturb_period = 0; + get_random_bytes(q-perturbation, 4); } else { int err = sfq_change(sch, opt); if (err) - To unsubscribe from this list: send the line unsubscribe netdev in the body
Re: SFQ qdisc crashes with limit of 2 packets
Hello! OK the off-by-one prevents an out-of-bounds array access, Yes, this is not off-by-one (off-by-two, to be more exact :-)). Maximal queue length is really limited by SFQ_DEPTH-2, because: 1. SFQ keeps list of queue lengths in array of length SFQ_DEPTH. This means length of queue must be in range 0..SFQ_DEPTH-1. 2. SFQ enqueues incoming packet first, and drops something after this. This means it always needs a free slot in queue. So, SFQ_DEPTH-2. CCed Alexey just to be safe, but I think the patch should be fine. Yes. But what'a about limit == 1? tc prohibits this case, but it is still not nice to have the bug in kernel. Plus, code remains creepy, the check + if (++sch-q.qlen q-limit) { still looks as a scary off-by-one. I would go all the way: replace this with natural: if (++sch-q.qlen = q-limit) { and maxed q-limit at SFQ_DEPTH-2. Patch enclosed. Seems, it is easy to relax the limit to SFQ_DEPTH-1, item #2 is an artificial limitation. If current queue already has SFQ_DEPTH-1 packets, we can drop from tail of this queue and skip update of all the state. It is even an optimization for the case when we have single flow. I am not 100% sure about this, I'll try and report. diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c index 9579573..cbf8089 100644 --- a/net/sched/sch_sfq.c +++ b/net/sched/sch_sfq.c @@ -270,7 +270,7 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc* sch) q-tail = x; } } - if (++sch-q.qlen q-limit-1) { + if (++sch-q.qlen = q-limit) { sch-bstats.bytes += skb-len; sch-bstats.packets++; return 0; @@ -306,7 +306,7 @@ sfq_requeue(struct sk_buff *skb, struct Qdisc* sch) q-tail = x; } } - if (++sch-q.qlen q-limit - 1) { + if (++sch-q.qlen = q-limit) { sch-qstats.requeues++; return 0; } @@ -391,10 +391,10 @@ static int sfq_change(struct Qdisc *sch, struct rtattr *opt) q-quantum = ctl-quantum ? : psched_mtu(sch-dev); q-perturb_period = ctl-perturb_period*HZ; if (ctl-limit) - q-limit = min_t(u32, ctl-limit, SFQ_DEPTH); + q-limit = min_t(u32, ctl-limit, SFQ_DEPTH - 2); qlen = sch-q.qlen; - while (sch-q.qlen = q-limit-1) + while (sch-q.qlen q-limit) sfq_drop(sch); qdisc_tree_decrease_qlen(sch, qlen - sch-q.qlen); @@ -423,7 +423,7 @@ static int sfq_init(struct Qdisc *sch, struct rtattr *opt) q-dep[i+SFQ_DEPTH].next = i+SFQ_DEPTH; q-dep[i+SFQ_DEPTH].prev = i+SFQ_DEPTH; } - q-limit = SFQ_DEPTH; + q-limit = SFQ_DEPTH - 2; q-max_depth = 0; q-tail = SFQ_DEPTH; if (opt == NULL) { - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with implementation of TCP_DEFER_ACCEPT?
Hello! At present with TCP_DEFER_ACCEPT the kernel treats the RFC 793 handshake as invalid; dropping the ACK from the client without replying so the client doesn't know the server has in fact set it's internal ACKed flag. If the client doesn't send a packet containing data before the SYN_ACK time-outs finally expire the connection will be dropped. A brought this up a long, long time ago, and I seem to remember Alexey Kuznetsov explained me at the time that this was intentional. Obviously, I said something like it is exactly what TCP_DEFER_ACCEPT does. There is no protocol violation here, ACK from client is considered as lost, it is quite normal and happens all the time. Handshake is not complete, server remains in SYN-RECV state and continues to retransmit SYN-ACK. If client tried to cheat and is not going to send its request, connection will time out. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC RTNETLINK 00/09]: Netlink link creation API
Hello! I just suggested to Pavel to create only a single device per newlink operation and binding them later, I see some logical inconsistency here. Look, the second end is supposed to be in another namespace. It will have identity, which cannot be expressed in any way in namespace, which is allowed to create the pair: name, ifindex - nothing is shared between namespaces. Moreover, do not forget we have two netlink spaces as well. Events happening in one namespace are reported only inside that namespace. Actually, the only self-consistent solution, which I see right now (sorry, did not think that much) is to create the whole pair as one operation; required parameters (name of partner, identity of namespace) can be passed as attributes. I guess IFLA_PARTNER approach suggests the same thing, right? As response to this action two replies are generated: one RTM_NEWLINK for one end of device with the whole desciption of partnet is broadcasted inside this namespace, another RTM_NETLINK with index/name of partner device is broadcasted inside the second namespace (and, probably, some attributes, which must be hidden inside namespace, f.e. identity of main device is suppressed). Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC RTNETLINK 00/09]: Netlink link creation API
Hello! Good point, I didn't think of that. Is there a version of this patch that already uses different namespaces so I can look at it? Pavel does not like the idea. It looks not exactly pretty, like you said. :-) The alternative is to create pair in main namespace and then move one end to another namespace renaming it and changing index. Why I do not like it? Because this makes RTM_NEWLINK just useless step, all its work is undone and real work is remade when the device moves, with all the unrettiness moved to another place. From another hand, some move operation is required in any case. Right now in openvz the problem is solved in tricky, but quite inerseting way: all the devices in main namespace are assigned with odd index, child devices get odd index. So that, when a device moves from main namespace to child, openvz does not need to change ifindex, conflict is impossible. Well, it is working approach. But it is not pretty either. Are network namespace completely seperated or is there some hierarchy with all lower namespaces visible above or something like that? Right now they are completely separate. It is possible to make child devices visible in parent namespace like it is done for process pids: i.e. there is an abstract identity which is seen under different names and indices in different namespaces. Sounds cool, but this add a lot of complexity, which has no meaning outside of context of device creation, I do not think it is worth to do. The identity of the main device has no meaning within a different namespace, but are there other reasons for hiding it? Perhaps, security. It is not a good idea to leak information about parent namespace to child namespace. Also, people will want to see emulated ethernet inside namespace looking exactly like ethernet. No freaking additional attributes. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [IPV4] nl_fib_lookup: Initialise res.r before fib_res_put(res)
Hello! When CONFIG_IP_MULTIPLE_TABLES is enabled, the code in nl_fib_lookup() needs to initialize the res.r field before fib_res_put(res) - unlike fib_lookup(), a direct call to -tb_lookup does not set this field. Indeed, I am sorry. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] infinite recursion in netlink
Hello! Reply to NETLINK_FIB_LOOKUP messages were misrouted back to kernel, which resulted in infinite recursion and stack overflow. The bug is present in all kernel versions since the feature appeared. The patch also makes some minimal cleanup: 1. Return something consistent (-ENOENT) when fib table is missing 2. Do not crash when queue is empty (does not happen, but yet) 3. Put result of lookup Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED] diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index fc920f6..cac06c4 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -776,6 +776,8 @@ static void nl_fib_lookup(struct fib_res .nl_u = { .ip4_u = { .daddr = frn-fl_addr, .tos = frn-fl_tos, .scope = frn-fl_scope } } }; + + frn-err = -ENOENT; if (tb) { local_bh_disable(); @@ -787,6 +789,7 @@ static void nl_fib_lookup(struct fib_res frn-nh_sel = res.nh_sel; frn-type = res.type; frn-scope = res.scope; + fib_res_put(res); } local_bh_enable(); } @@ -801,6 +804,9 @@ static void nl_fib_input(struct sock *sk struct fib_table *tb; skb = skb_dequeue(sk-sk_receive_queue); + if (skb == NULL) + return; + nlh = (struct nlmsghdr *)skb-data; if (skb-len NLMSG_SPACE(0) || skb-len nlh-nlmsg_len || nlh-nlmsg_len NLMSG_LENGTH(sizeof(*frn))) { @@ -813,7 +819,7 @@ static void nl_fib_input(struct sock *sk nl_fib_lookup(frn, tb); - pid = nlh-nlmsg_pid; /*pid of sending process */ + pid = NETLINK_CB(skb).pid; /* pid of sending process */ NETLINK_CB(skb).pid = 0; /* from kernel */ NETLINK_CB(skb).dst_group = 0; /* unicast */ netlink_unicast(sk, skb, pid, MSG_DONTWAIT); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: dst_ifdown breaks infiniband?
Hello! Well I don't think the loopback device is currently but as soon as we get network namespace support we will have multiple loopback devices and they will get unregistered when we remove the network namespace. There is no logical difference. At the moment when namespace is gone there is nobody who can hold unrevokable references to this loopback. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: dst_ifdown breaks infiniband?
Hello! Does this look sane (untested)? It does not, unfortunately. Instead of regular crash in infiniband you will get numerous random NULL pointer dereferences both due to dst-neighbour and due to dst-dev. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] Re: dst_ifdown breaks infiniband?
Hello! I think the thing to do is to just leave the loopback references in place, try to unregister the per-namespace loopback device, and that will safely wait for all the references to go away. Yes, it is exactly how it works in openvz. All the sockets are killed, queues are cleared, nobody holds references and virtual loopback can be unregistered just like another device. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dst_ifdown breaks infiniband?
Hello! If a device driver sets neigh_destructor in neigh_params, this could get called after the device has been unregistered and the driver module removed. It is the same problem: if dst-neighbour holds neighbour, it should not hold device. parms-dev is not supposed to be used after neigh_parms_release(). F.e. set parms-dev to NULL to catch bad references. Do you search for a way to find real inifiniband device in ipoib_neigh_destructor()? I guess you will not be able. The problem is logical: if destructor needs device, neighbour entry _somehow_ have to hold reference to the device (via neigh-dev, neigh-parms, whatever). Hence, if we hold neighbour entry, unregister cannot be completed. Therefore, destructor cannot refer to device. Q.E.D. :-) Seems, releasing dst-neighbour is inevitable. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dst_ifdown breaks infiniband?
Hello! infiniband sets parm-neigh_destructor, and I search for a way to prevent this destructor from being called after the module has been unloaded. Ideas? It must be called in any case to update/release internal ipoib structures. The idea is to move call of parm-neigh_destructor from neighbour destructor to the moment when it is unhashed, right after n-dead is set. infiniband is the only user (atm clip uses it too, but that use is obviously dummy), so that nobody will be harmed. But ipoib will have to check for validity of skb-dst-neighbour before attempt to reinitialize private data on dead (n-dead != 0) neighbour. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dst_ifdown breaks infiniband?
Hello! This might work. Could you post a patch to better show what you mean to do? Here it is. -neigh_destructor() is killed (not used), replaced with -neigh_cleanup(), which is called when neighbor entry goes to dead state. At this point everything is still valid: neigh-dev, neigh-parms etc. The device should guarantee that dead neighbor entries (neigh-dead != 0) do not get private part initialized, otherwise nobody will cleanup it. I think this is enough for ipoib which is the only user of this thing. Initialization private part of neighbor entries happens in ipib start_xmit routine, which is not reached when device is down. But it would be better to add explicit test for neigh-dead in any case. diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index f9dbc6f..2b5c297 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct queue_work(ipoib_workqueue, priv-restart_task); } -static void ipoib_neigh_destructor(struct neighbour *n) +static void ipoib_neigh_cleanup(struct neighbour *n) { struct ipoib_neigh *neigh; struct ipoib_dev_priv *priv = netdev_priv(n-dev); @@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struc struct ipoib_ah *ah = NULL; ipoib_dbg(priv, - neigh_destructor for %06x IPOIB_GID_FMT \n, + neigh_cleanup for %06x IPOIB_GID_FMT \n, IPOIB_QPN(n-ha), IPOIB_GID_RAW_ARG(n-ha + 4)); @@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms *parms) { - parms-neigh_destructor = ipoib_neigh_destructor; + parms-neigh_cleanup = ipoib_neigh_cleanup; return 0; } diff --git a/include/net/neighbour.h b/include/net/neighbour.h index 3725b93..ad7fe11 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -36,7 +36,7 @@ struct neigh_parms struct net_device *dev; struct neigh_parms *next; int (*neigh_setup)(struct neighbour *); - void(*neigh_destructor)(struct neighbour *); + void(*neigh_cleanup)(struct neighbour *); struct neigh_table *tbl; void*sysctl_table; diff --git a/net/atm/clip.c b/net/atm/clip.c index ebb5d0c..8c38258 100644 --- a/net/atm/clip.c +++ b/net/atm/clip.c @@ -261,14 +261,6 @@ static void clip_pop(struct atm_vcc *vcc spin_unlock_irqrestore(PRIV(dev)-xoff_lock, flags); } -static void clip_neigh_destroy(struct neighbour *neigh) -{ - DPRINTK(clip_neigh_destroy (neigh %p)\n, neigh); - if (NEIGH2ENTRY(neigh)-vccs) - printk(KERN_CRIT clip_neigh_destroy: vccs != NULL !!!\n); - NEIGH2ENTRY(neigh)-vccs = (void *) NEIGHBOR_DEAD; -} - static void clip_neigh_solicit(struct neighbour *neigh, struct sk_buff *skb) { DPRINTK(clip_neigh_solicit (neigh %p, skb %p)\n, neigh, skb); @@ -342,7 +334,6 @@ static struct neigh_table clip_tbl = { /* parameters are copied from ARP ... */ .parms = { .tbl= clip_tbl, - .neigh_destructor = clip_neigh_destroy, .base_reachable_time= 30 * HZ, .retrans_time = 1 * HZ, .gc_staletime = 60 * HZ, diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 3183142..cfc6001 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -140,6 +140,8 @@ static int neigh_forced_gc(struct neigh_ n-dead = 1; shrunk = 1; write_unlock(n-lock); + if (n-parms-neigh_cleanup) + n-parms-neigh_cleanup(n); neigh_release(n); continue; } @@ -211,6 +213,8 @@ static void neigh_flush_dev(struct neigh NEIGH_PRINTK2(neigh %p is stray.\n, n); } write_unlock(n-lock); + if (n-parms-neigh_cleanup) + n-parms-neigh_cleanup(n); neigh_release(n); } } @@ -582,9 +586,6 @@ void neigh_destroy(struct neighbour *nei kfree(hh); } - if (neigh-parms-neigh_destructor) - (neigh-parms-neigh_destructor)(neigh); - skb_queue_purge(neigh-arp_queue); dev_put(neigh-dev); @@ -675,6 +676,8 @@ static void neigh_periodic_timer(unsigne *np = n-next; n-dead = 1; write_unlock(n-lock); + if (n-parms-neigh_cleanup) + n-parms-neigh_cleanup(n);
Re: dst_ifdown breaks infiniband?
Hello! This is not new code, and should have triggered long time ago, so I am not sure how come we are triggering this only now, but somehow this did not lead to crashes in 2.6.20 I see. I guess this was plain luck. Why is neighbour-dev changed here? It holds reference to device and prevents its destruction. If dst is held somewhere, we cannot destroy the device and deadlock while unregister. We could not invalidate dst-neighbour but it looked safe to invalidate neigh-dev after quiescent state. Obviosuly, it is not and it never was safe. Was supposed to be repaired asap, but this did not happen. :-( Can dst-neighbour be changed to point to NULL instead, and the neighbour released? It should be cleared and we should be sure it will not be destroyed before quiescent state. Seems, this is the only correct solution, but to do this we have to audit all the places where dst-neighbour is dereferenced for RCU safety. Actually, it is very good you caught this eventually, the bug was so _disgusting_ that it was forgotten all the time, waiting for someone who will point out that the king is naked. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dst_ifdown breaks infiniband?
Hello! Hmm. Something I don't understand: does the code in question not run on *each* device unregister? It does. Why do I only see this under stress? You should have some referenced destination entries to trigger bad path. This should happen not only under stress. F.e. just try to ssh to something via this device. And unregister it. Seems, the crash is inevitable. If you do not see crash, I will be puzzled. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: dst_ifdown breaks infiniband?
Hello! It should be cleared and we should be sure it will not be destroyed before quiescent state. I'm confused. didn't you say dst_ifdown is called after quiescent state? Quiescent state should happen after dst-neighbour is invalidated. And this implies that all the users of dst-neighbour check validity after dereference and do not use it after quiescent state. This does not sound like something that's likely to be accepted in 2.6.21, right? Any simpler ideas? Well, if inifiniband destructor really needs to take that lock... no. Right now I do not see. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Copy mac_len in skb_clone() as well
Hello! What bug triggered that helped you discover this? Or is it merely from a code audit? I asked the same question. :-) openvz added some another fields to skbuff and when it was found that they are lost while clone, he tried to figure out how all this works and looked for another examples of this kind. As I understand, the problem can be seen only in xfrmX_tunnel_input. If uninitialized mac_len obtained from slab is more than current head room it could corrupt memory. Also, it looks like the fix is incomplete. copy_skb_header() also does not copy this field. But it will be initialized to 0 by alloc_skb in this case and xfrmX_tunnel_input() just will not copy mac header. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] TCP: Replace __kfree_skb() with kfree_skb()
Hello! do you know of any place where __kfree_skb is used to free an skb whose ref count is greater than 1? No. Actually, since kfree_skb is not inline, __kfree_skb could be made static and remaining places still using it switched to kfree_skb. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] problem with BPF in PF_PACKET sockets, introduced in linux-2.6.19
Hello! So this whole idea to make run_filter() return signed integers and fail on negative is entirely flawed, it simply cannot work and retain the expected semantics which have been there forever. Actually, it can. Return value was used only as sign of error, so that the mistake was to return original unsigned result casted to int. Alternative fix is enclosed. To be honest, it is not better than yours: duplication of couple lines of code against passing return value by pointer. Alexey diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index da73e8a..51e5537 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -437,11 +437,13 @@ static inline int run_filter(struct sk_b rcu_read_lock_bh(); filter = rcu_dereference(sk-sk_filter); if (filter != NULL) { - err = sk_run_filter(skb, filter-insns, filter-len); - if (!err) + unsigned int res; + + res = sk_run_filter(skb, filter-insns, filter-len); + if (!res) err = -EPERM; - else if (*snaplen err) - *snaplen = err; + else if (*snaplen res) + *snaplen = res; } rcu_read_unlock_bh(); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: consistent disable_xfrm behaviour
Hello! Alexey, do you remember what the original intent of this was? disable_policy was supposed to skip policy checks on input. It makes sense only on input device. disable_xfrm was supposed to skip transformations on output. It makes sense only on output device. If it does not work, it was done wrong. :-) As I see it, root of the problem is that DST_NOXFRM flag is calculated using wrong device. out_dev should be used in __mkroute_input(). It looks as a cut-n-paste error, the code was taken from output path, where it is correct. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: consistent disable_xfrm behaviour
Hello! Here's the patch again properly signed off. I think it is correct. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.19-rc1: Volanomark slowdown
Hell]! reduced Volanomark benchmark throughput by 10%. The irony of it is that java vm used to be one of victims of over-delayed acks. I will look, there is a little chance that it is possible to detect the situation and to stretch ACKs. There is one little question though. If you see a visible difference in performance, does it mean that you see Volaconomark used to show much better numbers comparing to another OSes? :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
Hello! transactions to data segments is fubar. That issue is also why I wonder about the setting of tcp_abc. Yes, switching ABC on/off has visible impact on amount of segments. When ABC is off, amount of segments is almost the same as number of transactions. When it is on, ~1.5% are merged. But this is invisible in numbers of throughput/cpu usage. That' numbers: 1Gig link. The first column is b. - separates runs of netperf in backward direction. Run #1. One host is slower. old,abc=0 new,abc=0 new,abc=1 old,abc=1 2 23652.00 6.31 21.11 10.665 8.924 23622.16 6.47 21.01 10.951 8.893 23625.05 6.21 21.01 10.512 8.891 23725.12 6.46 20.31 10.898 8.559 - 23594.87 21.90 6.44 9.283 10.912 23631.52 20.30 6.36 8.592 10.766 23609.55 21.00 6.26 8.896 10.599 23633.75 21.10 5.44 8.929 9.206 4 36349.11 8.71 31.21 9.584 8.585 36461.37 8.65 30.81 9.492 8.449 36723.72 8.22 31.31 8.949 8.526 35801.24 8.58 30.51 9.589 8.521 - 35127.34 33.80 8.43 9.621 9.605 36165.50 30.90 8.48 8.545 9.381 36201.45 31.10 8.31 8.592 9.185 35269.76 30.00 8.58 8.507 9.732 8 41148.23 10.39 42.30 10.101 10.281 41270.06 11.04 31.31 10.698 7.585 41181.56 5.66 48.61 5.496 11.803 40372.37 9.68 56.50 9.591 13.996 - 40392.14 47.00 11.89 11.637 11.775 40613.80 36.90 9.16 9.086 9.019 40504.66 53.60 7.73 13.234 7.639 40388.99 48.70 11.93 12.058 11.814 16 67952.27 16.27 43.70 9.576 6.432 68031.40 10.56 53.70 6.206 7.894 6.95 12.81 46.90 7.559 6.920 67814.41 16.13 46.50 9.517 6.857 - 68031.46 51.30 11.53 7.541 6.781 68044.57 40.70 8.48 5.982 4.986 67808.13 39.60 15.86 5.840 9.355 67818.32 52.90 11.51 7.801 6.791 32 90445.09 15.41 99.90 6.817 11.045 90210.34 16.11 100.00 7.143 11.085 90221.84 17.31 98.90 7.676 10.962 90712.78 18.41 99.40 8.120 10.958 - 89155.51 99.90 12.89 11.205 5.782 90058.54 99.90 16.16 11.093 7.179 90092.31 98.60 15.41 10.944 6.840 88688.96 99.00 17.59 11.163 7.933 64 89983.76 13.66 100.00 6.071 11.113 90504.24 17.54 100.00 7.750 11.049 92043.36 17.44 99.70 7.580 10.832 90979.29 16.01 99.90 7.038 10.981 - 88615.27 99.90 14.91 11.273 6.729 89316.13 99.90 17.28 11.185 7.740 90622.85 99.90 16.81 11.024 7.420 89084.85 99.90 17.51 11.214 7.861 Run #2. Slower host is replaced with better one. ABC=0. No runs in backward directions. new old 2 24009.73 8.80 6.49 3.667 10.806 24008.43 8.00 6.32 3.334 10.524 4 40012.53 18.30 8.79 4.574 8.783 3.84 19.40 8.86 4.851 8.857 8 60500.29 26.30 12.78 4.348 8.452 60397.79 26.30 11.73 4.355 7.769 16 69619.95 39.80 14.03 5.717 8.063 70528.72 24.90 14.43 3.531 8.184 32 132522.01 53.20 21.28 4.015 6.424 132602.93 57.70 22.59 4.351 6.813 64 145738.83 60.30 25.01 4.138 6.865 143129.55 73.20 24.19 5.114 6.759 128 148184.21 69.70 24.96 4.704 6.739 148143.47 71.00 25.01 4.793 6.753 256 144798.91 69.40 25.01 4.793 6.908 144086.01 73.00 24.61 5.067 6.832 Frankly, I do not see any statistically valid correlations. that linux didn't seem to be doing the same thing. Hence my tweaking when seeing this patch come along...] netperf does not catch this. :-) Even with this patch linux does not ack each second segment dumbly, it waits for some conditions, mostly read() emptying receive queue. To model this it is necessary to insert some gaps between bursted segments or to use slow network. I have no doubts it is easy to model a situation when we send lots of useless ACKs. F.e. inserting 20ms gaps between requests. To see effect on thoughput/cpu, we could start enough of connections, doing the same thing. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance degradation from 2.6.11.12 to 2.6.16.20
Hello! I can't even find a reference to SIOCGSTAMP in the dhcp-2.0pl5 or dhcp3-3.0.3 sources shipped in Ubuntu. But I will note that tpacket_rcv() expects to always get valid timestamps in the SKB, it does a: It is equally unlikely it uses mmapped packet socket (tpacket_rcv). I even installed that dhcp on x86_64. And I do not see anything, netstamp_needed remains zero when running both server and client. It looks like dhcp was defamed without a guilt. :-) Seems, Andi saw some leakage in netstamp_needed (value of 7), but I do not see this too. In any case, the issue is obviously more serious than just behaviour of some applications. On my notebook one gettimeofday() takes: 0.2 us with tsc 4.6 us with pm (AND THIS CRAP IS DEFAULT!!) 9.4 us with pit (kinda expected) It is ridiculous. Obviosuly, nobody (not only tcpdump, but everything else) does not need such clock. Taking timestamp takes time comparable with processing the whole tcp frame. :-) I have no idea what is possible to do without breaking everything, but it is not something to ignore. This timer must be shot. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
Hello! It looks perfectly fine to me, would you like me to apply it Alexey? Yes, I think it is safe. Theoretically, there is one place where it can be not so good. Good nagling tcp connection, which makes lots of small write()s, will send MSS sized frames due to delayed ACKs. But if we ACK each other segment, more segments will come out incomplete, which could result in some decrease of throughput. But the trap for this case was set 6 years ago. For unidirectional sessions ACKs were sent not even each second segment, but each small segment. :-) This did not show any problems for those 6 years. I guess it means that the problem does not exist. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance degradation from 2.6.11.12 to 2.6.16.20
Hello! For netdev: I'm more and more thinking we should just avoid the problem completely and switch to true end2end timestamps. This means don't time stamp when a packet is received, but only when it is delivered to a socket. This will work. From viewpoint of existing uses of timestamp by packet socket this time is not worse. The only danger is violation of casuality (when forwarded packet or reply packet gets timestamp earlier than original packet). This pathology was main reason why timestamp is recorded early, before packet is demultiplexed in netif_receive_skb(). But it is not a practical problem: delivery to packet/raw sockets is occasionally placed _before_ delivery to real protocol handlers. handler runs. Then the problem above would completely disappear. Well, not completely. Too slow clock source remains too slow clock source. If it is so slow, that it results in performance degradation, it just should not be used at all, even such pariah as tcpdump wants to be fast. Actually, I have a question. Why the subject is Network performance degradation from 2.6.11.12 to 2.6.16.20? I do not see beginning of the thread and cannot guess why clock source degraded. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance degradation from 2.6.11.12 to 2.6.16.20
Hello! Hmm, not sure how that could happen. Also is it a real problem even if it could? As I said, the problem is _occasionally_ theoretical. This would happen f.e. if packet socket handler was installed after IP handler. Then tcpdump would get packet after it is processed (acked/replied/forwarded). This would be disasterous, the results are unparsable. I recall, the issue was discussed, and that time it looked more reasonable to solve problems of this kind taking timestamp once before it is seen by all the rest of stack. Who could expect that PIT nightmare is going to return? :-) Then it has to use the ACPI pmtmr which is really really slow. The overhead of that thing is so large that you can clearly see it in the network benchmark. I see. Thank you. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
Hello! Of course, number of ACK increases. It is the goal. :-) unpleasant increase in service demands on something like a burst enabled (./configure --enable-burst) netperf TCP_RR test: netperf -t TCP_RR -H foo -- -b N # N 1 foo=localhost b patched orig 2 105874.83 105143.71 3 114208.53 114023.07 4 120493.99 120851.27 5 128087.48 128573.33 10 151328.48 151056.00 Probably, the test is done wrong. But I see no difference. to increase as a result. Pipelined HTTP would be like that, some NFS over TCP stuff too, maybe X traffic, X will be excited about better latency. What's about protocols not interested in latency, they will be a little happier, if transactions are processed asynchronously. But actually, it is not about increasing/decreasing number of ACKs. It is about killing that pain in ass which we used to have because we pretended to be too smart. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance degradation from 2.6.11.12 to 2.6.16.20
Hello! But that never happens right? Right. Well, not right. It happens. Simply because you get packet with newer timestamp after previous handler saw this packet and did some actions. I just do not see any bad consequences. And do you have some other prefered way to solve this? Even if the timer was fast it would be still good to avoid it in the fast path when DHCPD is running. No. The way, which you suggested, seems to be the best. 1. It even does not disable possibility to record timestamp inside driver, which Alan was afraid of. The sequence is: if (!skb-tstamp.off_sec) net_timestamp(skb); 2. Maybe, netif_rx() should continue to get timestamp in netif_rx(). 3. NAPI already introduced almost the same inaccuracy. And it is really silly to waste time getting timestamp in netif_receive_skb() a few moments before the packet is delivered to a socket. 4. ...but clock source, which takes one of top lines in profiles must be repaired yet. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance degradation from 2.6.11.12 to 2.6.16.20
Hello! Please think about it this way: suppose you haave a heavily loaded router and some network problem is to be diagnosed. You run tcpdump and suddenly router becomes overloaded (by switching to timestamp-it-all mode I am sorry. I cannot think that way. :-) Instead of attempts to scare, better resend original report, where you said how much performance degraded, I cannot find it. * I do see get_offset_pmtmr() in top lines of profile. That's scary enough. * I do not undestand what the hell dhcp needs timestamps for. * I do not listen any suggestions to screw up tcpdump with a sysctl. Kernel already implements much better thing then a sysctl. Do not want timestamps? Fix tcpdump, add an options, submit the patch to tcpdump maintainers. Not a big deal. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
Hello! There isn't any sort of clever short-circuiting in loopback is there? No, from all that I know. I do like the convenience of testing things over loopback, but always fret about not including drivers and actual hardware interrupts etc. Well, if the test is right, it should show cost of redundant ACKs. Regardless, kudos for running the test. The only thing missing is the -c and -C options to enable the CPU utilization measurements which will then give the service demand on a CPU time per transaction basis. Or was this a UP system that was taken to CPU saturation? It is my notebook. :-) Of course, cpu consumption is 100%. (Actally, netperf shows 100.10 :-)) I will redo test on a real network. What range of -b should I test? What i'm thinking about isn't so much about the latency I understand. Actually, I did those tests ages ago for a pure throughput case, when nothing goes in the opposite direction. I did not find a difference that time. And nobody even noticed that Linux sends ACKs _each_ small segment for unidirectional connections for all those years. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-rc6 memory mapped pcap truncates outgoing TCP packets, but not icmp
Hello! [PACKET]: Don't truncate non-linear skbs with mmaped IO Non-linear skbs are truncated to their linear part with mmaped IO. Fix by using skb_copy_bits instead of memcpy. Ack. I remember this trick. The idea was that I needed only TCP header in any case and it was perfect cutoff. This hack was not supposed to survive. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] make ipv4 multicast packets only get delivered to sockets that are joined to group
Hello! No, it returns 1 (allow) if there are no filters to explicitly filter it. I wrote that code. :-) I see. It did not behave this way old times. From your mails I understood that current behaviour matches another implementations (BSD whatever), is it true? Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] make ipv4 multicast packets only get delivered to sockets that are joined to group
Hello! IPv6 behaves the same way. Actually, Linux IPv6 filters received multicasts, inet6_mc_check() does this. IPv4 does not. I remember that attempts to do this were made in the past and failed, because some applications, related to multicast routing, did expect to receive all the multicasts even though they did not join any multicast addresses. So, it was left intact. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
Hello! Is this really necessary? No, of course. We lived for ages without this, would live for another age. I thought that the problems with ABC were in trying to apply byte-based heuristics from the RFC(s) to a packet-oritented cwnd in the stack? It was just the last drop. Even with disabled ABC, that test shows some gaps in latency summed up to ~300 msec. Almost invisible, but not good. Too aggressive delack has many other issues. Even without ABC we have quadratically suppressed cwnd on TCP_NODELAY connections comparing to BSD: at sender side we suppress it by counting cwnd in packets, at receiver side by ACKing by byte counter. Each time when another victim sees artificial latencies introduced by agressive delayed acks, even though he requested TCP_NODELAY, our best argument is Stupid, you do all wrong, how could you get a decent performance? :-). Probably, we stand for a feature which really does not worth to stand for and causes nothing but permanent pain in ass. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: high latency with TCP connections
Hello! At least for slow start it is safe, but experiments with atcp for netchannels showed that it is better not to send excessive number of acks when slow start is over, If this thing is done from tcp_cleanup_rbuf(), it should not affect performance too much. Note, that with ABC and another pathological cases, which do not allow to send more than a fixed amount of segments [ we have lots of them, f.e. sending tiny segments, we can hit sndbuf limit ], we deal with case, when slow start is _never_ over. instead we can introduce some tricky ack avoidance scheme and ack at least 2-3-4 packets or full MSS instead of two mss-sized frames. One smart scheme was used at some stage (2000, probably never merged in this form to mainstream): tcp counted amount of unacked small segments in ack.rcv_small and kept threshold in ack.rcv_thresh. + + /* If we ever saw N1 small segments from peer, it has +* enough of send buffer to send N packets and does not nagle. +* Hence, we may delay acks more aggresively. +*/ + if (tp-ack.rcv_small tp-ack.rcv_thresh+1) + tp-ack.rcv_thresh = tp-ack.rcv_small-1; + tp-ack.rcv_small = 0; That was too much of trouble for such simple thing. So, eventually it was replaced with much dumber scheme. Look at current tcp_cleanup_rbuf(). It forces ACK, each time when it sees, that some small segment was received. It survived for 6 years, so that I guess it did not hurt anybody. :-) What I would suggest to do now, is to replace: (copied 0 (icsk-icsk_ack.pending ICSK_ACK_PUSHED) !icsk-icsk_ack.pingpong !atomic_read(sk-sk_rmem_alloc))) time_to_ack = 1; with: (copied 0 (icsk-icsk_ack.unacked 1 || (icsk-icsk_ack.pending ICSK_ACK_PUSHED) !icsk-icsk_ack.pingpong) !atomic_read(sk-sk_rmem_alloc))) time_to_ack = 1; I would not hesitate even a minute, if variable unacked could be caluclated using some existing state variables. Alexey -- VGER BF report: U 0.500017 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM
Hello! This path obviously breaks assumption 1) and therefore can lead to ABBA dead-locks. Yes... I've looked at the history and there seems to be no reason for the lock to be held at all in dev_watchdog_up. The lock appeared in day one and even there it was unnecessary. Seems, it serializes mod_timer and timer handler to keep timer in predictable state. Maybe, this is not necessary. A priori, it is required. Note that in dev_watchdog_down() queue_lock is released before taking xmit_lock. Probably, this is the thing which was supposed to be done in dev_watchdog_up() too. Alexey -- VGER BF report: U 0.46385 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][RFC] Re: high latency with TCP connections
Hello! Some people reported that this program runs in 9.997 sec when run on FreeBSD. Try enclosed patch. I have no idea why 9.997 sec is so magic, but I get exactly this number on my notebook. :-) Alexey = This patch enables sending ACKs each 2d received segment. It does not affect either mss-sized connections (obviously) or connections controlled by Nagle (because there is only one small segment in flight). The idea is to record the fact that a small segment arrives on a connection, where one small segment has already been received and still not-ACKed. In this case ACK is forced after tcp_recvmsg() drains receive buffer. In other words, it is a soft each-2d-segment ACK, which is enough to preserve ACK clock even when ABC is enabled. Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED] diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 9bf73fe..de4e83b 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -147,7 +147,8 @@ extern struct sock *inet_csk_clone(struc enum inet_csk_ack_state_t { ICSK_ACK_SCHED = 1, ICSK_ACK_TIMER = 2, - ICSK_ACK_PUSHED = 4 + ICSK_ACK_PUSHED = 4, + ICSK_ACK_PUSHED2 = 8 }; extern void inet_csk_init_xmit_timers(struct sock *sk, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 934396b..4f3b76f 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -955,8 +955,11 @@ #endif * receive buffer and there was a small segment * in queue. */ - (copied 0 (icsk-icsk_ack.pending ICSK_ACK_PUSHED) -!icsk-icsk_ack.pingpong !atomic_read(sk-sk_rmem_alloc))) + (copied 0 +((icsk-icsk_ack.pending ICSK_ACK_PUSHED2) || + ((icsk-icsk_ack.pending ICSK_ACK_PUSHED) + !icsk-icsk_ack.pingpong) + !atomic_read(sk-sk_rmem_alloc))) time_to_ack = 1; } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 111ff39..5877920 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -156,6 +156,8 @@ static void tcp_measure_rcv_mss(struct s return; } } + if (icsk-icsk_ack.pending ICSK_ACK_PUSHED) + icsk-icsk_ack.pending |= ICSK_ACK_PUSHED2; icsk-icsk_ack.pending |= ICSK_ACK_PUSHED; } } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ProxyARP and IPSec
Hello! sarcasm What I great idea. Now I just have to get every host I want to interoperate with to support a nonstandard configuration. The scary part is that if I motivate it with Linux is too stupid to handle standard tunnel-mode IPsec I might actually get away with it. sarcasm mode is not accepted. Linux does exactly standard tunnel-mode IPsec. It does not give you device to make you totally happy. Probably, you are not aware that standard IPsec tunnel device, if it is created: 1. Probably, will not accept fragmented frames, because IPsec cannot handle them 2. Probably, will have undefined MTU (65536), because of 1 3. Probably, will screw up TCP because of 2 etc. Actually, this is the reason why it is not implemented. It is dirty business. :-) And the person, who implements this, has to be really... unscrupulous. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: high latency with TCP connections
Hello! 2) a way to take delayed ACKs into account for cwnd growth This part is OK now, right? 1) protection against ACK division But Linux never had this problem... Congestion window was increased only when a whole skb is ACKed, flag FLAG_DATA_ACKED. (TSO could break this, but should not). Otherwise, this ACK just advanced snd_una and nothing more. This aspect of ABC is crucial for BSD. TCP_NODELAY sockets did not obey congestion control there. From the very beginning, before slow start it can send thousands of 1 byte segments. The only problem of kind too-aggressive with Linux was that we could develop large cwnd sending small segments, and then switch to sending mss-sized segments. It does not look scary, to be honest. :-) Linux had troubles with slow start even before ABC. Actually, some of applications can suffer of the same syndrome even if ABC disabled. With ABC it becomes TROUBLE, cwnd has no chances to develop at all. Probably, aspect 1 of ABC just should be disabled. And the first my suggestion looks working too. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NAPI: netif_rx_reschedule() ??
Hello! However I'm confused about a couple of things, and there are only two uses of netif_rx_reschedule() in the kernel, so I'm a little stuck. First, do not believe to even single bit of code or docs about netif_rx_reschedule(). It was used once in the first version of NAPI for 3com driver (which did not go to mainstream) and was left to rot. :-) 1. What is the intent of the second, 'undo' parameter? For example, ibmveth.c does if(ibmveth_rxq_pending_buffer(adapter) netif_rx_reschedule(netdev, frames_processed)) { lpar_rc = h_vio_signal(adapter-vdev-unit_address, VIO_IRQ_DISABLE); ibmveth_assert(lpar_rc == H_SUCCESS); more_work = 1; goto restart_poll; } but it only does netdev-quota -= frames_processed; _after_ that block (and the jump back to restart_poll). So the whole things seems fishy: netdev-quota goes up by the number of frames processed?? It is broken. netdev-quota MUST not be touched after netif_rx_complete(). Authors improved coding style, moving it closer to update of *budget and it is wrong. First, because it is changed in an absolutely unserialized context, second... you noticed. It's not clear to me why the driver would want to do something different depending on whether the NAPI poll was already scheduled or not. netif_rx_complete() released control. -poll can be reentered on another CPU after this. If netif_rx_reschedule() fails, it means that -poll must exit, because poll was already rescheduled. If it succeds, it means current thread got control back. Changes made before netif_rx_complete(), becuase we were going to exit, (netdev-quota etc) are undone and loop can be restarted. To be honest, I do not recollect everything. F.e. scheduling softirq in netif_rx_reschedule() looks like a cut-n-paste bug. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix sk-sk_filter field access
Hello! Function sk_filter() is called from tcp_v{4,6}_rcv() functions with argue needlock = 0, while socket is not locked at that moment. In order to avoid this and similar issues in the future, use rcu for sk-sk_filter field read protection. Patch is for net-2.6.19 What bug are you fixing? The current code looks correct to me. Why is it important to avoid the needlock stuff in the future? This patch description is very incomplete, you should list the precise reason why you are doing something instead of using vague language. Yes, explanation could be better. Current code in tcp_v4_rcv() calls sk_filter() _before_ it takes socket lock. This happened when LSM patches were applied. Apparently, LSM does not want to see socket locked in security_sock_rcv_skb(). Obvious solution is to change the third argument of sk_filter needlock to 1. Then we see that sk_filter() is not used with needlock=0 anymore, therefore it can be completely eliminated. It was original fix. I suggested to remove ugly misuse of bh_lock_sock() (introduced by me, just because there was no better lock to use) and replace it with RCU, which is logical and clean. The patch looks decent. I had one doubt about misuse of rcu_read_lock_bh() in sk_attach_filter(). Probably, it should be plain local_bh_disable(), I do not know. But because rcu_read_lock_bh() actually is local_bh_disable(), it seems to be not a serious issue. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix sk-sk_filter field access
Hello! Really? It is used with needlock=0 by DCCP ipv6, for example. This case seems correct too. What about sk_receive_skb()? dn_queue_skb()? In fact, there seems to be numerous uses still with needlock=0, all legitimate. Well, not quite legitime. sk_receive_skb() has the same bug as tcp_v4_rcv(). Call in backlog processing of ipv6/dccp is occasionally forgotten duplicate of one done in sk_receive_skb(). Yes, those places really added a bit to desire to get rid of this misuse of bh_lock_sock(). :-) Let us to fix bugs first, and then consider rewriting the locking. To be honest, I think switching to RCU is the simplest bug fix. At least, it does not require to search where bh_lock_sock() is taken dn_queue_skb(), if it is not forgotten at all. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] net neighbour: convert to RCU
Hello! @@ -346,8 +354,8 @@ struct neighbour *neigh_lookup(struct ne NEIGH_CACHE_STAT_INC(tbl, lookups); - read_lock_bh(tbl-lock); - hlist_for_each_entry(n, tmp, tbl-hash_buckets[hash_val], hlist) { + rcu_read_lock(); + hlist_for_each_entry_rcu(n, tmp, tbl-hash_buckets[hash_val], hlist) { if (dev == n-dev !memcmp(n-primary_key, pkey, key_len)) { neigh_hold(n); NEIGH_CACHE_STAT_INC(tbl, hits); Sure? Seems, you cannot grab refcnt here, the entry can be already released. Probably, you should do atomic_inc_and_test() here and restart lookup, if it fails. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC IPv6] Disabling IPv6 autoconf
Hello! Yes, it is logical because without multicast IPV6 cannot work correctly. This is not quite true. IFF_BROADCAST is enough, it will work just like IPv4. Real troubles start only when interface is not IFF_BROADCAST and not IFF_POINTOPOINT. IFF_MULTICAST flag seems potentially problematic. How many other things break over such a device? Nothing should break. IFF_MULTICAST is an advisory flag, saying mostly You do not want to stream high bandwidth multicast video here. So that, it can be used to block autoconfiguration. It does not change the fact that Xen device makes something profoundly wrong. IPv6 autoconfiguration is _auto_configuration. It is triggered only for a few of media types, for which autoconfiguration is prescribed by corresponding RFCs. Ethernet is one of them. If Xen does not support the things, which are required for each ethernet device, it should not be ARPHRD_ETHER. If it wants to pretend to be ARPHRD_ETHER, it must support basic ethernet functions, which IMHO is so _easy_, that the question does not even makes sense. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] net neighbour: convert to RCU
Hello! atomic_inc_and_test is true iff result is zero, so that won't work. I meant atomic_inc_not_zero(), as Martin noticed. But the following should work: hlist_for_each_entry_rcu(n, tmp, tbl-hash_buckets[hash_val], hlist) { if (dev == n-dev !memcmp(n-primary_key, pkey, key_len)) { if (unlikely(atomic_inc_return(n-refcnt) == 1)) { neigh_release(n); I do not think it will work. It has exactly the same race condition. Yes, atomic_inc_not_zero() is expensive. But it looks like it is the cheapest variant, which works correctly without more work. Another variant would be rework use of refcnt. It can be done like rt cache: when release of the last reference does not mean anything. Also, probably, it makes sense to add neigh_lookup_light(), which does not take refcnt, but required to call neigh_release_light() (which is just rcu_read_unlock_bh()). Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] net neighbour: convert to RCU
Hello! Also, probably, it makes sense to add neigh_lookup_light(), which does not take refcnt, but required to call neigh_release_light() (which is just rcu_read_unlock_bh()). Which code paths would that make sense on? fib_detect_death (ok) infiniband (ok) wireless/strip (ok) -- hey, this code is crap it has a refcount leak already! arp_req_get (ok) ndisc (ok) Perhaps killing the refcount all together, and just changing everybody to neigh_lookup_rcu(). Nobody holds a long term reference to the entries. The only real user of refcnt is destination cache. It uses __neigh_lookup/__neigh_lookup_errno, which could also use neigh_lookup_rcu(), then do atomic_inc_not_zero() and, if it fails fall to neigh_create(). Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] net neighbour: convert to RCU
Hello! This should not be any more racy than the existing code. Existing code is not racy. Critical place is interpretation of refcnt==1. Current code assumes, that when refcnt=1 and entry is in hash table, nobody can take this entry (table is locked). So, it can be unlinked from the table. See? __neigh_lookup()/__neigh_lookup_errno() _MUST_ return alive and hashed entry. And will stay hashed until the reference is held. Or until neigh entry is forced for destruction by device going down, in this case referring dst entries will die as well. If dst cache grabs an entry, which is purged from table because for some time it had refcnt==1, you got a valid dst entry referring to dead neighbour entry. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] net neighbour: convert to RCU
Hello! Yes, I forgot to say I take back my suggestion about atomic_inc_test_zero(). It would not work. Seems, it is possible to add some barriers around setting n-dead and testing it in neigh_lookup_rcu(), but it would be scary and ugly. To be honest, I just do not know how to do this. :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] net neighbour: convert to RCU
Hello! Race 1: w/o RCU Cpu 0: is in neigh_lookup gets read_lock() finds entry ++refcount to 2 updates it Cpu 1: is in forced_gc() waits at write_lock() releases read_lock() drops ref count to 1. sees ref count is 1 deletes it Do you mean it is purged, though it is actually fresh? It is harmless race condition. Why must it be hashed, it could always get zapped just after the update. Because otherwise we have to check its validity on level of dst_cache and to rebind. We do not want this. Actually dst-neighbour is supposed to be immutable. It means that if some valid dst refers to a neighbour, it must remain hashed. Hmm.. Since this is a slow path, All the neighbour lookups are in slow path. Actually, lookup of neighbour entry for use in dst cache is the most loaded path, the rest of them are pure maintanance paths. Yes, this would work. RCU is a little spoiled, but at least global tbl-lock is replaced with per neighbour lock. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ProxyARP and IPSec
Hello! I'm thinking that David definitely has a point about having a usability problem, though. All other kind of tunnels have endpoint devices associated with them, and that would make all these kinds of problems go away, Yes, when you deal with sane practical setups, this approach is the only reasonable one. Unfortunately, IPsec is not something totally sane and practical :-), security gateway case is small part of it and routing viewpoint clashes fatally with another requirements. Pure result is that we use approach where it is possible to do everything with some efforts, rather than approach which is simple and intuitive, but does not allow to do many things. It is possible to simulate simple life, creating ipsecX devices with disabled xfrm and route all the tunnels there. That would be handy. I would just advice to rename one of dummy devices to ipsec0 and route all the IPsec tunnels there. It is also simple. What's about iptables, I am sorry, it is too flexible to control IPsec. :-) One day, someone with enough of energy and stamina will make flow cache to unify all the kinds of policy rules. Until that day, you have to tune all three policy sets (routing, ipsec and iptables) separately and take care of the cases, when one set has to cheat another. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ProxyARP and IPSec
Hello! What he's trying to accomplish doesn't sound all that weird, Absolutely sane. does anyone have any other ideas? The question is where is this host really? If it is far far away and connected only via IPsec tunnel with destionation of tunnel different of host address ip ro add THEHOST dev dummy0 should be enough. It asserts that THEHOST is not on eth0. IPsec policy will figure out correct route, unless something is broken. If tunnel endpoint is THEHOST, then it is necessary to make a prescription how to reach it bypassing IPsec. This can be made with a rule telling that THEHOST is reachable from router and only from router: ip ru add from OUR_TUNNEL_ENDPOINT to THEHOST table XXX ip ro add THEHOST via THAT_ROUTE_WHICH_IS_SUPPOSED_TO_KNOW table XXX Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Get rid of /proc/sys/net/unix/max_dgram_qlen
Hello! Either this, or it should be implemented correctly, which means poll needs to be fixed to also check for max_dgram_qlen, Feel free to do this correctly. :-) Deleting wrong code rarely helps. It is the only protection of commiting infinite amount of memory to a socket. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Get rid of /proc/sys/net/unix/max_dgram_qlen
Hello! It is the only protection of commiting infinite amount of memory to a socket. Doesn't the if (atomic_read(sk-sk_wmem_alloc) sk-sk_sndbuf) check in sock_alloc_send_pskb() limit things already? Unfortunately, it does not. You can open a socket, send something to a selected victim, close it, and repeat this until receiver accumulates enough of skbs to kill the system. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take12 0/3] kevent: Generic event handling mechanism.
Hello! No way - timespec uses long. I must have missed that discussion. Please enlighten me in what regard using an opaque type with lower resolution is preferable to a type defined in POSIX for this sort of purpose. Let me explain, as a person who did this mistake and deeply regrets about this. F.e. in this case you just cannot use kevents in 32bit application on x86_64, unless you add the whole translation layer inside kevent core. Even when you deal with plain syscall, translation is a big pain, but when you use mmapped buffer, it can be simply impossible. F.e. my mistake was unsigned long in struct tpacket_hdr in linux/if_packet.h. It makes use of mmapped packet socket essentially impossible by 32bit applications on 64bit archs. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Get rid of /proc/sys/net/unix/max_dgram_qlen
Hello! Isn't a socket freed until all skb are handled? In which case the limit on the number of open files limits the total memory usage? (Same as with streaming sockets?) Alas. Number of closed sockets is not limited. Actually, it is limited by sk_max_ack_backlog*max_files, which is a lot. The problem is specific for unconnected datagram sockets (predicate unix_peer(other) != sk) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] locking bug in fib_semantics.c
Hello! [IPV4]: severe locking bug in fib_semantics.c The patch is for net-2.6.19, but the bug is present in all the kernels since yore. Found in 2.4 by Yixin Pan [EMAIL PROTECTED]. Why do we need lockdep, when sharp-sighted eyes are available? :-) When I read fib_semantics.c of Linux-2.4.32, write_lock(fib_info_lock) = is used in fib_release_info() instead of write_lock_bh(fib_info_lock). = Is the following case possible: a BH interrupts fib_release_info() while = holding the write lock, and calls ip_check_fib_default() which calls = read_lock(fib_info_lock), and spin forever. Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED] --- diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 4ea6c68..5dfdad5 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -159,7 +159,7 @@ void free_fib_info(struct fib_info *fi) void fib_release_info(struct fib_info *fi) { - spin_lock(fib_info_lock); + spin_lock_bh(fib_info_lock); if (fi --fi-fib_treeref == 0) { hlist_del(fi-fib_hash); if (fi-fib_prefsrc) @@ -172,7 +172,7 @@ void fib_release_info(struct fib_info *f fi-fib_dead = 1; fib_info_put(fi); } - spin_unlock(fib_info_lock); + spin_unlock_bh(fib_info_lock); } static __inline__ int nh_comp(const struct fib_info *fi, const struct fib_info *ofi) @@ -598,7 +598,7 @@ static void fib_hash_move(struct hlist_h unsigned int old_size = fib_hash_size; unsigned int i, bytes; - spin_lock(fib_info_lock); + spin_lock_bh(fib_info_lock); old_info_hash = fib_info_hash; old_laddrhash = fib_info_laddrhash; fib_hash_size = new_size; @@ -639,7 +639,7 @@ static void fib_hash_move(struct hlist_h } fib_info_laddrhash = new_laddrhash; - spin_unlock(fib_info_lock); + spin_unlock_bh(fib_info_lock); bytes = old_size * sizeof(struct hlist_head *); fib_hash_free(old_info_hash, bytes); @@ -820,7 +820,7 @@ link_it: fi-fib_treeref++; atomic_inc(fi-fib_clntref); - spin_lock(fib_info_lock); + spin_lock_bh(fib_info_lock); hlist_add_head(fi-fib_hash, fib_info_hash[fib_info_hashfn(fi)]); if (fi-fib_prefsrc) { @@ -839,7 +839,7 @@ link_it: head = fib_info_devhash[hash]; hlist_add_head(nh-nh_hash, head); } endfor_nexthops(fi) - spin_unlock(fib_info_lock); + spin_unlock_bh(fib_info_lock); return fi; err_inval: - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
Hello! The netlink header pid is really akin to sadb_msg_pid from RFC 2367. IMHO it should always be zero if the kernel is the originator of the message. No. Analogue of sadb_msg_pid is nladdr.nl_pid. Netlink header pid is not originator of the message, but author of the change. The notion is ambiguous by definition, and the field is a little ambiguous. If the message is a plain ack or part of a dump, it is obviously pid of requestor. But if it is notification about change, it can be nl_pid of socket, which requested the operation, but may be 0. Of course, all the 0s sent only because I was lazy to track authorship, should be eliminated. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
Hello! In one conversation with Alexey he told me there was some inspiration from pfkey in the semantics of it i.e processid. Inspiration, but not a copy. :-) Unlike pfkeyv2 it uses addressing usual for networking i.e. struct sockaddr_nl. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] network namespaces
Hello! (application) containers. Performance aside, are there any reasons why this approach would be problematic for c/r? This approach is just perfect for c/r. Probably, this is the only approach when migration can be done in a clean and self-consistent way. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH?] tcp and delayed acks
Hello! send out any delayed ACKs when it is clear that the receiving process is waiting for more data? It has just be done in tcp_cleanup_rbuf() a few lines before your chunk. There is some somplex condition to be satisfied there and it is impossible to relax it any further. I do not know what is wrong in your case, check what of conditions in tcp_cleanup_rbuf() was not satisfied. Or just tcpdump a little. BTW what buffer do you mean? SO_SNDBUF? SO_RCVBUF? Something else? TCP tries to tune itself to weird buffer sizes to allow to have at least 2 segments in flight exactly to address the problem with delayed ACKs, but it is quite easy to confuse those heuristics. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skb_shared_info()
Hello! I still like existing way - it is much simpler (I hope :) to convince e1000 developers to fix driver's memory usage e1000 is not a problem at all. It just has to use pages. If it is going to use high order allocations, it will suck, be it order 3 or 2. area (does MAX_TCP_HEADER enough for all, what about ipv6 options?)... This does not matter. All the cases with pathologically extended headers go slow path with non-trivial pskb_pull(). This is what all those options are invented for. unix socket use that. Totally does not worth it, unless it is a part of something bigger. global changes like you suggested, we can break things more fundamentally :) _You_ have to. :-) Sender zerocopy stalled only because of this. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skb_shared_info()
Hello! e1000 will setup head/data/tail pointers to point to the area in the first sg page. Maybe. But I still hope this is not necessary, the driver should be able to do at least primitive header splitting, in that case the header could be inlined to skb. Alternatively, header can be copied with skb_pull(). Not good, but it is at least a regula procedure, which f.e. allows to keep the pages in highmem. What if we will store array of pages and in shared info just like we have right now. So there will only one destructor. Seems, this is not enough. Actually, I was asking for your experience with aio. The simplest situation, which confuses me, translated to aio language: two senders send two pages. Those pages are to be recycled, when we are done. We have to notify both. And we have to merge both those pages in one segment. The second half of that mail suggested three different solutions, all of them creepy. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()
Hello! Some of these removals of current-pid will affect users such as quagga, zebra, vrrpd etc. If they survived cleanup in IPv4, they definitely will not feel cleanup in IPv6. Thomas does great work, Jamal, do not worry. :-) IMO, I believe there is a strong case that can be made for events that were caused by non-netlink users such as ioctls that could at least be multicast with current-pid. Highly not desired. It would also be acceptable if the quagga etc folks could meet their goals some other way. To all that I remember, that past discussion essentially closed the question. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 5/7] neighbour: convert lookup to sequence lock
Hello! That wouldn't work if hard_header() ever expands the head. Fortunately hard_header() returns the length added even in case of an error so we can undo the absolute value returned. Yes. Or probably it is safer to undo to skb-nh. Even if hard_header expands skb, skb-nh still remains valid. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()
Hello! So we do something like this: Yes, exactly. Actually, there was a function with similar functionality: rtnetlink_send(). net/sched/* used it, older net/ipv4/ still did this directly. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()
Hello! Makes sense, especially for auto generated handles. I've been listening to the notifications on a separate socket for this purpose. That's... complicated. But cool. :-) It does make sense, the way it has been implemented if at all is creepy. Even worse, IPv6 is using current-pid, some other code has been using the pid from NETLINK_CREDS() :-) Yeah, some time ago I sent a path replacing those with 0, but apparently I forgot to grep IPv6. And did not even search for NETLINK_CREDS(). when receiving multicasts. Also because some of the multicast code is buggy and provides the pid of the requestor's socket to netlink_broadcast() leading to excluding that socket. Actually, it was the idea. If requestor asked NLM_F_ECHO and subscribed to muticasts, it suppresses double notifications. If it did not ask NLM_F_ECHO, he is not interested in results, he knows what's going on without this. F.e. it was used by my implementation in gated: it did not set either NLM_F_ECHO or even NLM_F_ACK. And when making massive batch updates, it received nothing back: only errors and updates made by someone else. I'm not sure I understand this correctly, if rcvbuf space was eaten by multicasts subsequent recvmsg() will follow invoking netlink_dump() again and the dump continues. Indeed. I forgot how it works. :-) I understood what is happening there. sock_rmalloc() fails not due to rcvbuf, but because of global resource limiting. Grr... it does not look solvable. Luckily, in normal kernel this can happen only if 0-order GFP_KERNEL allocation fails. Not impossible with oom_killer, but it definitely moves the status of the problem to marginal. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()
Hello! Actually I think the only safe solution is to allocate a separate socket for multicast messages. In other words, if you want reliable unicast reception on a socket, don't bind it to a multicast group. Yes, it was the point of my advocacy of NLM_F_ECHO. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: the mystery that is sock_fasync
Hello! Did I miss some way that multiple file objects can point to the same socket inode? Absolutely prohibited. Always was. Apparently, sock_fasync() was cloned from tty_fasync(), that's the only reason why it is so creepy. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: skb_shared_info()
Hello! management schemes and to just wrap SKB's around arbitrary pieces of data. + and something clever like a special page_offset encoding means use data, not page. But for what purpose do you plan to use it? The e1000 issue is just one example of this, another What is this issue? What's about aggregated tcp queue, I can guess you did not find place where to add protocol headers, but cannot figure out how adding non-pagecache references could help. You would rather want more then one skb_shared_info(): at least two, one is immutable, another is for headers. I think Evgeniy's idea about inlining skb_shared_info to skb head is promising and simple enough. All the point of shared skb_shared_info was to make cloning fast. But it makes lots of sense to inline some short vector inot skb head (and, probably, even a MAX_HEADER space _instead_ of space for fclone). With aggregated tcp send queue, when transmitting a segment, you could allocate new skb head with space for header and either take existing skb_shared_info from queue, attach it to head and set offset/length. Or, alternatively, set one or two of page pointers in array, inlined in head. (F.e. in the case of AF_UNIX socket, mentioned by Evgeniy, we would keep data in pages and attach it directly to skb head). Cloning becomes more expensive, but who needs it cheap, if tcp does not? Returning to arbitrary pieces of data. Page cache references in skb_shared_info are unique thing, get_page()/page_cache_release() are enough to clone data. But it is not enough even for such simple thing as splice(). It wants we remembered some strange pipe_buffer, where each page is wrapped together with some ops, flags and even pipe_inode_info :-), and called some destructor, when it is released. First thought is that it is insane: it does not respect page cache logic, requires we implemented additional level of refcounting, abuses amount of information, which have to be stored in skb beyond all the limits of sanity. But the second thought is that something like this is required in any case. At least we must report to someone when a page is not in use and can be recycled. I think Evgeniy knows more about this, AIO has the same issue. But this is simpler, because release callback can be done not per fragment or even per-skb, but actually per-transaction. One idea is to announce (some) skb_shared_info completely immutable, force each layer who needs to add a header or to fragment to refer to original skb_shared_info as whole, using for modifications another skb_shared_info() or area inlined in skb head. And if someone is not able to, he must reallocate all the pages. In this case destructor/notification can be done not for fragment, but for whole aggregated skb_shared_info. Seems, it will work both with aggregated tcp queue and with udp. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sender throttling for unreliable protocols not garuanteed? (different units in sock-wmem_alloc and net_devive-tx_queue_len)
Hello! I'd be interested in any opinions on the above mentioned effect. Everything is right, it is exactly how it works. Well, use another qdisc, which counts in bytes rather than in frames (f.e. bfifo) Set sndbuf small enough. And if sndbuf*#senders is still too large, you have to use fair queueing, sfq is quite good for this purpose. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()
Hello! I get your point and I see the value. Unfortunately, probably due to lack of documentation, this feature isn't used by any applications I know of. Well, tc was supposed to use it, but this did not happen and it remained deficient. We even put in the hacks to make identification of own caused notifications easier by storing the netlink pid of the originator in the notification message. Actually, it was supposed to be done everywhere, but originator info did not propagate deep enough in many cases, especially in IPv6. So, this is not a hack, it is a good work. :-) BTW I have just remembered why it was especially important, this should be documented as well. Each socket, which subscribes to multicasts becomes sensitive to rcvbuf overflows. F.e. when you do control operations on a socket, which is subscribed to multicasts, the response can be lost in stream of events and -ENOBUFS generated instead. If it is a daemon, it can resync the state, but if it is a simple utility, it cannot recover. Probably, unicasts sent due to NLM_F_ECHO should somehow override rcvbuf limits. This reminded me about a capital problem, found by openvz people. Frankly speaking, I still have no idea how to repair this, probably you will find a solution. Look: while a dump, skb allocation can fail (because of many reasons, the most obvious is that rcvbuf space was eaten by multicasts). But error is not reported! Oops. The worst thing is that even if an error is reported, iproute would ignore it. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()
Hello! This patch handles NLM_F_ECHO in netlink_rcv_skb() to handle it in a central point. Most subsystems currently interpret NLM_F_ECHO as to just unicast events to the originator of the change while the real meaning of the flag is to echo the request. Do not you think it is useless to echo something back to originator, who just sent it? Actually, the sense of NLM_F_ECHO was to tell user what happened due to his request. The answer is not original request, which can contain some incomplete fields etc., but full information about object deleted/added/changed. Moreover, the feedback can contain several messages (though accurately it is done only in net/sched/), f.e. when the request triggered deletion of one object and addition of another. Obviously, it cannot be done in a central place. Normally, it is not needed, ip route add does not tell user, what actually was done, so that it suppresses echo. But for multistage operation it is absolutely necessary: the answer contains f.e. auto-allocated handles, which should be given in subsequent requests. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()
Hello! What's wrong with listening to the notification for that purpose? Nothing! NLM_F_ECHO _is_ listening for notifications without subscription to multicast groups and need to figure out what messages are yours. But beyond this NLM_F_ECHO is totally subset of this. Which still makes much more sense then echoing of a know thing, does not it? Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] llc: SOCK_DGRAM interface fixes
Hello! This fix goes against the old historical comments about UNIX98 semantics but without this fix SOCK_DGRAM is broken and useless. So either ANK's interpretation was incorect or UNIX98 standard was wrong. Just found this reference to me. :-) The comment migrated from tcp.c. It is only about connected SOCK_STREAM sockets, I do not see how it can make SOCK_DGRAM broken or useless. That UNIX98 statement allowed to avoid expensive callback to protocol specific setup of address in tcp_recvmsg(). Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] limit rt cache size
Hello! During OpenVZ stress testing we found that UDP traffic with random src can generate too much excessive rt hash growing leading finally to OOM and kernel panics. It was found that for 4GB i686 system (having 1048576 total pages and 225280 normal zone pages) kernel allocates the following route hash: syslog: IP route cache hash table entries: 262144 (order: 8, 1048576 bytes) = ip_rt_max_size = 4194304 entries, i.e. max rt size is 4194304 * 256b = 1Gb of RAM normal_zone Grrr... Indeed. Attached the patch which removes HASH_HIGHMEM flag from alloc_large_system_hash() call. However, I'm not sure whether it should be removed as well for TCP tcp_hashinfo.ehash and tcp_hashinfo.bhash (as those are probably limited by number of files?). The patch looks OK. But I am not sure too. To be honest, I do not understand the sense of HASH_HIGHMEM flag. At the first sight, hash table eats low memory, objects hashed in this table also eat low memory. Why is its size calculated from total memory? But taking into account that this flag is used only by tcp.c and route.c, both of which feed on low memory, I miss something important. Let's ask people on netdev. What's about routing cache size, it looks like it is another bug. route.c should not force rt_max_size = 16*rt_hash_size. I think it should consult available memory and to limit rt_max_size to some reasonable value, even if hash size is too high. --- ./net/ipv4/route.c.xrt2006-07-14 19:08:33.0 +0400 +++ ./net/ipv4/route.c2006-08-07 18:25:37.0 +0400 @@ -3149,7 +3149,7 @@ int __init ip_rt_init(void) rhash_entries, (num_physpages = 128 * 1024) ? 15 : 17, - HASH_HIGHMEM, + 0, rt_hash_log, rt_hash_mask, 0); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space
Hello! Alexey, any suggestions on how to handle this kind of thing? Device, which adds something at head must check for space. Anyone, who adds something at head, must check. Otherwise, it will remain buggy forever. What's wrong with my patch? As I already said there is nothing wrong with the first chunk. Except that it hides the real problem. hardly their author's fault. I don't think we've ever advertised hard_header_len is valid only with non-NULL hard_header. Do not get it wrong. dev-hard_header_len is _NEVER_ guaranteed. The places, which allocate skb, take it as a hint to avoid reallocation. But each place which stuffs something at head, still must check the space. The only difference between the situation with dev-hard_header, is that when dev-hard_header != NULL, the header is added by IP itself. That's why IP checks it. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space
Hello! Do the semantics (I'm not talking about bugs) allow skb passed to dev-hard_header() (if defined) No. dev-hard_header() should get enough of space, which is dev-hard_header_len. Actually, it is historical hole in design, inherited from ancient times. Calling conventions of dev-hard_header() just did not allow to reallocate. BTW in 2.6 it can, if it uses pskb_expand_head(). and then to dev-hard_start_xmit() to have less link layer header space than dev-hard_header_len? Absolutely. It used to happen all the time. All those devices, which occasionally forget to check for space must be fixed. I.e., is dev-hard_header_len only advisory? For initial allocator it is an advice. For layers, which add something at head, it is just nothing, if there is enough space. And it is again an advice, when skb is reallocated. Anyway, the issue with kernel panic is real so I think we better fix it before 2.6.18, and propagate to stable series as well. :-) Know what? This problem followed us since prehistoric times. It happened in 2.4-stablest, 2.2-stable, 2.0... The same devices, the same problem, no matter how much of space it is given to them, they managed to find a hole and to crash. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space
Hello! It does seem weird that IP output won't pay attention to Not so weird, actually. The logic was: Only initial skb allocation tries to reserve all the space to avoid copies in the future. All the rest of places just check, that there is enough space for their immediate needs. If dev-hard_header() is NULL, it means that stack does not need any space at all, so that it does not need to worry. Right logic for reallocation would be: if (skb_headroom(skb) space_which_I_need_now) { skb2 = skb_realloc_headroom(skb, space_for_future); } That logic was not followed exactly only because of laziness, each time some device is found which forgets to check for space, so reallocation is made in absolutely inappropriate places. F.e. ip_forward() does not need to reallocate skb when skb_headroom() dev-hard_header_len. It does and it is not good. Good example is ipip tunnel. It sets: dev-hard_header_len = sizeof(iphdr) + LL_MAX_HEADER because it does not know, what device will be used. It is lots of space and most likely it will not use it. So, initial allocation reserves lots of space, but all the rest of stack should not reallocate, tunnel will take care of this itself. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! On Thu, Jul 27, 2006 at 03:46:12PM +1000, Rusty Russell wrote: Of course, it means rewriting all the userspace tools, documentation, and creating a complete new infrastructure for connection tracking and NAT, but if that's what's required, then so be it. That's what I love to hear. Not a joke. :-) Could I only suggest not to relate this to netchannels? :-) In the past we used to call this thing (grand-unified) flow cache. I don't think they are equivalent. In channels, I understand this. Actually, it was what I said in the next paragraph, which you even cited. I really do not like to repeat myself, it is nothing but idle talk, but if the questions are questioned... First, it was stated that suggested implementation performs better and even much better. I am asking why do we see such improvement? I am absolutely not satisifed with statement It is better. Period. From all that I see, this particular implementation does not implement optimizations suggested by VJ, it implements only the things, which are not supposed to affect performance or to affect it negatively. Idle talk? I am sure that if that improvement happened not due to a severe protocol violation we can easily fix existing stack. userspace), no dequeue lock is required. And that was a part of the second question. I do not see, how single threaded TCP is possible. In receiver path it has to ack with quite strict time bounds, to delack etc., in sender path it has to slow start, I am even not saying about slow path things: retransmit, probing window, lingering without process context etc. It looks like, VJ implies the protocol must be changed. We can't, we mustn't. After we deidealize this idealization and recognize that some slow path should exist and some part of this slow path has to be executed with higher priority than the fast one, where do we arrive? Is not it exactly what we have right now? Clean fast path, separate slow path. Not good enough? Where? Let's find and fix this. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space
Hello! ip_output() ignores dev-hard_header_len ip_output() worries about the space, which it needs. If some place needs more, it is its problem to check. To the moment where it is used, hard_header_len can even change. It can be applied, but it does not change the fact, that those placed which fail now must check the condition as well. A similar problem may be present in psched_mtu(). Nothing similar. The result psched_mtu() is compared with skb-len, how it is seen by qdiscs. If hard_header is NULL, it sees skbs without header. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! kernel thread takes 100% cpu (with preemption Preemption, you tell... :-) I begged you to spend 1 minute of your time to press ^Z. Did you? Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
HellO! I like this. However, since the cloned skb is either discarded in case of error, or queued in which case the caller discards its reference right away, wouldn't it be simpler to just do this? Well, if we wanted just to cheat those checking tools, it is nice. But if we want clarity, it does not look so sweet. I _loved_ the tricks with skb refcnt a lot (look into netlink.c :-)) until the day, when pskb_expand_head() and Co appeared. It is so much more handy and so hates refcnt. I would seriously review all the uses of refcnt and leave it internal to skbuff.c and use it externally only in a few places, where it is obviously safe (MSG_PEEK, mostly) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
Hello! Code was reusing an skb which could lead to use after free or double free. No, this does not help. The bug is not here. I was so ashamed of this that could not touch the thing. :-) It startled me a lot, how is it possible that the thing was in production for several years and such bad bug never was noticed? Now it is clear. skbs leaked sometimes before BK changeset 1.889.26.53, subj: [IPV4]: Fix skb leak in inet_rtm_getroute. But after this fix, which introduced new bug (goto out_free in the enclosed patch), the bug showed on surface. Please, review this. - ipmr_cache_unresolved() does not free skb. Caller does. - goto out_free in route.c in the case, when skb is enqueued is returned to goto out. - some cosmetic cleanup in rt_fill_info() to make it more readable. diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 9ccacf5..1788c04 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -640,8 +640,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc if (atomic_read(cache_resolve_queue_len)=10 || (c=ipmr_cache_alloc_unres())==NULL) { spin_unlock_bh(mfc_unres_lock); - - kfree_skb(skb); return -ENOBUFS; } @@ -662,7 +660,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc spin_unlock_bh(mfc_unres_lock); kmem_cache_free(mrt_cachep, c); - kfree_skb(skb); return err; } @@ -677,7 +674,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc * See if we can append the packet */ if (c-mfc_un.unres.unresolved.qlen3) { - kfree_skb(skb); err = -ENOBUFS; } else { skb_queue_tail(c-mfc_un.unres.unresolved,skb); @@ -1375,6 +1371,7 @@ int ip_mr_input(struct sk_buff *skb) */ if (cache==NULL) { int vif; + int err; if (local) { struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC); @@ -1387,15 +1384,13 @@ int ip_mr_input(struct sk_buff *skb) } vif = ipmr_find_vif(skb-dev); - if (vif = 0) { - int err = ipmr_cache_unresolved(vif, skb); - read_unlock(mrt_lock); - - return err; - } + err = -ENODEV; + if (vif = 0) + err = ipmr_cache_unresolved(vif, skb); read_unlock(mrt_lock); - kfree_skb(skb); - return -ENODEV; + if (err) + kfree_skb(skb); + return err; } ip_mr_forward(skb, cache, local); @@ -1568,6 +1563,17 @@ rtattr_failure: return -EMSGSIZE; } +/** + * ipmr_get_route - return mrouting information or queue for resolution + * @skb - netlink skb with partially complete rtnelink information + * @rtm - pointer to head of rtmsg + * @nowait - if mroute is not in cache, do not resolve, fail with EAGAIN + * + * Return: + * 1 - if mrouting information is succesfully recorded in skb + * 0 - if information is not available, but skb is queued for resolution + *0 - error + */ int ipmr_get_route(struct sk_buff *skb, struct rtmsg *rtm, int nowait) { int err; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 2dc6dbb..62f53e9 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2703,16 +2703,12 @@ static int rt_fill_info(struct sk_buff * if (MULTICAST(dst) !LOCAL_MCAST(dst) ipv4_devconf.mc_forwarding) { int err = ipmr_get_route(skb, r, nowait); - if (err = 0) { - if (!nowait) { - if (err == 0) - return 0; + if (err == 0) + return 0; + if (err 0) { + if (err == -EMSGSIZE) goto nlmsg_failure; - } else { - if (err == -EMSGSIZE) - goto nlmsg_failure; - ((struct rta_cacheinfo*)RTA_DATA(eptr))-rta_error = err; - } + ((struct rta_cacheinfo*)RTA_DATA(eptr))-rta_error = err; } } else #endif @@ -2794,7 +2790,7 @@ int inet_rtm_getroute(struct sk_buff *in err = rt_fill_info(skb, NETLINK_CB(in_skb).pid, nlh-nlmsg_seq, RTM_NEWROUTE, 0, 0); if (!err) - goto out_free; + goto out; if (err 0) { err = -EMSGSIZE; goto out_free; - To unsubscribe
Re: [PATCH] ip multicast route bug fix
Hello! checking tools because the skb lifetime depends on the return value. Wouldn't it be better to have a consistent interface (skb always freed), and clone the skb if needed for deferred processing? But skb is not always freed in any case. Normally it is submitted to netlink_unicast(). It is freed only in error case. So, following this logic should not you clone skb to feed to netlink_unicast()? Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
Hello! Wouldn't it be better to have a consistent interface (skb always freed), and clone the skb if needed for deferred processing? I am sorry, I misunderstood you. I absolutely agree. It is much better, the variant which I suggested is a good sample of bad programming. :-) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ip multicast route bug fix
Hello! Wouldn't it be better to have a consistent interface (skb always freed), and clone the skb if needed for deferred processing? I think you mean this. Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains the whole half-prepared netlink message plus room for the rest. It could be also skb_copy(), if we want to be puristic about mangling cloned data, but original copy is really not going to be used. Alexey diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 9ccacf5..85893ee 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -1578,6 +1578,7 @@ int ipmr_get_route(struct sk_buff *skb, cache = ipmr_cache_find(rt-rt_src, rt-rt_dst); if (cache==NULL) { + struct sk_buff *skb2; struct net_device *dev; int vif; @@ -1591,12 +1592,18 @@ int ipmr_get_route(struct sk_buff *skb, read_unlock(mrt_lock); return -ENODEV; } - skb-nh.raw = skb_push(skb, sizeof(struct iphdr)); - skb-nh.iph-ihl = sizeof(struct iphdr)2; - skb-nh.iph-saddr = rt-rt_src; - skb-nh.iph-daddr = rt-rt_dst; - skb-nh.iph-version = 0; - err = ipmr_cache_unresolved(vif, skb); + skb2 = skb_clone(skb, GFP_ATOMIC); + if (!skb2) { + read_unlock(mrt_lock); + return -ENOMEM; + } + + skb2-nh.raw = skb_push(skb2, sizeof(struct iphdr)); + skb2-nh.iph-ihl = sizeof(struct iphdr)2; + skb2-nh.iph-saddr = rt-rt_src; + skb2-nh.iph-daddr = rt-rt_dst; + skb2-nh.iph-version = 0; + err = ipmr_cache_unresolved(vif, skb2); read_unlock(mrt_lock); return err; } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! Also, there is some code for refcnt's in it that looks wrong. Yes, it is disgusting. rcu does not allow to increase socket refcnt in lookup routine. Ben's version looks cleaner here, it does not touch refcnt in rcu lookups. But it is dubious too: do_time_wait: + sock_hold(sk); is obviously in violation of the rule. Probably, rcu lookup should do something like: if (!atomic_inc_not_zero(sk-sk_refcnt)) pretend_it_is_not_found; It is clear Ben did not look into IBM patch, because one known place of trouble is missed: when socket moves from established to timewait, timewait bucket must be inserted before established socket is removed. Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netchannles: first stage has been completed. Further ideas.
Hello! Small question first: userspace, but also there are big problems, like one syscall per ack, I do not see redundant syscalls. Is not it expected to send ACKs only after receiving data as you said? What is the problem? Now boring things: There is no BH protocol processing at all, so there is no need to pprotect against someone who will add data while you are processing own chunk. Essential part of socket user lock is the same mutex. Backlog is actually not a protection, but a thing equivalent to netchannel. The difference is only that it tries to process something immediately, when it is safe. You can omit this and push everything to backlog(=netchannel), which is processed only by syscalls, if you do not care about latency. How many hacks just to be a bit closer to userspace processing, implemented in netchannels! Moving processing closer to userspace is not a goal, it is a tool. Which sometimes useful, but generally quite useless. F.e. in your tests it should not affect performance at all, end user is just a sink. What's about prequeueing, it is a bright example. Guess why is it useful? What does it save? Nothing, like netchannel. Answer is: it is just a tool to generate coarsed ACKs in a controlled manner without essential violation of protocol. (Well, and to combine checksumming and copy if you do not like how your card does this) If userspace is scheduled away for too much time, it is bloody wrong to ack the data, that is impossible to read due to the fact that system is being busy. It is just postponing the work from one end to another - ack now and stop when queue is full, or postpone the ack generation when segment is realy being read. ... when you get all the segments nicely aligned, blah-blah-blah. If you do not care about losses-congestion-delays-delacks-whatever, you have a totally different protocol. Sending window feedback is only a minor part of tcp. But even these boring tcp intrinsics are not so important, look at ideal lossless network: Think what happens f.e. while plain file transfer to your notebook. You get 110MB/sec for a few seconds, then writeback is fired and disk io subsystems discovers that the disk holds only 50MB/sec. If you are unlucky and some another application starts, disk is so congested that it will take lots of seconds to make a progress with io. For this time another side will retransmit, because poor thing thought rtt is 100 usecs and you will never return to 50MB/sec. You have to _CLOSE_ window in the case of long delay, rather than to forget to ack. See the difference? It is just because actual end user is still far far away. And this happens all the time, when you relay the results to another application via pipe, when... Well, the only case where real end user is user of netchannel is when you receive to a sink. But I said not this. I said it looks _worse_. A bit, but worse. At least for 80 bytes it does not matter at all. Hello-o, do you hear me? :-) I am asking: it looks not much better, but a bit worse, then what is real reason for better performance, unless it is due to castration of protocol? Simplify protocol, move all the processing (even memory copies) to softirq, leave to user space only feeding pages to copy and you will have unbeatable performance. Been there, done that, not with TCP of course, but if you do not care about losses and ACK clocking and send an ACK once per window, I do not see how it can spoil the situation. And actually I never understood nanooptimisation behind more serious problems (i.e. one cache line vs. 50MB/sec speed). You deal with 80 byte packets, to all that I understand. If you lose one cacheline per packet, it is a big problem. All that we can change is protocol overhead. Handling data part is invariant anyway. You are scared of complexity of tcp, but you obviously forget one thing: cpu is fast. The code can look very complicated: some crazy hash functions, damn hairy protocol processing, but if you take care about caches etc., all this is dominated by the first look into packet in eth_type_trans() or ip_rcv(). BTW, when you deal with normal data flow, cache can be not dirtied by data at all, it can be bypassed. works perfectly ok, but it is possible to have better performance by changing architecture, and it was done. It is exactly the point of trouble. From all that I see and you said, better performance is got not due to change of architecture, but despite of this. A proof that we can perform better by changing protocol is not required, it is kinda obvious. The question is how to make existing protocol to perform better. I have no idea, why your tcp performs better. It can be everything: absence of slow start, more coarse ACKs, whatever. I believe you were careful to check those reasons and to do a fair comparison, but then the only guess remains that you saved lots of i-cache getting rid of long code path. And none of those guesses can be attributed to
Re: Netchannles: first stage has been completed. Further ideas.
Hello! Moving protocol (no matter if it is TCP or not) closer to user allows naturally control the dataflow - when user can read that data(and _this_ is the main goal), user acks, when it can not - it does not generate ack. In theory To all that I rememeber, in theory absence of feedback leads to loss of control yet. The same is in practice, unfortunately. You must say that window is closed, otherwise sender is totally confused. There is one problem in your logic. RTT will not be so small, since acks are not sent when user does not read data. It is arithmetics: rtt = window/rate. And rto stays rounded up to 200 msec, unless you messed the connection so hard that it is not alive. Check. Simplify protocol, move all the processing (even memory copies) to softirq, leave to user space only feeding pages to copy and you will have unbeatable performance. Been there, done that, not with TCP of course, but if you do not care about losses and ACK clocking and send an ACK once per window, I do not see how it can spoil the situation. Do you live in a perfect world, where user does not want what was requested? All the time I am trying to bring you attention that you read to sink. :-) At least, read to disk to move it a little closer to reality. Or at least do it from terminal and press ^Z sometimes. You deal with 80 byte packets, to all that I understand. If you lose one cacheline per packet, it is a big problem. So actual netchannels speed is even better? :) atcp. If you get rid of netchannels, leave only atcp, the speed will be at least not worse. No doubts. tell me, why we should keep (enabled) that redundant functionality? Because it can work better in some other places, and that is correct, but why it should be enabled then in majority of the cases? Did not I tell you something like that? :-) Optimize real thing, even trying to detect the situations when retransmissions are redundant and eliminate the code. Let's draw the line. ... That was my opinion on the topic. It looks like neither you, nor me will not change our point of view about that right now :) I agree. :) Alexey - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html