Re: [PATCH] net: Fragment large datagrams even when IP_HDRINCL is set.

2016-07-08 Thread Alexey Kuznetsov
Hello!

I can tell why it has not been done initially.

Main problem was in IP options, which can be present in raw packet.
They have to be properly fragmented, some options are to be deleted
on fragments. Not that it is too complicated, it is just boring and ugly
and inconsistent with IP_HDRINCL logic.

So, it was done in logically consistent way: did you order IP_HDRINCL?
Please, then deal with fragments yourself.

Alexey


Re: [PATCH net-2.6 0/3]: Three TCP fixes

2007-12-05 Thread Alexey Kuznetsov
Hello!

 My theory is that it could relate to tcp_cwnd_restart and 
 tcp_cwnd_application_limited using it and the others are just then 
 accidently changed as well. Perhaps I'll have to dig once again to 
 changelog history to see if there's some clue (unless Alexey shed 
 some light to this)...

Yes, it is RFC2861. There is some rationale in preamble to section 3.
But even for this case it can be set to full cwnd instead.

Also, I used it to calculate prior_ssthresh to restore ssthresh
in the case when we detect false retransmission. It is not an accident.
I really mean that tcp_current_sshresh() is correct estimate
for current ssthresh and this should be the same as in restart
situation. Of course, rationale of RFC2861 does not apply
to this case _directly_. But it was at the end of chain of syllogisms.

Look:

If we have not infinite ssthresh and cwnd  ssthresh, we are already
in congestion avoidance mode and we can set prior_ssthresh to any value
in the range ssthresh...cwnd without significant changes in behaviour
wrt congestion avoidance.

ssthresh would be an obvious underestimate.

cwnd can be an overestimate, because the path is surely congested
(cwnd  ssthresh) and cwnd is not a measure of current capacity.

Behavioral changes happen, when final cwnd is too low, because pipe
was drained during recovery. We allow to slow start to sshresh,
but value of ssthresh should be the same as we use for restart case.

Alexey
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] [UDP6]: Counter increment on BH mode

2007-12-03 Thread Alexey Kuznetsov
On Mon, Dec 03, 2007 at 10:39:35PM +1100, Herbert Xu wrote:
 So we need to fix this, and whatever the fix is will probably render
 the BH/USER distinction obsolete.

Hmm, I would think opposite. USER (or generic) is expensive variant,
BH is lite. No?

Alexey
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/ipv4/arp.c: Fix arp reply when sender ip 0 (was: Strange behavior in arp probe reply, bug or feature?)

2007-11-19 Thread Alexey Kuznetsov
Hello!

 Is there a reason that the target hardware address isn't the target
 hardware address?

It is bound only to the fact that linux uses protocol address
of the machine, which responds. It would be highly confusing
(more than confusing :-)), if we used our protocol address and hardware
address of requestor.

But if you use zero protocol address as source, you really can use
any hw address.

 The dhcp clients I examined, and the implementation of the arpcheck
 that I use will compare the target hardware field of the arp-reply and
 match it against its own mac, to verify the reply. And this fails with
 the current implementation in the kernel.

1. Do not do this. Mainly, because you already know that this does not work
   with linux. :-) Logically, target hw address in arp reply is just
   a nonsensial redundancy, it should not be checked and even looked at.

2. What's about your suggestion, I thought about this and I am going to agree.

   Arguments, which convinced me are:

   - arping still works.
   - any piece of reasonable software should work.
   - if Windows understands DaD (is it really true? I cannot believe)
 and it is unhappy about our responce and does not block use
 of duplicate address only due to this, we _must_ accomodate ASAP.
   - if we do,we have to use 0 protocol address, no choice.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/ipv4/arp.c: Fix arp reply when sender ip 0 (was: Strange behavior in arp probe reply, bug or feature?)

2007-11-15 Thread Alexey Kuznetsov
Hello!

 Send a correct arp reply instead of one with sender ip and sender
 hardware adress in target fields.

I do not see anything more legal in setting target address to 0.


Actually, semantics of target address in ARP reply is ambiguous.
If it is a reply to some real request, it is set to address of requestor
and protocol requires recipient of this arp reply to test that the address
matches its own address before creating new entry triggered by unsolicited
arp reply. That's all.

In the case of duplicate address detection, requestor does not have
any address, so that it is absolutely not essential what we use as target
address. The only place, which could depend on this is the tool, which
tests for duplicate address. At least, arping written by me, should
work with any variant.

So, please, could you explain what did force you to think that use of 0
is better?

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] remove Documentation/networking/routing.txt

2007-11-05 Thread Alexey Kuznetsov
Hello!

 This file is so outdated that I can't see any value in keeping it.

Absolutely agree.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] ip_gre: sendto/recvfrom NBMA address

2007-10-24 Thread Alexey Kuznetsov
Hello!

 I was able to set a nbma gre tunnel, add routes to it and it worked
 perfectly ok.
 
 Link-level next hop worked:
 ip route add route via link-level-address dev tunnel-dev onlink

This can work if you use gre0. By plain luck it has all-zero dev_addr.

It will break on nbma devices set with:

ip tunnel add XXX mode gre local a.b.c.d [key whatever] ...

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] ip_gre: sendto/recvfrom NBMA address

2007-10-23 Thread Alexey Kuznetsov
Hello!

 When GRE tunnel is in NBMA mode, this patch allows an application to use
 a PF_PACKET socket to:
 - send a packet to specific NBMA address with sendto()
 - use recvfrom() to receive packet and check which NBMA address it came from
 
 This is required to implement properly NHRP over GRE tunnel.

Ack. This is good idea.

Frankly, I was sure ip_gre worked in this way all these years.
I do not remember any reasons why it was crippled.

The only dubious case is when next hop is set using routing tables.
But code in ipgre_tunnel_xmit() is ready to accept this situation,
it checks for zero destination address and fixes it when it is able to.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] ip_gre: sendto/recvfrom NBMA address

2007-10-23 Thread Alexey Kuznetsov
Hello!

Me wrote:

 Ack. This is good idea.
 
 Frankly, I was sure ip_gre worked in this way all these years.
 I do not remember any reasons why it was crippled.
 
 The only dubious case is when next hop is set using routing tables.
 But code in ipgre_tunnel_xmit() is ready to accept this situation,
 it checks for zero destination address and fixes it when it is able to.

Nevertheless, it does not work.

The reason is that NOARP arp entries on device with initialized
hard_header are initialized not to all zeros, but to dev-dev_addr.
So that, netxthop from routing tables is ignored and all gre packets
are lost in loopback. Not good.

The problem can be ignored. I am even not sure that someone uses
this feature. Actually, it was not recommended in documentation.

Alternatively, arp.c can be changed to generate 0 addresses instead
of dev-dev_addr. Normally it is equally good, but I am not sure
about possible side effects.

Another thoughts?

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/10] [NET]: Avoid unnecessary cloning for ingress filtering

2007-10-15 Thread Alexey Kuznetsov
Hello!

 If it is causing trouble, then one idea would be to move the resetting
 to a wrapper function which calls clone first and then resets the other
 fields. All actions currently cloning would need to be mod-ed to use
 that call.

I see not so many places inside net/sched/act* where skb_clone is used.
Exactly one in act_mirred.c, to be more exact. Did I miss something?

If you mean that place, your suggestion is correct and everything is OK.

But if there is at least one place where an external skb_clone()
should do this... this cannot be correct. Tell me, it does not matter
what some random skb_clone() makes with tc_verd, please. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SFQ qdisc crashes with limit of 2 packets

2007-09-21 Thread Alexey Kuznetsov
Hello!

Remove artificial limitation for sfq queue limit.

This is followup to Patrick's patch. A little optimization to enqueue
routine allows to remove artificial limitation on queue length.

Plus, testing showed that hash function used by SFQ is too bad or even worse.
It does not even sweep the whole range of hash values.
Switched to Jenkins' hash.


Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED]


diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 3a23e30..b542c87 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -19,6 +19,7 @@
 #include linux/init.h
 #include linux/ipv6.h
 #include linux/skbuff.h
+#include linux/jhash.h
 #include net/ip.h
 #include net/netlink.h
 #include net/pkt_sched.h
@@ -95,7 +96,7 @@ struct sfq_sched_data
 
 /* Variables */
struct timer_list perturb_timer;
-   int perturbation;
+   u32 perturbation;
sfq_index   tail;   /* Index of current slot in round */
sfq_index   max_depth;  /* Maximal depth */
 
@@ -109,12 +110,7 @@ struct sfq_sched_data
 
 static __inline__ unsigned sfq_fold_hash(struct sfq_sched_data *q, u32 h, u32 
h1)
 {
-   int pert = q-perturbation;
-
-   /* Have we any rotation primitives? If not, WHY? */
-   h ^= (h1pert) ^ (h1(0x1F - pert));
-   h ^= h10;
-   return h  0x3FF;
+   return jhash_2words(h, h1, q-perturbation)  (SFQ_HASH_DIVISOR - 1);
 }
 
 static unsigned sfq_hash(struct sfq_sched_data *q, struct sk_buff *skb)
@@ -256,6 +252,13 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc* sch)
q-ht[hash] = x = q-dep[SFQ_DEPTH].next;
q-hash[x] = hash;
}
+   /* If selected queue has length q-limit, this means that
+* all another queues are empty and that we do simple tail drop,
+* i.e. drop _this_ packet.
+*/
+   if (q-qs[x].qlen = q-limit)
+   return qdisc_drop(skb, sch);
+
sch-qstats.backlog += skb-len;
__skb_queue_tail(q-qs[x], skb);
sfq_inc(q, x);
@@ -294,6 +297,19 @@ sfq_requeue(struct sk_buff *skb, struct Qdisc* sch)
}
sch-qstats.backlog += skb-len;
__skb_queue_head(q-qs[x], skb);
+   /* If selected queue has length q-limit+1, this means that
+* all another queues are empty and we do simple tail drop.
+* This packet is still requeued at head of queue, tail packet
+* is dropped.
+*/
+   if (q-qs[x].qlen  q-limit) {
+   skb = q-qs[x].prev;
+   __skb_unlink(skb, q-qs[x]);
+   sch-qstats.drops++;
+   sch-qstats.backlog -= skb-len;
+   kfree_skb(skb);
+   return NET_XMIT_CN;
+   }
sfq_inc(q, x);
if (q-qs[x].qlen == 1) {   /* The flow is new */
if (q-tail == SFQ_DEPTH) { /* It is the first flow */
@@ -370,12 +386,10 @@ static void sfq_perturbation(unsigned long arg)
struct Qdisc *sch = (struct Qdisc*)arg;
struct sfq_sched_data *q = qdisc_priv(sch);
 
-   q-perturbation = net_random()0x1F;
+   get_random_bytes(q-perturbation, 4);
 
-   if (q-perturb_period) {
-   q-perturb_timer.expires = jiffies + q-perturb_period;
-   add_timer(q-perturb_timer);
-   }
+   if (q-perturb_period)
+   mod_timer(q-perturb_timer, jiffies + q-perturb_period);
 }
 
 static int sfq_change(struct Qdisc *sch, struct rtattr *opt)
@@ -391,7 +405,7 @@ static int sfq_change(struct Qdisc *sch, struct rtattr *opt)
q-quantum = ctl-quantum ? : psched_mtu(sch-dev);
q-perturb_period = ctl-perturb_period*HZ;
if (ctl-limit)
-   q-limit = min_t(u32, ctl-limit, SFQ_DEPTH - 2);
+   q-limit = min_t(u32, ctl-limit, SFQ_DEPTH - 1);
 
qlen = sch-q.qlen;
while (sch-q.qlen  q-limit)
@@ -400,8 +414,8 @@ static int sfq_change(struct Qdisc *sch, struct rtattr *opt)
 
del_timer(q-perturb_timer);
if (q-perturb_period) {
-   q-perturb_timer.expires = jiffies + q-perturb_period;
-   add_timer(q-perturb_timer);
+   mod_timer(q-perturb_timer, jiffies + q-perturb_period);
+   get_random_bytes(q-perturbation, 4);
}
sch_tree_unlock(sch);
return 0;
@@ -423,12 +437,13 @@ static int sfq_init(struct Qdisc *sch, struct rtattr *opt)
q-dep[i+SFQ_DEPTH].next = i+SFQ_DEPTH;
q-dep[i+SFQ_DEPTH].prev = i+SFQ_DEPTH;
}
-   q-limit = SFQ_DEPTH - 2;
+   q-limit = SFQ_DEPTH - 1;
q-max_depth = 0;
q-tail = SFQ_DEPTH;
if (opt == NULL) {
q-quantum = psched_mtu(sch-dev);
q-perturb_period = 0;
+   get_random_bytes(q-perturbation, 4);
} else {
int err = sfq_change(sch, opt);
if (err)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body

Re: SFQ qdisc crashes with limit of 2 packets

2007-09-19 Thread Alexey Kuznetsov
Hello!

 OK the off-by-one prevents an out-of-bounds array access,

Yes, this is not off-by-one (off-by-two, to be more exact :-)).

Maximal queue length is really limited by SFQ_DEPTH-2, because:

1. SFQ keeps list of queue lengths in array of length SFQ_DEPTH.
   This means length of queue must be in range 0..SFQ_DEPTH-1.

2. SFQ enqueues incoming packet first, and drops something after this.
   This means it always needs a free slot in queue. So, SFQ_DEPTH-2.


 CCed Alexey just to be safe, but I think the patch should be fine.

Yes. But what'a about limit == 1? tc prohibits this case, but it is still
not nice to have the bug in kernel. Plus, code remains creepy, the check

+   if (++sch-q.qlen  q-limit) {

still looks as a scary off-by-one. I would go all the way: replace this
with natural:

if (++sch-q.qlen = q-limit) {

and maxed q-limit at SFQ_DEPTH-2. Patch enclosed.


Seems, it is easy to relax the limit to SFQ_DEPTH-1, item #2 is an artificial
limitation. If current queue already has SFQ_DEPTH-1 packets, we can
drop from tail of this queue and skip update of all the state. It is even
an optimization for the case when we have single flow. I am not 100% sure
about this, I'll try and report.



diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 9579573..cbf8089 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -270,7 +270,7 @@ sfq_enqueue(struct sk_buff *skb, struct Qdisc* sch)
q-tail = x;
}
}
-   if (++sch-q.qlen  q-limit-1) {
+   if (++sch-q.qlen = q-limit) {
sch-bstats.bytes += skb-len;
sch-bstats.packets++;
return 0;
@@ -306,7 +306,7 @@ sfq_requeue(struct sk_buff *skb, struct Qdisc* sch)
q-tail = x;
}
}
-   if (++sch-q.qlen  q-limit - 1) {
+   if (++sch-q.qlen = q-limit) {
sch-qstats.requeues++;
return 0;
}
@@ -391,10 +391,10 @@ static int sfq_change(struct Qdisc *sch, struct rtattr 
*opt)
q-quantum = ctl-quantum ? : psched_mtu(sch-dev);
q-perturb_period = ctl-perturb_period*HZ;
if (ctl-limit)
-   q-limit = min_t(u32, ctl-limit, SFQ_DEPTH);
+   q-limit = min_t(u32, ctl-limit, SFQ_DEPTH - 2);
 
qlen = sch-q.qlen;
-   while (sch-q.qlen = q-limit-1)
+   while (sch-q.qlen  q-limit)
sfq_drop(sch);
qdisc_tree_decrease_qlen(sch, qlen - sch-q.qlen);
 
@@ -423,7 +423,7 @@ static int sfq_init(struct Qdisc *sch, struct rtattr *opt)
q-dep[i+SFQ_DEPTH].next = i+SFQ_DEPTH;
q-dep[i+SFQ_DEPTH].prev = i+SFQ_DEPTH;
}
-   q-limit = SFQ_DEPTH;
+   q-limit = SFQ_DEPTH - 2;
q-max_depth = 0;
q-tail = SFQ_DEPTH;
if (opt == NULL) {
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with implementation of TCP_DEFER_ACCEPT?

2007-08-24 Thread Alexey Kuznetsov
Hello!

  At present with TCP_DEFER_ACCEPT the kernel treats the RFC 793 handshake
  as invalid; dropping the ACK from the client without replying so the
  client doesn't know the server has in fact set it's internal ACKed flag.
  
  If the client doesn't send a packet containing data before the SYN_ACK
  time-outs finally expire the connection will be dropped.
 
 A brought this up a long, long time ago, and I seem to remember
 Alexey Kuznetsov explained me at the time that this was intentional.

Obviously, I said something like it is exactly what TCP_DEFER_ACCEPT does.


There is no protocol violation here, ACK from client is considered as lost,
it is quite normal and happens all the time. Handshake is not complete,
server remains in SYN-RECV state and continues to retransmit SYN-ACK.
If client tried to cheat and is not going to send its request,
connection will time out.

Alexey

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Alexey Kuznetsov
Hello!

I just suggested to
 Pavel to create only a single device per newlink operation and binding
 them later,

I see some logical inconsistency here.

Look, the second end is supposed to be in another namespace.
It will have identity, which cannot be expressed in any way in namespace,
which is allowed to create the pair: name, ifindex - nothing
is shared between namespaces.

Moreover, do not forget we have two netlink spaces as well.
Events happening in one namespace are reported only inside that namespace.

Actually, the only self-consistent solution, which I see right now
(sorry, did not think that much) is to create the whole pair as
one operation; required parameters (name of partner, identity of namespace)
can be passed as attributes. I guess IFLA_PARTNER approach suggests
the same thing, right?

As response to this action two replies are generated: one RTM_NEWLINK
for one end of device with the whole desciption of partnet
is broadcasted inside this namespace, another RTM_NETLINK with index/name
of partner device is broadcasted inside the second namespace
(and, probably, some attributes, which must be hidden inside namespace,
f.e. identity of main device is suppressed). 

Alexey


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC RTNETLINK 00/09]: Netlink link creation API

2007-06-06 Thread Alexey Kuznetsov
Hello!

 Good point, I didn't think of that. Is there a version of this patch
 that already uses different namespaces so I can look at it?

Pavel does not like the idea. It looks not exactly pretty, like you said. :-)

The alternative is to create pair in main namespace and then move
one end to another namespace renaming it and changing index.
Why I do not like it? Because this makes RTM_NEWLINK just useless step,
all its work is undone and real work is remade when the device moves,
with all the unrettiness moved to another place.

From another hand, some move operation is required in any case.
Right now in openvz the problem is solved in tricky, but quite
inerseting way: all the devices in main namespace are assigned
with odd index, child devices get odd index. So that, when a device
moves from main namespace to child, openvz does not need to change
ifindex, conflict is impossible. Well, it is working approach.
But it is not pretty either.


 Are network namespace completely seperated or is there some hierarchy
 with all lower namespaces visible above or something like that?

Right now they are completely separate.

It is possible to make child devices visible in parent namespace
like it is done for process pids: i.e. there is an abstract identity
which is seen under different names and indices in different namespaces.
Sounds cool, but this add a lot of complexity, which has no meaning
outside of context of device creation, I do not think it is worth to do.




 The identity of the main device has no meaning within a different
 namespace, but are there other reasons for hiding it?

Perhaps, security. It is not a good idea to leak information
about parent namespace to child namespace.

Also, people will want to see emulated ethernet inside namespace
looking exactly like ethernet. No freaking additional attributes.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [IPV4] nl_fib_lookup: Initialise res.r before fib_res_put(res)

2007-04-26 Thread Alexey Kuznetsov
Hello!

 When CONFIG_IP_MULTIPLE_TABLES is enabled, the code in nl_fib_lookup()
 needs to initialize the res.r field before fib_res_put(res) - unlike
 fib_lookup(), a direct call to -tb_lookup does not set this field.

Indeed, I am sorry.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] infinite recursion in netlink

2007-04-25 Thread Alexey Kuznetsov
Hello!

Reply to NETLINK_FIB_LOOKUP messages were misrouted back to kernel,
which resulted in infinite recursion and stack overflow.

The bug is present in all kernel versions since the feature appeared.

The patch also makes some minimal cleanup:

1. Return something consistent (-ENOENT) when fib table is missing
2. Do not crash when queue is empty (does not happen, but yet)
3. Put result of lookup

Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED]


diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index fc920f6..cac06c4 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -776,6 +776,8 @@ static void nl_fib_lookup(struct fib_res
   .nl_u = { .ip4_u = { .daddr = 
frn-fl_addr,
.tos = frn-fl_tos,
.scope = 
frn-fl_scope } } };
+
+   frn-err = -ENOENT;
if (tb) {
local_bh_disable();
 
@@ -787,6 +789,7 @@ static void nl_fib_lookup(struct fib_res
frn-nh_sel = res.nh_sel;
frn-type = res.type;
frn-scope = res.scope;
+   fib_res_put(res);
}
local_bh_enable();
}
@@ -801,6 +804,9 @@ static void nl_fib_input(struct sock *sk
struct fib_table *tb;
 
skb = skb_dequeue(sk-sk_receive_queue);
+   if (skb == NULL)
+   return;
+
nlh = (struct nlmsghdr *)skb-data;
if (skb-len  NLMSG_SPACE(0) || skb-len  nlh-nlmsg_len ||
nlh-nlmsg_len  NLMSG_LENGTH(sizeof(*frn))) {
@@ -813,7 +819,7 @@ static void nl_fib_input(struct sock *sk
 
nl_fib_lookup(frn, tb);
 
-   pid = nlh-nlmsg_pid;   /*pid of sending process */
+   pid = NETLINK_CB(skb).pid;   /* pid of sending process */
NETLINK_CB(skb).pid = 0; /* from kernel */
NETLINK_CB(skb).dst_group = 0;  /* unicast */
netlink_unicast(sk, skb, pid, MSG_DONTWAIT);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 Well I don't think the loopback device is currently but as soon
 as we get network namespace support we will have multiple loopback
 devices and they will get unregistered when we remove the network
 namespace.

There is no logical difference. At the moment when namespace is gone
there is nobody who can hold unrevokable references to this loopback.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 Does this look sane (untested)?

It does not, unfortunately.

Instead of regular crash in infiniband you will get numerous
random NULL pointer dereferences both due to dst-neighbour
and due to dst-dev.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ofa-general] Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 I think the thing to do is to just leave the loopback references
 in place, try to unregister the per-namespace loopback device,
 and that will safely wait for all the references to go away.

Yes, it is exactly how it works in openvz. All the sockets are killed,
queues are cleared, nobody holds references and virtual loopback
can be unregistered just like another device.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 If a device driver sets neigh_destructor in neigh_params, this could
 get called after the device has been unregistered and the driver module
 removed.

It is the same problem: if dst-neighbour holds neighbour, it should
not hold device. parms-dev is not supposed to be used after
neigh_parms_release(). F.e. set parms-dev to NULL to catch bad references.


Do you search for a way to find real inifiniband device in
ipoib_neigh_destructor()? I guess you will not be able.
The problem is logical: if destructor needs device, neighbour entry
_somehow_ have to hold reference to the device (via neigh-dev, neigh-parms,
whatever). Hence, if we hold neighbour entry, unregister cannot be completed.
Therefore, destructor cannot refer to device. Q.E.D. :-)

Seems, releasing dst-neighbour is inevitable.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 infiniband sets parm-neigh_destructor, and I search for a way to prevent
 this destructor from being called after the module has been unloaded.
 Ideas?

It must be called in any case to update/release internal ipoib structures.

The idea is to move call of parm-neigh_destructor from neighbour destructor
to the moment when it is unhashed, right after n-dead is set.

infiniband is the only user (atm clip uses it too, but that use is obviously
dummy), so that nobody will be harmed.

But ipoib will have to check for validity of skb-dst-neighbour before
attempt to reinitialize private data on dead (n-dead != 0) neighbour.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dst_ifdown breaks infiniband?

2007-03-19 Thread Alexey Kuznetsov
Hello!

 This might work. Could you post a patch to better show what you mean to do?

Here it is.

-neigh_destructor() is killed (not used), replaced with -neigh_cleanup(),
which is called when neighbor entry goes to dead state. At this point
everything is still valid: neigh-dev, neigh-parms etc.

The device should guarantee that dead neighbor entries (neigh-dead != 0)
do not get private part initialized, otherwise nobody will cleanup it.

I think this is enough for ipoib which is the only user of this thing.
Initialization private part of neighbor entries happens in ipib
start_xmit routine, which is not reached when device is down.
But it would be better to add explicit test for neigh-dead
in any case.


diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index f9dbc6f..2b5c297 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -814,7 +814,7 @@ static void ipoib_set_mcast_list(struct 
queue_work(ipoib_workqueue, priv-restart_task);
 }
 
-static void ipoib_neigh_destructor(struct neighbour *n)
+static void ipoib_neigh_cleanup(struct neighbour *n)
 {
struct ipoib_neigh *neigh;
struct ipoib_dev_priv *priv = netdev_priv(n-dev);
@@ -822,7 +822,7 @@ static void ipoib_neigh_destructor(struc
struct ipoib_ah *ah = NULL;
 
ipoib_dbg(priv,
- neigh_destructor for %06x  IPOIB_GID_FMT \n,
+ neigh_cleanup for %06x  IPOIB_GID_FMT \n,
  IPOIB_QPN(n-ha),
  IPOIB_GID_RAW_ARG(n-ha + 4));
 
@@ -874,7 +874,7 @@ void ipoib_neigh_free(struct net_device 
 
 static int ipoib_neigh_setup_dev(struct net_device *dev, struct neigh_parms 
*parms)
 {
-   parms-neigh_destructor = ipoib_neigh_destructor;
+   parms-neigh_cleanup = ipoib_neigh_cleanup;
 
return 0;
 }
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 3725b93..ad7fe11 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -36,7 +36,7 @@ struct neigh_parms
struct net_device *dev;
struct neigh_parms *next;
int (*neigh_setup)(struct neighbour *);
-   void(*neigh_destructor)(struct neighbour *);
+   void(*neigh_cleanup)(struct neighbour *);
struct neigh_table *tbl;
 
void*sysctl_table;
diff --git a/net/atm/clip.c b/net/atm/clip.c
index ebb5d0c..8c38258 100644
--- a/net/atm/clip.c
+++ b/net/atm/clip.c
@@ -261,14 +261,6 @@ static void clip_pop(struct atm_vcc *vcc
spin_unlock_irqrestore(PRIV(dev)-xoff_lock, flags);
 }
 
-static void clip_neigh_destroy(struct neighbour *neigh)
-{
-   DPRINTK(clip_neigh_destroy (neigh %p)\n, neigh);
-   if (NEIGH2ENTRY(neigh)-vccs)
-   printk(KERN_CRIT clip_neigh_destroy: vccs != NULL !!!\n);
-   NEIGH2ENTRY(neigh)-vccs = (void *) NEIGHBOR_DEAD;
-}
-
 static void clip_neigh_solicit(struct neighbour *neigh, struct sk_buff *skb)
 {
DPRINTK(clip_neigh_solicit (neigh %p, skb %p)\n, neigh, skb);
@@ -342,7 +334,6 @@ static struct neigh_table clip_tbl = {
/* parameters are copied from ARP ... */
.parms = {
.tbl= clip_tbl,
-   .neigh_destructor   = clip_neigh_destroy,
.base_reachable_time= 30 * HZ,
.retrans_time   = 1 * HZ,
.gc_staletime   = 60 * HZ,
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 3183142..cfc6001 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -140,6 +140,8 @@ static int neigh_forced_gc(struct neigh_
n-dead = 1;
shrunk  = 1;
write_unlock(n-lock);
+   if (n-parms-neigh_cleanup)
+   n-parms-neigh_cleanup(n);
neigh_release(n);
continue;
}
@@ -211,6 +213,8 @@ static void neigh_flush_dev(struct neigh
NEIGH_PRINTK2(neigh %p is stray.\n, n);
}
write_unlock(n-lock);
+   if (n-parms-neigh_cleanup)
+   n-parms-neigh_cleanup(n);
neigh_release(n);
}
}
@@ -582,9 +586,6 @@ void neigh_destroy(struct neighbour *nei
kfree(hh);
}
 
-   if (neigh-parms-neigh_destructor)
-   (neigh-parms-neigh_destructor)(neigh);
-
skb_queue_purge(neigh-arp_queue);
 
dev_put(neigh-dev);
@@ -675,6 +676,8 @@ static void neigh_periodic_timer(unsigne
*np = n-next;
n-dead = 1;
write_unlock(n-lock);
+   if (n-parms-neigh_cleanup)
+   n-parms-neigh_cleanup(n);
  

Re: dst_ifdown breaks infiniband?

2007-03-18 Thread Alexey Kuznetsov
Hello!

 This is not new code, and should have triggered long time ago,
 so I am not sure how come we are triggering this only now,
 but somehow this did not lead to crashes in 2.6.20

I see. I guess this was plain luck.


 Why is neighbour-dev changed here?

It holds reference to device and prevents its destruction.
If dst is held somewhere, we cannot destroy the device and deadlock
while unregister.

We could not invalidate dst-neighbour but it looked safe to invalidate
neigh-dev after quiescent state. Obviosuly, it is not and it never was safe.
Was supposed to be repaired asap, but this did not happen. :-(


 Can dst-neighbour be changed to point to NULL instead, and the neighbour
 released?

It should be cleared and we should be sure it will not be destroyed
before quiescent state.

Seems, this is the only correct solution, but to do this we have
to audit all the places where dst-neighbour is dereferenced for
RCU safety.

Actually, it is very good you caught this eventually, the bug was
so _disgusting_ that it was forgotten all the time, waiting for
someone who will point out that the king is naked. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dst_ifdown breaks infiniband?

2007-03-18 Thread Alexey Kuznetsov
Hello!

 Hmm. Something I don't understand: does the code
 in question not run on *each* device unregister?

It does.


 Why do I only see this under stress?

You should have some referenced destination entries to trigger bad path.
This should happen not only under stress.

F.e. just try to ssh to something via this device. And unregister it.
Seems, the crash is inevitable. If you do not see crash, I will be puzzled.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dst_ifdown breaks infiniband?

2007-03-18 Thread Alexey Kuznetsov
Hello!

  It should be cleared and we should be sure it will not be destroyed
  before quiescent state.
 
 I'm confused. didn't you say dst_ifdown is called after quiescent state?

Quiescent state should happen after dst-neighbour is invalidated.
And this implies that all the users of dst-neighbour check validity
after dereference and do not use it after quiescent state.


 This does not sound like something that's likely to be accepted in 2.6.21, 
 right?
 
 Any simpler ideas?

Well, if inifiniband destructor really needs to take that lock... no.
Right now I do not see.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Copy mac_len in skb_clone() as well

2007-03-15 Thread Alexey Kuznetsov
Hello!

 What bug triggered that helped you discover this?  Or is it
 merely from a code audit?

I asked the same question. :-)

openvz added some another fields to skbuff and when it was found
that they are lost while clone, he tried to figure out how all this works
and looked for another examples of this kind.

As I understand, the problem can be seen only in xfrmX_tunnel_input.
If uninitialized mac_len obtained from slab is more than current head room
it could corrupt memory.

Also, it looks like the fix is incomplete. copy_skb_header() also does not
copy this field. But it will be initialized to 0 by alloc_skb in this case
and xfrmX_tunnel_input() just will not copy mac header.

Alexey



-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] TCP: Replace __kfree_skb() with kfree_skb()

2007-01-26 Thread Alexey Kuznetsov
Hello!

 do you know of any place where __kfree_skb is used to free an skb
 whose ref count is greater than 1?

No.

Actually, since kfree_skb is not inline, __kfree_skb could be made static
and remaining places still using it switched to kfree_skb.






-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BUG] problem with BPF in PF_PACKET sockets, introduced in linux-2.6.19

2007-01-25 Thread Alexey Kuznetsov
Hello!

 So this whole idea to make run_filter() return signed integers
 and fail on negative is entirely flawed, it simply cannot work
 and retain the expected semantics which have been there forever.

Actually, it can. Return value was used only as sign of error,
so that the mistake was to return original unsigned result casted to int.

Alternative fix is enclosed. To be honest, it is not better than
yours: duplication of couple lines of code against passing return
value by pointer.

Alexey


diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index da73e8a..51e5537 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -437,11 +437,13 @@ static inline int run_filter(struct sk_b
rcu_read_lock_bh();
filter = rcu_dereference(sk-sk_filter);
if (filter != NULL) {
-   err = sk_run_filter(skb, filter-insns, filter-len);
-   if (!err)
+   unsigned int res;
+
+   res = sk_run_filter(skb, filter-insns, filter-len);
+   if (!res)
err = -EPERM;
-   else if (*snaplen  err)
-   *snaplen = err;
+   else if (*snaplen  res)
+   *snaplen = res;
}
rcu_read_unlock_bh();
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: consistent disable_xfrm behaviour

2006-12-04 Thread Alexey Kuznetsov
Hello!

 Alexey, do you remember what the original intent of this was?

disable_policy was supposed to skip policy checks on input.
It makes sense only on input device.

disable_xfrm was supposed to skip transformations on output.
It makes sense only on output device.

If it does not work, it was done wrong. :-)

As I see it, root of the problem is that DST_NOXFRM flag
is calculated using wrong device. out_dev should be used
in __mkroute_input(). It looks as a cut-n-paste error, the code
was taken from output path, where it is correct.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: consistent disable_xfrm behaviour

2006-12-04 Thread Alexey Kuznetsov
Hello!

 Here's the patch again properly signed off.

I think it is correct.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.19-rc1: Volanomark slowdown

2006-11-08 Thread Alexey Kuznetsov
Hell]!

  reduced Volanomark benchmark throughput by 10%.  

The irony of it is that java vm used to be one of victims
of over-delayed acks.

I will look, there is a little chance that it is possible
to detect the situation and to stretch ACKs.

There is one little question though. If you see a visible
difference in performance, does it mean that you see Volaconomark
used to show much better numbers comparing to another OSes? :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-22 Thread Alexey Kuznetsov
Hello!

 transactions to data segments is fubar.  That issue is also why I wonder 
 about the setting of tcp_abc.

Yes, switching ABC on/off has visible impact on amount of segments.
When ABC is off, amount of segments is almost the same as number of
transactions. When it is on, ~1.5% are merged. But this is invisible
in numbers of throughput/cpu usage.

That' numbers:

1Gig link. The first column is b. - separates runs of netperf
in backward direction.

Run #1. One host is slower.

old,abc=0
 new,abc=0
  new,abc=1
   old,abc=1

2   23652.00  6.31   21.11  10.665  8.924
 23622.16  6.47   21.01  10.951  8.893
  23625.05  6.21   21.01  10.512  8.891
   23725.12  6.46   20.31  10.898  8.559
-
23594.87  21.90  6.44   9.283   10.912
 23631.52  20.30  6.36   8.592   10.766
  23609.55  21.00  6.26   8.896   10.599
   23633.75  21.10  5.44   8.929   9.206

4   36349.11  8.71   31.21  9.584   8.585
 36461.37  8.65   30.81  9.492   8.449
  36723.72  8.22   31.31  8.949   8.526
   35801.24  8.58   30.51  9.589   8.521
-
35127.34  33.80  8.43   9.621   9.605
 36165.50  30.90  8.48   8.545   9.381
  36201.45  31.10  8.31   8.592   9.185
   35269.76  30.00  8.58   8.507   9.732

8   41148.23  10.39  42.30  10.101  10.281
 41270.06  11.04  31.31  10.698  7.585
  41181.56  5.66   48.61  5.496   11.803
   40372.37  9.68   56.50  9.591   13.996
-
40392.14  47.00  11.89  11.637  11.775
 40613.80  36.90  9.16   9.086   9.019
  40504.66  53.60  7.73   13.234  7.639
   40388.99  48.70  11.93  12.058  11.814

16  67952.27  16.27  43.70  9.576   6.432
 68031.40  10.56  53.70  6.206   7.894
  6.95  12.81  46.90  7.559   6.920
   67814.41  16.13  46.50  9.517   6.857
-
68031.46  51.30  11.53  7.541   6.781
 68044.57  40.70  8.48   5.982   4.986
  67808.13  39.60  15.86  5.840   9.355
   67818.32  52.90  11.51  7.801   6.791

32  90445.09  15.41  99.90  6.817   11.045
 90210.34  16.11  100.00 7.143   11.085
  90221.84  17.31  98.90  7.676   10.962
   90712.78  18.41  99.40  8.120   10.958
-
89155.51  99.90  12.89  11.205  5.782
 90058.54  99.90  16.16  11.093  7.179
  90092.31  98.60  15.41  10.944  6.840
   88688.96  99.00  17.59  11.163  7.933

64  89983.76  13.66  100.00 6.071   11.113
 90504.24  17.54  100.00 7.750   11.049
  92043.36  17.44  99.70  7.580   10.832
   90979.29  16.01  99.90  7.038   10.981
-
88615.27  99.90  14.91  11.273  6.729
 89316.13  99.90  17.28  11.185  7.740
  90622.85  99.90  16.81  11.024  7.420
   89084.85  99.90  17.51  11.214  7.861

Run #2. Slower host is replaced with better one. ABC=0.
No runs in backward directions.

new
 old

2   24009.73  8.80   6.49   3.667   10.806
 24008.43  8.00   6.32   3.334   10.524
4   40012.53  18.30  8.79   4.574   8.783
 3.84  19.40  8.86   4.851   8.857
8   60500.29  26.30  12.78  4.348   8.452
 60397.79  26.30  11.73  4.355   7.769
16  69619.95  39.80  14.03  5.717   8.063
 70528.72  24.90  14.43  3.531   8.184
32  132522.01  53.20  21.28  4.015   6.424
 132602.93  57.70  22.59  4.351   6.813
64  145738.83  60.30  25.01  4.138   6.865
 143129.55  73.20  24.19  5.114   6.759
128 148184.21  69.70  24.96  4.704   6.739
 148143.47  71.00  25.01  4.793   6.753
256 144798.91  69.40  25.01  4.793   6.908
 144086.01  73.00  24.61  5.067   6.832

Frankly, I do not see any statistically valid correlations.



 that linux didn't seem to be doing the same thing. Hence my tweaking 
 when seeing this patch come along...]

netperf does not catch this. :-)

Even with this patch linux does not ack each second segment dumbly,
it waits for some conditions, mostly read() emptying receive queue.
To model this it is necessary to insert some gaps between
bursted segments or to use slow network.

I have no doubts it is easy to model a situation when we send
lots of useless ACKs. F.e. inserting 20ms gaps between requests.
To see effect on thoughput/cpu, we could start enough of connections,
doing the same thing.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-09-22 Thread Alexey Kuznetsov
Hello!

 I can't even find a reference to SIOCGSTAMP in the
 dhcp-2.0pl5 or dhcp3-3.0.3 sources shipped in Ubuntu.
 
 But I will note that tpacket_rcv() expects to always get
 valid timestamps in the SKB, it does a:

It is equally unlikely it uses mmapped packet socket (tpacket_rcv).

I even installed that dhcp on x86_64. And I do not see anything,
netstamp_needed remains zero when running both server and client.
It looks like dhcp was defamed without a guilt. :-)

Seems, Andi saw some leakage in netstamp_needed (value of 7),
but I do not see this too.


In any case, the issue is obviously more serious than just behaviour
of some applications. On my notebook one gettimeofday() takes:

0.2 us with tsc
4.6 us with pm  (AND THIS CRAP IS DEFAULT!!)
9.4 us with pit (kinda expected)

It is ridiculous. Obviosuly, nobody (not only tcpdump, but everything
else) does not need such clock. Taking timestamp takes time comparable
with processing the whole tcp frame. :-) I have no idea what is possible
to do without breaking everything, but it is not something to ignore.
This timer must be shot. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-18 Thread Alexey Kuznetsov
Hello!

 It looks perfectly fine to me, would you like me to apply it
 Alexey?

Yes, I think it is safe.


Theoretically, there is one place where it can be not so good.
Good nagling tcp connection, which makes lots of small write()s,
will send MSS sized frames due to delayed ACKs. But if we ACK
each other segment, more segments will come out incomplete,
which could result in some decrease of throughput.

But the trap for this case was set 6 years ago. For unidirectional sessions
ACKs were sent not even each second segment, but each small segment. :-)
This did not show any problems for those 6 years. I guess it means
that the problem does not exist.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-09-18 Thread Alexey Kuznetsov
Hello!

 For netdev: I'm more and more thinking we should just avoid the problem
 completely and switch to true end2end timestamps. This means don't
 time stamp when a packet is received, but only when it is delivered
 to a socket.

This will work.

From viewpoint of existing uses of timestamp by packet socket
this time is not worse. The only danger is violation of casuality
(when forwarded packet or reply packet gets timestamp earlier than
original packet). This pathology was main reason why timestamp
is recorded early, before packet is demultiplexed in netif_receive_skb().
But it is not a practical problem: delivery to packet/raw sockets
is occasionally placed _before_ delivery to real protocol handlers.


 handler runs. Then the problem above would completely disappear. 

Well, not completely. Too slow clock source remains too slow clock source.
If it is so slow, that it results in performance degradation, it just
should not be used at all, even such pariah as tcpdump wants to be fast.

Actually, I have a question. Why the subject is
Network performance degradation from 2.6.11.12 to 2.6.16.20?
I do not see beginning of the thread and cannot guess
why clock source degraded. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-09-18 Thread Alexey Kuznetsov
Hello!

 Hmm, not sure how that could happen. Also is it a real problem
 even if it could?

As I said, the problem is _occasionally_ theoretical.

This would happen f.e. if packet socket handler was installed
after IP handler. Then tcpdump would get packet after it is processed
(acked/replied/forwarded). This would be disasterous, the results
are unparsable.

I recall, the issue was discussed, and that time it looked more
reasonable to solve problems of this kind taking timestamp once
before it is seen by all the rest of stack. Who could expect that
PIT nightmare is going to return? :-)


 Then it has to use the ACPI pmtmr which is really really slow.
 The overhead of that thing is so large that you can clearly see it in
 the network benchmark.

I see. Thank you.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-18 Thread Alexey Kuznetsov
Hello!

Of course, number of ACK increases. It is the goal. :-)


 unpleasant increase in service demands on something like a burst 
 enabled (./configure --enable-burst) netperf TCP_RR test:
 
 netperf -t TCP_RR -H foo -- -b N   # N  1

foo=localhost

b   patched orig
2   105874.83   105143.71
3   114208.53   114023.07
4   120493.99   120851.27
5   128087.48   128573.33
10  151328.48   151056.00

Probably, the test is done wrong. But I see no difference.


 to increase as a result.   Pipelined HTTP would be like that, some NFS 
 over TCP stuff too, maybe X traffic,

X will be excited about better latency.

What's about protocols not interested in latency, they will be a little
happier, if transactions are processed asynchronously.

But actually, it is not about increasing/decreasing number of ACKs.
It is about killing that pain in ass which we used to have because
we pretended to be too smart.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-09-18 Thread Alexey Kuznetsov
Hello!

 But that never happens right? 

Right.

Well, not right. It happens. Simply because you get packet
with newer timestamp after previous handler saw this packet
and did some actions. I just do not see any bad consequences.


 And do you have some other prefered way to solve this? Even if the timer
 was fast it would be still good to avoid it in the fast path when DHCPD
 is running.

No. The way, which you suggested, seems to be the best.


1. It even does not disable possibility to record timestamp inside
   driver, which Alan was afraid of. The sequence is:

if (!skb-tstamp.off_sec)
net_timestamp(skb);

2. Maybe, netif_rx() should continue to get timestamp in netif_rx().

3. NAPI already introduced almost the same inaccuracy. And it is really
   silly to waste time getting timestamp in netif_receive_skb() a few
   moments before the packet is delivered to a socket.

4. ...but clock source, which takes one of top lines in profiles
   must be repaired yet. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance degradation from 2.6.11.12 to 2.6.16.20

2006-09-18 Thread Alexey Kuznetsov
Hello!

 Please think about it this way:
 suppose you haave a heavily loaded router and some network problem is to
 be diagnosed. You run tcpdump and suddenly router becomes overloaded (by
 switching to timestamp-it-all mode

I am sorry. I cannot think that way. :-)

Instead of attempts to scare, better resend original report,
where you said how much performance degraded, I cannot find it.

* I do see get_offset_pmtmr() in top lines of profile. That's scary enough.
* I do not undestand what the hell dhcp needs timestamps for.
* I do not listen any suggestions to screw up tcpdump with a sysctl.
  Kernel already implements much better thing then a sysctl.
  Do not want timestamps? Fix tcpdump, add an options, submit the
  patch to tcpdump maintainers. Not a big deal. 

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-18 Thread Alexey Kuznetsov
Hello!

 There isn't any sort of clever short-circuiting in loopback is there?

No, from all that I know.


   I 
 do like the convenience of testing things over loopback, but always fret 
 about not including drivers and actual hardware interrupts etc.

Well, if the test is right, it should show cost of redundant ACKs.


 Regardless, kudos for running the test.  The only thing missing is the 
 -c and -C options to enable the CPU utilization measurements which will 
 then give the service demand on a CPU time per transaction basis.  Or 
 was this a UP system that was taken to CPU saturation?

It is my notebook. :-) Of course, cpu consumption is 100%.
(Actally, netperf shows 100.10 :-))

I will redo test on a real network. What range of -b should I test?


 What i'm thinking about isn't so much about the latency

I understand.

Actually, I did those tests ages ago for a pure throughput case,
when nothing goes in the opposite direction. I did not find a difference
that time. And nobody even noticed that Linux sends ACKs _each_ small
segment for unidirectional connections for all those years. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.18-rc6 memory mapped pcap truncates outgoing TCP packets, but not icmp

2006-09-14 Thread Alexey Kuznetsov
Hello!

 [PACKET]: Don't truncate non-linear skbs with mmaped IO
 
 Non-linear skbs are truncated to their linear part with mmaped IO.
 Fix by using skb_copy_bits instead of memcpy.

Ack.

I remember this trick. The idea was that I needed only TCP header in any
case and it was perfect cutoff. This hack was not supposed to survive. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make ipv4 multicast packets only get delivered to sockets that are joined to group

2006-09-14 Thread Alexey Kuznetsov
Hello!

 No, it returns 1 (allow) if there are no filters to explicitly
 filter it. I wrote that code. :-)

I see. It did not behave this way old times.

From your mails I understood that current behaviour matches another
implementations (BSD whatever), is it true?

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] make ipv4 multicast packets only get delivered to sockets that are joined to group

2006-09-13 Thread Alexey Kuznetsov
Hello!

 IPv6 behaves the same way.

Actually, Linux IPv6 filters received multicasts, inet6_mc_check() does
this.

IPv4 does not. I remember that attempts to do this were made in the past
and failed, because some applications, related to multicast routing,
did expect to receive all the multicasts even though they did not join
any multicast addresses. So, it was left intact.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Re: high latency with TCP connections

2006-09-05 Thread Alexey Kuznetsov
Hello!

 Is this really necessary?

No, of course. We lived for ages without this, would live for another age.



   I thought that the problems with ABC were in 
 trying to apply byte-based heuristics from the RFC(s) to a 
 packet-oritented cwnd in the stack?

It was just the last drop.

Even with disabled ABC, that test shows some gaps in latency summed
up to ~300 msec. Almost invisible, but not good.

Too aggressive delack has many other issues. Even without ABC
we have quadratically suppressed cwnd on TCP_NODELAY connections
comparing to BSD: at sender side we suppress it by counting
cwnd in packets, at receiver side by ACKing by byte counter.

Each time when another victim sees artificial latencies introduced
by agressive delayed acks, even though he requested TCP_NODELAY,
our best argument is Stupid, you do all wrong, how could you get
a decent performance? :-). 

Probably, we stand for a feature which really does not worth
to stand for and causes nothing but permanent pain in ass.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: high latency with TCP connections

2006-09-04 Thread Alexey Kuznetsov
Hello!

 At least for slow start it is safe, but experiments with atcp for
 netchannels showed that it is better not to send excessive number of
 acks when slow start is over,

If this thing is done from tcp_cleanup_rbuf(), it should not affect
performance too much.

Note, that with ABC and another pathological cases, which do not allow
to send more than a fixed amount of segments [ we have lots of them,
f.e. sending tiny segments, we can hit sndbuf limit ], we deal with case,
when slow start is _never_ over.


instead we can introduce some tricky
 ack avoidance scheme and ack at least 2-3-4 packets or full MSS instead
 of two mss-sized frames.

One smart scheme was used at some stage (2000, probably never merged
in this form to mainstream): tcp counted amount of unacked small segments
in ack.rcv_small and kept threshold in ack.rcv_thresh.

+
+   /* If we ever saw N1 small segments from peer, it has
+* enough of send buffer to send N packets and does not nagle.
+* Hence, we may delay acks more aggresively.
+*/
+   if (tp-ack.rcv_small  tp-ack.rcv_thresh+1)
+   tp-ack.rcv_thresh = tp-ack.rcv_small-1;
+   tp-ack.rcv_small = 0;


That was too much of trouble for such simple thing. So, eventually
it was replaced with much dumber scheme. Look at current tcp_cleanup_rbuf().
It forces ACK, each time when it sees, that some small segment was received.
It survived for 6 years, so that I guess it did not hurt anybody. :-)

What I would suggest to do now, is to replace:

(copied  0 
(icsk-icsk_ack.pending  ICSK_ACK_PUSHED) 
!icsk-icsk_ack.pingpong 
!atomic_read(sk-sk_rmem_alloc)))
time_to_ack = 1;

with:

(copied  0 
(icsk-icsk_ack.unacked  1 ||
(icsk-icsk_ack.pending  ICSK_ACK_PUSHED)  !icsk-icsk_ack.pingpong) 

!atomic_read(sk-sk_rmem_alloc)))
time_to_ack = 1;

I would not hesitate even a minute, if variable unacked could be caluclated
using some existing state variables.

Alexey

-- 
VGER BF report: U 0.500017
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.18-rc5 with GRE, iptables and Speedtouch ADSL, PPP over ATM

2006-09-04 Thread Alexey Kuznetsov
Hello!

 This path obviously breaks assumption 1) and therefore can lead to ABBA
 dead-locks.

Yes...


 I've looked at the history and there seems to be no reason for the lock
 to be held at all in dev_watchdog_up.  The lock appeared in day one and
 even there it was unnecessary.

Seems, it serializes mod_timer and timer handler to keep timer
in predictable state. Maybe, this is not necessary. A priori, it is required.

Note that in dev_watchdog_down() queue_lock is released before
taking xmit_lock. Probably, this is the thing which was supposed
to be done in dev_watchdog_up() too.

Alexey

-- 
VGER BF report: U 0.46385
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH][RFC] Re: high latency with TCP connections

2006-09-04 Thread Alexey Kuznetsov
Hello!

 Some people reported that this program runs in 9.997 sec when run on
 FreeBSD.

Try enclosed patch. I have no idea why 9.997 sec is so magic, but I
get exactly this number on my notebook. :-)

Alexey

=

This patch enables sending ACKs each 2d received segment.
It does not affect either mss-sized connections (obviously) or connections
controlled by Nagle (because there is only one small segment in flight).

The idea is to record the fact that a small segment arrives
on a connection, where one small segment has already been received
and still not-ACKed. In this case ACK is forced after tcp_recvmsg()
drains receive buffer.

In other words, it is a soft each-2d-segment ACK, which is enough
to preserve ACK clock even when ABC is enabled.

Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED]



diff --git a/include/net/inet_connection_sock.h 
b/include/net/inet_connection_sock.h
index 9bf73fe..de4e83b 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -147,7 +147,8 @@ extern struct sock *inet_csk_clone(struc
 enum inet_csk_ack_state_t {
ICSK_ACK_SCHED  = 1,
ICSK_ACK_TIMER  = 2,
-   ICSK_ACK_PUSHED = 4
+   ICSK_ACK_PUSHED = 4,
+   ICSK_ACK_PUSHED2 = 8
 };
 
 extern void inet_csk_init_xmit_timers(struct sock *sk,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 934396b..4f3b76f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -955,8 +955,11 @@ #endif
 * receive buffer and there was a small segment
 * in queue.
 */
-   (copied  0  (icsk-icsk_ack.pending  ICSK_ACK_PUSHED) 
-!icsk-icsk_ack.pingpong  
!atomic_read(sk-sk_rmem_alloc)))
+   (copied  0 
+((icsk-icsk_ack.pending  ICSK_ACK_PUSHED2) ||
+ ((icsk-icsk_ack.pending  ICSK_ACK_PUSHED) 
+  !icsk-icsk_ack.pingpong) 
+ !atomic_read(sk-sk_rmem_alloc)))
time_to_ack = 1;
}
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 111ff39..5877920 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -156,6 +156,8 @@ static void tcp_measure_rcv_mss(struct s
return;
}
}
+   if (icsk-icsk_ack.pending  ICSK_ACK_PUSHED)
+   icsk-icsk_ack.pending |= ICSK_ACK_PUSHED2;
icsk-icsk_ack.pending |= ICSK_ACK_PUSHED;
}
 }
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ProxyARP and IPSec

2006-09-04 Thread Alexey Kuznetsov
Hello!

 sarcasm
 
 What I great idea.  Now I just have to get every host I want to 
 interoperate with to support a nonstandard configuration.  The scary 
 part is that if I motivate it with Linux is too stupid to handle 
 standard tunnel-mode IPsec I might actually get away with it.

sarcasm mode is not accepted. Linux does exactly standard tunnel-mode IPsec.
It does not give you device to make you totally happy.

Probably, you are not aware that standard IPsec tunnel device,
if it is created:

1. Probably, will not accept fragmented frames, because IPsec cannot
   handle them
2. Probably, will have undefined MTU (65536), because of 1
3. Probably, will screw up TCP because of 2
   etc.

Actually, this is the reason why it is not implemented.
It is dirty business. :-) And the person, who implements this,
has to be really... unscrupulous. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: high latency with TCP connections

2006-08-31 Thread Alexey Kuznetsov
Hello!

 2) a way to take delayed ACKs into account for cwnd growth

This part is OK now, right?


 1) protection against ACK division

But Linux never had this problem... Congestion window was increased
only when a whole skb is ACKed, flag FLAG_DATA_ACKED. (TSO could
break this, but should not). Otherwise, this ACK just advanced snd_una
and nothing more.

This aspect of ABC is crucial for BSD. TCP_NODELAY sockets did not obey
congestion control there. From the very beginning, before slow start
it can send thousands of 1 byte segments.

The only problem of kind too-aggressive with Linux was that we could
develop large cwnd sending small segments, and then switch to sending
mss-sized segments. It does not look scary, to be honest. :-)

Linux had troubles with slow start even before ABC. Actually,
some of applications can suffer of the same syndrome even if ABC disabled.
With ABC it becomes TROUBLE, cwnd has no chances to develop at all.

Probably, aspect 1 of ABC just should be disabled. And the first my suggestion
looks working too.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NAPI: netif_rx_reschedule() ??

2006-08-31 Thread Alexey Kuznetsov
Hello!

 However I'm confused about a couple of things, and there are only two
 uses of netif_rx_reschedule() in the kernel, so I'm a little stuck.

First, do not believe to even single bit of code or docs about
netif_rx_reschedule(). It was used once in the first version of NAPI
for 3com driver (which did not go to mainstream) and was left to rot. :-)


 1. What is the intent of the second, 'undo' parameter?  For example,
ibmveth.c does
 
   if(ibmveth_rxq_pending_buffer(adapter)  netif_rx_reschedule(netdev, 
 frames_processed))
   {
   lpar_rc = h_vio_signal(adapter-vdev-unit_address, 
 VIO_IRQ_DISABLE);
   ibmveth_assert(lpar_rc == H_SUCCESS);
   more_work = 1;
   goto restart_poll;
   }
 
but it only does
 
   netdev-quota -= frames_processed;
 
_after_ that block (and the jump back to restart_poll).  So the
whole things seems fishy: netdev-quota goes up by the number of
frames processed??

It is broken. netdev-quota MUST not be touched after netif_rx_complete().
Authors improved coding style, moving it closer to update of *budget
and it is wrong. First, because it is changed in an absolutely unserialized
context, second... you noticed.


It's not clear to me why the driver would want to do something
different depending on whether the NAPI poll was already scheduled
or not.

netif_rx_complete() released control. -poll can be reentered on another
CPU after this.

If netif_rx_reschedule() fails, it means that -poll must exit, because
poll was already rescheduled.

If it succeds, it means current thread got control back. Changes made
before netif_rx_complete(), becuase we were going to exit, (netdev-quota etc)
are undone and loop can be restarted.

To be honest, I do not recollect everything. F.e. scheduling softirq
in netif_rx_reschedule() looks like a cut-n-paste bug.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix sk-sk_filter field access

2006-08-30 Thread Alexey Kuznetsov
Hello!

  Function sk_filter() is called from tcp_v{4,6}_rcv() functions with argue 
  needlock = 0, while socket is not locked at that moment. In order to avoid 
  this and similar issues in the future, use rcu for sk-sk_filter field read 
  protection.
  
  Patch is for net-2.6.19
 
 What bug are you fixing?  The current code looks correct to me.
 
 Why is it important to avoid the needlock stuff in the future?
 
 This patch description is very incomplete, you should list the
 precise reason why you are doing something instead of using
 vague language.

Yes, explanation could be better.

Current code in tcp_v4_rcv() calls sk_filter() _before_ it takes socket lock.
This happened when LSM patches were applied. Apparently, LSM does not
want to see socket locked in security_sock_rcv_skb().

Obvious solution is to change the third argument of sk_filter needlock to 1.
Then we see that sk_filter() is not used with needlock=0 anymore, therefore
it can be completely eliminated. It was original fix.

I suggested to remove ugly misuse of bh_lock_sock() (introduced by me,
just because there was no better lock to use) and replace it with RCU, which
is logical and clean.

The patch looks decent. I had one doubt about misuse of rcu_read_lock_bh()
in sk_attach_filter(). Probably, it should be plain local_bh_disable(),
I do not know. But because rcu_read_lock_bh() actually is local_bh_disable(),
it seems to be not a serious issue.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fix sk-sk_filter field access

2006-08-30 Thread Alexey Kuznetsov
Hello!

 Really?

 It is used with needlock=0 by DCCP ipv6, for example.  This case seems
 correct too.  What about sk_receive_skb()?  dn_queue_skb()?  In fact,
 there seems to be numerous uses still with needlock=0, all legitimate.

Well, not quite legitime.

sk_receive_skb() has the same bug as tcp_v4_rcv().

Call in backlog processing of ipv6/dccp is occasionally forgotten
duplicate of one done in sk_receive_skb().

Yes, those places really added a bit to desire to get rid of
this misuse of bh_lock_sock(). :-)


 Let us to fix bugs first, and then consider rewriting the locking.

To be honest, I think switching to RCU is the simplest bug fix.
At least, it does not require to search where bh_lock_sock() is taken
dn_queue_skb(), if it is not forgotten at all. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello!

 @@ -346,8 +354,8 @@ struct neighbour *neigh_lookup(struct ne
   
   NEIGH_CACHE_STAT_INC(tbl, lookups);
  
 - read_lock_bh(tbl-lock);
 - hlist_for_each_entry(n, tmp, tbl-hash_buckets[hash_val], hlist) {
 + rcu_read_lock();
 + hlist_for_each_entry_rcu(n, tmp, tbl-hash_buckets[hash_val], hlist) {
   if (dev == n-dev  !memcmp(n-primary_key, pkey, key_len)) {
   neigh_hold(n);
   NEIGH_CACHE_STAT_INC(tbl, hits);

Sure? Seems, you cannot grab refcnt here, the entry can be already
released.

Probably, you should do atomic_inc_and_test() here and restart lookup,
if it fails.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC IPv6] Disabling IPv6 autoconf

2006-08-29 Thread Alexey Kuznetsov
Hello!

 Yes, it is logical because without multicast IPV6 cannot
 work correctly.

This is not quite true. IFF_BROADCAST is enough, it will work just
like IPv4.

Real troubles start only when interface is not IFF_BROADCAST and not
IFF_POINTOPOINT.


 IFF_MULTICAST flag seems potentially problematic.  How many other
 things break over such a device?

Nothing should break. IFF_MULTICAST is an advisory flag, saying
mostly You do not want to stream high bandwidth multicast video here.
So that, it can be used to block autoconfiguration.

It does not change the fact that Xen device makes something profoundly
wrong. IPv6 autoconfiguration is _auto_configuration. It is triggered
only for a few of media types, for which autoconfiguration is prescribed
by corresponding RFCs. Ethernet is one of them.

If Xen does not support the things, which are required for each
ethernet device, it should not be ARPHRD_ETHER. If it wants to pretend
to be ARPHRD_ETHER, it must support basic ethernet functions, which IMHO
is so _easy_, that the question does not even makes sense.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello!

 atomic_inc_and_test is true iff result is zero, so that won't work.

I meant atomic_inc_not_zero(), as Martin noticed.


 But the following should work:
 
   hlist_for_each_entry_rcu(n, tmp, tbl-hash_buckets[hash_val], hlist) {
   if (dev == n-dev  !memcmp(n-primary_key, pkey, key_len)) {
   if (unlikely(atomic_inc_return(n-refcnt) == 1)) {
   neigh_release(n);

I do not think it will work. It has exactly the same race condition.

Yes, atomic_inc_not_zero() is expensive. But it looks like it is the cheapest
variant, which works correctly without more work.

Another variant would be rework use of refcnt. It can be done like rt cache:
when release of the last reference does not mean anything.

Also, probably, it makes sense to add neigh_lookup_light(), which does
not take refcnt, but required to call
neigh_release_light() (which is just rcu_read_unlock_bh()).

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello!

  Also, probably, it makes sense to add neigh_lookup_light(), which does
  not take refcnt, but required to call
  neigh_release_light() (which is just rcu_read_unlock_bh()).
 
 Which code paths would that make sense on?
   fib_detect_death (ok)
   infiniband (ok)
   wireless/strip (ok) -- hey, this code is crap it has
   a refcount leak already!
   arp_req_get (ok)
   ndisc (ok)
 
 Perhaps killing the refcount all together, and just changing
 everybody to neigh_lookup_rcu(). Nobody holds a long term reference
 to the entries.

The only real user of refcnt is destination cache.

It uses __neigh_lookup/__neigh_lookup_errno, which could also
use neigh_lookup_rcu(), then do atomic_inc_not_zero() and, if it fails
fall to neigh_create().

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello!

 This should not be any more racy than the existing code.

Existing code is not racy.

Critical place is interpretation of refcnt==1. Current code assumes,
that when refcnt=1 and entry is in hash table, nobody can take this
entry (table is locked). So, it can be unlinked from the table.

See? __neigh_lookup()/__neigh_lookup_errno() _MUST_ return alive
and hashed entry. And will stay hashed until the reference is held.
Or until neigh entry is forced for destruction by device going down,
in this case referring dst entries will die as well.

If dst cache grabs an entry, which is purged from table because for some
time it had refcnt==1, you got a valid dst entry referring to dead
neighbour entry.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello!

Yes, I forgot to say I take back my suggestion about atomic_inc_test_zero().
It would not work.

Seems, it is possible to add some barriers around setting n-dead
and testing it in neigh_lookup_rcu(), but it would be scary and ugly.
To be honest, I just do not know how to do this. :-)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/6] net neighbour: convert to RCU

2006-08-29 Thread Alexey Kuznetsov
Hello!

 Race 1: w/o RCU
   Cpu 0: is in neigh_lookup
   gets read_lock()
   finds entry
   ++refcount to 2
   updates it
   Cpu 1: is in forced_gc()
   waits at write_lock()
   releases read_lock()
   drops ref count to 1.
   sees ref count is 1
   deletes it

Do you mean it is purged, though it is actually fresh?
It is harmless race condition.


 Why must it be hashed, it could always get zapped just after the update.

Because otherwise we have to check its validity on level of dst_cache
and to rebind. We do not want this. Actually dst-neighbour is supposed
to be immutable.

It means that if some valid dst refers to a neighbour, it must remain
hashed.


 Hmm.. Since this is a slow path,

All the neighbour lookups are in slow path. Actually, lookup
of neighbour entry for use in dst cache is the most loaded path,
the rest of them are pure maintanance paths.

Yes, this would work. RCU is a little spoiled, but at least global
tbl-lock is replaced with per neighbour lock.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ProxyARP and IPSec

2006-08-24 Thread Alexey Kuznetsov
Hello!

 I'm thinking that David definitely has a point about having a usability 
 problem, though.  All other kind of tunnels have endpoint devices 
 associated with them, and that would make all these kinds of problems go 
 away, 

Yes, when you deal with sane practical setups, this approach is the only
reasonable one.

Unfortunately, IPsec is not something totally sane and practical :-),
security gateway case is small part of it and routing viewpoint
clashes fatally with another requirements. Pure result is that we use approach
where it is possible to do everything with some efforts, rather than approach
which is simple and intuitive, but does not allow to do many things.

It is possible to simulate simple life, creating ipsecX devices
with disabled xfrm and route all the tunnels there. That would be handy.

I would just advice to rename one of dummy devices to ipsec0
and route all the IPsec tunnels there. It is also simple.

What's about iptables, I am sorry, it is too flexible to control IPsec. :-)
One day, someone with enough of energy and stamina will make flow cache
to unify all the kinds of policy rules. Until that day, you have to tune
all three policy sets (routing, ipsec and iptables) separately and take
care of the cases, when one set has to cheat another. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ProxyARP and IPSec

2006-08-23 Thread Alexey Kuznetsov
Hello!

 What he's trying to accomplish doesn't sound all that weird,

Absolutely sane.


 does anyone have any other ideas?

The question is where is this host really?

If it is far far away and connected only via IPsec tunnel with destionation
of tunnel different of host address

ip ro add THEHOST dev dummy0

should be enough. It asserts that THEHOST is not on eth0.
IPsec policy will figure out correct route, unless something is broken.

If tunnel endpoint is THEHOST, then it is necessary to make a prescription
how to reach it bypassing IPsec. This can be made with a rule telling
that THEHOST is reachable from router and only from router:

ip ru add from OUR_TUNNEL_ENDPOINT to THEHOST table XXX
ip ro add THEHOST via THAT_ROUTE_WHICH_IS_SUPPOSED_TO_KNOW table XXX

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Get rid of /proc/sys/net/unix/max_dgram_qlen

2006-08-22 Thread Alexey Kuznetsov
Hello!

 Either this, or it should be implemented correctly, which means poll needs
 to be fixed to also check for max_dgram_qlen,

Feel free to do this correctly. :-)
Deleting wrong code rarely helps.

It is the only protection of commiting infinite amount of memory to a socket.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Get rid of /proc/sys/net/unix/max_dgram_qlen

2006-08-22 Thread Alexey Kuznetsov
Hello!

  It is the only protection of commiting infinite amount of memory to a 
  socket.
 
 Doesn't the if (atomic_read(sk-sk_wmem_alloc)  sk-sk_sndbuf) check in 
 sock_alloc_send_pskb()
 limit things already?

Unfortunately, it does not. You can open a socket, send
something to a selected victim, close it, and repeat this
until receiver accumulates enough of skbs to kill the system.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take12 0/3] kevent: Generic event handling mechanism.

2006-08-22 Thread Alexey Kuznetsov
Hello!

 No way - timespec uses long.
 
 I must have missed that discussion. Please enlighten me in what regard
 using an opaque type with lower resolution is preferable to a type
 defined in POSIX for this sort of purpose.

Let me explain, as a person who did this mistake and deeply
regrets about this.

F.e. in this case you just cannot use kevents in 32bit application
on x86_64, unless you add the whole translation layer inside kevent core.
Even when you deal with plain syscall, translation is a big pain,
but when you use mmapped buffer, it can be simply impossible.

F.e. my mistake was unsigned long in struct tpacket_hdr in linux/if_packet.h.
It makes use of mmapped packet socket essentially impossible by 32bit
applications on 64bit archs.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Get rid of /proc/sys/net/unix/max_dgram_qlen

2006-08-22 Thread Alexey Kuznetsov
Hello!

 Isn't a socket freed until all skb are handled? In which case the limit on 
 the number of open
 files limits the total memory usage? (Same as with streaming sockets?)

Alas. Number of closed sockets is not limited. Actually, it is limited
by sk_max_ack_backlog*max_files, which is a lot.

The problem is specific for unconnected datagram sockets
(predicate unix_peer(other) != sk)
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] locking bug in fib_semantics.c

2006-08-17 Thread Alexey Kuznetsov
Hello!

[IPV4]: severe locking bug in fib_semantics.c

The patch is for net-2.6.19, but the bug is present in all the kernels
since yore.

Found in 2.4 by Yixin Pan [EMAIL PROTECTED]. Why do we need lockdep,
when sharp-sighted eyes are available? :-)

 When I read fib_semantics.c of Linux-2.4.32, write_lock(fib_info_lock) =
 is used in fib_release_info() instead of write_lock_bh(fib_info_lock).  =
 Is the following case possible: a BH interrupts fib_release_info() while =
 holding the write lock, and calls ip_check_fib_default() which calls =
 read_lock(fib_info_lock), and spin forever.


Signed-off-by: Alexey Kuznetsov [EMAIL PROTECTED]
---

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 4ea6c68..5dfdad5 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -159,7 +159,7 @@ void free_fib_info(struct fib_info *fi)
 
 void fib_release_info(struct fib_info *fi)
 {
-   spin_lock(fib_info_lock);
+   spin_lock_bh(fib_info_lock);
if (fi  --fi-fib_treeref == 0) {
hlist_del(fi-fib_hash);
if (fi-fib_prefsrc)
@@ -172,7 +172,7 @@ void fib_release_info(struct fib_info *f
fi-fib_dead = 1;
fib_info_put(fi);
}
-   spin_unlock(fib_info_lock);
+   spin_unlock_bh(fib_info_lock);
 }
 
 static __inline__ int nh_comp(const struct fib_info *fi, const struct fib_info 
*ofi)
@@ -598,7 +598,7 @@ static void fib_hash_move(struct hlist_h
unsigned int old_size = fib_hash_size;
unsigned int i, bytes;
 
-   spin_lock(fib_info_lock);
+   spin_lock_bh(fib_info_lock);
old_info_hash = fib_info_hash;
old_laddrhash = fib_info_laddrhash;
fib_hash_size = new_size;
@@ -639,7 +639,7 @@ static void fib_hash_move(struct hlist_h
}
fib_info_laddrhash = new_laddrhash;
 
-   spin_unlock(fib_info_lock);
+   spin_unlock_bh(fib_info_lock);
 
bytes = old_size * sizeof(struct hlist_head *);
fib_hash_free(old_info_hash, bytes);
@@ -820,7 +820,7 @@ link_it:
 
fi-fib_treeref++;
atomic_inc(fi-fib_clntref);
-   spin_lock(fib_info_lock);
+   spin_lock_bh(fib_info_lock);
hlist_add_head(fi-fib_hash,
   fib_info_hash[fib_info_hashfn(fi)]);
if (fi-fib_prefsrc) {
@@ -839,7 +839,7 @@ link_it:
head = fib_info_devhash[hash];
hlist_add_head(nh-nh_hash, head);
} endfor_nexthops(fi)
-   spin_unlock(fib_info_lock);
+   spin_unlock_bh(fib_info_lock);
return fi;
 
 err_inval:

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Alexey Kuznetsov
Hello!

 The netlink header pid is really akin to sadb_msg_pid from RFC 2367.
 IMHO it should always be zero if the kernel is the originator of the
 message.

No. Analogue of sadb_msg_pid is nladdr.nl_pid.


Netlink header pid is not originator of the message, but author of
the change. The notion is ambiguous by definition, and the field
is a little ambiguous.

If the message is a plain ack or part of a dump, it is obviously
pid of requestor. But if it is notification about change, it can be
nl_pid of socket, which requested the operation, but may be 0.
Of course, all the 0s sent only because I was lazy to track authorship,
should be eliminated.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-16 Thread Alexey Kuznetsov
Hello!

 In one conversation with Alexey he told me there was some inspiration
 from pfkey in the semantics of it i.e processid.

Inspiration, but not a copy. :-)

Unlike pfkeyv2 it uses addressing usual for networking i.e.
struct sockaddr_nl.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] network namespaces

2006-08-16 Thread Alexey Kuznetsov
Hello!

 (application) containers.  Performance aside, are there any reasons why
 this approach would be problematic for c/r?

This approach is just perfect for c/r.

Probably, this is the only approach when migration can be done
in a clean and self-consistent way.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH?] tcp and delayed acks

2006-08-16 Thread Alexey Kuznetsov
Hello!

 send out any delayed ACKs when it is clear that the receiving process is 
 waiting for more data?

It has just be done in tcp_cleanup_rbuf() a few lines before your chunk.
There is some somplex condition to be satisfied there and it is
impossible to relax it any further.

I do not know what is wrong in your case, check what of conditions
in tcp_cleanup_rbuf() was not satisfied. Or just tcpdump a little.

BTW what buffer do you mean? SO_SNDBUF? SO_RCVBUF? Something else?
TCP tries to tune itself to weird buffer sizes to allow to have
at least 2 segments in flight exactly to address the problem with delayed
ACKs, but it is quite easy to confuse those heuristics.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skb_shared_info()

2006-08-15 Thread Alexey Kuznetsov
Hello!

 I still like existing way - it is much simpler (I hope :) to convince
 e1000 developers to fix driver's memory usage

e1000 is not a problem at all. It just has to use pages.
If it is going to use high order allocations, it will suck,
be it order 3 or 2.


 area (does MAX_TCP_HEADER enough for all, what about ipv6 options?)...

This does not matter. All the cases with pathologically extended
headers go slow path with non-trivial pskb_pull(). This is what all
those options are invented for.


unix 
 socket use that.

Totally does not worth it, unless it is a part of something bigger.


 global changes like you suggested, we can break things more
 fundamentally :)

_You_ have to. :-) Sender zerocopy stalled only because of this. 

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skb_shared_info()

2006-08-14 Thread Alexey Kuznetsov
Hello!

 e1000 will setup head/data/tail pointers to point to the area in the
 first sg page.

Maybe.

But I still hope this is not necessary, the driver should be able to do
at least primitive header splitting, in that case the header could
be inlined to skb.

Alternatively, header can be copied with skb_pull(). Not good, but
it is at least a regula procedure, which f.e. allows to keep the pages
in highmem.


 What if we will store array of pages and in shared info
 just like we have right now. So there will only one destructor.

Seems, this is not enough. Actually, I was asking for your experience
with aio.

The simplest situation, which confuses me, translated to aio language:
two senders send two pages. Those pages are to be recycled, when we
are done. We have to notify both. And we have to merge both those pages
in one segment.

The second half of that mail suggested three different solutions,
all of them creepy. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/16] [IPv6] address: Convert address notification to use rtnl_notify()

2006-08-14 Thread Alexey Kuznetsov
Hello!

 Some of these removals of current-pid will affect users such as quagga,
 zebra, vrrpd etc.

If they survived cleanup in IPv4, they definitely will not feel cleanup
in IPv6.

Thomas does great work, Jamal, do not worry. :-)


 IMO, I believe there is a strong case that can be made for events that
 were caused by non-netlink users such as ioctls that could at least be
 multicast with current-pid.

Highly not desired.


 It would also be acceptable if the quagga etc folks could meet their
 goals some other way.

To all that I remember, that past discussion essentially closed the question.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 5/7] neighbour: convert lookup to sequence lock

2006-08-14 Thread Alexey Kuznetsov
Hello!

 That wouldn't work if hard_header() ever expands the head. Fortunately
 hard_header() returns the length added even in case of an error so we
 can undo the absolute value returned.

Yes.

Or probably it is safer to undo to skb-nh. Even if hard_header
expands skb, skb-nh still remains valid.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-13 Thread Alexey Kuznetsov
Hello!

 So we do something like this:

Yes, exactly.

Actually, there was a function with similar functionality: rtnetlink_send().
net/sched/* used it, older net/ipv4/ still did this directly.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-12 Thread Alexey Kuznetsov
Hello!

 Makes sense, especially for auto generated handles. I've been listening
 to the notifications on a separate socket for this purpose.

That's... complicated. But cool. :-)


 It does make sense, the way it has been implemented if at all is
 creepy. Even worse, IPv6 is using current-pid, some other code
 has been using the pid from NETLINK_CREDS() :-)

Yeah, some time ago I sent a path replacing those with 0, but
apparently I forgot to grep IPv6. And did not even search for NETLINK_CREDS().


 when receiving multicasts. Also because some of the multicast
 code is buggy and provides the pid of the requestor's socket to
 netlink_broadcast() leading to excluding that socket.

Actually, it was the idea. If requestor asked NLM_F_ECHO and subscribed
to muticasts, it suppresses double notifications. If it did not ask
NLM_F_ECHO, he is not interested in results, he knows what's going on
without this.

F.e. it was used by my implementation in gated: it did not set either
NLM_F_ECHO or even NLM_F_ACK. And when making massive batch updates,
it received nothing back: only errors and updates made by someone else.


 I'm not sure I understand this correctly, if rcvbuf space was eaten
 by multicasts subsequent recvmsg() will follow invoking netlink_dump()
 again and the dump continues.

Indeed. I forgot how it works. :-)

I understood what is happening there. sock_rmalloc() fails not due
to rcvbuf, but because of global resource limiting. Grr... it does not
look solvable. Luckily, in normal kernel this can happen only if 0-order
GFP_KERNEL allocation fails. Not impossible with oom_killer, but it
definitely moves the status of the problem to marginal.

Alexey

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-12 Thread Alexey Kuznetsov
Hello!

 Actually I think the only safe solution is to allocate a separate
 socket for multicast messages.  In other words, if you want reliable
 unicast reception on a socket, don't bind it to a multicast group.

Yes, it was the point of my advocacy of NLM_F_ECHO. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: the mystery that is sock_fasync

2006-08-11 Thread Alexey Kuznetsov
Hello!

 Did I miss some way that multiple file objects can point to the
 same socket inode?

Absolutely prohibited. Always was.

Apparently, sock_fasync() was cloned from tty_fasync(), that's the only
reason why it is so creepy.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: skb_shared_info()

2006-08-11 Thread Alexey Kuznetsov
Hello!

 management schemes and to just wrap SKB's around
 arbitrary pieces of data.
+
 and something clever like a special page_offset encoding
 means use data, not page.

But for what purpose do you plan to use it?


 The e1000 issue is just one example of this, another

What is this issue?


What's about aggregated tcp queue, I can guess you did not find place
where to add protocol headers, but cannot figure out how adding non-pagecache
references could help.

You would rather want more then one skb_shared_info(): at least two,
one is immutable, another is for headers.

I think Evgeniy's idea about inlining skb_shared_info to skb head
is promising and simple enough. All the point of shared skb_shared_info
was to make cloning fast. But it makes lots of sense to inline some short
vector inot skb head (and, probably, even a MAX_HEADER space _instead_
of space for fclone).

With aggregated tcp send queue, when transmitting a segment, you could
allocate new skb head with space for header and either take existing
skb_shared_info from queue, attach it to head and set offset/length.
Or, alternatively, set one or two of page pointers in array, inlined in head.
(F.e. in the case of AF_UNIX socket, mentioned by Evgeniy, we would keep data
in pages and attach it directly to skb head).

Cloning becomes more expensive, but who needs it cheap, if tcp does not?



Returning to arbitrary pieces of data.

Page cache references in skb_shared_info are unique thing,
get_page()/page_cache_release() are enough to clone data.

But it is not enough even for such simple thing as splice().
It wants we remembered some strange pipe_buffer, where each page
is wrapped together with some ops, flags and even pipe_inode_info :-),
and called some destructor, when it is released. First thought is that
it is insane: it does not respect page cache logic, requires we implemented
additional level of refcounting, abuses amount of information, which
have to be stored in skb beyond all the limits of sanity.

But the second thought is that something like this is required in any case.
At least we must report to someone when a page is not in use and
can be recycled. I think Evgeniy knows more about this, AIO has
the same issue. But this is simpler, because release callback can be done
not per fragment or even per-skb, but actually per-transaction.

One idea is to announce (some) skb_shared_info completely immutable,
force each layer who needs to add a header or to fragment to refer
to original skb_shared_info as whole, using for modifications
another skb_shared_info() or area inlined in skb head.
And if someone is not able to, he must reallocate all the pages.
In this case destructor/notification can be done not for fragment,
but for whole aggregated skb_shared_info. Seems, it will work both
with aggregated tcp queue and with udp.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sender throttling for unreliable protocols not garuanteed? (different units in sock-wmem_alloc and net_devive-tx_queue_len)

2006-08-11 Thread Alexey Kuznetsov
Hello!

 I'd be interested in any opinions on the above mentioned effect.

Everything is right, it is exactly how it works.

Well, use another qdisc, which counts in bytes rather than in frames
(f.e. bfifo)

Set sndbuf small enough.

And if sndbuf*#senders is still too large, you have to use fair queueing,
sfq is quite good for this purpose.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-11 Thread Alexey Kuznetsov
Hello!

 I get your point and I see the value. Unfortunately, probably due to
 lack of documentation, this feature isn't used by any applications I
 know of.

Well, tc was supposed to use it, but this did not happen and
it remained deficient.


 We even put in the hacks to make identification of own caused
 notifications easier by storing the netlink pid of the originator in
 the notification message.

Actually, it was supposed to be done everywhere, but originator info
did not propagate deep enough in many cases, especially in IPv6.
So, this is not a hack, it is a good work. :-)


BTW I have just remembered why it was especially important,
this should be documented as well.

Each socket, which subscribes to multicasts becomes sensitive
to rcvbuf overflows. F.e. when you do control operations on a socket,
which is subscribed to multicasts, the response can be lost in stream
of events and -ENOBUFS generated instead. If it is a daemon, it can resync
the state, but if it is a simple utility, it cannot recover.

Probably, unicasts sent due to NLM_F_ECHO should somehow override
rcvbuf limits.

This reminded me about a capital problem, found by openvz people.
Frankly speaking, I still have no idea how to repair this, probably you
will find a solution.

Look: while a dump, skb allocation can fail (because of many reasons,
the most obvious is that rcvbuf space was eaten by multicasts).
But error is not reported! Oops. The worst thing is that even if an error
is reported, iproute would ignore it.

Alexey


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-10 Thread Alexey Kuznetsov
Hello!

 This patch handles NLM_F_ECHO in netlink_rcv_skb() to
 handle it in a central point. Most subsystems currently
 interpret NLM_F_ECHO as to just unicast events to the
 originator of the change while the real meaning of the
 flag is to echo the request.

Do not you think it is useless to echo something back to originator,
who just sent it?

Actually, the sense of NLM_F_ECHO was to tell user what happened due to
his request. The answer is not original request, which can contain
some incomplete fields etc., but full information about object
deleted/added/changed. Moreover, the feedback can contain several
messages (though accurately it is done only in net/sched/), f.e. when
the request triggered deletion of one object and addition of another.

Obviously, it cannot be done in a central place.

Normally, it is not needed, ip route add does not tell user, what
actually was done, so that it suppresses echo. But for multistage
operation it is absolutely necessary: the answer contains f.e. auto-allocated
handles, which should be given in subsequent requests.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] [NETLINK]: Handle NLM_F_ECHO in netlink_rcv_skb()

2006-08-10 Thread Alexey Kuznetsov
Hello!

 What's wrong with listening to the notification for that purpose?

Nothing! NLM_F_ECHO _is_ listening for notifications without subscription
to multicast groups and need to figure out what messages are yours.
But beyond this NLM_F_ECHO is totally subset of this.
Which still makes much more sense then echoing of a know thing, does not it?

Alexey


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] llc: SOCK_DGRAM interface fixes

2006-08-08 Thread Alexey Kuznetsov
Hello!

 This fix goes against the old historical comments about UNIX98 semantics
 but without this fix SOCK_DGRAM is broken and useless. So either ANK's
 interpretation was incorect or UNIX98 standard was wrong.

Just found this reference to me. :-)

The comment migrated from tcp.c. It is only about connected SOCK_STREAM
sockets, I do not see how it can make SOCK_DGRAM broken or useless.

That UNIX98 statement allowed to avoid expensive callback to protocol
specific setup of address in tcp_recvmsg().

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] limit rt cache size

2006-08-07 Thread Alexey Kuznetsov
Hello!

 During OpenVZ stress testing we found that UDP traffic with
 random src can generate too much excessive rt hash growing
 leading finally to OOM and kernel panics.
 
 It was found that for 4GB i686 system (having 1048576 total pages and
  225280 normal zone pages) kernel allocates the following route hash:
 syslog: IP route cache hash table entries: 262144 (order: 8, 1048576 bytes)
 = ip_rt_max_size = 4194304 entries, i.e.
 max rt size is 4194304 * 256b = 1Gb of RAM  normal_zone

Grrr... Indeed.


 Attached the patch which removes HASH_HIGHMEM flag from
 alloc_large_system_hash() call. However, I'm not sure whether
 it should be removed as well for TCP tcp_hashinfo.ehash and
 tcp_hashinfo.bhash (as those are probably limited by number of files?).

The patch looks OK. But I am not sure too.

To be honest, I do not understand the sense of HASH_HIGHMEM flag.
At the first sight, hash table eats low memory, objects hashed in this table
also eat low memory. Why is its size calculated from total memory?
But taking into account that this flag is used only by tcp.c and route.c,
both of which feed on low memory, I miss something important.

Let's ask people on netdev.


What's about routing cache size, it looks like it is another bug.
route.c should not force rt_max_size = 16*rt_hash_size.
I think it should consult available memory and to limit rt_max_size
to some reasonable value, even if hash size is too high.



 --- ./net/ipv4/route.c.xrt2006-07-14 19:08:33.0 +0400
 +++ ./net/ipv4/route.c2006-08-07 18:25:37.0 +0400
 @@ -3149,7 +3149,7 @@ int __init ip_rt_init(void)
   rhash_entries,
   (num_physpages = 128 * 1024) ?
   15 : 17,
 - HASH_HIGHMEM,
 + 0,
   rt_hash_log,
   rt_hash_mask,
   0);

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space

2006-08-01 Thread Alexey Kuznetsov
Hello!

  Alexey, any suggestions on how to handle this kind of thing?

Device, which adds something at head must check for space.
Anyone, who adds something at head, must check.
Otherwise, it will remain buggy forever.


 What's wrong with my patch?

As I already said there is nothing wrong with the first chunk.
Except that it hides the real problem.


 hardly their author's fault. I don't think we've ever advertised
 hard_header_len is valid only with non-NULL hard_header.

Do not get it wrong. dev-hard_header_len is _NEVER_ guaranteed.
The places, which allocate skb, take it as a hint to avoid reallocation.
But each place which stuffs something at head, still must check the space.

The only difference between the situation with dev-hard_header,
is that when dev-hard_header != NULL, the header is added by IP itself.
That's why IP checks it.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space

2006-08-01 Thread Alexey Kuznetsov
Hello!

 Do the semantics (I'm not talking about bugs) allow skb passed
 to dev-hard_header() (if defined)

No. dev-hard_header() should get enough of space, which is
dev-hard_header_len.

Actually, it is historical hole in design, inherited from ancient
times. Calling conventions of dev-hard_header() just did not allow
to reallocate. BTW in 2.6 it can, if it uses pskb_expand_head().


and then to dev-hard_start_xmit()
 to have less link layer header space than dev-hard_header_len?

Absolutely. It used to happen all the time. All those devices,
which occasionally forget to check for space must be fixed.


 I.e., is dev-hard_header_len only advisory?

For initial allocator it is an advice. For layers, which add something
at head, it is just nothing, if there is enough space. And it is again
an advice, when skb is reallocated.


 Anyway, the issue with kernel panic is real so I think we better
 fix it before 2.6.18, and propagate to stable series as well.

:-) Know what? This problem followed us since prehistoric times.
It happened in 2.4-stablest, 2.2-stable, 2.0... The same devices,
the same problem, no matter how much of space it is given to them,
they managed to find a hole and to crash. :-)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space

2006-07-31 Thread Alexey Kuznetsov
Hello!

 It does seem weird that IP output won't pay attention to

Not so weird, actually.

The logic was:

Only initial skb allocation tries to reserve all the space
to avoid copies in the future.

All the rest of places just check, that there is enough space
for their immediate needs. If dev-hard_header() is NULL, it means that
stack does not need any space at all, so that it does not need to worry.

Right logic for reallocation would be:

if (skb_headroom(skb)  space_which_I_need_now) {
skb2 = skb_realloc_headroom(skb, space_for_future);
}

That logic was not followed exactly only because of laziness,
each time some device is found which forgets to check for space,
so reallocation is made in absolutely inappropriate places.
F.e. ip_forward() does not need to reallocate skb when
skb_headroom()  dev-hard_header_len. It does and it is not good.

Good example is ipip tunnel. It sets:

dev-hard_header_len = sizeof(iphdr) + LL_MAX_HEADER

because it does not know, what device will be used.
It is lots of space and most likely it will not use it.
So, initial allocation reserves lots of space, but all the rest
of stack should not reallocate, tunnel will take care of this itself.

Alexey


-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Alexey Kuznetsov
Hello!

On Thu, Jul 27, 2006 at 03:46:12PM +1000, Rusty Russell wrote:
 Of course, it means rewriting all the userspace tools, documentation,
 and creating a complete new infrastructure for connection tracking and
 NAT, but if that's what's required, then so be it.

That's what I love to hear. Not a joke. :-)

Could I only suggest not to relate this to netchannels? :-)
In the past we used to call this thing (grand-unified) flow cache.


 I don't think they are equivalent.  In channels,

I understand this. Actually, it was what I said in the next paragraph,
which you even cited.

I really do not like to repeat myself, it is nothing but idle talk,
but if the questions are questioned...

First, it was stated that suggested implementation performs better and even
much better. I am asking why do we see such improvement?
I am absolutely not satisifed with statement It is better. Period.
From all that I see, this particular implementation does not implement
optimizations suggested by VJ, it implements only the things,
which are not supposed to affect performance or to affect it negatively.

Idle talk? I am sure that if that improvement happened not due
to a severe protocol violation we can easily fix existing stack.


 userspace), no dequeue lock is required.

And that was a part of the second question.

I do not see, how single threaded TCP is possible. In receiver path
it has to ack with quite strict time bounds, to delack etc., in sender path
it has to slow start, I am even not saying about slow path things:
retransmit, probing window, lingering without process context etc.
It looks like, VJ implies the protocol must be changed. We can't, we mustn't.

After we deidealize this idealization and recognize that some slow path
should exist and some part of this slow path has to be executed
with higher priority than the fast one, where do we arrive?
Is not it exactly what we have right now? Clean fast path, separate slow path.
Not good enough? Where? Let's find and fix this.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NET: fix kernel panic from no dev-hard_header_len space

2006-07-27 Thread Alexey Kuznetsov
Hello!

 ip_output() ignores dev-hard_header_len

ip_output() worries about the space, which it needs.

If some place needs more, it is its problem to check.
To the moment where it is used, hard_header_len can even change.

It can be applied, but it does not change the fact, that those
placed which fail now must check the condition as well.


 A similar problem may be present in psched_mtu().

Nothing similar. The result psched_mtu() is compared with skb-len,
how it is seen by qdiscs. If hard_header is NULL, it sees skbs
without header.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-27 Thread Alexey Kuznetsov
Hello!

 kernel thread takes 100% cpu (with preemption

Preemption, you tell... :-)

I begged you to spend 1 minute of your time to press ^Z. Did you?

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-26 Thread Alexey Kuznetsov
HellO!

 I like this.  However, since the cloned skb is either discarded in case
 of error, or queued in which case the caller discards its reference right
 away, wouldn't it be simpler to just do this?

Well, if we wanted just to cheat those checking tools, it is nice.
But if we want clarity, it does not look so sweet.

I _loved_ the tricks with skb refcnt a lot (look into netlink.c :-))
until the day, when pskb_expand_head() and Co appeared. It is so much
more handy and so hates refcnt.

I would seriously review all the uses of refcnt and leave it internal
to skbuff.c and use it externally only in a few places,
where it is obviously safe (MSG_PEEK, mostly) 

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello!

 Code was reusing an skb which could lead to use after free or double free.

No, this does not help. The bug is not here.

I was so ashamed of this that could not touch the thing. :-)
It startled me a lot, how is it possible that the thing was in production
for several years and such bad bug never was noticed?

Now it is clear. skbs leaked sometimes before BK changeset 1.889.26.53,
subj: [IPV4]: Fix skb leak in inet_rtm_getroute. But after this fix,
which introduced new bug (goto out_free in the enclosed patch),
the bug showed on surface.

Please, review this.

- ipmr_cache_unresolved() does not free skb. Caller does.
- goto out_free in route.c in the case, when skb is enqueued
  is returned to goto out.
- some cosmetic cleanup in rt_fill_info() to make it more readable.

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 9ccacf5..1788c04 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -640,8 +640,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc
if (atomic_read(cache_resolve_queue_len)=10 ||
(c=ipmr_cache_alloc_unres())==NULL) {
spin_unlock_bh(mfc_unres_lock);
-
-   kfree_skb(skb);
return -ENOBUFS;
}
 
@@ -662,7 +660,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc
spin_unlock_bh(mfc_unres_lock);
 
kmem_cache_free(mrt_cachep, c);
-   kfree_skb(skb);
return err;
}
 
@@ -677,7 +674,6 @@ ipmr_cache_unresolved(vifi_t vifi, struc
 *  See if we can append the packet
 */
if (c-mfc_un.unres.unresolved.qlen3) {
-   kfree_skb(skb);
err = -ENOBUFS;
} else {
skb_queue_tail(c-mfc_un.unres.unresolved,skb);
@@ -1375,6 +1371,7 @@ int ip_mr_input(struct sk_buff *skb)
 */
if (cache==NULL) {
int vif;
+   int err;
 
if (local) {
struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
@@ -1387,15 +1384,13 @@ int ip_mr_input(struct sk_buff *skb)
}
 
vif = ipmr_find_vif(skb-dev);
-   if (vif = 0) {
-   int err = ipmr_cache_unresolved(vif, skb);
-   read_unlock(mrt_lock);
-
-   return err;
-   }
+   err = -ENODEV;
+   if (vif = 0)
+   err = ipmr_cache_unresolved(vif, skb);
read_unlock(mrt_lock);
-   kfree_skb(skb);
-   return -ENODEV;
+   if (err)
+   kfree_skb(skb);
+   return err;
}
 
ip_mr_forward(skb, cache, local);
@@ -1568,6 +1563,17 @@ rtattr_failure:
return -EMSGSIZE;
 }
 
+/**
+ * ipmr_get_route - return mrouting information or queue for resolution
+ * @skb - netlink skb with partially complete rtnelink information
+ * @rtm - pointer to head of rtmsg
+ * @nowait - if mroute is not in cache, do not resolve, fail with EAGAIN
+ *
+ * Return:
+ *   1 - if mrouting information is succesfully recorded in skb
+ *   0 - if information is not available, but skb is queued for resolution
+ *0 - error
+ */
 int ipmr_get_route(struct sk_buff *skb, struct rtmsg *rtm, int nowait)
 {
int err;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 2dc6dbb..62f53e9 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2703,16 +2703,12 @@ static int rt_fill_info(struct sk_buff *
if (MULTICAST(dst)  !LOCAL_MCAST(dst) 
ipv4_devconf.mc_forwarding) {
int err = ipmr_get_route(skb, r, nowait);
-   if (err = 0) {
-   if (!nowait) {
-   if (err == 0)
-   return 0;
+   if (err == 0)
+   return 0;
+   if (err  0) {
+   if (err == -EMSGSIZE)
goto nlmsg_failure;
-   } else {
-   if (err == -EMSGSIZE)
-   goto nlmsg_failure;
-   ((struct 
rta_cacheinfo*)RTA_DATA(eptr))-rta_error = err;
-   }
+   ((struct 
rta_cacheinfo*)RTA_DATA(eptr))-rta_error = err;
}
} else
 #endif
@@ -2794,7 +2790,7 @@ int inet_rtm_getroute(struct sk_buff *in
err = rt_fill_info(skb, NETLINK_CB(in_skb).pid, nlh-nlmsg_seq,
RTM_NEWROUTE, 0, 0);
if (!err)
-   goto out_free;
+   goto out;
if (err  0) {
err = -EMSGSIZE;
goto out_free;



 
-
To unsubscribe 

Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello!

 checking tools because the skb lifetime depends on the return value.
 Wouldn't it be better to have a consistent interface (skb always freed),
 and clone the skb if needed for deferred processing?

But skb is not always freed in any case.
Normally it is submitted to netlink_unicast(). It is freed only in error case.
So, following this logic should not you clone skb to feed to netlink_unicast()?

Alexey

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello!

 Wouldn't it be better to have a consistent interface (skb always freed),
 and clone the skb if needed for deferred processing?

I am sorry, I misunderstood you. I absolutely agree. It is much better,
the variant which I suggested is a good sample of bad programming. :-)

Alexey

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ip multicast route bug fix

2006-07-25 Thread Alexey Kuznetsov
Hello!

 Wouldn't it be better to have a consistent interface (skb always freed),
 and clone the skb if needed for deferred processing?

I think you mean this.

Note, it is real skb_clone(), not alloc_skb(). Equeued skb contains
the whole half-prepared netlink message plus room for the rest.
It could be also skb_copy(), if we want to be puristic about mangling
cloned data, but original copy is really not going to be used.  

Alexey


diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 9ccacf5..85893ee 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1578,6 +1578,7 @@ int ipmr_get_route(struct sk_buff *skb, 
cache = ipmr_cache_find(rt-rt_src, rt-rt_dst);
 
if (cache==NULL) {
+   struct sk_buff *skb2;
struct net_device *dev;
int vif;
 
@@ -1591,12 +1592,18 @@ int ipmr_get_route(struct sk_buff *skb, 
read_unlock(mrt_lock);
return -ENODEV;
}
-   skb-nh.raw = skb_push(skb, sizeof(struct iphdr));
-   skb-nh.iph-ihl = sizeof(struct iphdr)2;
-   skb-nh.iph-saddr = rt-rt_src;
-   skb-nh.iph-daddr = rt-rt_dst;
-   skb-nh.iph-version = 0;
-   err = ipmr_cache_unresolved(vif, skb);
+   skb2 = skb_clone(skb, GFP_ATOMIC);
+   if (!skb2) {
+   read_unlock(mrt_lock);
+   return -ENOMEM;
+   }
+
+   skb2-nh.raw = skb_push(skb2, sizeof(struct iphdr));
+   skb2-nh.iph-ihl = sizeof(struct iphdr)2;
+   skb2-nh.iph-saddr = rt-rt_src;
+   skb2-nh.iph-daddr = rt-rt_dst;
+   skb2-nh.iph-version = 0;
+   err = ipmr_cache_unresolved(vif, skb2);
read_unlock(mrt_lock);
return err;
}
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-24 Thread Alexey Kuznetsov
Hello!

 Also, there is some code for refcnt's in it that looks wrong.

Yes, it is disgusting. rcu does not allow to increase socket refcnt
in lookup routine.

Ben's version looks cleaner here, it does not touch refcnt
in rcu lookups. But it is dubious too:

 do_time_wait:
+   sock_hold(sk);

is obviously in violation of the rule. Probably, rcu lookup should do something
like:

if (!atomic_inc_not_zero(sk-sk_refcnt))
pretend_it_is_not_found; 

It is clear Ben did not look into IBM patch, because one known place
of trouble is missed: when socket moves from established to timewait,
timewait bucket must be inserted before established socket is removed.

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread Alexey Kuznetsov
Hello!

Small question first:

 userspace, but also there are big problems, like one syscall per ack,

I do not see redundant syscalls. Is not it expected to send ACKs only
after receiving data as you said? What is the problem?


Now boring things:

 There is no BH protocol processing at all, so there is no need to
 pprotect against someone who will add data while you are processing own
 chunk.

Essential part of socket user lock is the same mutex.

Backlog is actually not a protection, but a thing equivalent to netchannel.
The difference is only that it tries to process something immediately,
when it is safe. You can omit this and push everything to backlog(=netchannel),
which is processed only by syscalls, if you do not care about latency.


 How many hacks just to be a bit closer to userspace processing,
 implemented in netchannels!

Moving processing closer to userspace is not a goal, it is a tool.
Which sometimes useful, but generally quite useless.

F.e. in your tests it should not affect performance at all,
end user is just a sink.

What's about prequeueing, it is a bright example. Guess why is it useful?
What does it save? Nothing, like netchannel. Answer is: it is just a tool
to generate coarsed ACKs in a controlled manner without essential violation
of protocol. (Well, and to combine checksumming and copy if you do not like how
your card does this)


 If userspace is scheduled away for too much time, it is bloody wrong to
 ack the data, that is impossible to read due to the fact that system is
 being busy. It is just postponing the work from one end to another - ack
 now and stop when queue is full, or postpone the ack generation when
 segment is realy being read.

... when you get all the segments nicely aligned, blah-blah-blah.

If you do not care about losses-congestion-delays-delacks-whatever,
you have a totally different protocol. Sending window feedback
is only a minor part of tcp. But even these boring tcp intrinsics
are not so important, look at ideal lossless network:

Think what happens f.e. while plain file transfer to your notebook.
You get 110MB/sec for a few seconds, then writeback is fired and
disk io subsystems discovers that the disk holds only 50MB/sec.
If you are unlucky and some another application starts, disk is so congested
that it will take lots of seconds to make a progress with io.
For this time another side will retransmit, because poor thing thought
rtt is 100 usecs and you will never return to 50MB/sec.

You have to _CLOSE_ window in the case of long delay, rather than to forget
to ack. See the difference?

It is just because actual end user is still far far away.
And this happens all the time, when you relay the results to another
application via pipe, when... Well, the only case where real end user
is user of netchannel is when you receive to a sink.


 But I said not this. I said it looks _worse_. A bit, but worse.
 
 At least for 80 bytes it does not matter at all.

Hello-o, do you hear me? :-)

I am asking: it looks not much better, but a bit worse,
then what is real reason for better performance, unless it is
due to castration of protocol?

Simplify protocol, move all the processing (even memory copies) to softirq,
leave to user space only feeding pages to copy and you will have unbeatable
performance. Been there, done that, not with TCP of course, but if you do not
care about losses and ACK clocking and send an ACK once per window,
I do not see how it can spoil the situation.


 And actually I never understood nanooptimisation behind more serious
 problems (i.e. one cache line vs. 50MB/sec speed).

You deal with 80 byte packets, to all that I understand.
If you lose one cacheline per packet, it is a big problem.

All that we can change is protocol overhead. Handling data part
is invariant anyway. You are scared of complexity of tcp, but
you obviously forget one thing: cpu is fast.
The code can look very complicated: some crazy hash functions,
damn hairy protocol processing, but if you take care about caches etc.,
all this is dominated by the first look into packet in eth_type_trans()
or ip_rcv().

BTW, when you deal with normal data flow, cache can be not dirtied
by data at all, it can be bypassed.


 works perfectly ok, but it is possible to have better performance by
 changing architecture, and it was done.

It is exactly the point of trouble. From all that I see and you said,
better performance is got not due to change of architecture,
but despite of this.

A proof that we can perform better by changing protocol is not required,
it is kinda obvious. The question is how to make existing protocol
to perform better.

I have no idea, why your tcp performs better. It can be everything:
absence of slow start, more coarse ACKs, whatever. I believe you were careful
to check those reasons and to do a fair comparison, but then the only guess
remains that you saved lots of i-cache getting rid of long code path.

And none of those guesses can be attributed to 

Re: Netchannles: first stage has been completed. Further ideas.

2006-07-20 Thread Alexey Kuznetsov
Hello!

 Moving protocol (no matter if it is TCP or not) closer to user allows
 naturally control the dataflow - when user can read that data(and _this_
 is the main goal), user acks, when it can not - it does not generate
 ack. In theory

To all that I rememeber, in theory absence of feedback leads
to loss of control yet. The same is in practice, unfortunately.
You must say that window is closed, otherwise sender is totally
confused.


 There is one problem in your logic.
 RTT will not be so small, since acks are not sent when user does not
 read data.

It is arithmetics: rtt = window/rate.

And rto stays rounded up to 200 msec, unless you messed the connection
so hard that it is not alive. Check.


  Simplify protocol, move all the processing (even memory copies) to softirq,
  leave to user space only feeding pages to copy and you will have unbeatable
  performance. Been there, done that, not with TCP of course, but if you do 
  not
  care about losses and ACK clocking and send an ACK once per window,
  I do not see how it can spoil the situation.
 
 Do you live in a perfect world, where user does not want what was
 requested?

All the time I am trying to bring you attention that you read to sink. :-)
At least, read to disk to move it a little closer to reality.
Or at least do it from terminal and press ^Z sometimes.


  You deal with 80 byte packets, to all that I understand.
  If you lose one cacheline per packet, it is a big problem.
 
 So actual netchannels speed is even better? :)

atcp. If you get rid of netchannels, leave only atcp, the speed will
be at least not worse. No doubts.


 tell me, why we should keep (enabled) that redundant functionality?
 Because it can work better in some other places, and that is correct,
 but why it should be enabled then in majority of the cases?

Did not I tell you something like that? :-) Optimize real thing,
even trying to detect the situations when retransmissions are redundant
and eliminate the code.


 Let's draw the line.
...
 That was my opinion on the topic. It looks like neither you, nor me will
 not change our point of view about that right now :)

I agree. :)

Alexey
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >