On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote:
> 
> 
> On 03/25/2019 01:33 AM, Eric Dumazet wrote:
> > 
> > 
> > On 03/24/2019 09:19 AM, Alexei Starovoitov wrote:
> 
> >> Cover letter also explains why bpf_skb_ecn_set_ce is not enough.
> >> Please realize that existing qdiscs already doing this.
> >> The patchset allows bpf-cgroup to do the same.
> > 
> > Not the same thing I am afraid.
> 
> To be clear Alexei :
> 
> Existing qdisc set CE mark on a packet, exactly like a router would do.
> Simple and universal.
> This can be stacked, and done far away from the sender.
> 
> We do not _call_ back local TCP to propagate cn.

How do you classify NET_XMIT_CN ?
It's exactly local call back to indicate CN into tcp from layers below tcp.
tc-bpf prog returning 'drop' code means drop+cn
whereas cgroup-bpf prog returning 'drop' means drop only.
This patch set is fixing this discrepancy.

> We simply rely on the fact that incoming ACK will carry the needed 
> information,
> and TCP stack already handles the case just fine.
> 
> Larry cover letter does not really explain why we need to handle a corner case
> (local drops) with such intrusive changes.

Only after so many rounds of back and forth I think I'm starting
to understand your 'intrusive change' comment.
I think you're referring to:
-       return ip_finish_output2(net, sk, skb);
+       ret = ip_finish_output2(net, sk, skb);
+       return ret ? : ret_bpf;

right?
And your concern that this change slows down ip stack?
Please spell out your concerns in more verbose way to avoid this guessing game.

I've looked at assembler and indeed this change pessimizes tailcall
optimization. What kind of benchmark do you want to see ?
As an alternative we can do it under static_key that cgroup-bpf is under.
It will be larger number of lines changed, but tailcall
optimization can be preserved.

> TCP Small Queues already should make sure local drops are non existent.

tsq don't help.
Here is the comment from patch 5:
"When a packet is dropped when calling queue_xmit in  __tcp_transmit_skb
and packets_out is 0, it is beneficial to set a small probe timer.
Otherwise, the throughput for the flow can suffer because it may need to
depend on the probe timer to start sending again. "

In other words when we're clamping aggregated cgroup bandwidth with this 
facility
we may need to drop the only in flight packet for this flow
and the flow restarts after default 200ms probe timer.
In such case it's well under tsq limit.

Thinking about tsq... I think it would be very interesting to add bpf hook
there as well and have programmable and dynamic tsq limit, but that is
orthogonal to this patch set. We will explore this idea separately.

Reply via email to