On Mon, Mar 25, 2019 at 01:48:27AM -0700, Eric Dumazet wrote: > > > On 03/25/2019 01:33 AM, Eric Dumazet wrote: > > > > > > On 03/24/2019 09:19 AM, Alexei Starovoitov wrote: > > >> Cover letter also explains why bpf_skb_ecn_set_ce is not enough. > >> Please realize that existing qdiscs already doing this. > >> The patchset allows bpf-cgroup to do the same. > > > > Not the same thing I am afraid. > > To be clear Alexei : > > Existing qdisc set CE mark on a packet, exactly like a router would do. > Simple and universal. > This can be stacked, and done far away from the sender. > > We do not _call_ back local TCP to propagate cn.
How do you classify NET_XMIT_CN ? It's exactly local call back to indicate CN into tcp from layers below tcp. tc-bpf prog returning 'drop' code means drop+cn whereas cgroup-bpf prog returning 'drop' means drop only. This patch set is fixing this discrepancy. > We simply rely on the fact that incoming ACK will carry the needed > information, > and TCP stack already handles the case just fine. > > Larry cover letter does not really explain why we need to handle a corner case > (local drops) with such intrusive changes. Only after so many rounds of back and forth I think I'm starting to understand your 'intrusive change' comment. I think you're referring to: - return ip_finish_output2(net, sk, skb); + ret = ip_finish_output2(net, sk, skb); + return ret ? : ret_bpf; right? And your concern that this change slows down ip stack? Please spell out your concerns in more verbose way to avoid this guessing game. I've looked at assembler and indeed this change pessimizes tailcall optimization. What kind of benchmark do you want to see ? As an alternative we can do it under static_key that cgroup-bpf is under. It will be larger number of lines changed, but tailcall optimization can be preserved. > TCP Small Queues already should make sure local drops are non existent. tsq don't help. Here is the comment from patch 5: "When a packet is dropped when calling queue_xmit in __tcp_transmit_skb and packets_out is 0, it is beneficial to set a small probe timer. Otherwise, the throughput for the flow can suffer because it may need to depend on the probe timer to start sending again. " In other words when we're clamping aggregated cgroup bandwidth with this facility we may need to drop the only in flight packet for this flow and the flow restarts after default 200ms probe timer. In such case it's well under tsq limit. Thinking about tsq... I think it would be very interesting to add bpf hook there as well and have programmable and dynamic tsq limit, but that is orthogonal to this patch set. We will explore this idea separately.