On Sat, Mar 23, 2019 at 10:36:24PM -0700, Eric Dumazet wrote: > > > On 03/23/2019 08:41 AM, Alexei Starovoitov wrote: > > On Sat, Mar 23, 2019 at 02:12:39AM -0700, Eric Dumazet wrote: > >> > >> > >> On 03/23/2019 01:05 AM, brakmo wrote: > >>> This patchset adds support for propagating congestion notifications (cn) > >>> to TCP from cgroup inet skb egress BPF programs. > >>> > >>> Current cgroup skb BPF programs cannot trigger TCP congestion window > >>> reductions, even when they drop a packet. This patch-set adds support > >>> for cgroup skb BPF programs to send congestion notifications in the > >>> return value when the packets are TCP packets. Rather than the > >>> current 1 for keeping the packet and 0 for dropping it, they can > >>> now return: > >>> NET_XMIT_SUCCESS (0) - continue with packet output > >>> NET_XMIT_DROP (1) - drop packet and do cn > >>> NET_XMIT_CN (2) - continue with packet output and do cn > >>> -EPERM - drop packet > >>> > >> > >> I believe I already mentioned this model is broken, if you have any virtual > >> device before the cgroup BPF program. > >> > >> Please think about offloading the pacing/throttling in the NIC, > >> there is no way we will report back to tcp stack instant notifications. > > > > I don't think 'offload to google proprietary nic' is a suggestion > > that folks can practically follow. > > Very few NICs can offload pacing to hw and there are plenty of limitations. > > This patch set represents a pure sw solution that works and scales to > > millions of flows. > > > >> This patch series is going way too far for my taste. > > > > I would really appreciate if you can do a technical review of the patches. > > Our previous approach didn't quite work due to complexity around > > locked/non-locked socket. > > This is a cleaner approach. > > Either we go with this one or will add a bpf hook into __tcp_transmit_skb. > > This approach is better since it works for other protocols and can be > > used by qdiscs w/o any bpf. > > > >> This idea is not new, you were at Google when it was experimented by > >> Nandita and > >> others, and we know it is not worth the pain. > > > > google networking needs are different from the rest of the world. > > > > This has nothing to do with Google against Facebook really, it is a bit sad > you react like this Alexei. > > We just upstreamed bpf_skb_ecn_set_ce(), so I doubt you already have numbers > to show that this strategy > is not enough. > > All recent discussions about ECN (TCP Prague and SCE) do not _require_ > instant feedback to the sender. > > Please show us experimental results before we have to carry these huge hacks.
I have a feeling we're looking at different patches. All of your comments seems to be talking about something else. I have a hard time connecting them to this patch set. Have you read the cover letter and patches of _this_ set? The results are in the cover letter and there is no change in behavior in networking stack. Patches 1, 2, 3 are simple refactoring of bpf-cgroup return codes. Patch 4 is the only one that touches net/ipv4/ip_output.c only to pass the return code to tcp/udp layer. The concept works for both despite of what you're claiming being tcp only. Cover letter also explains why bpf_skb_ecn_set_ce is not enough. Please realize that existing qdiscs already doing this. The patchset allows bpf-cgroup to do the same. If you were arguing about sysctl knob in patch 5 that I would understand. Instead of a knob we can hard code the value for now. But please explain what issues you see with that knob. Also to be fair I applied your commit f11216b24219 ("bpf: add skb->tstamp r/w access from tc clsact and cg skb progs") without demanding 'experimental results' because the feature makes sense. Yet, you folks didn't produce a _single_ example or test result since then. This is not cool.