On Thu, Apr 4, 2024 at 6:41 PM Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote: > > On Thu, Apr 4, 2024 at 8:27 AM Benjamin Tissoires > <benjamin.tissoi...@redhat.com> wrote: > > > > > > > > > > So we need something like: > > > > > > struct bpf_hrtimer { > > > union { > > > struct hrtimer timer; > > > + struct work_struct work; > > > }; > > > struct bpf_map *map; > > > struct bpf_prog *prog; > > > void __rcu *callback_fn; > > > void *value; > > > union { > > > > Are you sure we need an union here? If we get to call kfree_rcu() we > > need to have both struct rcu_head and sync_work, not one or the other. > > why? with an extra flag it's one or the other. > In bpf_timer_cancel_and_free() > if (flag & SLEEPABLE) { > schedule_work() to cancel_work_sync + kfree_rcu > } else { > hrtimer_cancel > kfree_rcu > }
I thought kfree_rcu required struct rcu_head, and given that we need to initialize sync_work it will be poisoned... > > > > struct rcu_head rcu; > > > + struct work_struct sync_work; > > > }; > > > + u64 flags; // bpf_timer_init() will require BPF_F_TIMER_SLEEPABLE > > > > If I understand, you want BPF_F_TIMER_SLEEPABLE in bpf_timer_init() > > (like in my v2 or v3 IIRC). But that means that once a timer is > > initialized it needs to be of one or the other type (especially true > > with the first union in this struct). > > yes. That's an idea. > The code to support wq vs timer seems to be diverging more > than what we expected initially. > It seems cleaner to set it as init time and enforce in > other helpers. OK, works for me. > > Also with two work_struct-s we're pushing the sizeof(bpf_hrtimer) > too far. > It's already at 112 bytes and some people use bpf_timer per flow. > So potentially millions of such timers. > Adding extra sizeof(struct work_struct)=32 * 2 that won't be > used is too much. > Note that sizeof(struct hrtimer)=64, so unions make everything > fit nicely. Maybe we should do union { struct hrtimer timer; struct { struct work_struct work; struct work_struct sync_work; } } (not nice to read but at least we don't change the size at the beginning) > > > So should we reject during run time bpf_timer_set_callback() for > > sleepable timers and only allow bpf_timer_set_sleepable_cb() for > > those? (and the invert in the other case). > > yes. > > > This version of the patch allows for one timer to be used as softIRQ > > or WQ, depending on the timer_set_callback that is used. But it might > > be simpler for the kfree_rcu race to define the bpf_timer to be of one > > kind, so we are sure to call the correct kfree method. > > I think one or another simplifies the code and makes it easier > to think through combinations. > > I'm still contemplating adding new "struct bpf_wq" and new kfuncs > to completely separate wq vs timer. > The code reuse seems to be relatively small. There is some code reuse in the verifier, but it can be factored out I think. Though the biggest reuse might be in the map portion of bpf_timer, which I haven't looked much TBH. > We can potentially factor out internals of bpf_timer_* into smaller > helpers and use them from bpf_timer_* and from new bpf_wq_* kfuncs. Yeah, also, given that we are going to enforce delay == 0 for sleepable timers (wq), the user api would be much cleaner if we can have a dedicated bpf_wq (and it would make the flags of bpf_timer_init easier to deal with). > > One more thing. > bpf_timer_cancel() api turned out to be troublesome. > Not only it cancels the timer, but drops callback too. > It was a surprising behavior for people familiar with > kernel timer api-s. > We should not repeat this mistake with wq. > > We can try to fix bpf_timer_cancel() too. > If we drop drop_prog_refcnt() from it it shouldn't affect > existing bpf_timer users who are forced to do: > bpf_timer_cancel() > bpf_timer_set_callback() > bpf_timer_start() > all the time. > If/when bpf_timer_cancel() stops dropping the callback > such bpf prog won't be affected. So low chance of breaking any prog. > wdyt? > How would a program know set_callback() is not required after a cancel() because the kernel kept it around? It seems that it's going to be hard for them to know that (unless by trying first a start()), and it will add more code. timer_cancel() would be hard to change but we can always do the change and add a new kfunc timer_cancel_no_drop() which would clearly allow for new programs to know that set_callback() is not required to be called. In a few kernel releases we could remove it and say that timer_cancel() is the same (and replaced by a #define) Anyway, the more I think of it, the more I think the best API would be a dedicated wq API. It's probably going to need a little bit more work, but it'll be more or less this work plus the new bpf_wq type in the map. Cheers, Benjamin