On Fri, 24 Oct 2025 07:05:50 -0700 Dave Hansen <[email protected]> wrote:
> On 10/23/25 22:16, Kuniyuki Iwashima wrote: > >> This makes me nervous. The access_ok() check is quite a distance away. > >> I'd kinda want to see some performance numbers before doing this. Is > >> removing a single access_ok() even measurable? > > I noticed I made a typo in commit message, s/tcp_rr/udp_rr/. > > > > epoll_put_uevent() can be called multiple times in a single > > epoll_wait(), and we can see 1.7% more pps on UDP even when > > 1 thread has 1000 sockets only: > > > > server: $ udp_rr --nolog -6 -F 1000 -T 1 -l 3600 > > client: $ udp_rr --nolog -6 -F 1000 -T 256 -l 3600 -c -H $SERVER > > server: $ nstat > /dev/null; sleep 10; nstat | grep -i udp > > > > Without patch (2 stac/clac): > > Udp6InDatagrams 2205209 0.0 > > > > With patch (1 stac/clac): > > Udp6InDatagrams 2242602 0.0 > > I'm totally with you about removing a stac/clac: > > > https://lore.kernel.org/lkml/[email protected]/ > > The thing I'm worried about is having the access_ok() so distant > from the unsafe_put_user(). I'm wondering if this: > > - __user_write_access_begin(uevent, sizeof(*uevent)); > + if (!user_write_access_begin(uevent, sizeof(*uevent)) > + return NULL; > unsafe_put_user(revents, &uevent->events, efault); > unsafe_put_user(data, &uevent->data, efault); > user_access_end(); > > is measurably slower than what was in your series. If it is > not measurably slower, then the series gets simpler because it > does not need to refactor user_write_access_begin(). It also ends > up more obviously correct because the access check is closer to > the unsafe_put_user() calls. > > Also, the extra access_ok() is *much* cheaper than stac/clac. access_ok() does contain a conditional branch - just waiting for the misprediction penalty (say 20 clocks). OTOH you shouldn't get that more that twice for the loop. I'm pretty sure access_ok() itself contains an lfence - needed for reads. But that ought to be absent from user_write_access_begin(). The 'masked' version uses alu operations (on x86-64) and don't need lfence (or anything else) and don't contain a mispredictable branch. They should be faster than the above - unless the code has serious register pressure and too much gets spilled to stack. The timings may also depend on the cpu you are using. I'm sure I remember some of the very recent ones having much faster stac/clac and/or lfence. David >
