Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Tue, Mar 21, 2017 at 7:03 PM, Eric Dumazet wrote: > On Tue, 2017-03-21 at 16:51 -0700, Kees Cook wrote: > >> Am I understanding you correctly that you'd want something like: >> >> refcount.h: >> #ifdef UNPROTECTED_REFCOUNT >> #define refcount_inc(x) atomic_inc(x) >> ... >> #else >> void refcount_inc(... >> ... >> #endif >> >> some/net.c: >> #define UNPROTECTED_REFCOUNT >> #include >> >> or similar? > > At first, it could be something simple like that yes. > > Note that we might define two refcount_inc() : One that does whole > tests, and refcount_inc_relaxed() that might translate to atomic_inc() > on non debug kernels. > > Then later, maybe provide a dynamic infrastructure so that we can > dynamically force the full checks even for refcount_inc_relaxed() on say > 1% of the hosts, to get better debug coverage ? Well, this isn't about finding bugs in normal workflows. This is about catching bugs that attackers have found and start exploiting to gain a use-after-free primitive. The intention is for it to be always enabled. -Kees -- Kees Cook Pixel Security
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Wed, Mar 22, 2017 at 07:54:04AM -0700, Eric Dumazet wrote: > > I guess someone could code a lib/test_refcount.c launching X threads > using either atomic_inc or refcount_inc() in a loop. > > That would give a rough estimate of the refcount_t overhead among > various platforms. Cycles spend on uncontended ops: SKL SNB IVB-EP atomic: lock incl ~15 ~13 ~10 atomic-ref: call refcount_inc ~31 ~37 ~31 atomic-ref2:$inlined~23 ~22 ~21 Contended numbers (E3-1245 v5): root@skl:~/spinlocks# LOCK=./atomic ./test1.sh 1: 14.797240 2: 87.451230 4: 100.747790 8: 118.234010 root@skl:~/spinlocks# LOCK=./atomic-ref ./test1.sh 1: 30.627320 2: 91.866730 4: 111.029560 8: 141.922420 root@skl:~/spinlocks# LOCK=./atomic-ref2 ./test1.sh 1: 23.243930 2: 98.620250 4: 119.604240 8: 124.864380 The code includes the patches found here: https://lkml.kernel.org/r/20170317211918.393791...@infradead.org and effectively does: #define REFCOUNT_WARN(cond, str) WARN_ON_ONCE(cond) s/WARN_ONCE/REFCOUNT_WARN/ on lib/refcount.c Find the tarball of the userspace code used attached (its a bit of a mess; its grown over time and needs a cleanup). I used: gcc (Debian 6.3.0-6) 6.3.0 20170205 So while its about ~20 cycles worse, reducing contention is far more effective than removing straight line instruction count (which too is entirely possible, because GCC generates absolute shite in places). spinlocks.tar.bz2 Description: Binary data
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Wed, 2017-03-22 at 16:08 +0100, Peter Zijlstra wrote: > On Wed, Mar 22, 2017 at 07:54:04AM -0700, Eric Dumazet wrote: > > On Wed, 2017-03-22 at 15:33 +0100, Peter Zijlstra wrote: > > > > > > > > But I would feel a whole lot better about the entire thing if we could > > > measure their impact. It would also give us good precedent to whack > > > other potential users of _nocheck over the head with -- show numbers. > > > > I wont be able to measure the impact on real workloads, our productions > > kernels are based on 4.3 at this moment. > > Is there really no micro bench that exercises the relevant network > paths? Do you really fully rely on Google production workloads? You could run a synflood test, with ~10 Mpps. sock_hold() is definitely used in SYN handling. Last upstream kernels do not work on my lab hosts, for whatever reason.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Wed, Mar 22, 2017 at 07:54:04AM -0700, Eric Dumazet wrote: > On Wed, 2017-03-22 at 15:33 +0100, Peter Zijlstra wrote: > > > > > But I would feel a whole lot better about the entire thing if we could > > measure their impact. It would also give us good precedent to whack > > other potential users of _nocheck over the head with -- show numbers. > > I wont be able to measure the impact on real workloads, our productions > kernels are based on 4.3 at this moment. Is there really no micro bench that exercises the relevant network paths? Do you really fully rely on Google production workloads? > I guess someone could code a lib/test_refcount.c launching X threads > using either atomic_inc or refcount_inc() in a loop. > > That would give a rough estimate of the refcount_t overhead among > various platforms. Its also a fairly meaningless number. It doesn't include any of the other work the network path does.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Wed, 2017-03-22 at 15:33 +0100, Peter Zijlstra wrote: > > But I would feel a whole lot better about the entire thing if we could > measure their impact. It would also give us good precedent to whack > other potential users of _nocheck over the head with -- show numbers. I wont be able to measure the impact on real workloads, our productions kernels are based on 4.3 at this moment. I guess someone could code a lib/test_refcount.c launching X threads using either atomic_inc or refcount_inc() in a loop. That would give a rough estimate of the refcount_t overhead among various platforms.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Wed, Mar 22, 2017 at 06:22:16AM -0700, Eric Dumazet wrote: > But admittedly we can replace all these by standard refcount_inc() and > simply provide a CONFIG option to turn off the checks, and let brave > people enable this option. Still brings us back to lacking a real reason to provide that CONFIG option. Not to mention that this CONFIG knob will kill the warnings for everything, even the code that might not be as heavily audited as network and which doesn't really care much about the performance of refcount operations. So I'm actually in favour of _nocheck variants, if we can show the need for them. And I like your idea of being able to dynamically switch them back to full debug as well. But I would feel a whole lot better about the entire thing if we could measure their impact. It would also give us good precedent to whack other potential users of _nocheck over the head with -- show numbers.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Tue, Mar 21, 2017 at 04:51:13PM -0700, Kees Cook wrote: > On Tue, Mar 21, 2017 at 2:23 PM, Eric Dumazet wrote: > > Unfortunately there is no good test simulating real-world workloads, > > which are mostly using TCP flows. > > Sure, but there has to be _something_ that can be used to test to > measure the effects. Without a meaningful test, it's weird to reject a > change for performance reasons. This. How can you optimize if there's no way to actually measure something? > > Most synthetic tools you can find are not using epoll(), and very often > > hit bottlenecks in other layers. > > > > > > It looks like our suggestion to get kernel builds with atomic_inc() > > being exactly an atomic_inc() is not even discussed or implemented. > > So, FWIW, I originally tried to make this a CONFIG in the first couple > passes at getting a refcount defense. I would be fine with this, but I > was not able to convince Peter. :) However, things have evolved a lot > since then, so perhaps there are things do be done here. Well, the argument was that unless there's a benchmark that shows it cares, its all premature optimization. Similarly, you wanted this enabled at all times because hardening.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Wed, 2017-03-22 at 13:25 +0100, Peter Zijlstra wrote: > On Tue, Mar 21, 2017 at 07:03:19PM -0700, Eric Dumazet wrote: > > > Note that we might define two refcount_inc() : One that does whole > > tests, and refcount_inc_relaxed() that might translate to atomic_inc() > > on non debug kernels. > > So you'd want a duplicate interface, such that most code, which doesn't > care about refcount performance much, can still have all the tests > enabled. > > But the code that cares about it (and preferably can prove it with > numbers) can use the other. > > I'm also somewhat hesitant to use _relaxed for this distinction, as it > has a clear meaning in atomics, maybe _nocheck? > > Also; what operations do you want _nocheck variants of, only > refcount_inc() ? I was mostly thinking of points where we were already checking the value either before or after the atomic_inc(), using some lazy check (a la WARN_ON(atomic_read(p) == 0) or something like that. But admittedly we can replace all these by standard refcount_inc() and simply provide a CONFIG option to turn off the checks, and let brave people enable this option.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Tue, Mar 21, 2017 at 07:03:19PM -0700, Eric Dumazet wrote: > Note that we might define two refcount_inc() : One that does whole > tests, and refcount_inc_relaxed() that might translate to atomic_inc() > on non debug kernels. So you'd want a duplicate interface, such that most code, which doesn't care about refcount performance much, can still have all the tests enabled. But the code that cares about it (and preferably can prove it with numbers) can use the other. I'm also somewhat hesitant to use _relaxed for this distinction, as it has a clear meaning in atomics, maybe _nocheck? Also; what operations do you want _nocheck variants of, only refcount_inc() ? That said; I'm really loath to provide these without actual measurements that prove they make a difference. > Then later, maybe provide a dynamic infrastructure so that we can > dynamically force the full checks even for refcount_inc_relaxed() on say > 1% of the hosts, to get better debug coverage ? Shouldn't be too hard to do in arch specific code using alternative stuff. Generic code could use jump labels I suppose, but that would result in bigger code.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Tue, 2017-03-21 at 16:51 -0700, Kees Cook wrote: > Am I understanding you correctly that you'd want something like: > > refcount.h: > #ifdef UNPROTECTED_REFCOUNT > #define refcount_inc(x) atomic_inc(x) > ... > #else > void refcount_inc(... > ... > #endif > > some/net.c: > #define UNPROTECTED_REFCOUNT > #include > > or similar? At first, it could be something simple like that yes. Note that we might define two refcount_inc() : One that does whole tests, and refcount_inc_relaxed() that might translate to atomic_inc() on non debug kernels. Then later, maybe provide a dynamic infrastructure so that we can dynamically force the full checks even for refcount_inc_relaxed() on say 1% of the hosts, to get better debug coverage ?
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Tue, Mar 21, 2017 at 2:23 PM, Eric Dumazet wrote: > On Tue, 2017-03-21 at 13:49 -0700, Kees Cook wrote: > >> Yeah, this is exactly what I'd like to find as well. Just comparing >> cycles between refcount implementations, while interesting, doesn't >> show us real-world performance changes, which is what we need to >> measure. >> >> Is Eric's "20 concurrent 'netperf -t UDP_STREAM'" example (from >> elsewhere in this email thread) real-world meaningful enough? > > Not at all ;) > > This was targeting the specific change I had in mind for > ip_idents_reserve(), which is not used by TCP flows. Okay, I just wanted to check. I didn't think so, but it was the only example in the thread. > Unfortunately there is no good test simulating real-world workloads, > which are mostly using TCP flows. Sure, but there has to be _something_ that can be used to test to measure the effects. Without a meaningful test, it's weird to reject a change for performance reasons. > Most synthetic tools you can find are not using epoll(), and very often > hit bottlenecks in other layers. > > > It looks like our suggestion to get kernel builds with atomic_inc() > being exactly an atomic_inc() is not even discussed or implemented. So, FWIW, I originally tried to make this a CONFIG in the first couple passes at getting a refcount defense. I would be fine with this, but I was not able to convince Peter. :) However, things have evolved a lot since then, so perhaps there are things do be done here. > Coding this would require less time than running a typical Google kernel > qualification (roughly one month, thousands of hosts..., days of SWE). It wasn't the issue of coding time; just that it had been specifically not wanted. :) Am I understanding you correctly that you'd want something like: refcount.h: #ifdef UNPROTECTED_REFCOUNT #define refcount_inc(x) atomic_inc(x) ... #else void refcount_inc(... ... #endif some/net.c: #define UNPROTECTED_REFCOUNT #include or similar? -Kees -- Kees Cook Pixel Security
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
From: Eric Dumazet Date: Tue, 21 Mar 2017 14:23:09 -0700 > It looks like our suggestion to get kernel builds with atomic_inc() > being exactly an atomic_inc() is not even discussed or implemented. > > Coding this would require less time than running a typical Google kernel > qualification (roughly one month, thousands of hosts..., days of SWE). +1
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Tue, 2017-03-21 at 13:49 -0700, Kees Cook wrote: > Yeah, this is exactly what I'd like to find as well. Just comparing > cycles between refcount implementations, while interesting, doesn't > show us real-world performance changes, which is what we need to > measure. > > Is Eric's "20 concurrent 'netperf -t UDP_STREAM'" example (from > elsewhere in this email thread) real-world meaningful enough? Not at all ;) This was targeting the specific change I had in mind for ip_idents_reserve(), which is not used by TCP flows. Unfortunately there is no good test simulating real-world workloads, which are mostly using TCP flows. Most synthetic tools you can find are not using epoll(), and very often hit bottlenecks in other layers. It looks like our suggestion to get kernel builds with atomic_inc() being exactly an atomic_inc() is not even discussed or implemented. Coding this would require less time than running a typical Google kernel qualification (roughly one month, thousands of hosts..., days of SWE).
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, Mar 20, 2017 at 6:40 AM, Peter Zijlstra wrote: > On Mon, Mar 20, 2017 at 09:27:13PM +0800, Herbert Xu wrote: >> On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote: >> > >> > So what bench/setup do you want ran? >> >> You can start by counting how many cycles an atomic op takes >> vs. how many cycles this new code takes. > > On what uarch? > > I think I tested hand coded asm version and it ended up about double the > cycles for a cmpxchg loop vs the direct instruction on an IVB-EX (until > the memory bus saturated, at which point they took the same). Newer > parts will of course have different numbers, > > Can't we run some iperf on a 40gbe fiber loop or something? It would be > very useful to have an actual workload we can run. Yeah, this is exactly what I'd like to find as well. Just comparing cycles between refcount implementations, while interesting, doesn't show us real-world performance changes, which is what we need to measure. Is Eric's "20 concurrent 'netperf -t UDP_STREAM'" example (from elsewhere in this email thread) real-world meaningful enough? -Kees -- Kees Cook Pixel Security
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, 2017-03-20 at 09:18 -0700, Eric Dumazet wrote: > Interesting. > > UDP ipv4 xmit path gets a ~25 % improvement on PPC with this patch. > > ( 20 concurrent netperf -t UDP_STREAM : 2.45 Mpps -> 3.07 Mpps ) Well, there _is_ a difference, but not 25 % (this was probably caused by different queues on TX or RX between my reboots). I added a sysctl hack to be able to dynamically change on a given workload, and we hit other bottlenecks (mainly qdisc locks and driver tx locks) anyway.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, 2017-03-20 at 07:59 -0700, Eric Dumazet wrote: > On Mon, 2017-03-20 at 07:51 -0700, Eric Dumazet wrote: > > > atomic_cmpxchg() on PowerPC is horribly more expensive because of the > > added two SYNC instructions. > > Although I just saw that refcount was using atomic_cmpxchg_relaxed() > > Time to find some documentation (probably missing) or get some specs for > this thing. Interesting. UDP ipv4 xmit path gets a ~25 % improvement on PPC with this patch. ( 20 concurrent netperf -t UDP_STREAM : 2.45 Mpps -> 3.07 Mpps ) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 8471dd116771462d149e1da2807e446b69b74bcc..9f14aebf0ae1f5f366cfff0fbf58c48603916bc7 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -497,14 +497,14 @@ u32 ip_idents_reserve(u32 hash, int segs) u32 now = (u32)jiffies; u32 new, delta = 0; - if (old != now && cmpxchg(p_tstamp, old, now) == old) + if (old != now && cmpxchg_relaxed(p_tstamp, old, now) == old) delta = prandom_u32_max(now - old); /* Do not use atomic_add_return() as it makes UBSAN unhappy */ do { old = (u32)atomic_read(p_id); new = old + delta + segs; - } while (atomic_cmpxchg(p_id, old, new) != old); + } while (atomic_cmpxchg_relaxed(p_id, old, new) != old); return new - segs; }
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, Mar 20, 2017 at 07:51:01AM -0700, Eric Dumazet wrote: > PowerPC has no efficient atomic_inc() and this definitely shows on > network intensive workloads involving concurrent cores/threads. Correct, PPC LL/SC are dreadfully expensive. > atomic_cmpxchg() on PowerPC is horribly more expensive because of the > added two SYNC instructions. Note that refcount_t uses atomic_cmpxchg_release() and atomic_cmpxchg_relaxed() which avoid most of the painful barriers.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, 2017-03-20 at 14:40 +0100, Peter Zijlstra wrote: > On Mon, Mar 20, 2017 at 09:27:13PM +0800, Herbert Xu wrote: > > On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote: > > > > > > So what bench/setup do you want ran? > > > > You can start by counting how many cycles an atomic op takes > > vs. how many cycles this new code takes. > > On what uarch? > > I think I tested hand coded asm version and it ended up about double the > cycles for a cmpxchg loop vs the direct instruction on an IVB-EX (until > the memory bus saturated, at which point they took the same). Newer > parts will of course have different numbers, > > Can't we run some iperf on a 40gbe fiber loop or something? It would be > very useful to have an actual workload we can run. If atomic ops are converted one by one, it is likely that results will be noise. We can not start a global conversion without having a way to have selective debugging ? Then, adopting this fine infra would really not be a problem. Some arches have efficient atomic_inc() ( no full barriers ) while load + test + atomic_cmpxchg() + test + loop" is more expensive. PowerPC has no efficient atomic_inc() and this definitely shows on network intensive workloads involving concurrent cores/threads. atomic_cmpxchg() on PowerPC is horribly more expensive because of the added two SYNC instructions. networking performance is quite poor on PowerPC as of today.
RE: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
From: Peter Zijlstra > Sent: 20 March 2017 14:28 > On Mon, Mar 20, 2017 at 02:10:24PM +, David Laight wrote: > > On x86 the cpu flags from the 'lock inc/dec' could be used to reasonably > > cheaply detect errors - provided you actually generate a forwards branch. > > Note that currently there is no arch specific implementation. We could > of course cure this. > > But note that the thing you propose; using the overflow flag, can only > reasonably be done on PREEMPT=n kernels, otherwise we have an incredible > number of contexts that can nest. > > Sure; getting all starts aligned to double overflow is incredibly rare, > but I don't want to be the one to have to debug that. One overflow would set the overflow flag, you don't need both to fail. In any case you can use the sign flag. Say valid count values are -64k to -256 and 0 to MAXINT. The count will normally be +ve unless the 'main free path' has released the 64k references it holds. If the sign bit is set after inc/dec the value is checked; might valid, an error, or require the item be freed. Ok assuming the items have reasonable lifetimes and have a nominal 'delete' function. David
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, 2017-03-20 at 07:51 -0700, Eric Dumazet wrote: > atomic_cmpxchg() on PowerPC is horribly more expensive because of the > added two SYNC instructions. Although I just saw that refcount was using atomic_cmpxchg_relaxed() Time to find some documentation (probably missing) or get some specs for this thing.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote: > > So what bench/setup do you want ran? You can start by counting how many cycles an atomic op takes vs. how many cycles this new code takes. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, Mar 20, 2017 at 02:10:24PM +, David Laight wrote: > On x86 the cpu flags from the 'lock inc/dec' could be used to reasonably > cheaply detect errors - provided you actually generate a forwards branch. Note that currently there is no arch specific implementation. We could of course cure this. But note that the thing you propose; using the overflow flag, can only reasonably be done on PREEMPT=n kernels, otherwise we have an incredible number of contexts that can nest. Sure; getting all starts aligned to double overflow is incredibly rare, but I don't want to be the one to have to debug that.
RE: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
From: Herbert Xu > Sent: 20 March 2017 13:16 > On Mon, Mar 20, 2017 at 11:39:37AM +0100, Peter Zijlstra wrote: > > > > Can we at least give a benchmark and have someone run numbers? We should > > be able to quantify these things. > > Do you realise how many times this thing gets hit at 10Gb/s or > higher? Anyway, since you're proposing this change you should > demonstrate that it does not cause a performance regression. What checks does refcnt_t actually do? An extra decrement is hard to detect since the item gets freed early. I guess making the main 'allocate/free' code hold (say) 64k references would give some leeway for extra decrements. An extra increment will be detected when the count eventually wraps. Unless the error is in a very common path that won't happen for a long time. On x86 the cpu flags from the 'lock inc/dec' could be used to reasonably cheaply detect errors - provided you actually generate a forwards branch. Otherwise having a common, but not every packet, code path verify that the reference count is 'sane' would give reasonable coverage. David
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, Mar 20, 2017 at 11:39:37AM +0100, Peter Zijlstra wrote: > > Can we at least give a benchmark and have someone run numbers? We should > be able to quantify these things. Do you realise how many times this thing gets hit at 10Gb/s or higher? Anyway, since you're proposing this change you should demonstrate that it does not cause a performance regression. Cheers, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, Mar 20, 2017 at 09:27:13PM +0800, Herbert Xu wrote: > On Mon, Mar 20, 2017 at 02:23:57PM +0100, Peter Zijlstra wrote: > > > > So what bench/setup do you want ran? > > You can start by counting how many cycles an atomic op takes > vs. how many cycles this new code takes. On what uarch? I think I tested hand coded asm version and it ended up about double the cycles for a cmpxchg loop vs the direct instruction on an IVB-EX (until the memory bus saturated, at which point they took the same). Newer parts will of course have different numbers, Can't we run some iperf on a 40gbe fiber loop or something? It would be very useful to have an actual workload we can run.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Mon, Mar 20, 2017 at 09:16:29PM +0800, Herbert Xu wrote: > On Mon, Mar 20, 2017 at 11:39:37AM +0100, Peter Zijlstra wrote: > > > > Can we at least give a benchmark and have someone run numbers? We should > > be able to quantify these things. > > Do you realise how many times this thing gets hit at 10Gb/s or > higher? Anyway, since you're proposing this change you should > demonstrate that it does not cause a performance regression. So what bench/setup do you want ran?
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Sat, Mar 18, 2017 at 06:21:21PM -0700, David Miller wrote: > From: Herbert Xu > Date: Sun, 19 Mar 2017 00:47:59 +0800 > > > Eric Dumazet wrote: > >> On Fri, 2017-03-17 at 07:42 +, Reshetova, Elena wrote: > >> > >>> Should we then first measure the actual numbers to understand what we > >>> are talking here about? > >>> I would be glad to do it if you suggest what is the correct way to do > >>> measurements here to actually reflect the real life use cases. > >> > >> How have these patches been tested in real life exactly ? > >> > >> Can you quantify number of added cycles per TCP packet, where I expect > >> we have maybe 20 atomic operations in all layers ... > > > > I completely agree. I think this thing needs to default to the > > existing atomic_t behaviour. > > I totally agree as well, the refcount_t facility as-is is unacceptable > for networking. Can we at least give a benchmark and have someone run numbers? We should be able to quantify these things.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
From: Herbert Xu Date: Sun, 19 Mar 2017 00:47:59 +0800 > Eric Dumazet wrote: >> On Fri, 2017-03-17 at 07:42 +, Reshetova, Elena wrote: >> >>> Should we then first measure the actual numbers to understand what we >>> are talking here about? >>> I would be glad to do it if you suggest what is the correct way to do >>> measurements here to actually reflect the real life use cases. >> >> How have these patches been tested in real life exactly ? >> >> Can you quantify number of added cycles per TCP packet, where I expect >> we have maybe 20 atomic operations in all layers ... > > I completely agree. I think this thing needs to default to the > existing atomic_t behaviour. I totally agree as well, the refcount_t facility as-is is unacceptable for networking.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
Eric Dumazet wrote: > On Fri, 2017-03-17 at 07:42 +, Reshetova, Elena wrote: > >> Should we then first measure the actual numbers to understand what we >> are talking here about? >> I would be glad to do it if you suggest what is the correct way to do >> measurements here to actually reflect the real life use cases. > > How have these patches been tested in real life exactly ? > > Can you quantify number of added cycles per TCP packet, where I expect > we have maybe 20 atomic operations in all layers ... I completely agree. I think this thing needs to default to the existing atomic_t behaviour. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Fri, 2017-03-17 at 07:42 +, Reshetova, Elena wrote: > Should we then first measure the actual numbers to understand what we > are talking here about? > I would be glad to do it if you suggest what is the correct way to do > measurements here to actually reflect the real life use cases. How have these patches been tested in real life exactly ? Can you quantify number of added cycles per TCP packet, where I expect we have maybe 20 atomic operations in all layers ... (sk refcnt, skb->users, page refcounts, sk->sk_wmem_alloc, sk->sk_rmem_alloc, qdisc ...) Once we 'protect' all of them, cost will be quite high. This translates to more fossil fuel being burnt. one atomic_inc() used to be a single x86 instruction. Rough estimate of refcount_inc() : 0140 : 140: 55 push %rbp 141: 48 89 e5mov%rsp,%rbp 144: e8 00 00 00 00 callq refcount_inc_not_zero 149: 84 c0 test %al,%al 14b: 74 02 je 14f 14d: 5d pop%rbp 14e: c3 retq 00e0 : e0: 8b 17 mov(%rdi),%edx e2: eb 10 jmpf4 e4: 85 c9 test %ecx,%ecx e6: 74 1b je 103 e8: 89 d0 mov%edx,%eax ea: f0 0f b1 0f lock cmpxchg %ecx,(%rdi) ee: 39 d0 cmp%edx,%eax f0: 74 0c je fe f2: 89 c2 mov%eax,%edx f4: 85 d2 test %edx,%edx f6: 8d 4a 01lea0x1(%rdx),%ecx f9: 75 e9 jnee4 fb: 31 c0 xor%eax,%eax fd: c3 retq fe: 83 f9 ffcmp$0x,%ecx 101: 74 06 je 109 103: b8 01 00 00 00 mov$0x1,%eax 108: c3 retq This is simply bloat for most cases. Again, I believe this infrastructure makes sense for debugging kernels. If some vendors are willing to run fully enabled debug kernels, that is their choice. Probably many devices wont show any difference. Have we forced KASAN being enabled in linux kernel, just because it found ~400 bugs so far ? I believe refcount_t infra is not mature enough to be widely used right now. Maybe in few months when we have more flexibility, like existing debugging facilities (CONFIG_DEBUG_PAGEALLOC, CONFIG_DEBUG_PAGE_REF, LOCKDEP, KMEMLEAK, KASAN, ...)
RE: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
> From: Kees Cook > Date: Thu, 16 Mar 2017 11:38:25 -0600 > > > I am, of course, biased, but I think the evidence of actual > > refcounting attacks outweighs the theoretical performance cost of > > these changes. > > This is not theoretical at all. > > We count the nanoseconds that every packet takes to get processed and > you are adding quite a bit. > > I understand your point of view, but this is knowingly going to add > performance regressions to the networking code. Should we then first measure the actual numbers to understand what we are talking here about? I would be glad to do it if you suggest what is the correct way to do measurements here to actually reflect the real life use cases. Best Regards, Elena.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
From: Kees Cook Date: Thu, 16 Mar 2017 11:38:25 -0600 > I am, of course, biased, but I think the evidence of actual > refcounting attacks outweighs the theoretical performance cost of > these changes. This is not theoretical at all. We count the nanoseconds that every packet takes to get processed and you are adding quite a bit. I understand your point of view, but this is knowingly going to add performance regressions to the networking code.
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Thu, Mar 16, 2017 at 10:58 AM, Eric Dumazet wrote: > On Thu, 2017-03-16 at 17:28 +0200, Elena Reshetova wrote: >> refcount_t type and corresponding API should be >> used instead of atomic_t when the variable is used as >> a reference counter. This allows to avoid accidental >> refcounter overflows that might lead to use-after-free >> situations. > > > ... > >> static __always_inline void sock_hold(struct sock *sk) >> { >> - atomic_inc(&sk->sk_refcnt); >> + refcount_inc(&sk->sk_refcnt); >> } >> > > While I certainly see the value of these refcount_t, we have a very > different behavior on these atomic_inc() which were doing a single > inlined LOCK RMW on x86. I think we can certainly investigate arch-specific ways to improve the performance, but the consensus seemed to be that getting the infrastructure in and doing the migration was the first set of steps. > We now call an external function performing a > atomic_read(), various ops/tests, then atomic_cmpxchg_relaxed(), in a > loop, loosing the nice ability for x86 of preventing live locks. > > Looks a lot of bloat, just to be able to chase hypothetical bugs in the > kernel. > > I would love to have a way to enable extra debugging when I want a debug > kernel, like LOCKDEP or KASAN. > > By adding all this bloat, we assert linux kernel is terminally buggy and > every atomic_inc() we did was suspicious, and need to be always > instrumented/validated. This IS the assertion, unfortunately. With average 5 year lifetimes on security flaws[1], and many of the last couple years' public exploits being refcount flaws[2], this is something we have to get done. We need the default kernel to be much more self-protective, and this is one of many places to make it happen. I am, of course, biased, but I think the evidence of actual refcounting attacks outweighs the theoretical performance cost of these changes. If there is a realistic workflow that shows actual problems, let's examine it and find a solution for that as a separate part of this work without blocking this migration. -Kees [1] https://outflux.net/blog/archives/2016/10/18/security-bug-lifetime/ [2] http://kernsec.org/wiki/index.php/Bug_Classes/Integer_overflow -- Kees Cook Pixel Security
Re: [PATCH 07/17] net: convert sock.sk_refcnt from atomic_t to refcount_t
On Thu, 2017-03-16 at 17:28 +0200, Elena Reshetova wrote: > refcount_t type and corresponding API should be > used instead of atomic_t when the variable is used as > a reference counter. This allows to avoid accidental > refcounter overflows that might lead to use-after-free > situations. ... > static __always_inline void sock_hold(struct sock *sk) > { > - atomic_inc(&sk->sk_refcnt); > + refcount_inc(&sk->sk_refcnt); > } > While I certainly see the value of these refcount_t, we have a very different behavior on these atomic_inc() which were doing a single inlined LOCK RMW on x86. We now call an external function performing a atomic_read(), various ops/tests, then atomic_cmpxchg_relaxed(), in a loop, loosing the nice ability for x86 of preventing live locks. Looks a lot of bloat, just to be able to chase hypothetical bugs in the kernel. I would love to have a way to enable extra debugging when I want a debug kernel, like LOCKDEP or KASAN. By adding all this bloat, we assert linux kernel is terminally buggy and every atomic_inc() we did was suspicious, and need to be always instrumented/validated.