Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection
On Mon, Jul 24, 2017 at 10:09:32PM +1000, Michael Ellerman wrote: > Peter Zijlstra writes: > > anyway, and the fact that your LL/SC is horrendously slow in any case. > > Boo :/ :-) > Just kidding. I suspect you're right that we can probably pack a > reasonable amount of tests in the body of the LL/SC and not notice. > > > Also, I still haven't seen an actual benchmark where our cmpxchg loop > > actually regresses anything, just a lot of yelling about potential > > regressions :/ > > Heh yeah. Though I have looked at the code it generates on PPC and it's > not sleek, though I guess that's not a benchmark is it :) Oh for sure, GCC still can't sanely convert a cmpxchg loop (esp. if the cmpxchg is implemented using asm) into a native LL/SC sequence, so the generic code will end up looking pretty horrendous. A native implementation of the same semantics should look loads better. One thing that might help you is that refcount_dec_and_test() is weaker than atomic_dec_and_test() wrt ordering, so that might help some (RELEASE vs fully ordered).
Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection
Peter Zijlstra writes: > On Mon, Jul 24, 2017 at 04:38:06PM +1000, Michael Ellerman wrote: > >> What I'm not entirely clear on is what the best trade off is in terms of >> overhead vs checks. The summary of behaviour between the fast and full >> versions you promised Ingo will help there I think. > > That's something that's probably completely different for PPC than it is > for x86. Yeah definitely. I guess I see the x86 version as a lower bound on the semantics we'd need to implement and still claim to implement the refcount stuff. > Both because your primitive is LL/SC and thus the saturation > semantics we need a cmpxchg loop for are more natural in your case Yay! > anyway, and the fact that your LL/SC is horrendously slow in any case. Boo :/ Just kidding. I suspect you're right that we can probably pack a reasonable amount of tests in the body of the LL/SC and not notice. > Also, I still haven't seen an actual benchmark where our cmpxchg loop > actually regresses anything, just a lot of yelling about potential > regressions :/ Heh yeah. Though I have looked at the code it generates on PPC and it's not sleek, though I guess that's not a benchmark is it :) cheers
Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection
On Mon, Jul 24, 2017 at 04:38:06PM +1000, Michael Ellerman wrote: > What I'm not entirely clear on is what the best trade off is in terms of > overhead vs checks. The summary of behaviour between the fast and full > versions you promised Ingo will help there I think. That's something that's probably completely different for PPC than it is for x86. Both because your primitive is LL/SC and thus the saturation semantics we need a cmpxchg loop for are more natural in your case anyway, and the fact that your LL/SC is horrendously slow in any case. Also, I still haven't seen an actual benchmark where our cmpxchg loop actually regresses anything, just a lot of yelling about potential regressions :/
Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection
Kees Cook writes: > On Fri, Jul 21, 2017 at 2:22 PM, Andrew Morton >> >> Do we have a feeling for how feasible/difficult it will be for other >> architectures to implement such a thing? > > The PaX atomic_t overflow protection this is heavily based on was > ported to a number of architectures (arm, powerpc, mips, sparc), so I > suspect it shouldn't be too hard to adapt those for the more narrow > refcount_t protection: > https://forums.grsecurity.net/viewtopic.php?f=7&t=4173 The ppc code there appears to be 32-bit only and has other issues so probably isn't something we'd use. I don't think there should be any fundamental problem implementing it. What I'm not entirely clear on is what the best trade off is in terms of overhead vs checks. The summary of behaviour between the fast and full versions you promised Ingo will help there I think. cheers
Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection
On Fri, Jul 21, 2017 at 2:22 PM, Andrew Morton wrote: > On Thu, 20 Jul 2017 11:11:06 +0200 Ingo Molnar wrote: > >> >> * Kees Cook wrote: >> >> > This implements refcount_t overflow protection on x86 without a noticeable >> > performance impact, though without the fuller checking of REFCOUNT_FULL. >> > This is done by duplicating the existing atomic_t refcount implementation >> > but with normally a single instruction added to detect if the refcount >> > has gone negative (i.e. wrapped past INT_MAX or below zero). When >> > detected, the handler saturates the refcount_t to INT_MIN / 2. With this >> > overflow protection, the erroneous reference release that would follow >> > a wrap back to zero is blocked from happening, avoiding the class of >> > refcount-over-increment use-after-free vulnerabilities entirely. >> > >> > Only the overflow case of refcounting can be perfectly protected, since it >> > can be detected and stopped before the reference is freed and left to be >> > abused by an attacker. This implementation also notices some of the "dec >> > to 0 without test", and "below 0" cases. However, these only indicate that >> > a use-after-free may have already happened. Such notifications are likely >> > avoidable by an attacker that has already exploited a use-after-free >> > vulnerability, but it's better to have them than allow such conditions to >> > remain universally silent. >> > >> > On first overflow detection, the refcount value is reset to INT_MIN / 2 >> > (which serves as a saturation value), the offending process is killed, >> > and a report and stack trace are produced. When operations detect only >> > negative value results (such as changing an already saturated value), >> > saturation still happens but no notification is performed (since the >> > value was already saturated). >> > >> > On the matter of races, since the entire range beyond INT_MAX but before >> > 0 is negative, every operation at INT_MIN / 2 will trap, leaving no >> > overflow-only race condition. >> > >> > As for performance, this implementation adds a single "js" instruction >> > to the regular execution flow of a copy of the standard atomic_t refcount >> > operations. (The non-"and_test" refcount_dec() function, which is uncommon >> > in regular refcount design patterns, has an additional "jz" instruction >> > to detect reaching exactly zero.) Since this is a forward jump, it is by >> > default the non-predicted path, which will be reinforced by dynamic branch >> > prediction. The result is this protection having virtually no measurable >> > change in performance over standard atomic_t operations. The error path, >> > located in .text.unlikely, saves the refcount location and then uses UD0 >> > to fire a refcount exception handler, which resets the refcount, handles >> > reporting, and returns to regular execution. This keeps the changes to >> > .text size minimal, avoiding return jumps and open-coded calls to the >> > error reporting routine. >> >> Pretty nice! >> > > Yes, this is a relief. > > Do we have a feeling for how feasible/difficult it will be for other > architectures to implement such a thing? The PaX atomic_t overflow protection this is heavily based on was ported to a number of architectures (arm, powerpc, mips, sparc), so I suspect it shouldn't be too hard to adapt those for the more narrow refcount_t protection: https://forums.grsecurity.net/viewtopic.php?f=7&t=4173 And an arm64 port of the fast refcount_t protection is already happening too. -Kees -- Kees Cook Pixel Security
Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection
On Thu, 20 Jul 2017 11:11:06 +0200 Ingo Molnar wrote: > > * Kees Cook wrote: > > > This implements refcount_t overflow protection on x86 without a noticeable > > performance impact, though without the fuller checking of REFCOUNT_FULL. > > This is done by duplicating the existing atomic_t refcount implementation > > but with normally a single instruction added to detect if the refcount > > has gone negative (i.e. wrapped past INT_MAX or below zero). When > > detected, the handler saturates the refcount_t to INT_MIN / 2. With this > > overflow protection, the erroneous reference release that would follow > > a wrap back to zero is blocked from happening, avoiding the class of > > refcount-over-increment use-after-free vulnerabilities entirely. > > > > Only the overflow case of refcounting can be perfectly protected, since it > > can be detected and stopped before the reference is freed and left to be > > abused by an attacker. This implementation also notices some of the "dec > > to 0 without test", and "below 0" cases. However, these only indicate that > > a use-after-free may have already happened. Such notifications are likely > > avoidable by an attacker that has already exploited a use-after-free > > vulnerability, but it's better to have them than allow such conditions to > > remain universally silent. > > > > On first overflow detection, the refcount value is reset to INT_MIN / 2 > > (which serves as a saturation value), the offending process is killed, > > and a report and stack trace are produced. When operations detect only > > negative value results (such as changing an already saturated value), > > saturation still happens but no notification is performed (since the > > value was already saturated). > > > > On the matter of races, since the entire range beyond INT_MAX but before > > 0 is negative, every operation at INT_MIN / 2 will trap, leaving no > > overflow-only race condition. > > > > As for performance, this implementation adds a single "js" instruction > > to the regular execution flow of a copy of the standard atomic_t refcount > > operations. (The non-"and_test" refcount_dec() function, which is uncommon > > in regular refcount design patterns, has an additional "jz" instruction > > to detect reaching exactly zero.) Since this is a forward jump, it is by > > default the non-predicted path, which will be reinforced by dynamic branch > > prediction. The result is this protection having virtually no measurable > > change in performance over standard atomic_t operations. The error path, > > located in .text.unlikely, saves the refcount location and then uses UD0 > > to fire a refcount exception handler, which resets the refcount, handles > > reporting, and returns to regular execution. This keeps the changes to > > .text size minimal, avoiding return jumps and open-coded calls to the > > error reporting routine. > > Pretty nice! > Yes, this is a relief. Do we have a feeling for how feasible/difficult it will be for other architectures to implement such a thing?
Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection
* Kees Cook wrote: > On Thu, Jul 20, 2017 at 10:15 AM, Kees Cook wrote: > > On Thu, Jul 20, 2017 at 2:11 AM, Ingo Molnar wrote: > >> Could you please also create a tabulated quick-comparison of the three > >> variants, > >> of all key properties, about behavior, feature and tradeoff differences? > >> > >> Something like: > >> > >> !ARCH_HAS_REFCOUNT > >> ARCH_HAS_REFCOUNT=y REFCOUNT_FULL=y > >> > >> avg fast path instructions: 5 3 > >> 10 > >> behavior on overflow: unsafe, silent safe, verbose > >> safe, verbose > >> behavior on underflow: unsafe, silent unsafe, verbose > >> unsafe, verbose > >> ... > >> > >> etc. - note that this table is just a quick mockup with wild guesses. > >> (Please add > >> more comparisons of other aspects as well.) > >> > >> Such a comparison would make it easier for arch, subsystem and distribution > >> maintainers to decide on which variant to use/enable. > > > > Sure, I can write this up. I'm not sure "safe"/"unsafe" is quite that > > clean. The differences between -full and -fast are pretty subtle, but > > I think I can describe it using the updated LKDTM tests I've written > > to compare the two. There are conditions that -fast doesn't catch, but > > those cases aren't actually useful for the overflow defense. > > > > As for "avg fast path instructions", do you mean the resulting > > assembly for each refcount API function? I think it's going to look > > something like "1 2 45", but I'll write it up. > > So, doing a worst-case timing of a loop of inc() to INT_MAX and then > dec_and_test() back to zero, I see this out of perf: > > atomic > 25255.114805 task-clock (msec) > 82249267387 cycles > 11208720041 instructions > > refcount-fast > 25259.577583 task-clock (msec) > 82211446892 cycles > 15486246572 instructions > > refcount-full > 44625.923432 task-clock (msec) > 144814735193 cycles > 105937495952 instructions > > I'll still summarize all this in the v7 series, but I think that > really clarifies the differences: 1.5x more instructions in -fast, but > nearly identical cycles and clock. Using -full sees a large change (as > expected). Ok, that's pretty convincig - I'd suggest including a cicles row in the table instead of an instructions row: number of instructions is indeed slightly misleading in this case. Thanks, Ingo
Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection
On Thu, Jul 20, 2017 at 10:15 AM, Kees Cook wrote: > On Thu, Jul 20, 2017 at 2:11 AM, Ingo Molnar wrote: >> Could you please also create a tabulated quick-comparison of the three >> variants, >> of all key properties, about behavior, feature and tradeoff differences? >> >> Something like: >> >> !ARCH_HAS_REFCOUNT ARCH_HAS_REFCOUNT=y >>REFCOUNT_FULL=y >> >> avg fast path instructions: 5 3 >>10 >> behavior on overflow: unsafe, silent safe, verbose >>safe, verbose >> behavior on underflow: unsafe, silent unsafe, verbose >>unsafe, verbose >> ... >> >> etc. - note that this table is just a quick mockup with wild guesses. >> (Please add >> more comparisons of other aspects as well.) >> >> Such a comparison would make it easier for arch, subsystem and distribution >> maintainers to decide on which variant to use/enable. > > Sure, I can write this up. I'm not sure "safe"/"unsafe" is quite that > clean. The differences between -full and -fast are pretty subtle, but > I think I can describe it using the updated LKDTM tests I've written > to compare the two. There are conditions that -fast doesn't catch, but > those cases aren't actually useful for the overflow defense. > > As for "avg fast path instructions", do you mean the resulting > assembly for each refcount API function? I think it's going to look > something like "1 2 45", but I'll write it up. So, doing a worst-case timing of a loop of inc() to INT_MAX and then dec_and_test() back to zero, I see this out of perf: atomic 25255.114805 task-clock (msec) 82249267387 cycles 11208720041 instructions refcount-fast 25259.577583 task-clock (msec) 82211446892 cycles 15486246572 instructions refcount-full 44625.923432 task-clock (msec) 144814735193 cycles 105937495952 instructions I'll still summarize all this in the v7 series, but I think that really clarifies the differences: 1.5x more instructions in -fast, but nearly identical cycles and clock. Using -full sees a large change (as expected). -Kees -- Kees Cook Pixel Security
Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection
On Thu, Jul 20, 2017 at 2:11 AM, Ingo Molnar wrote: > Could you please also create a tabulated quick-comparison of the three > variants, > of all key properties, about behavior, feature and tradeoff differences? > > Something like: > > !ARCH_HAS_REFCOUNT ARCH_HAS_REFCOUNT=y > REFCOUNT_FULL=y > > avg fast path instructions: 5 3 > 10 > behavior on overflow: unsafe, silent safe, verbose > safe, verbose > behavior on underflow: unsafe, silent unsafe, verbose > unsafe, verbose > ... > > etc. - note that this table is just a quick mockup with wild guesses. (Please > add > more comparisons of other aspects as well.) > > Such a comparison would make it easier for arch, subsystem and distribution > maintainers to decide on which variant to use/enable. Sure, I can write this up. I'm not sure "safe"/"unsafe" is quite that clean. The differences between -full and -fast are pretty subtle, but I think I can describe it using the updated LKDTM tests I've written to compare the two. There are conditions that -fast doesn't catch, but those cases aren't actually useful for the overflow defense. As for "avg fast path instructions", do you mean the resulting assembly for each refcount API function? I think it's going to look something like "1 2 45", but I'll write it up. -Kees -- Kees Cook Pixel Security
Re: [PATCH v6 0/2] x86: Implement fast refcount overflow protection
* Kees Cook wrote: > This implements refcount_t overflow protection on x86 without a noticeable > performance impact, though without the fuller checking of REFCOUNT_FULL. > This is done by duplicating the existing atomic_t refcount implementation > but with normally a single instruction added to detect if the refcount > has gone negative (i.e. wrapped past INT_MAX or below zero). When > detected, the handler saturates the refcount_t to INT_MIN / 2. With this > overflow protection, the erroneous reference release that would follow > a wrap back to zero is blocked from happening, avoiding the class of > refcount-over-increment use-after-free vulnerabilities entirely. > > Only the overflow case of refcounting can be perfectly protected, since it > can be detected and stopped before the reference is freed and left to be > abused by an attacker. This implementation also notices some of the "dec > to 0 without test", and "below 0" cases. However, these only indicate that > a use-after-free may have already happened. Such notifications are likely > avoidable by an attacker that has already exploited a use-after-free > vulnerability, but it's better to have them than allow such conditions to > remain universally silent. > > On first overflow detection, the refcount value is reset to INT_MIN / 2 > (which serves as a saturation value), the offending process is killed, > and a report and stack trace are produced. When operations detect only > negative value results (such as changing an already saturated value), > saturation still happens but no notification is performed (since the > value was already saturated). > > On the matter of races, since the entire range beyond INT_MAX but before > 0 is negative, every operation at INT_MIN / 2 will trap, leaving no > overflow-only race condition. > > As for performance, this implementation adds a single "js" instruction > to the regular execution flow of a copy of the standard atomic_t refcount > operations. (The non-"and_test" refcount_dec() function, which is uncommon > in regular refcount design patterns, has an additional "jz" instruction > to detect reaching exactly zero.) Since this is a forward jump, it is by > default the non-predicted path, which will be reinforced by dynamic branch > prediction. The result is this protection having virtually no measurable > change in performance over standard atomic_t operations. The error path, > located in .text.unlikely, saves the refcount location and then uses UD0 > to fire a refcount exception handler, which resets the refcount, handles > reporting, and returns to regular execution. This keeps the changes to > .text size minimal, avoiding return jumps and open-coded calls to the > error reporting routine. Pretty nice! Could you please also create a tabulated quick-comparison of the three variants, of all key properties, about behavior, feature and tradeoff differences? Something like: !ARCH_HAS_REFCOUNT ARCH_HAS_REFCOUNT=y REFCOUNT_FULL=y avg fast path instructions: 5 3 10 behavior on overflow: unsafe, silent safe, verbose safe, verbose behavior on underflow: unsafe, silent unsafe, verbose unsafe, verbose ... etc. - note that this table is just a quick mockup with wild guesses. (Please add more comparisons of other aspects as well.) Such a comparison would make it easier for arch, subsystem and distribution maintainers to decide on which variant to use/enable. Thanks, Ingo
[PATCH v6 0/2] x86: Implement fast refcount overflow protection
This series implements a fast refcount overflow protection for x86, which is needed to provide coverage for the several refcount-overflow use-after-free flaws the kernel has seen over the last many years. For example, here are two from 2016: http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/ http://cyseclabs.com/page?n=02012016 Patch 1 provides support for adding additional assembly to the GEN_*_RMWcc macros, and patch 2 does the real work. (I left Josh's Reviewed-by since the exception handler code has remained the same.) Here is patch 2's full commit log: This implements refcount_t overflow protection on x86 without a noticeable performance impact, though without the fuller checking of REFCOUNT_FULL. This is done by duplicating the existing atomic_t refcount implementation but with normally a single instruction added to detect if the refcount has gone negative (i.e. wrapped past INT_MAX or below zero). When detected, the handler saturates the refcount_t to INT_MIN / 2. With this overflow protection, the erroneous reference release that would follow a wrap back to zero is blocked from happening, avoiding the class of refcount-over-increment use-after-free vulnerabilities entirely. Only the overflow case of refcounting can be perfectly protected, since it can be detected and stopped before the reference is freed and left to be abused by an attacker. This implementation also notices some of the "dec to 0 without test", and "below 0" cases. However, these only indicate that a use-after-free may have already happened. Such notifications are likely avoidable by an attacker that has already exploited a use-after-free vulnerability, but it's better to have them than allow such conditions to remain universally silent. On first overflow detection, the refcount value is reset to INT_MIN / 2 (which serves as a saturation value), the offending process is killed, and a report and stack trace are produced. When operations detect only negative value results (such as changing an already saturated value), saturation still happens but no notification is performed (since the value was already saturated). On the matter of races, since the entire range beyond INT_MAX but before 0 is negative, every operation at INT_MIN / 2 will trap, leaving no overflow-only race condition. As for performance, this implementation adds a single "js" instruction to the regular execution flow of a copy of the standard atomic_t refcount operations. (The non-"and_test" refcount_dec() function, which is uncommon in regular refcount design patterns, has an additional "jz" instruction to detect reaching exactly zero.) Since this is a forward jump, it is by default the non-predicted path, which will be reinforced by dynamic branch prediction. The result is this protection having virtually no measurable change in performance over standard atomic_t operations. The error path, located in .text.unlikely, saves the refcount location and then uses UD0 to fire a refcount exception handler, which resets the refcount, handles reporting, and returns to regular execution. This keeps the changes to .text size minimal, avoiding return jumps and open-coded calls to the error reporting routine. Example assembly comparison: refcount_inc before .text: 81546149: f0 ff 45 f4 lock incl -0xc(%rbp) refcount_inc after .text: 81546149: f0 ff 45 f4 lock incl -0xc(%rbp) 8154614d: 0f 88 80 d5 17 00 js 816c36d3 ... .text.unlikely: 816c36d3: 48 8d 4d f4 lea-0xc(%rbp),%rcx 816c36d7: 0f ff (bad) This protection is a modified version of the x86 PAX_REFCOUNT defense from the last public patch of PaX/grsecurity, based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Thanks to PaX Team for various suggestions for improvement. Thanks! -Kees v6: - use single saturation value (INT_MIN / 2) - detect refcount_dec() to zero and saturate v5: - add unchecked atomic_t implementation when !CONFIG_REFCOUNT_FULL - use "leal" again, as in v3 for more flexible reset handling - provide better underflow detection, with saturation v4: - switch to js from jns to gain static branch prediction benefits - use .text.unlikely for js target, effectively making handler __cold - use UD0 with refcount exception handler instead of int 0x81 - Kconfig defaults on when arch has support v3: - drop named text sections until we need to distinguish sizes/directions - reset value immediately instead of passing back to handler - drop needless export; josh v2: - fix instruction pointer decrement bug; thejh - switch to js; pax-team - improve commit log