Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Sat, Dec 01, 2018 at 06:52:45AM +, Nadav Amit wrote: > > On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf wrote: > > > > On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote: > >> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski > >> wrote: > >>> On Nov 28, 2018, at 6:06 PM, Nadav Amit wrote: > >>> > > On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: > > > >> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf > >> wrote: > >> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf > wrote: > > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > > This RFC introduces indirect call promotion in runtime, which for > > the > > matter of simplification (and branding) will be called here > > "relpolines" > > (relative call + trampoline). Relpolines are mainly intended as a > > way > > of reducing retpoline overheads due to Spectre v2. > > > > Unlike indirect call promotion through profile guided optimization, > > the > > proposed approach does not require a profiling stage, works well > > with > > modules whose address is unknown and can adapt to changing > > workloads. > > > > The main idea is simple: for every indirect call, we inject a piece > > of > > code with fast- and slow-path calls. The fast path is used if the > > target > > matches the expected (hot) target. The slow-path uses a retpoline. > > During training, the slow-path is set to call a function that saves > > the > > call source and target in a hash-table and keep count for call > > frequency. The most common target is then patched into the hot path. > > > > The patching is done on-the-fly by patching the conditional branch > > (opcode and offset) that is used to compare the target to the hot > > target. This allows to direct all cores to the fast-path, while > > patching > > the slow-path and vice-versa. Patching follows 2 more rules: (1) > > Only > > patch a single byte when the code might be executed by any core. (2) > > When patching more than one byte, ensure that all cores do not run > > the > > to-be-patched-code by preventing this code from being preempted, and > > using synchronize_sched() after patching the branch that jumps over > > this > > code. > > > > Changing all the indirect calls to use relpolines is done using > > assembly > > macro magic. There are alternative solutions, but this one is > > relatively simple and transparent. There is also logic to retrain > > the > > software predictor, but the policy it uses may need to be refined. > > > > Eventually the results are not bad (2 VCPU VM, throughput reported): > > > > baserelpoline > > - > > nginx 22898 25178 (+10%) > > redis-ycsb 24523 25486 (+4%) > > dbench 21442103 (+2%) > > > > When retpolines are disabled, and if retraining is off, performance > > benefits are up to 2% (nginx), but are much less impressive. > > Hi Nadav, > > Peter pointed me to these patches during a discussion about retpoline > profiling. Personally, I think this is brilliant. This could help > networking and filesystem intensive workloads a lot. > >>> > >>> Thanks! I was a bit held-back by the relatively limited number of > >>> responses. > >> > >> It is a rather, erm, ambitious idea, maybe they were speechless :-) > >> > >>> I finished another version two weeks ago, and every day I think: > >>> "should it > >>> be RFCv2 or v1”, ending up not sending it… > >>> > >>> There is one issue that I realized while working on the new version: > >>> I’m not > >>> sure it is well-defined what an outline retpoline is allowed to do. > >>> The > >>> indirect branch promotion code can change rflags, which might cause > >>> correction issues. In practice, using gcc, it is not a problem. > >> > >> Callees can clobber flags, so it seems fine to me. > > > > Just to check I understand your approach right: you made a macro > > called "call", and you're therefore causing all instances of "call" to > > become magic? This is... terrifying. It's even plausibly worse than > > "#define if" :) The scariest bit is that it will impact inline asm as > > well. Maybe a gcc plugin would be less alarming? > > It is likely to look less alarming. When I looked at the inline
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Sat, Dec 01, 2018 at 06:52:45AM +, Nadav Amit wrote: > > On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf wrote: > > > > On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote: > >> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski > >> wrote: > >>> On Nov 28, 2018, at 6:06 PM, Nadav Amit wrote: > >>> > > On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: > > > >> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf > >> wrote: > >> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf > wrote: > > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > > This RFC introduces indirect call promotion in runtime, which for > > the > > matter of simplification (and branding) will be called here > > "relpolines" > > (relative call + trampoline). Relpolines are mainly intended as a > > way > > of reducing retpoline overheads due to Spectre v2. > > > > Unlike indirect call promotion through profile guided optimization, > > the > > proposed approach does not require a profiling stage, works well > > with > > modules whose address is unknown and can adapt to changing > > workloads. > > > > The main idea is simple: for every indirect call, we inject a piece > > of > > code with fast- and slow-path calls. The fast path is used if the > > target > > matches the expected (hot) target. The slow-path uses a retpoline. > > During training, the slow-path is set to call a function that saves > > the > > call source and target in a hash-table and keep count for call > > frequency. The most common target is then patched into the hot path. > > > > The patching is done on-the-fly by patching the conditional branch > > (opcode and offset) that is used to compare the target to the hot > > target. This allows to direct all cores to the fast-path, while > > patching > > the slow-path and vice-versa. Patching follows 2 more rules: (1) > > Only > > patch a single byte when the code might be executed by any core. (2) > > When patching more than one byte, ensure that all cores do not run > > the > > to-be-patched-code by preventing this code from being preempted, and > > using synchronize_sched() after patching the branch that jumps over > > this > > code. > > > > Changing all the indirect calls to use relpolines is done using > > assembly > > macro magic. There are alternative solutions, but this one is > > relatively simple and transparent. There is also logic to retrain > > the > > software predictor, but the policy it uses may need to be refined. > > > > Eventually the results are not bad (2 VCPU VM, throughput reported): > > > > baserelpoline > > - > > nginx 22898 25178 (+10%) > > redis-ycsb 24523 25486 (+4%) > > dbench 21442103 (+2%) > > > > When retpolines are disabled, and if retraining is off, performance > > benefits are up to 2% (nginx), but are much less impressive. > > Hi Nadav, > > Peter pointed me to these patches during a discussion about retpoline > profiling. Personally, I think this is brilliant. This could help > networking and filesystem intensive workloads a lot. > >>> > >>> Thanks! I was a bit held-back by the relatively limited number of > >>> responses. > >> > >> It is a rather, erm, ambitious idea, maybe they were speechless :-) > >> > >>> I finished another version two weeks ago, and every day I think: > >>> "should it > >>> be RFCv2 or v1”, ending up not sending it… > >>> > >>> There is one issue that I realized while working on the new version: > >>> I’m not > >>> sure it is well-defined what an outline retpoline is allowed to do. > >>> The > >>> indirect branch promotion code can change rflags, which might cause > >>> correction issues. In practice, using gcc, it is not a problem. > >> > >> Callees can clobber flags, so it seems fine to me. > > > > Just to check I understand your approach right: you made a macro > > called "call", and you're therefore causing all instances of "call" to > > become magic? This is... terrifying. It's even plausibly worse than > > "#define if" :) The scariest bit is that it will impact inline asm as > > well. Maybe a gcc plugin would be less alarming? > > It is likely to look less alarming. When I looked at the inline
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
> On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf wrote: > > On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote: >> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski wrote: >>> On Nov 28, 2018, at 6:06 PM, Nadav Amit wrote: >>> > On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: > >> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf >> wrote: >> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > This RFC introduces indirect call promotion in runtime, which for the > matter of simplification (and branding) will be called here > "relpolines" > (relative call + trampoline). Relpolines are mainly intended as a way > of reducing retpoline overheads due to Spectre v2. > > Unlike indirect call promotion through profile guided optimization, > the > proposed approach does not require a profiling stage, works well with > modules whose address is unknown and can adapt to changing workloads. > > The main idea is simple: for every indirect call, we inject a piece of > code with fast- and slow-path calls. The fast path is used if the > target > matches the expected (hot) target. The slow-path uses a retpoline. > During training, the slow-path is set to call a function that saves > the > call source and target in a hash-table and keep count for call > frequency. The most common target is then patched into the hot path. > > The patching is done on-the-fly by patching the conditional branch > (opcode and offset) that is used to compare the target to the hot > target. This allows to direct all cores to the fast-path, while > patching > the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > patch a single byte when the code might be executed by any core. (2) > When patching more than one byte, ensure that all cores do not run the > to-be-patched-code by preventing this code from being preempted, and > using synchronize_sched() after patching the branch that jumps over > this > code. > > Changing all the indirect calls to use relpolines is done using > assembly > macro magic. There are alternative solutions, but this one is > relatively simple and transparent. There is also logic to retrain the > software predictor, but the policy it uses may need to be refined. > > Eventually the results are not bad (2 VCPU VM, throughput reported): > > baserelpoline > - > nginx 22898 25178 (+10%) > redis-ycsb 24523 25486 (+4%) > dbench 21442103 (+2%) > > When retpolines are disabled, and if retraining is off, performance > benefits are up to 2% (nginx), but are much less impressive. Hi Nadav, Peter pointed me to these patches during a discussion about retpoline profiling. Personally, I think this is brilliant. This could help networking and filesystem intensive workloads a lot. >>> >>> Thanks! I was a bit held-back by the relatively limited number of >>> responses. >> >> It is a rather, erm, ambitious idea, maybe they were speechless :-) >> >>> I finished another version two weeks ago, and every day I think: >>> "should it >>> be RFCv2 or v1”, ending up not sending it… >>> >>> There is one issue that I realized while working on the new version: >>> I’m not >>> sure it is well-defined what an outline retpoline is allowed to do. The >>> indirect branch promotion code can change rflags, which might cause >>> correction issues. In practice, using gcc, it is not a problem. >> >> Callees can clobber flags, so it seems fine to me. > > Just to check I understand your approach right: you made a macro > called "call", and you're therefore causing all instances of "call" to > become magic? This is... terrifying. It's even plausibly worse than > "#define if" :) The scariest bit is that it will impact inline asm as > well. Maybe a gcc plugin would be less alarming? It is likely to look less alarming. When I looked at the inline retpoline implementation of gcc, it didn’t look much better than what I did - it basically just emits assembly instructions. >>> >>> To be clear, that wasn’t a NAK. It was merely a “this is alarming.” >> >> Although... how do you avoid matching on things that really don't want >> this treatment? paravirt ops come to mind. > > Paravirt ops don't use
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
> On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf wrote: > > On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote: >> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski wrote: >>> On Nov 28, 2018, at 6:06 PM, Nadav Amit wrote: >>> > On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: > >> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf >> wrote: >> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > This RFC introduces indirect call promotion in runtime, which for the > matter of simplification (and branding) will be called here > "relpolines" > (relative call + trampoline). Relpolines are mainly intended as a way > of reducing retpoline overheads due to Spectre v2. > > Unlike indirect call promotion through profile guided optimization, > the > proposed approach does not require a profiling stage, works well with > modules whose address is unknown and can adapt to changing workloads. > > The main idea is simple: for every indirect call, we inject a piece of > code with fast- and slow-path calls. The fast path is used if the > target > matches the expected (hot) target. The slow-path uses a retpoline. > During training, the slow-path is set to call a function that saves > the > call source and target in a hash-table and keep count for call > frequency. The most common target is then patched into the hot path. > > The patching is done on-the-fly by patching the conditional branch > (opcode and offset) that is used to compare the target to the hot > target. This allows to direct all cores to the fast-path, while > patching > the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > patch a single byte when the code might be executed by any core. (2) > When patching more than one byte, ensure that all cores do not run the > to-be-patched-code by preventing this code from being preempted, and > using synchronize_sched() after patching the branch that jumps over > this > code. > > Changing all the indirect calls to use relpolines is done using > assembly > macro magic. There are alternative solutions, but this one is > relatively simple and transparent. There is also logic to retrain the > software predictor, but the policy it uses may need to be refined. > > Eventually the results are not bad (2 VCPU VM, throughput reported): > > baserelpoline > - > nginx 22898 25178 (+10%) > redis-ycsb 24523 25486 (+4%) > dbench 21442103 (+2%) > > When retpolines are disabled, and if retraining is off, performance > benefits are up to 2% (nginx), but are much less impressive. Hi Nadav, Peter pointed me to these patches during a discussion about retpoline profiling. Personally, I think this is brilliant. This could help networking and filesystem intensive workloads a lot. >>> >>> Thanks! I was a bit held-back by the relatively limited number of >>> responses. >> >> It is a rather, erm, ambitious idea, maybe they were speechless :-) >> >>> I finished another version two weeks ago, and every day I think: >>> "should it >>> be RFCv2 or v1”, ending up not sending it… >>> >>> There is one issue that I realized while working on the new version: >>> I’m not >>> sure it is well-defined what an outline retpoline is allowed to do. The >>> indirect branch promotion code can change rflags, which might cause >>> correction issues. In practice, using gcc, it is not a problem. >> >> Callees can clobber flags, so it seems fine to me. > > Just to check I understand your approach right: you made a macro > called "call", and you're therefore causing all instances of "call" to > become magic? This is... terrifying. It's even plausibly worse than > "#define if" :) The scariest bit is that it will impact inline asm as > well. Maybe a gcc plugin would be less alarming? It is likely to look less alarming. When I looked at the inline retpoline implementation of gcc, it didn’t look much better than what I did - it basically just emits assembly instructions. >>> >>> To be clear, that wasn’t a NAK. It was merely a “this is alarming.” >> >> Although... how do you avoid matching on things that really don't want >> this treatment? paravirt ops come to mind. > > Paravirt ops don't use
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote: > On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski wrote: > > > > > > On Nov 28, 2018, at 6:06 PM, Nadav Amit wrote: > > > > >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: > > >> > > >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf > > >>> wrote: > > >>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf > > > wrote: > > > > > >> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > > >> This RFC introduces indirect call promotion in runtime, which for the > > >> matter of simplification (and branding) will be called here > > >> "relpolines" > > >> (relative call + trampoline). Relpolines are mainly intended as a way > > >> of reducing retpoline overheads due to Spectre v2. > > >> > > >> Unlike indirect call promotion through profile guided optimization, > > >> the > > >> proposed approach does not require a profiling stage, works well with > > >> modules whose address is unknown and can adapt to changing workloads. > > >> > > >> The main idea is simple: for every indirect call, we inject a piece > > >> of > > >> code with fast- and slow-path calls. The fast path is used if the > > >> target > > >> matches the expected (hot) target. The slow-path uses a retpoline. > > >> During training, the slow-path is set to call a function that saves > > >> the > > >> call source and target in a hash-table and keep count for call > > >> frequency. The most common target is then patched into the hot path. > > >> > > >> The patching is done on-the-fly by patching the conditional branch > > >> (opcode and offset) that is used to compare the target to the hot > > >> target. This allows to direct all cores to the fast-path, while > > >> patching > > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > > >> patch a single byte when the code might be executed by any core. (2) > > >> When patching more than one byte, ensure that all cores do not run > > >> the > > >> to-be-patched-code by preventing this code from being preempted, and > > >> using synchronize_sched() after patching the branch that jumps over > > >> this > > >> code. > > >> > > >> Changing all the indirect calls to use relpolines is done using > > >> assembly > > >> macro magic. There are alternative solutions, but this one is > > >> relatively simple and transparent. There is also logic to retrain the > > >> software predictor, but the policy it uses may need to be refined. > > >> > > >> Eventually the results are not bad (2 VCPU VM, throughput reported): > > >> > > >> baserelpoline > > >> - > > >> nginx 22898 25178 (+10%) > > >> redis-ycsb 24523 25486 (+4%) > > >> dbench 21442103 (+2%) > > >> > > >> When retpolines are disabled, and if retraining is off, performance > > >> benefits are up to 2% (nginx), but are much less impressive. > > > > > > Hi Nadav, > > > > > > Peter pointed me to these patches during a discussion about retpoline > > > profiling. Personally, I think this is brilliant. This could help > > > networking and filesystem intensive workloads a lot. > > > > Thanks! I was a bit held-back by the relatively limited number of > > responses. > > >>> > > >>> It is a rather, erm, ambitious idea, maybe they were speechless :-) > > >>> > > I finished another version two weeks ago, and every day I think: > > "should it > > be RFCv2 or v1”, ending up not sending it… > > > > There is one issue that I realized while working on the new version: > > I’m not > > sure it is well-defined what an outline retpoline is allowed to do. The > > indirect branch promotion code can change rflags, which might cause > > correction issues. In practice, using gcc, it is not a problem. > > >>> > > >>> Callees can clobber flags, so it seems fine to me. > > >> > > >> Just to check I understand your approach right: you made a macro > > >> called "call", and you're therefore causing all instances of "call" to > > >> become magic? This is... terrifying. It's even plausibly worse than > > >> "#define if" :) The scariest bit is that it will impact inline asm as > > >> well. Maybe a gcc plugin would be less alarming? > > > > > > It is likely to look less alarming. When I looked at the inline retpoline > > > implementation of gcc, it didn’t look much better than what I did - it > > > basically just emits assembly instructions. > > > > To be clear, that wasn’t a NAK. It was merely a “this is alarming.” > > Although... how do you avoid matching on things that really don't want > this treatment? paravirt ops come
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote: > On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski wrote: > > > > > > On Nov 28, 2018, at 6:06 PM, Nadav Amit wrote: > > > > >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: > > >> > > >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf > > >>> wrote: > > >>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf > > > wrote: > > > > > >> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > > >> This RFC introduces indirect call promotion in runtime, which for the > > >> matter of simplification (and branding) will be called here > > >> "relpolines" > > >> (relative call + trampoline). Relpolines are mainly intended as a way > > >> of reducing retpoline overheads due to Spectre v2. > > >> > > >> Unlike indirect call promotion through profile guided optimization, > > >> the > > >> proposed approach does not require a profiling stage, works well with > > >> modules whose address is unknown and can adapt to changing workloads. > > >> > > >> The main idea is simple: for every indirect call, we inject a piece > > >> of > > >> code with fast- and slow-path calls. The fast path is used if the > > >> target > > >> matches the expected (hot) target. The slow-path uses a retpoline. > > >> During training, the slow-path is set to call a function that saves > > >> the > > >> call source and target in a hash-table and keep count for call > > >> frequency. The most common target is then patched into the hot path. > > >> > > >> The patching is done on-the-fly by patching the conditional branch > > >> (opcode and offset) that is used to compare the target to the hot > > >> target. This allows to direct all cores to the fast-path, while > > >> patching > > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > > >> patch a single byte when the code might be executed by any core. (2) > > >> When patching more than one byte, ensure that all cores do not run > > >> the > > >> to-be-patched-code by preventing this code from being preempted, and > > >> using synchronize_sched() after patching the branch that jumps over > > >> this > > >> code. > > >> > > >> Changing all the indirect calls to use relpolines is done using > > >> assembly > > >> macro magic. There are alternative solutions, but this one is > > >> relatively simple and transparent. There is also logic to retrain the > > >> software predictor, but the policy it uses may need to be refined. > > >> > > >> Eventually the results are not bad (2 VCPU VM, throughput reported): > > >> > > >> baserelpoline > > >> - > > >> nginx 22898 25178 (+10%) > > >> redis-ycsb 24523 25486 (+4%) > > >> dbench 21442103 (+2%) > > >> > > >> When retpolines are disabled, and if retraining is off, performance > > >> benefits are up to 2% (nginx), but are much less impressive. > > > > > > Hi Nadav, > > > > > > Peter pointed me to these patches during a discussion about retpoline > > > profiling. Personally, I think this is brilliant. This could help > > > networking and filesystem intensive workloads a lot. > > > > Thanks! I was a bit held-back by the relatively limited number of > > responses. > > >>> > > >>> It is a rather, erm, ambitious idea, maybe they were speechless :-) > > >>> > > I finished another version two weeks ago, and every day I think: > > "should it > > be RFCv2 or v1”, ending up not sending it… > > > > There is one issue that I realized while working on the new version: > > I’m not > > sure it is well-defined what an outline retpoline is allowed to do. The > > indirect branch promotion code can change rflags, which might cause > > correction issues. In practice, using gcc, it is not a problem. > > >>> > > >>> Callees can clobber flags, so it seems fine to me. > > >> > > >> Just to check I understand your approach right: you made a macro > > >> called "call", and you're therefore causing all instances of "call" to > > >> become magic? This is... terrifying. It's even plausibly worse than > > >> "#define if" :) The scariest bit is that it will impact inline asm as > > >> well. Maybe a gcc plugin would be less alarming? > > > > > > It is likely to look less alarming. When I looked at the inline retpoline > > > implementation of gcc, it didn’t look much better than what I did - it > > > basically just emits assembly instructions. > > > > To be clear, that wasn’t a NAK. It was merely a “this is alarming.” > > Although... how do you avoid matching on things that really don't want > this treatment? paravirt ops come
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski wrote: > > > On Nov 28, 2018, at 6:06 PM, Nadav Amit wrote: > > >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: > >> > >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf > >>> wrote: > >>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > > > >> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > >> This RFC introduces indirect call promotion in runtime, which for the > >> matter of simplification (and branding) will be called here > >> "relpolines" > >> (relative call + trampoline). Relpolines are mainly intended as a way > >> of reducing retpoline overheads due to Spectre v2. > >> > >> Unlike indirect call promotion through profile guided optimization, the > >> proposed approach does not require a profiling stage, works well with > >> modules whose address is unknown and can adapt to changing workloads. > >> > >> The main idea is simple: for every indirect call, we inject a piece of > >> code with fast- and slow-path calls. The fast path is used if the > >> target > >> matches the expected (hot) target. The slow-path uses a retpoline. > >> During training, the slow-path is set to call a function that saves the > >> call source and target in a hash-table and keep count for call > >> frequency. The most common target is then patched into the hot path. > >> > >> The patching is done on-the-fly by patching the conditional branch > >> (opcode and offset) that is used to compare the target to the hot > >> target. This allows to direct all cores to the fast-path, while > >> patching > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > >> patch a single byte when the code might be executed by any core. (2) > >> When patching more than one byte, ensure that all cores do not run the > >> to-be-patched-code by preventing this code from being preempted, and > >> using synchronize_sched() after patching the branch that jumps over > >> this > >> code. > >> > >> Changing all the indirect calls to use relpolines is done using > >> assembly > >> macro magic. There are alternative solutions, but this one is > >> relatively simple and transparent. There is also logic to retrain the > >> software predictor, but the policy it uses may need to be refined. > >> > >> Eventually the results are not bad (2 VCPU VM, throughput reported): > >> > >> baserelpoline > >> - > >> nginx 22898 25178 (+10%) > >> redis-ycsb 24523 25486 (+4%) > >> dbench 21442103 (+2%) > >> > >> When retpolines are disabled, and if retraining is off, performance > >> benefits are up to 2% (nginx), but are much less impressive. > > > > Hi Nadav, > > > > Peter pointed me to these patches during a discussion about retpoline > > profiling. Personally, I think this is brilliant. This could help > > networking and filesystem intensive workloads a lot. > > Thanks! I was a bit held-back by the relatively limited number of > responses. > >>> > >>> It is a rather, erm, ambitious idea, maybe they were speechless :-) > >>> > I finished another version two weeks ago, and every day I think: "should > it > be RFCv2 or v1”, ending up not sending it… > > There is one issue that I realized while working on the new version: I’m > not > sure it is well-defined what an outline retpoline is allowed to do. The > indirect branch promotion code can change rflags, which might cause > correction issues. In practice, using gcc, it is not a problem. > >>> > >>> Callees can clobber flags, so it seems fine to me. > >> > >> Just to check I understand your approach right: you made a macro > >> called "call", and you're therefore causing all instances of "call" to > >> become magic? This is... terrifying. It's even plausibly worse than > >> "#define if" :) The scariest bit is that it will impact inline asm as > >> well. Maybe a gcc plugin would be less alarming? > > > > It is likely to look less alarming. When I looked at the inline retpoline > > implementation of gcc, it didn’t look much better than what I did - it > > basically just emits assembly instructions. > > To be clear, that wasn’t a NAK. It was merely a “this is alarming.” Although... how do you avoid matching on things that really don't want this treatment? paravirt ops come to mind.
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski wrote: > > > On Nov 28, 2018, at 6:06 PM, Nadav Amit wrote: > > >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: > >> > >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf > >>> wrote: > >>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > > > >> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > >> This RFC introduces indirect call promotion in runtime, which for the > >> matter of simplification (and branding) will be called here > >> "relpolines" > >> (relative call + trampoline). Relpolines are mainly intended as a way > >> of reducing retpoline overheads due to Spectre v2. > >> > >> Unlike indirect call promotion through profile guided optimization, the > >> proposed approach does not require a profiling stage, works well with > >> modules whose address is unknown and can adapt to changing workloads. > >> > >> The main idea is simple: for every indirect call, we inject a piece of > >> code with fast- and slow-path calls. The fast path is used if the > >> target > >> matches the expected (hot) target. The slow-path uses a retpoline. > >> During training, the slow-path is set to call a function that saves the > >> call source and target in a hash-table and keep count for call > >> frequency. The most common target is then patched into the hot path. > >> > >> The patching is done on-the-fly by patching the conditional branch > >> (opcode and offset) that is used to compare the target to the hot > >> target. This allows to direct all cores to the fast-path, while > >> patching > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > >> patch a single byte when the code might be executed by any core. (2) > >> When patching more than one byte, ensure that all cores do not run the > >> to-be-patched-code by preventing this code from being preempted, and > >> using synchronize_sched() after patching the branch that jumps over > >> this > >> code. > >> > >> Changing all the indirect calls to use relpolines is done using > >> assembly > >> macro magic. There are alternative solutions, but this one is > >> relatively simple and transparent. There is also logic to retrain the > >> software predictor, but the policy it uses may need to be refined. > >> > >> Eventually the results are not bad (2 VCPU VM, throughput reported): > >> > >> baserelpoline > >> - > >> nginx 22898 25178 (+10%) > >> redis-ycsb 24523 25486 (+4%) > >> dbench 21442103 (+2%) > >> > >> When retpolines are disabled, and if retraining is off, performance > >> benefits are up to 2% (nginx), but are much less impressive. > > > > Hi Nadav, > > > > Peter pointed me to these patches during a discussion about retpoline > > profiling. Personally, I think this is brilliant. This could help > > networking and filesystem intensive workloads a lot. > > Thanks! I was a bit held-back by the relatively limited number of > responses. > >>> > >>> It is a rather, erm, ambitious idea, maybe they were speechless :-) > >>> > I finished another version two weeks ago, and every day I think: "should > it > be RFCv2 or v1”, ending up not sending it… > > There is one issue that I realized while working on the new version: I’m > not > sure it is well-defined what an outline retpoline is allowed to do. The > indirect branch promotion code can change rflags, which might cause > correction issues. In practice, using gcc, it is not a problem. > >>> > >>> Callees can clobber flags, so it seems fine to me. > >> > >> Just to check I understand your approach right: you made a macro > >> called "call", and you're therefore causing all instances of "call" to > >> become magic? This is... terrifying. It's even plausibly worse than > >> "#define if" :) The scariest bit is that it will impact inline asm as > >> well. Maybe a gcc plugin would be less alarming? > > > > It is likely to look less alarming. When I looked at the inline retpoline > > implementation of gcc, it didn’t look much better than what I did - it > > basically just emits assembly instructions. > > To be clear, that wasn’t a NAK. It was merely a “this is alarming.” Although... how do you avoid matching on things that really don't want this treatment? paravirt ops come to mind.
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Nov 28, 2018 at 07:24:08PM -0800, Andy Lutomirski wrote: > To be clear, that wasn’t a NAK. It was merely a “this is alarming.” > > Hey Josh - you could potentially do the same hack to generate the > static call tables. Take that, objtool. Ha, after witnessing Nadav's glorious hack, I considered that. But I didn't see a way to pull it off, because asm macro conditionals don't seem to have a way to test for a regex (or at least a named prefix) for the call target. I'd need a way to detect "call __static_call_tramp_". -- Josh
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Nov 28, 2018 at 07:24:08PM -0800, Andy Lutomirski wrote: > To be clear, that wasn’t a NAK. It was merely a “this is alarming.” > > Hey Josh - you could potentially do the same hack to generate the > static call tables. Take that, objtool. Ha, after witnessing Nadav's glorious hack, I considered that. But I didn't see a way to pull it off, because asm macro conditionals don't seem to have a way to test for a regex (or at least a named prefix) for the call target. I'd need a way to detect "call __static_call_tramp_". -- Josh
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Nov 28, 2018, at 6:06 PM, Nadav Amit wrote: >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: >> >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf wrote: >>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > >> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: >> This RFC introduces indirect call promotion in runtime, which for the >> matter of simplification (and branding) will be called here "relpolines" >> (relative call + trampoline). Relpolines are mainly intended as a way >> of reducing retpoline overheads due to Spectre v2. >> >> Unlike indirect call promotion through profile guided optimization, the >> proposed approach does not require a profiling stage, works well with >> modules whose address is unknown and can adapt to changing workloads. >> >> The main idea is simple: for every indirect call, we inject a piece of >> code with fast- and slow-path calls. The fast path is used if the target >> matches the expected (hot) target. The slow-path uses a retpoline. >> During training, the slow-path is set to call a function that saves the >> call source and target in a hash-table and keep count for call >> frequency. The most common target is then patched into the hot path. >> >> The patching is done on-the-fly by patching the conditional branch >> (opcode and offset) that is used to compare the target to the hot >> target. This allows to direct all cores to the fast-path, while patching >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only >> patch a single byte when the code might be executed by any core. (2) >> When patching more than one byte, ensure that all cores do not run the >> to-be-patched-code by preventing this code from being preempted, and >> using synchronize_sched() after patching the branch that jumps over this >> code. >> >> Changing all the indirect calls to use relpolines is done using assembly >> macro magic. There are alternative solutions, but this one is >> relatively simple and transparent. There is also logic to retrain the >> software predictor, but the policy it uses may need to be refined. >> >> Eventually the results are not bad (2 VCPU VM, throughput reported): >> >> baserelpoline >> - >> nginx 22898 25178 (+10%) >> redis-ycsb 24523 25486 (+4%) >> dbench 21442103 (+2%) >> >> When retpolines are disabled, and if retraining is off, performance >> benefits are up to 2% (nginx), but are much less impressive. > > Hi Nadav, > > Peter pointed me to these patches during a discussion about retpoline > profiling. Personally, I think this is brilliant. This could help > networking and filesystem intensive workloads a lot. Thanks! I was a bit held-back by the relatively limited number of responses. >>> >>> It is a rather, erm, ambitious idea, maybe they were speechless :-) >>> I finished another version two weeks ago, and every day I think: "should it be RFCv2 or v1”, ending up not sending it… There is one issue that I realized while working on the new version: I’m not sure it is well-defined what an outline retpoline is allowed to do. The indirect branch promotion code can change rflags, which might cause correction issues. In practice, using gcc, it is not a problem. >>> >>> Callees can clobber flags, so it seems fine to me. >> >> Just to check I understand your approach right: you made a macro >> called "call", and you're therefore causing all instances of "call" to >> become magic? This is... terrifying. It's even plausibly worse than >> "#define if" :) The scariest bit is that it will impact inline asm as >> well. Maybe a gcc plugin would be less alarming? > > It is likely to look less alarming. When I looked at the inline retpoline > implementation of gcc, it didn’t look much better than what I did - it > basically just emits assembly instructions. To be clear, that wasn’t a NAK. It was merely a “this is alarming.” Hey Josh - you could potentially do the same hack to generate the static call tables. Take that, objtool. > > Anyhow, I look (again) into using gcc-plugins. > 1. An indirect branch inside the BP handler might be the one we patch >>> >>> I _think_ nested INT3s should be doable, because they don't use IST. >>> Maybe Andy can clarify. >> >> int3 should survive recursion these days. Although I admit I'm >> currently wondering what happens if one thread puts a kprobe on an >> address that another thread tries to text_poke. > > The issue I regarded is having an indirect call *inside* the the handler. > For example, you try to patch the call to bp_int3_handler and then get an >
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Nov 28, 2018, at 6:06 PM, Nadav Amit wrote: >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: >> >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf wrote: >>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > >> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: >> This RFC introduces indirect call promotion in runtime, which for the >> matter of simplification (and branding) will be called here "relpolines" >> (relative call + trampoline). Relpolines are mainly intended as a way >> of reducing retpoline overheads due to Spectre v2. >> >> Unlike indirect call promotion through profile guided optimization, the >> proposed approach does not require a profiling stage, works well with >> modules whose address is unknown and can adapt to changing workloads. >> >> The main idea is simple: for every indirect call, we inject a piece of >> code with fast- and slow-path calls. The fast path is used if the target >> matches the expected (hot) target. The slow-path uses a retpoline. >> During training, the slow-path is set to call a function that saves the >> call source and target in a hash-table and keep count for call >> frequency. The most common target is then patched into the hot path. >> >> The patching is done on-the-fly by patching the conditional branch >> (opcode and offset) that is used to compare the target to the hot >> target. This allows to direct all cores to the fast-path, while patching >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only >> patch a single byte when the code might be executed by any core. (2) >> When patching more than one byte, ensure that all cores do not run the >> to-be-patched-code by preventing this code from being preempted, and >> using synchronize_sched() after patching the branch that jumps over this >> code. >> >> Changing all the indirect calls to use relpolines is done using assembly >> macro magic. There are alternative solutions, but this one is >> relatively simple and transparent. There is also logic to retrain the >> software predictor, but the policy it uses may need to be refined. >> >> Eventually the results are not bad (2 VCPU VM, throughput reported): >> >> baserelpoline >> - >> nginx 22898 25178 (+10%) >> redis-ycsb 24523 25486 (+4%) >> dbench 21442103 (+2%) >> >> When retpolines are disabled, and if retraining is off, performance >> benefits are up to 2% (nginx), but are much less impressive. > > Hi Nadav, > > Peter pointed me to these patches during a discussion about retpoline > profiling. Personally, I think this is brilliant. This could help > networking and filesystem intensive workloads a lot. Thanks! I was a bit held-back by the relatively limited number of responses. >>> >>> It is a rather, erm, ambitious idea, maybe they were speechless :-) >>> I finished another version two weeks ago, and every day I think: "should it be RFCv2 or v1”, ending up not sending it… There is one issue that I realized while working on the new version: I’m not sure it is well-defined what an outline retpoline is allowed to do. The indirect branch promotion code can change rflags, which might cause correction issues. In practice, using gcc, it is not a problem. >>> >>> Callees can clobber flags, so it seems fine to me. >> >> Just to check I understand your approach right: you made a macro >> called "call", and you're therefore causing all instances of "call" to >> become magic? This is... terrifying. It's even plausibly worse than >> "#define if" :) The scariest bit is that it will impact inline asm as >> well. Maybe a gcc plugin would be less alarming? > > It is likely to look less alarming. When I looked at the inline retpoline > implementation of gcc, it didn’t look much better than what I did - it > basically just emits assembly instructions. To be clear, that wasn’t a NAK. It was merely a “this is alarming.” Hey Josh - you could potentially do the same hack to generate the static call tables. Take that, objtool. > > Anyhow, I look (again) into using gcc-plugins. > 1. An indirect branch inside the BP handler might be the one we patch >>> >>> I _think_ nested INT3s should be doable, because they don't use IST. >>> Maybe Andy can clarify. >> >> int3 should survive recursion these days. Although I admit I'm >> currently wondering what happens if one thread puts a kprobe on an >> address that another thread tries to text_poke. > > The issue I regarded is having an indirect call *inside* the the handler. > For example, you try to patch the call to bp_int3_handler and then get an >
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: > > On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf wrote: >> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > This RFC introduces indirect call promotion in runtime, which for the > matter of simplification (and branding) will be called here "relpolines" > (relative call + trampoline). Relpolines are mainly intended as a way > of reducing retpoline overheads due to Spectre v2. > > Unlike indirect call promotion through profile guided optimization, the > proposed approach does not require a profiling stage, works well with > modules whose address is unknown and can adapt to changing workloads. > > The main idea is simple: for every indirect call, we inject a piece of > code with fast- and slow-path calls. The fast path is used if the target > matches the expected (hot) target. The slow-path uses a retpoline. > During training, the slow-path is set to call a function that saves the > call source and target in a hash-table and keep count for call > frequency. The most common target is then patched into the hot path. > > The patching is done on-the-fly by patching the conditional branch > (opcode and offset) that is used to compare the target to the hot > target. This allows to direct all cores to the fast-path, while patching > the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > patch a single byte when the code might be executed by any core. (2) > When patching more than one byte, ensure that all cores do not run the > to-be-patched-code by preventing this code from being preempted, and > using synchronize_sched() after patching the branch that jumps over this > code. > > Changing all the indirect calls to use relpolines is done using assembly > macro magic. There are alternative solutions, but this one is > relatively simple and transparent. There is also logic to retrain the > software predictor, but the policy it uses may need to be refined. > > Eventually the results are not bad (2 VCPU VM, throughput reported): > > baserelpoline > - > nginx 22898 25178 (+10%) > redis-ycsb 24523 25486 (+4%) > dbench 21442103 (+2%) > > When retpolines are disabled, and if retraining is off, performance > benefits are up to 2% (nginx), but are much less impressive. Hi Nadav, Peter pointed me to these patches during a discussion about retpoline profiling. Personally, I think this is brilliant. This could help networking and filesystem intensive workloads a lot. >>> >>> Thanks! I was a bit held-back by the relatively limited number of responses. >> >> It is a rather, erm, ambitious idea, maybe they were speechless :-) >> >>> I finished another version two weeks ago, and every day I think: "should it >>> be RFCv2 or v1”, ending up not sending it… >>> >>> There is one issue that I realized while working on the new version: I’m not >>> sure it is well-defined what an outline retpoline is allowed to do. The >>> indirect branch promotion code can change rflags, which might cause >>> correction issues. In practice, using gcc, it is not a problem. >> >> Callees can clobber flags, so it seems fine to me. > > Just to check I understand your approach right: you made a macro > called "call", and you're therefore causing all instances of "call" to > become magic? This is... terrifying. It's even plausibly worse than > "#define if" :) The scariest bit is that it will impact inline asm as > well. Maybe a gcc plugin would be less alarming? It is likely to look less alarming. When I looked at the inline retpoline implementation of gcc, it didn’t look much better than what I did - it basically just emits assembly instructions. Anyhow, I look (again) into using gcc-plugins. >>> 1. An indirect branch inside the BP handler might be the one we patch >> >> I _think_ nested INT3s should be doable, because they don't use IST. >> Maybe Andy can clarify. > > int3 should survive recursion these days. Although I admit I'm > currently wondering what happens if one thread puts a kprobe on an > address that another thread tries to text_poke. The issue I regarded is having an indirect call *inside* the the handler. For example, you try to patch the call to bp_int3_handler and then get an int3. They can be annotated to prevent them from being patched. Then again, I need to see how gcc plugins can get these annotations. > > Also, this relpoline magic is likely to start patching text at runtime > on a semi-regular basis. This type of patching is *slow*. Is it a > problem? It didn’t appear so. Although there are >1 indirect
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski wrote: > > On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf wrote: >> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > This RFC introduces indirect call promotion in runtime, which for the > matter of simplification (and branding) will be called here "relpolines" > (relative call + trampoline). Relpolines are mainly intended as a way > of reducing retpoline overheads due to Spectre v2. > > Unlike indirect call promotion through profile guided optimization, the > proposed approach does not require a profiling stage, works well with > modules whose address is unknown and can adapt to changing workloads. > > The main idea is simple: for every indirect call, we inject a piece of > code with fast- and slow-path calls. The fast path is used if the target > matches the expected (hot) target. The slow-path uses a retpoline. > During training, the slow-path is set to call a function that saves the > call source and target in a hash-table and keep count for call > frequency. The most common target is then patched into the hot path. > > The patching is done on-the-fly by patching the conditional branch > (opcode and offset) that is used to compare the target to the hot > target. This allows to direct all cores to the fast-path, while patching > the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > patch a single byte when the code might be executed by any core. (2) > When patching more than one byte, ensure that all cores do not run the > to-be-patched-code by preventing this code from being preempted, and > using synchronize_sched() after patching the branch that jumps over this > code. > > Changing all the indirect calls to use relpolines is done using assembly > macro magic. There are alternative solutions, but this one is > relatively simple and transparent. There is also logic to retrain the > software predictor, but the policy it uses may need to be refined. > > Eventually the results are not bad (2 VCPU VM, throughput reported): > > baserelpoline > - > nginx 22898 25178 (+10%) > redis-ycsb 24523 25486 (+4%) > dbench 21442103 (+2%) > > When retpolines are disabled, and if retraining is off, performance > benefits are up to 2% (nginx), but are much less impressive. Hi Nadav, Peter pointed me to these patches during a discussion about retpoline profiling. Personally, I think this is brilliant. This could help networking and filesystem intensive workloads a lot. >>> >>> Thanks! I was a bit held-back by the relatively limited number of responses. >> >> It is a rather, erm, ambitious idea, maybe they were speechless :-) >> >>> I finished another version two weeks ago, and every day I think: "should it >>> be RFCv2 or v1”, ending up not sending it… >>> >>> There is one issue that I realized while working on the new version: I’m not >>> sure it is well-defined what an outline retpoline is allowed to do. The >>> indirect branch promotion code can change rflags, which might cause >>> correction issues. In practice, using gcc, it is not a problem. >> >> Callees can clobber flags, so it seems fine to me. > > Just to check I understand your approach right: you made a macro > called "call", and you're therefore causing all instances of "call" to > become magic? This is... terrifying. It's even plausibly worse than > "#define if" :) The scariest bit is that it will impact inline asm as > well. Maybe a gcc plugin would be less alarming? It is likely to look less alarming. When I looked at the inline retpoline implementation of gcc, it didn’t look much better than what I did - it basically just emits assembly instructions. Anyhow, I look (again) into using gcc-plugins. >>> 1. An indirect branch inside the BP handler might be the one we patch >> >> I _think_ nested INT3s should be doable, because they don't use IST. >> Maybe Andy can clarify. > > int3 should survive recursion these days. Although I admit I'm > currently wondering what happens if one thread puts a kprobe on an > address that another thread tries to text_poke. The issue I regarded is having an indirect call *inside* the the handler. For example, you try to patch the call to bp_int3_handler and then get an int3. They can be annotated to prevent them from being patched. Then again, I need to see how gcc plugins can get these annotations. > > Also, this relpoline magic is likely to start patching text at runtime > on a semi-regular basis. This type of patching is *slow*. Is it a > problem? It didn’t appear so. Although there are >1 indirect
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf wrote: > > On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > > > > > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > > >> This RFC introduces indirect call promotion in runtime, which for the > > >> matter of simplification (and branding) will be called here "relpolines" > > >> (relative call + trampoline). Relpolines are mainly intended as a way > > >> of reducing retpoline overheads due to Spectre v2. > > >> > > >> Unlike indirect call promotion through profile guided optimization, the > > >> proposed approach does not require a profiling stage, works well with > > >> modules whose address is unknown and can adapt to changing workloads. > > >> > > >> The main idea is simple: for every indirect call, we inject a piece of > > >> code with fast- and slow-path calls. The fast path is used if the target > > >> matches the expected (hot) target. The slow-path uses a retpoline. > > >> During training, the slow-path is set to call a function that saves the > > >> call source and target in a hash-table and keep count for call > > >> frequency. The most common target is then patched into the hot path. > > >> > > >> The patching is done on-the-fly by patching the conditional branch > > >> (opcode and offset) that is used to compare the target to the hot > > >> target. This allows to direct all cores to the fast-path, while patching > > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > > >> patch a single byte when the code might be executed by any core. (2) > > >> When patching more than one byte, ensure that all cores do not run the > > >> to-be-patched-code by preventing this code from being preempted, and > > >> using synchronize_sched() after patching the branch that jumps over this > > >> code. > > >> > > >> Changing all the indirect calls to use relpolines is done using assembly > > >> macro magic. There are alternative solutions, but this one is > > >> relatively simple and transparent. There is also logic to retrain the > > >> software predictor, but the policy it uses may need to be refined. > > >> > > >> Eventually the results are not bad (2 VCPU VM, throughput reported): > > >> > > >>baserelpoline > > >>- > > >> nginx 22898 25178 (+10%) > > >> redis-ycsb 24523 25486 (+4%) > > >> dbench 21442103 (+2%) > > >> > > >> When retpolines are disabled, and if retraining is off, performance > > >> benefits are up to 2% (nginx), but are much less impressive. > > > > > > Hi Nadav, > > > > > > Peter pointed me to these patches during a discussion about retpoline > > > profiling. Personally, I think this is brilliant. This could help > > > networking and filesystem intensive workloads a lot. > > > > Thanks! I was a bit held-back by the relatively limited number of responses. > > It is a rather, erm, ambitious idea, maybe they were speechless :-) > > > I finished another version two weeks ago, and every day I think: "should it > > be RFCv2 or v1”, ending up not sending it… > > > > There is one issue that I realized while working on the new version: I’m not > > sure it is well-defined what an outline retpoline is allowed to do. The > > indirect branch promotion code can change rflags, which might cause > > correction issues. In practice, using gcc, it is not a problem. > > Callees can clobber flags, so it seems fine to me. Just to check I understand your approach right: you made a macro called "call", and you're therefore causing all instances of "call" to become magic? This is... terrifying. It's even plausibly worse than "#define if" :) The scariest bit is that it will impact inline asm as well. Maybe a gcc plugin would be less alarming? > > > > 1. An indirect branch inside the BP handler might be the one we patch > > I _think_ nested INT3s should be doable, because they don't use IST. > Maybe Andy can clarify. int3 should survive recursion these days. Although I admit I'm currently wondering what happens if one thread puts a kprobe on an address that another thread tries to text_poke. Also, this relpoline magic is likely to start patching text at runtime on a semi-regular basis. This type of patching is *slow*. Is it a problem? > > 2. An indirect branch inside an interrupt or NMI handler might be the > >one we patch > > But INT3s just use the existing stack, and NMIs support nesting, so I'm > thinking that should also be doable. Andy? > In principle, as long as the code isn't NOKPROBE_SYMBOL-ified, we should be fine, right? I'd be a little nervous if we get an int3 in the C code that handles the early part of an NMI from user mode. It's *probably* okay, but one of the alarming issues is that the int3 return path will implicitly unmask NMI, which isn't fantastic. Maybe we finally need to dust off my old "return using RET" code to get rid
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf wrote: > > On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > > > > > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > > >> This RFC introduces indirect call promotion in runtime, which for the > > >> matter of simplification (and branding) will be called here "relpolines" > > >> (relative call + trampoline). Relpolines are mainly intended as a way > > >> of reducing retpoline overheads due to Spectre v2. > > >> > > >> Unlike indirect call promotion through profile guided optimization, the > > >> proposed approach does not require a profiling stage, works well with > > >> modules whose address is unknown and can adapt to changing workloads. > > >> > > >> The main idea is simple: for every indirect call, we inject a piece of > > >> code with fast- and slow-path calls. The fast path is used if the target > > >> matches the expected (hot) target. The slow-path uses a retpoline. > > >> During training, the slow-path is set to call a function that saves the > > >> call source and target in a hash-table and keep count for call > > >> frequency. The most common target is then patched into the hot path. > > >> > > >> The patching is done on-the-fly by patching the conditional branch > > >> (opcode and offset) that is used to compare the target to the hot > > >> target. This allows to direct all cores to the fast-path, while patching > > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > > >> patch a single byte when the code might be executed by any core. (2) > > >> When patching more than one byte, ensure that all cores do not run the > > >> to-be-patched-code by preventing this code from being preempted, and > > >> using synchronize_sched() after patching the branch that jumps over this > > >> code. > > >> > > >> Changing all the indirect calls to use relpolines is done using assembly > > >> macro magic. There are alternative solutions, but this one is > > >> relatively simple and transparent. There is also logic to retrain the > > >> software predictor, but the policy it uses may need to be refined. > > >> > > >> Eventually the results are not bad (2 VCPU VM, throughput reported): > > >> > > >>baserelpoline > > >>- > > >> nginx 22898 25178 (+10%) > > >> redis-ycsb 24523 25486 (+4%) > > >> dbench 21442103 (+2%) > > >> > > >> When retpolines are disabled, and if retraining is off, performance > > >> benefits are up to 2% (nginx), but are much less impressive. > > > > > > Hi Nadav, > > > > > > Peter pointed me to these patches during a discussion about retpoline > > > profiling. Personally, I think this is brilliant. This could help > > > networking and filesystem intensive workloads a lot. > > > > Thanks! I was a bit held-back by the relatively limited number of responses. > > It is a rather, erm, ambitious idea, maybe they were speechless :-) > > > I finished another version two weeks ago, and every day I think: "should it > > be RFCv2 or v1”, ending up not sending it… > > > > There is one issue that I realized while working on the new version: I’m not > > sure it is well-defined what an outline retpoline is allowed to do. The > > indirect branch promotion code can change rflags, which might cause > > correction issues. In practice, using gcc, it is not a problem. > > Callees can clobber flags, so it seems fine to me. Just to check I understand your approach right: you made a macro called "call", and you're therefore causing all instances of "call" to become magic? This is... terrifying. It's even plausibly worse than "#define if" :) The scariest bit is that it will impact inline asm as well. Maybe a gcc plugin would be less alarming? > > > > 1. An indirect branch inside the BP handler might be the one we patch > > I _think_ nested INT3s should be doable, because they don't use IST. > Maybe Andy can clarify. int3 should survive recursion these days. Although I admit I'm currently wondering what happens if one thread puts a kprobe on an address that another thread tries to text_poke. Also, this relpoline magic is likely to start patching text at runtime on a semi-regular basis. This type of patching is *slow*. Is it a problem? > > 2. An indirect branch inside an interrupt or NMI handler might be the > >one we patch > > But INT3s just use the existing stack, and NMIs support nesting, so I'm > thinking that should also be doable. Andy? > In principle, as long as the code isn't NOKPROBE_SYMBOL-ified, we should be fine, right? I'd be a little nervous if we get an int3 in the C code that handles the early part of an NMI from user mode. It's *probably* okay, but one of the alarming issues is that the int3 return path will implicitly unmask NMI, which isn't fantastic. Maybe we finally need to dust off my old "return using RET" code to get rid
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > > > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > >> This RFC introduces indirect call promotion in runtime, which for the > >> matter of simplification (and branding) will be called here "relpolines" > >> (relative call + trampoline). Relpolines are mainly intended as a way > >> of reducing retpoline overheads due to Spectre v2. > >> > >> Unlike indirect call promotion through profile guided optimization, the > >> proposed approach does not require a profiling stage, works well with > >> modules whose address is unknown and can adapt to changing workloads. > >> > >> The main idea is simple: for every indirect call, we inject a piece of > >> code with fast- and slow-path calls. The fast path is used if the target > >> matches the expected (hot) target. The slow-path uses a retpoline. > >> During training, the slow-path is set to call a function that saves the > >> call source and target in a hash-table and keep count for call > >> frequency. The most common target is then patched into the hot path. > >> > >> The patching is done on-the-fly by patching the conditional branch > >> (opcode and offset) that is used to compare the target to the hot > >> target. This allows to direct all cores to the fast-path, while patching > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > >> patch a single byte when the code might be executed by any core. (2) > >> When patching more than one byte, ensure that all cores do not run the > >> to-be-patched-code by preventing this code from being preempted, and > >> using synchronize_sched() after patching the branch that jumps over this > >> code. > >> > >> Changing all the indirect calls to use relpolines is done using assembly > >> macro magic. There are alternative solutions, but this one is > >> relatively simple and transparent. There is also logic to retrain the > >> software predictor, but the policy it uses may need to be refined. > >> > >> Eventually the results are not bad (2 VCPU VM, throughput reported): > >> > >>baserelpoline > >>- > >> nginx 22898 25178 (+10%) > >> redis-ycsb 24523 25486 (+4%) > >> dbench 21442103 (+2%) > >> > >> When retpolines are disabled, and if retraining is off, performance > >> benefits are up to 2% (nginx), but are much less impressive. > > > > Hi Nadav, > > > > Peter pointed me to these patches during a discussion about retpoline > > profiling. Personally, I think this is brilliant. This could help > > networking and filesystem intensive workloads a lot. > > Thanks! I was a bit held-back by the relatively limited number of responses. It is a rather, erm, ambitious idea, maybe they were speechless :-) > I finished another version two weeks ago, and every day I think: "should it > be RFCv2 or v1”, ending up not sending it… > > There is one issue that I realized while working on the new version: I’m not > sure it is well-defined what an outline retpoline is allowed to do. The > indirect branch promotion code can change rflags, which might cause > correction issues. In practice, using gcc, it is not a problem. Callees can clobber flags, so it seems fine to me. > > Some high-level comments: > > > > - "Relpoline" looks confusingly a lot like "retpoline". How about > > "optpoline"? To avoid confusing myself I will hereafter refer to it > > as such :-) > > Sure. For the academic paper we submitted, we used a much worse name that my > colleague came up with. I’m ok with anything other than that name (not > mentioned to prevent double-blinding violations). I’ll go with your name. > > > - Instead of patching one byte at a time, is there a reason why > > text_poke_bp() can't be used? That would greatly simplify the > > patching process, as everything could be patched in a single step. > > I thought of it and maybe it is somehow possible, but there are several > problems, for which I didn’t find a simple solution: > > 1. An indirect branch inside the BP handler might be the one we patch I _think_ nested INT3s should be doable, because they don't use IST. Maybe Andy can clarify. To avoid recursion we'd have to make sure not to use any function pointers in do_int3() before or during the call to poke_int3_handler(). Incidentally I'm working on a patch which adds an indirect call to poke_int3_handler(). We'd have to disable optolines for that code. > 2. An indirect branch inside an interrupt or NMI handler might be the >one we patch But INT3s just use the existing stack, and NMIs support nesting, so I'm thinking that should also be doable. Andy? > 3. Overall, we need to patch more than a single instruction, and >text_poke_bp() is not suitable Hm, I suppose that's true. > > - In many cases, a single direct call may not be sufficient, as there > >
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote: > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > > > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > >> This RFC introduces indirect call promotion in runtime, which for the > >> matter of simplification (and branding) will be called here "relpolines" > >> (relative call + trampoline). Relpolines are mainly intended as a way > >> of reducing retpoline overheads due to Spectre v2. > >> > >> Unlike indirect call promotion through profile guided optimization, the > >> proposed approach does not require a profiling stage, works well with > >> modules whose address is unknown and can adapt to changing workloads. > >> > >> The main idea is simple: for every indirect call, we inject a piece of > >> code with fast- and slow-path calls. The fast path is used if the target > >> matches the expected (hot) target. The slow-path uses a retpoline. > >> During training, the slow-path is set to call a function that saves the > >> call source and target in a hash-table and keep count for call > >> frequency. The most common target is then patched into the hot path. > >> > >> The patching is done on-the-fly by patching the conditional branch > >> (opcode and offset) that is used to compare the target to the hot > >> target. This allows to direct all cores to the fast-path, while patching > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > >> patch a single byte when the code might be executed by any core. (2) > >> When patching more than one byte, ensure that all cores do not run the > >> to-be-patched-code by preventing this code from being preempted, and > >> using synchronize_sched() after patching the branch that jumps over this > >> code. > >> > >> Changing all the indirect calls to use relpolines is done using assembly > >> macro magic. There are alternative solutions, but this one is > >> relatively simple and transparent. There is also logic to retrain the > >> software predictor, but the policy it uses may need to be refined. > >> > >> Eventually the results are not bad (2 VCPU VM, throughput reported): > >> > >>baserelpoline > >>- > >> nginx 22898 25178 (+10%) > >> redis-ycsb 24523 25486 (+4%) > >> dbench 21442103 (+2%) > >> > >> When retpolines are disabled, and if retraining is off, performance > >> benefits are up to 2% (nginx), but are much less impressive. > > > > Hi Nadav, > > > > Peter pointed me to these patches during a discussion about retpoline > > profiling. Personally, I think this is brilliant. This could help > > networking and filesystem intensive workloads a lot. > > Thanks! I was a bit held-back by the relatively limited number of responses. It is a rather, erm, ambitious idea, maybe they were speechless :-) > I finished another version two weeks ago, and every day I think: "should it > be RFCv2 or v1”, ending up not sending it… > > There is one issue that I realized while working on the new version: I’m not > sure it is well-defined what an outline retpoline is allowed to do. The > indirect branch promotion code can change rflags, which might cause > correction issues. In practice, using gcc, it is not a problem. Callees can clobber flags, so it seems fine to me. > > Some high-level comments: > > > > - "Relpoline" looks confusingly a lot like "retpoline". How about > > "optpoline"? To avoid confusing myself I will hereafter refer to it > > as such :-) > > Sure. For the academic paper we submitted, we used a much worse name that my > colleague came up with. I’m ok with anything other than that name (not > mentioned to prevent double-blinding violations). I’ll go with your name. > > > - Instead of patching one byte at a time, is there a reason why > > text_poke_bp() can't be used? That would greatly simplify the > > patching process, as everything could be patched in a single step. > > I thought of it and maybe it is somehow possible, but there are several > problems, for which I didn’t find a simple solution: > > 1. An indirect branch inside the BP handler might be the one we patch I _think_ nested INT3s should be doable, because they don't use IST. Maybe Andy can clarify. To avoid recursion we'd have to make sure not to use any function pointers in do_int3() before or during the call to poke_int3_handler(). Incidentally I'm working on a patch which adds an indirect call to poke_int3_handler(). We'd have to disable optolines for that code. > 2. An indirect branch inside an interrupt or NMI handler might be the >one we patch But INT3s just use the existing stack, and NMIs support nesting, so I'm thinking that should also be doable. Andy? > 3. Overall, we need to patch more than a single instruction, and >text_poke_bp() is not suitable Hm, I suppose that's true. > > - In many cases, a single direct call may not be sufficient, as there > >
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: >> This RFC introduces indirect call promotion in runtime, which for the >> matter of simplification (and branding) will be called here "relpolines" >> (relative call + trampoline). Relpolines are mainly intended as a way >> of reducing retpoline overheads due to Spectre v2. >> >> Unlike indirect call promotion through profile guided optimization, the >> proposed approach does not require a profiling stage, works well with >> modules whose address is unknown and can adapt to changing workloads. >> >> The main idea is simple: for every indirect call, we inject a piece of >> code with fast- and slow-path calls. The fast path is used if the target >> matches the expected (hot) target. The slow-path uses a retpoline. >> During training, the slow-path is set to call a function that saves the >> call source and target in a hash-table and keep count for call >> frequency. The most common target is then patched into the hot path. >> >> The patching is done on-the-fly by patching the conditional branch >> (opcode and offset) that is used to compare the target to the hot >> target. This allows to direct all cores to the fast-path, while patching >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only >> patch a single byte when the code might be executed by any core. (2) >> When patching more than one byte, ensure that all cores do not run the >> to-be-patched-code by preventing this code from being preempted, and >> using synchronize_sched() after patching the branch that jumps over this >> code. >> >> Changing all the indirect calls to use relpolines is done using assembly >> macro magic. There are alternative solutions, but this one is >> relatively simple and transparent. There is also logic to retrain the >> software predictor, but the policy it uses may need to be refined. >> >> Eventually the results are not bad (2 VCPU VM, throughput reported): >> >> baserelpoline >> - >> nginx22898 25178 (+10%) >> redis-ycsb 24523 25486 (+4%) >> dbench 21442103 (+2%) >> >> When retpolines are disabled, and if retraining is off, performance >> benefits are up to 2% (nginx), but are much less impressive. > > Hi Nadav, > > Peter pointed me to these patches during a discussion about retpoline > profiling. Personally, I think this is brilliant. This could help > networking and filesystem intensive workloads a lot. Thanks! I was a bit held-back by the relatively limited number of responses. I finished another version two weeks ago, and every day I think: "should it be RFCv2 or v1”, ending up not sending it… There is one issue that I realized while working on the new version: I’m not sure it is well-defined what an outline retpoline is allowed to do. The indirect branch promotion code can change rflags, which might cause correction issues. In practice, using gcc, it is not a problem. > Some high-level comments: > > - "Relpoline" looks confusingly a lot like "retpoline". How about > "optpoline"? To avoid confusing myself I will hereafter refer to it > as such :-) Sure. For the academic paper we submitted, we used a much worse name that my colleague came up with. I’m ok with anything other than that name (not mentioned to prevent double-blinding violations). I’ll go with your name. > - Instead of patching one byte at a time, is there a reason why > text_poke_bp() can't be used? That would greatly simplify the > patching process, as everything could be patched in a single step. I thought of it and maybe it is somehow possible, but there are several problems, for which I didn’t find a simple solution: 1. An indirect branch inside the BP handler might be the one we patch 2. An indirect branch inside an interrupt or NMI handler might be the one we patch 3. Overall, we need to patch more than a single instruction, and text_poke_bp() is not suitable > - In many cases, a single direct call may not be sufficient, as there > could be for example multiple tasks using different network protocols > which need different callbacks for the same call site. We want to know during compilation how many targets to use. It is not super-simple to support multiple inlined targets, but it is feasible if you are willing to annotate when multiple targets are needed. We have a version which uses outlined indirect branch promotion when there are multiple targets, but it’s not ready for prime time, and the code-cache misses can induce some overheads. > - I'm not sure about the periodic retraining logic, it seems a bit > nondeterministic and bursty. I agree. It can be limited to cases in which modules are loaded/removed, or when the user explicitly asks for it to take place. > > So I'd propose the following changes: > > - In the optpoline, reserve space for multiple (5 or so)
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf wrote: > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: >> This RFC introduces indirect call promotion in runtime, which for the >> matter of simplification (and branding) will be called here "relpolines" >> (relative call + trampoline). Relpolines are mainly intended as a way >> of reducing retpoline overheads due to Spectre v2. >> >> Unlike indirect call promotion through profile guided optimization, the >> proposed approach does not require a profiling stage, works well with >> modules whose address is unknown and can adapt to changing workloads. >> >> The main idea is simple: for every indirect call, we inject a piece of >> code with fast- and slow-path calls. The fast path is used if the target >> matches the expected (hot) target. The slow-path uses a retpoline. >> During training, the slow-path is set to call a function that saves the >> call source and target in a hash-table and keep count for call >> frequency. The most common target is then patched into the hot path. >> >> The patching is done on-the-fly by patching the conditional branch >> (opcode and offset) that is used to compare the target to the hot >> target. This allows to direct all cores to the fast-path, while patching >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only >> patch a single byte when the code might be executed by any core. (2) >> When patching more than one byte, ensure that all cores do not run the >> to-be-patched-code by preventing this code from being preempted, and >> using synchronize_sched() after patching the branch that jumps over this >> code. >> >> Changing all the indirect calls to use relpolines is done using assembly >> macro magic. There are alternative solutions, but this one is >> relatively simple and transparent. There is also logic to retrain the >> software predictor, but the policy it uses may need to be refined. >> >> Eventually the results are not bad (2 VCPU VM, throughput reported): >> >> baserelpoline >> - >> nginx22898 25178 (+10%) >> redis-ycsb 24523 25486 (+4%) >> dbench 21442103 (+2%) >> >> When retpolines are disabled, and if retraining is off, performance >> benefits are up to 2% (nginx), but are much less impressive. > > Hi Nadav, > > Peter pointed me to these patches during a discussion about retpoline > profiling. Personally, I think this is brilliant. This could help > networking and filesystem intensive workloads a lot. Thanks! I was a bit held-back by the relatively limited number of responses. I finished another version two weeks ago, and every day I think: "should it be RFCv2 or v1”, ending up not sending it… There is one issue that I realized while working on the new version: I’m not sure it is well-defined what an outline retpoline is allowed to do. The indirect branch promotion code can change rflags, which might cause correction issues. In practice, using gcc, it is not a problem. > Some high-level comments: > > - "Relpoline" looks confusingly a lot like "retpoline". How about > "optpoline"? To avoid confusing myself I will hereafter refer to it > as such :-) Sure. For the academic paper we submitted, we used a much worse name that my colleague came up with. I’m ok with anything other than that name (not mentioned to prevent double-blinding violations). I’ll go with your name. > - Instead of patching one byte at a time, is there a reason why > text_poke_bp() can't be used? That would greatly simplify the > patching process, as everything could be patched in a single step. I thought of it and maybe it is somehow possible, but there are several problems, for which I didn’t find a simple solution: 1. An indirect branch inside the BP handler might be the one we patch 2. An indirect branch inside an interrupt or NMI handler might be the one we patch 3. Overall, we need to patch more than a single instruction, and text_poke_bp() is not suitable > - In many cases, a single direct call may not be sufficient, as there > could be for example multiple tasks using different network protocols > which need different callbacks for the same call site. We want to know during compilation how many targets to use. It is not super-simple to support multiple inlined targets, but it is feasible if you are willing to annotate when multiple targets are needed. We have a version which uses outlined indirect branch promotion when there are multiple targets, but it’s not ready for prime time, and the code-cache misses can induce some overheads. > - I'm not sure about the periodic retraining logic, it seems a bit > nondeterministic and bursty. I agree. It can be limited to cases in which modules are loaded/removed, or when the user explicitly asks for it to take place. > > So I'd propose the following changes: > > - In the optpoline, reserve space for multiple (5 or so)
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > This RFC introduces indirect call promotion in runtime, which for the > matter of simplification (and branding) will be called here "relpolines" > (relative call + trampoline). Relpolines are mainly intended as a way > of reducing retpoline overheads due to Spectre v2. > > Unlike indirect call promotion through profile guided optimization, the > proposed approach does not require a profiling stage, works well with > modules whose address is unknown and can adapt to changing workloads. > > The main idea is simple: for every indirect call, we inject a piece of > code with fast- and slow-path calls. The fast path is used if the target > matches the expected (hot) target. The slow-path uses a retpoline. > During training, the slow-path is set to call a function that saves the > call source and target in a hash-table and keep count for call > frequency. The most common target is then patched into the hot path. > > The patching is done on-the-fly by patching the conditional branch > (opcode and offset) that is used to compare the target to the hot > target. This allows to direct all cores to the fast-path, while patching > the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > patch a single byte when the code might be executed by any core. (2) > When patching more than one byte, ensure that all cores do not run the > to-be-patched-code by preventing this code from being preempted, and > using synchronize_sched() after patching the branch that jumps over this > code. > > Changing all the indirect calls to use relpolines is done using assembly > macro magic. There are alternative solutions, but this one is > relatively simple and transparent. There is also logic to retrain the > software predictor, but the policy it uses may need to be refined. > > Eventually the results are not bad (2 VCPU VM, throughput reported): > > baserelpoline > - > nginx 22898 25178 (+10%) > redis-ycsb24523 25486 (+4%) > dbench21442103 (+2%) > > When retpolines are disabled, and if retraining is off, performance > benefits are up to 2% (nginx), but are much less impressive. Hi Nadav, Peter pointed me to these patches during a discussion about retpoline profiling. Personally, I think this is brilliant. This could help networking and filesystem intensive workloads a lot. Some high-level comments: - "Relpoline" looks confusingly a lot like "retpoline". How about "optpoline"? To avoid confusing myself I will hereafter refer to it as such :-) - Instead of patching one byte at a time, is there a reason why text_poke_bp() can't be used? That would greatly simplify the patching process, as everything could be patched in a single step. - In many cases, a single direct call may not be sufficient, as there could be for example multiple tasks using different network protocols which need different callbacks for the same call site. - I'm not sure about the periodic retraining logic, it seems a bit nondeterministic and bursty. So I'd propose the following changes: - In the optpoline, reserve space for multiple (5 or so) comparisons and direct calls. Maybe the number of reserved cmp/jne/call slots can be tweaked by the caller somehow. Or maybe it could grow as needed. Starting out, they would just be NOPs. - Instead of the temporary learning mode, add permanent tracking to detect a direct call "miss" -- i.e., when none of the existing direct calls are applicable and the retpoline will be used. - In the case of a miss (or N misses), it could trigger a direct call patching operation to be run later (workqueue or syscall exit). If all the direct call slots are full, it could patch the least recently modified one. If this causes thrashing (>x changes over y time), it could increase the number of direct call slots using a trampoline. Even if there were several slots, CPU branch prediction would presumably help make it much faster than a basic retpoline. Thoughts? -- Josh
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote: > This RFC introduces indirect call promotion in runtime, which for the > matter of simplification (and branding) will be called here "relpolines" > (relative call + trampoline). Relpolines are mainly intended as a way > of reducing retpoline overheads due to Spectre v2. > > Unlike indirect call promotion through profile guided optimization, the > proposed approach does not require a profiling stage, works well with > modules whose address is unknown and can adapt to changing workloads. > > The main idea is simple: for every indirect call, we inject a piece of > code with fast- and slow-path calls. The fast path is used if the target > matches the expected (hot) target. The slow-path uses a retpoline. > During training, the slow-path is set to call a function that saves the > call source and target in a hash-table and keep count for call > frequency. The most common target is then patched into the hot path. > > The patching is done on-the-fly by patching the conditional branch > (opcode and offset) that is used to compare the target to the hot > target. This allows to direct all cores to the fast-path, while patching > the slow-path and vice-versa. Patching follows 2 more rules: (1) Only > patch a single byte when the code might be executed by any core. (2) > When patching more than one byte, ensure that all cores do not run the > to-be-patched-code by preventing this code from being preempted, and > using synchronize_sched() after patching the branch that jumps over this > code. > > Changing all the indirect calls to use relpolines is done using assembly > macro magic. There are alternative solutions, but this one is > relatively simple and transparent. There is also logic to retrain the > software predictor, but the policy it uses may need to be refined. > > Eventually the results are not bad (2 VCPU VM, throughput reported): > > baserelpoline > - > nginx 22898 25178 (+10%) > redis-ycsb24523 25486 (+4%) > dbench21442103 (+2%) > > When retpolines are disabled, and if retraining is off, performance > benefits are up to 2% (nginx), but are much less impressive. Hi Nadav, Peter pointed me to these patches during a discussion about retpoline profiling. Personally, I think this is brilliant. This could help networking and filesystem intensive workloads a lot. Some high-level comments: - "Relpoline" looks confusingly a lot like "retpoline". How about "optpoline"? To avoid confusing myself I will hereafter refer to it as such :-) - Instead of patching one byte at a time, is there a reason why text_poke_bp() can't be used? That would greatly simplify the patching process, as everything could be patched in a single step. - In many cases, a single direct call may not be sufficient, as there could be for example multiple tasks using different network protocols which need different callbacks for the same call site. - I'm not sure about the periodic retraining logic, it seems a bit nondeterministic and bursty. So I'd propose the following changes: - In the optpoline, reserve space for multiple (5 or so) comparisons and direct calls. Maybe the number of reserved cmp/jne/call slots can be tweaked by the caller somehow. Or maybe it could grow as needed. Starting out, they would just be NOPs. - Instead of the temporary learning mode, add permanent tracking to detect a direct call "miss" -- i.e., when none of the existing direct calls are applicable and the retpoline will be used. - In the case of a miss (or N misses), it could trigger a direct call patching operation to be run later (workqueue or syscall exit). If all the direct call slots are full, it could patch the least recently modified one. If this causes thrashing (>x changes over y time), it could increase the number of direct call slots using a trampoline. Even if there were several slots, CPU branch prediction would presumably help make it much faster than a basic retpoline. Thoughts? -- Josh
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On 10/23/18 1:32 PM, Nadav Amit wrote: >> On 10/17/18 5:54 PM, Nadav Amit wrote: >>> baserelpoline >>> - >>> nginx 22898 25178 (+10%) >>> redis-ycsb 24523 25486 (+4%) >>> dbench 21442103 (+2%) >> Just out of curiosity, which indirect branches are the culprits here for >> causing the slowdowns? > So I didn’t try to measure exactly which one. There are roughly 500 that > actually “run” in my tests. OK, cool, that's pretty much all I wanted to know, just that there aren't 3 of them or something for which we need all this infrastructure.
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On 10/23/18 1:32 PM, Nadav Amit wrote: >> On 10/17/18 5:54 PM, Nadav Amit wrote: >>> baserelpoline >>> - >>> nginx 22898 25178 (+10%) >>> redis-ycsb 24523 25486 (+4%) >>> dbench 21442103 (+2%) >> Just out of curiosity, which indirect branches are the culprits here for >> causing the slowdowns? > So I didn’t try to measure exactly which one. There are roughly 500 that > actually “run” in my tests. OK, cool, that's pretty much all I wanted to know, just that there aren't 3 of them or something for which we need all this infrastructure.
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
at 11:36 AM, Dave Hansen wrote: > On 10/17/18 5:54 PM, Nadav Amit wrote: >> base relpoline >> - >> nginx22898 25178 (+10%) >> redis-ycsb 24523 25486 (+4%) >> dbench 21442103 (+2%) > > Just out of curiosity, which indirect branches are the culprits here for > causing the slowdowns? So I didn’t try to measure exactly which one. There are roughly 500 that actually “run” in my tests. Initially, I took the silly approach of trying to patch the C source-code using semi automatically-generated Coccinelle scripts, so I can tell you it is not just few branches but many. The network stack is full of function pointers (e.g., tcp_congestion_ops, tcp_sock_af_ops, dst_ops). The file-system also uses many function pointers (file_operations specifically). Compound-pages have d’tor and so on. If you want, you can rebuild the kernel without retpolines and run perf record -e br_inst_exec.taken_indirect_near_call:k (your workload) For some reason I didn’t manage to use PEBS (:ppp) from either the guest or the host, so my results are a bit skewed (i.e., the sampled location is usually after the call was taken). Running dbench in the VM gives me the following “hot-spots”: # Samples: 304 of event 'br_inst_exec.taken_indirect_near_call' # Event count (approx.): 60800912 # # Overhead Command Shared ObjectSymbol # ... ... . # 5.26% :197970 [guest.kernel.kallsyms] [g] __fget_light 4.28% :197969 [guest.kernel.kallsyms] [g] __fget_light 3.95% :197969 [guest.kernel.kallsyms] [g] dcache_readdir 3.29% :197970 [guest.kernel.kallsyms] [g] next_positive.isra.14 2.96% :197970 [guest.kernel.kallsyms] [g] __do_sys_kill 2.30% :197970 [guest.kernel.kallsyms] [g] apparmor_file_open 1.97% :197969 [guest.kernel.kallsyms] [g] __do_sys_kill 1.97% :197969 [guest.kernel.kallsyms] [g] next_positive.isra.14 1.97% :197970 [guest.kernel.kallsyms] [g] _raw_spin_lock 1.64% :197969 [guest.kernel.kallsyms] [g] __alloc_file 1.64% :197969 [guest.kernel.kallsyms] [g] common_file_perm 1.64% :197969 [guest.kernel.kallsyms] [g] filldir 1.64% :197970 [guest.kernel.kallsyms] [g] do_dentry_open 1.64% :197970 [guest.kernel.kallsyms] [g] kmem_cache_free 1.32% :197969 [guest.kernel.kallsyms] [g] __raw_callee_save___pv_queued_spin_unlock 1.32% :197969 [guest.kernel.kallsyms] [g] __slab_free Regards, Nadav
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
at 11:36 AM, Dave Hansen wrote: > On 10/17/18 5:54 PM, Nadav Amit wrote: >> base relpoline >> - >> nginx22898 25178 (+10%) >> redis-ycsb 24523 25486 (+4%) >> dbench 21442103 (+2%) > > Just out of curiosity, which indirect branches are the culprits here for > causing the slowdowns? So I didn’t try to measure exactly which one. There are roughly 500 that actually “run” in my tests. Initially, I took the silly approach of trying to patch the C source-code using semi automatically-generated Coccinelle scripts, so I can tell you it is not just few branches but many. The network stack is full of function pointers (e.g., tcp_congestion_ops, tcp_sock_af_ops, dst_ops). The file-system also uses many function pointers (file_operations specifically). Compound-pages have d’tor and so on. If you want, you can rebuild the kernel without retpolines and run perf record -e br_inst_exec.taken_indirect_near_call:k (your workload) For some reason I didn’t manage to use PEBS (:ppp) from either the guest or the host, so my results are a bit skewed (i.e., the sampled location is usually after the call was taken). Running dbench in the VM gives me the following “hot-spots”: # Samples: 304 of event 'br_inst_exec.taken_indirect_near_call' # Event count (approx.): 60800912 # # Overhead Command Shared ObjectSymbol # ... ... . # 5.26% :197970 [guest.kernel.kallsyms] [g] __fget_light 4.28% :197969 [guest.kernel.kallsyms] [g] __fget_light 3.95% :197969 [guest.kernel.kallsyms] [g] dcache_readdir 3.29% :197970 [guest.kernel.kallsyms] [g] next_positive.isra.14 2.96% :197970 [guest.kernel.kallsyms] [g] __do_sys_kill 2.30% :197970 [guest.kernel.kallsyms] [g] apparmor_file_open 1.97% :197969 [guest.kernel.kallsyms] [g] __do_sys_kill 1.97% :197969 [guest.kernel.kallsyms] [g] next_positive.isra.14 1.97% :197970 [guest.kernel.kallsyms] [g] _raw_spin_lock 1.64% :197969 [guest.kernel.kallsyms] [g] __alloc_file 1.64% :197969 [guest.kernel.kallsyms] [g] common_file_perm 1.64% :197969 [guest.kernel.kallsyms] [g] filldir 1.64% :197970 [guest.kernel.kallsyms] [g] do_dentry_open 1.64% :197970 [guest.kernel.kallsyms] [g] kmem_cache_free 1.32% :197969 [guest.kernel.kallsyms] [g] __raw_callee_save___pv_queued_spin_unlock 1.32% :197969 [guest.kernel.kallsyms] [g] __slab_free Regards, Nadav
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On 10/17/18 5:54 PM, Nadav Amit wrote: > baserelpoline > - > nginx 22898 25178 (+10%) > redis-ycsb24523 25486 (+4%) > dbench21442103 (+2%) Just out of curiosity, which indirect branches are the culprits here for causing the slowdowns?
Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion
On 10/17/18 5:54 PM, Nadav Amit wrote: > baserelpoline > - > nginx 22898 25178 (+10%) > redis-ycsb24523 25486 (+4%) > dbench21442103 (+2%) Just out of curiosity, which indirect branches are the culprits here for causing the slowdowns?