Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-12-01 Thread Josh Poimboeuf
On Sat, Dec 01, 2018 at 06:52:45AM +, Nadav Amit wrote:
> > On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf  wrote:
> > 
> > On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
> >> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski  
> >> wrote:
> >>> On Nov 28, 2018, at 6:06 PM, Nadav Amit  wrote:
> >>> 
> > On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
> > 
> >> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  
> >> wrote:
> >> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
>  On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  
>  wrote:
>  
> > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> > This RFC introduces indirect call promotion in runtime, which for 
> > the
> > matter of simplification (and branding) will be called here 
> > "relpolines"
> > (relative call + trampoline). Relpolines are mainly intended as a 
> > way
> > of reducing retpoline overheads due to Spectre v2.
> > 
> > Unlike indirect call promotion through profile guided optimization, 
> > the
> > proposed approach does not require a profiling stage, works well 
> > with
> > modules whose address is unknown and can adapt to changing 
> > workloads.
> > 
> > The main idea is simple: for every indirect call, we inject a piece 
> > of
> > code with fast- and slow-path calls. The fast path is used if the 
> > target
> > matches the expected (hot) target. The slow-path uses a retpoline.
> > During training, the slow-path is set to call a function that saves 
> > the
> > call source and target in a hash-table and keep count for call
> > frequency. The most common target is then patched into the hot path.
> > 
> > The patching is done on-the-fly by patching the conditional branch
> > (opcode and offset) that is used to compare the target to the hot
> > target. This allows to direct all cores to the fast-path, while 
> > patching
> > the slow-path and vice-versa. Patching follows 2 more rules: (1) 
> > Only
> > patch a single byte when the code might be executed by any core. (2)
> > When patching more than one byte, ensure that all cores do not run 
> > the
> > to-be-patched-code by preventing this code from being preempted, and
> > using synchronize_sched() after patching the branch that jumps over 
> > this
> > code.
> > 
> > Changing all the indirect calls to use relpolines is done using 
> > assembly
> > macro magic. There are alternative solutions, but this one is
> > relatively simple and transparent. There is also logic to retrain 
> > the
> > software predictor, but the policy it uses may need to be refined.
> > 
> > Eventually the results are not bad (2 VCPU VM, throughput reported):
> > 
> > baserelpoline
> > -
> > nginx  22898   25178 (+10%)
> > redis-ycsb 24523   25486 (+4%)
> > dbench 21442103 (+2%)
> > 
> > When retpolines are disabled, and if retraining is off, performance
> > benefits are up to 2% (nginx), but are much less impressive.
>  
>  Hi Nadav,
>  
>  Peter pointed me to these patches during a discussion about retpoline
>  profiling.  Personally, I think this is brilliant.  This could help
>  networking and filesystem intensive workloads a lot.
> >>> 
> >>> Thanks! I was a bit held-back by the relatively limited number of 
> >>> responses.
> >> 
> >> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> >> 
> >>> I finished another version two weeks ago, and every day I think: 
> >>> "should it
> >>> be RFCv2 or v1”, ending up not sending it…
> >>> 
> >>> There is one issue that I realized while working on the new version: 
> >>> I’m not
> >>> sure it is well-defined what an outline retpoline is allowed to do. 
> >>> The
> >>> indirect branch promotion code can change rflags, which might cause
> >>> correction issues. In practice, using gcc, it is not a problem.
> >> 
> >> Callees can clobber flags, so it seems fine to me.
> > 
> > Just to check I understand your approach right: you made a macro
> > called "call", and you're therefore causing all instances of "call" to
> > become magic?  This is... terrifying.  It's even plausibly worse than
> > "#define if" :)  The scariest bit is that it will impact inline asm as
> > well.  Maybe a gcc plugin would be less alarming?
>  
>  It is likely to look less alarming. When I looked at the inline 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-12-01 Thread Josh Poimboeuf
On Sat, Dec 01, 2018 at 06:52:45AM +, Nadav Amit wrote:
> > On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf  wrote:
> > 
> > On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
> >> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski  
> >> wrote:
> >>> On Nov 28, 2018, at 6:06 PM, Nadav Amit  wrote:
> >>> 
> > On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
> > 
> >> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  
> >> wrote:
> >> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
>  On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  
>  wrote:
>  
> > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> > This RFC introduces indirect call promotion in runtime, which for 
> > the
> > matter of simplification (and branding) will be called here 
> > "relpolines"
> > (relative call + trampoline). Relpolines are mainly intended as a 
> > way
> > of reducing retpoline overheads due to Spectre v2.
> > 
> > Unlike indirect call promotion through profile guided optimization, 
> > the
> > proposed approach does not require a profiling stage, works well 
> > with
> > modules whose address is unknown and can adapt to changing 
> > workloads.
> > 
> > The main idea is simple: for every indirect call, we inject a piece 
> > of
> > code with fast- and slow-path calls. The fast path is used if the 
> > target
> > matches the expected (hot) target. The slow-path uses a retpoline.
> > During training, the slow-path is set to call a function that saves 
> > the
> > call source and target in a hash-table and keep count for call
> > frequency. The most common target is then patched into the hot path.
> > 
> > The patching is done on-the-fly by patching the conditional branch
> > (opcode and offset) that is used to compare the target to the hot
> > target. This allows to direct all cores to the fast-path, while 
> > patching
> > the slow-path and vice-versa. Patching follows 2 more rules: (1) 
> > Only
> > patch a single byte when the code might be executed by any core. (2)
> > When patching more than one byte, ensure that all cores do not run 
> > the
> > to-be-patched-code by preventing this code from being preempted, and
> > using synchronize_sched() after patching the branch that jumps over 
> > this
> > code.
> > 
> > Changing all the indirect calls to use relpolines is done using 
> > assembly
> > macro magic. There are alternative solutions, but this one is
> > relatively simple and transparent. There is also logic to retrain 
> > the
> > software predictor, but the policy it uses may need to be refined.
> > 
> > Eventually the results are not bad (2 VCPU VM, throughput reported):
> > 
> > baserelpoline
> > -
> > nginx  22898   25178 (+10%)
> > redis-ycsb 24523   25486 (+4%)
> > dbench 21442103 (+2%)
> > 
> > When retpolines are disabled, and if retraining is off, performance
> > benefits are up to 2% (nginx), but are much less impressive.
>  
>  Hi Nadav,
>  
>  Peter pointed me to these patches during a discussion about retpoline
>  profiling.  Personally, I think this is brilliant.  This could help
>  networking and filesystem intensive workloads a lot.
> >>> 
> >>> Thanks! I was a bit held-back by the relatively limited number of 
> >>> responses.
> >> 
> >> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> >> 
> >>> I finished another version two weeks ago, and every day I think: 
> >>> "should it
> >>> be RFCv2 or v1”, ending up not sending it…
> >>> 
> >>> There is one issue that I realized while working on the new version: 
> >>> I’m not
> >>> sure it is well-defined what an outline retpoline is allowed to do. 
> >>> The
> >>> indirect branch promotion code can change rflags, which might cause
> >>> correction issues. In practice, using gcc, it is not a problem.
> >> 
> >> Callees can clobber flags, so it seems fine to me.
> > 
> > Just to check I understand your approach right: you made a macro
> > called "call", and you're therefore causing all instances of "call" to
> > become magic?  This is... terrifying.  It's even plausibly worse than
> > "#define if" :)  The scariest bit is that it will impact inline asm as
> > well.  Maybe a gcc plugin would be less alarming?
>  
>  It is likely to look less alarming. When I looked at the inline 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-30 Thread Nadav Amit
> On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf  wrote:
> 
> On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
>> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski  wrote:
>>> On Nov 28, 2018, at 6:06 PM, Nadav Amit  wrote:
>>> 
> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
> 
>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  
>> wrote:
>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
 On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  
 wrote:
 
> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> This RFC introduces indirect call promotion in runtime, which for the
> matter of simplification (and branding) will be called here 
> "relpolines"
> (relative call + trampoline). Relpolines are mainly intended as a way
> of reducing retpoline overheads due to Spectre v2.
> 
> Unlike indirect call promotion through profile guided optimization, 
> the
> proposed approach does not require a profiling stage, works well with
> modules whose address is unknown and can adapt to changing workloads.
> 
> The main idea is simple: for every indirect call, we inject a piece of
> code with fast- and slow-path calls. The fast path is used if the 
> target
> matches the expected (hot) target. The slow-path uses a retpoline.
> During training, the slow-path is set to call a function that saves 
> the
> call source and target in a hash-table and keep count for call
> frequency. The most common target is then patched into the hot path.
> 
> The patching is done on-the-fly by patching the conditional branch
> (opcode and offset) that is used to compare the target to the hot
> target. This allows to direct all cores to the fast-path, while 
> patching
> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> patch a single byte when the code might be executed by any core. (2)
> When patching more than one byte, ensure that all cores do not run the
> to-be-patched-code by preventing this code from being preempted, and
> using synchronize_sched() after patching the branch that jumps over 
> this
> code.
> 
> Changing all the indirect calls to use relpolines is done using 
> assembly
> macro magic. There are alternative solutions, but this one is
> relatively simple and transparent. There is also logic to retrain the
> software predictor, but the policy it uses may need to be refined.
> 
> Eventually the results are not bad (2 VCPU VM, throughput reported):
> 
> baserelpoline
> -
> nginx  22898   25178 (+10%)
> redis-ycsb 24523   25486 (+4%)
> dbench 21442103 (+2%)
> 
> When retpolines are disabled, and if retraining is off, performance
> benefits are up to 2% (nginx), but are much less impressive.
 
 Hi Nadav,
 
 Peter pointed me to these patches during a discussion about retpoline
 profiling.  Personally, I think this is brilliant.  This could help
 networking and filesystem intensive workloads a lot.
>>> 
>>> Thanks! I was a bit held-back by the relatively limited number of 
>>> responses.
>> 
>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>> 
>>> I finished another version two weeks ago, and every day I think: 
>>> "should it
>>> be RFCv2 or v1”, ending up not sending it…
>>> 
>>> There is one issue that I realized while working on the new version: 
>>> I’m not
>>> sure it is well-defined what an outline retpoline is allowed to do. The
>>> indirect branch promotion code can change rflags, which might cause
>>> correction issues. In practice, using gcc, it is not a problem.
>> 
>> Callees can clobber flags, so it seems fine to me.
> 
> Just to check I understand your approach right: you made a macro
> called "call", and you're therefore causing all instances of "call" to
> become magic?  This is... terrifying.  It's even plausibly worse than
> "#define if" :)  The scariest bit is that it will impact inline asm as
> well.  Maybe a gcc plugin would be less alarming?
 
 It is likely to look less alarming. When I looked at the inline retpoline
 implementation of gcc, it didn’t look much better than what I did - it
 basically just emits assembly instructions.
>>> 
>>> To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”
>> 
>> Although... how do you avoid matching on things that really don't want
>> this treatment?  paravirt ops come to mind.
> 
> Paravirt ops don't use 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-30 Thread Nadav Amit
> On Nov 29, 2018, at 7:19 AM, Josh Poimboeuf  wrote:
> 
> On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
>> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski  wrote:
>>> On Nov 28, 2018, at 6:06 PM, Nadav Amit  wrote:
>>> 
> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
> 
>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  
>> wrote:
>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
 On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  
 wrote:
 
> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> This RFC introduces indirect call promotion in runtime, which for the
> matter of simplification (and branding) will be called here 
> "relpolines"
> (relative call + trampoline). Relpolines are mainly intended as a way
> of reducing retpoline overheads due to Spectre v2.
> 
> Unlike indirect call promotion through profile guided optimization, 
> the
> proposed approach does not require a profiling stage, works well with
> modules whose address is unknown and can adapt to changing workloads.
> 
> The main idea is simple: for every indirect call, we inject a piece of
> code with fast- and slow-path calls. The fast path is used if the 
> target
> matches the expected (hot) target. The slow-path uses a retpoline.
> During training, the slow-path is set to call a function that saves 
> the
> call source and target in a hash-table and keep count for call
> frequency. The most common target is then patched into the hot path.
> 
> The patching is done on-the-fly by patching the conditional branch
> (opcode and offset) that is used to compare the target to the hot
> target. This allows to direct all cores to the fast-path, while 
> patching
> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> patch a single byte when the code might be executed by any core. (2)
> When patching more than one byte, ensure that all cores do not run the
> to-be-patched-code by preventing this code from being preempted, and
> using synchronize_sched() after patching the branch that jumps over 
> this
> code.
> 
> Changing all the indirect calls to use relpolines is done using 
> assembly
> macro magic. There are alternative solutions, but this one is
> relatively simple and transparent. There is also logic to retrain the
> software predictor, but the policy it uses may need to be refined.
> 
> Eventually the results are not bad (2 VCPU VM, throughput reported):
> 
> baserelpoline
> -
> nginx  22898   25178 (+10%)
> redis-ycsb 24523   25486 (+4%)
> dbench 21442103 (+2%)
> 
> When retpolines are disabled, and if retraining is off, performance
> benefits are up to 2% (nginx), but are much less impressive.
 
 Hi Nadav,
 
 Peter pointed me to these patches during a discussion about retpoline
 profiling.  Personally, I think this is brilliant.  This could help
 networking and filesystem intensive workloads a lot.
>>> 
>>> Thanks! I was a bit held-back by the relatively limited number of 
>>> responses.
>> 
>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>> 
>>> I finished another version two weeks ago, and every day I think: 
>>> "should it
>>> be RFCv2 or v1”, ending up not sending it…
>>> 
>>> There is one issue that I realized while working on the new version: 
>>> I’m not
>>> sure it is well-defined what an outline retpoline is allowed to do. The
>>> indirect branch promotion code can change rflags, which might cause
>>> correction issues. In practice, using gcc, it is not a problem.
>> 
>> Callees can clobber flags, so it seems fine to me.
> 
> Just to check I understand your approach right: you made a macro
> called "call", and you're therefore causing all instances of "call" to
> become magic?  This is... terrifying.  It's even plausibly worse than
> "#define if" :)  The scariest bit is that it will impact inline asm as
> well.  Maybe a gcc plugin would be less alarming?
 
 It is likely to look less alarming. When I looked at the inline retpoline
 implementation of gcc, it didn’t look much better than what I did - it
 basically just emits assembly instructions.
>>> 
>>> To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”
>> 
>> Although... how do you avoid matching on things that really don't want
>> this treatment?  paravirt ops come to mind.
> 
> Paravirt ops don't use 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-29 Thread Josh Poimboeuf
On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski  wrote:
> >
> >
> > On Nov 28, 2018, at 6:06 PM, Nadav Amit  wrote:
> >
> > >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
> > >>
> > >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  
> > >>> wrote:
> > >>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
> > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  
> > > wrote:
> > >
> > >> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> > >> This RFC introduces indirect call promotion in runtime, which for the
> > >> matter of simplification (and branding) will be called here 
> > >> "relpolines"
> > >> (relative call + trampoline). Relpolines are mainly intended as a way
> > >> of reducing retpoline overheads due to Spectre v2.
> > >>
> > >> Unlike indirect call promotion through profile guided optimization, 
> > >> the
> > >> proposed approach does not require a profiling stage, works well with
> > >> modules whose address is unknown and can adapt to changing workloads.
> > >>
> > >> The main idea is simple: for every indirect call, we inject a piece 
> > >> of
> > >> code with fast- and slow-path calls. The fast path is used if the 
> > >> target
> > >> matches the expected (hot) target. The slow-path uses a retpoline.
> > >> During training, the slow-path is set to call a function that saves 
> > >> the
> > >> call source and target in a hash-table and keep count for call
> > >> frequency. The most common target is then patched into the hot path.
> > >>
> > >> The patching is done on-the-fly by patching the conditional branch
> > >> (opcode and offset) that is used to compare the target to the hot
> > >> target. This allows to direct all cores to the fast-path, while 
> > >> patching
> > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> > >> patch a single byte when the code might be executed by any core. (2)
> > >> When patching more than one byte, ensure that all cores do not run 
> > >> the
> > >> to-be-patched-code by preventing this code from being preempted, and
> > >> using synchronize_sched() after patching the branch that jumps over 
> > >> this
> > >> code.
> > >>
> > >> Changing all the indirect calls to use relpolines is done using 
> > >> assembly
> > >> macro magic. There are alternative solutions, but this one is
> > >> relatively simple and transparent. There is also logic to retrain the
> > >> software predictor, but the policy it uses may need to be refined.
> > >>
> > >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> > >>
> > >>  baserelpoline
> > >>  -
> > >> nginx  22898   25178 (+10%)
> > >> redis-ycsb 24523   25486 (+4%)
> > >> dbench 21442103 (+2%)
> > >>
> > >> When retpolines are disabled, and if retraining is off, performance
> > >> benefits are up to 2% (nginx), but are much less impressive.
> > >
> > > Hi Nadav,
> > >
> > > Peter pointed me to these patches during a discussion about retpoline
> > > profiling.  Personally, I think this is brilliant.  This could help
> > > networking and filesystem intensive workloads a lot.
> > 
> >  Thanks! I was a bit held-back by the relatively limited number of 
> >  responses.
> > >>>
> > >>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> > >>>
> >  I finished another version two weeks ago, and every day I think: 
> >  "should it
> >  be RFCv2 or v1”, ending up not sending it…
> > 
> >  There is one issue that I realized while working on the new version: 
> >  I’m not
> >  sure it is well-defined what an outline retpoline is allowed to do. The
> >  indirect branch promotion code can change rflags, which might cause
> >  correction issues. In practice, using gcc, it is not a problem.
> > >>>
> > >>> Callees can clobber flags, so it seems fine to me.
> > >>
> > >> Just to check I understand your approach right: you made a macro
> > >> called "call", and you're therefore causing all instances of "call" to
> > >> become magic?  This is... terrifying.  It's even plausibly worse than
> > >> "#define if" :)  The scariest bit is that it will impact inline asm as
> > >> well.  Maybe a gcc plugin would be less alarming?
> > >
> > > It is likely to look less alarming. When I looked at the inline retpoline
> > > implementation of gcc, it didn’t look much better than what I did - it
> > > basically just emits assembly instructions.
> >
> > To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”
> 
> Although... how do you avoid matching on things that really don't want
> this treatment?  paravirt ops come 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-29 Thread Josh Poimboeuf
On Wed, Nov 28, 2018 at 10:06:52PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski  wrote:
> >
> >
> > On Nov 28, 2018, at 6:06 PM, Nadav Amit  wrote:
> >
> > >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
> > >>
> > >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  
> > >>> wrote:
> > >>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
> > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  
> > > wrote:
> > >
> > >> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> > >> This RFC introduces indirect call promotion in runtime, which for the
> > >> matter of simplification (and branding) will be called here 
> > >> "relpolines"
> > >> (relative call + trampoline). Relpolines are mainly intended as a way
> > >> of reducing retpoline overheads due to Spectre v2.
> > >>
> > >> Unlike indirect call promotion through profile guided optimization, 
> > >> the
> > >> proposed approach does not require a profiling stage, works well with
> > >> modules whose address is unknown and can adapt to changing workloads.
> > >>
> > >> The main idea is simple: for every indirect call, we inject a piece 
> > >> of
> > >> code with fast- and slow-path calls. The fast path is used if the 
> > >> target
> > >> matches the expected (hot) target. The slow-path uses a retpoline.
> > >> During training, the slow-path is set to call a function that saves 
> > >> the
> > >> call source and target in a hash-table and keep count for call
> > >> frequency. The most common target is then patched into the hot path.
> > >>
> > >> The patching is done on-the-fly by patching the conditional branch
> > >> (opcode and offset) that is used to compare the target to the hot
> > >> target. This allows to direct all cores to the fast-path, while 
> > >> patching
> > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> > >> patch a single byte when the code might be executed by any core. (2)
> > >> When patching more than one byte, ensure that all cores do not run 
> > >> the
> > >> to-be-patched-code by preventing this code from being preempted, and
> > >> using synchronize_sched() after patching the branch that jumps over 
> > >> this
> > >> code.
> > >>
> > >> Changing all the indirect calls to use relpolines is done using 
> > >> assembly
> > >> macro magic. There are alternative solutions, but this one is
> > >> relatively simple and transparent. There is also logic to retrain the
> > >> software predictor, but the policy it uses may need to be refined.
> > >>
> > >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> > >>
> > >>  baserelpoline
> > >>  -
> > >> nginx  22898   25178 (+10%)
> > >> redis-ycsb 24523   25486 (+4%)
> > >> dbench 21442103 (+2%)
> > >>
> > >> When retpolines are disabled, and if retraining is off, performance
> > >> benefits are up to 2% (nginx), but are much less impressive.
> > >
> > > Hi Nadav,
> > >
> > > Peter pointed me to these patches during a discussion about retpoline
> > > profiling.  Personally, I think this is brilliant.  This could help
> > > networking and filesystem intensive workloads a lot.
> > 
> >  Thanks! I was a bit held-back by the relatively limited number of 
> >  responses.
> > >>>
> > >>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> > >>>
> >  I finished another version two weeks ago, and every day I think: 
> >  "should it
> >  be RFCv2 or v1”, ending up not sending it…
> > 
> >  There is one issue that I realized while working on the new version: 
> >  I’m not
> >  sure it is well-defined what an outline retpoline is allowed to do. The
> >  indirect branch promotion code can change rflags, which might cause
> >  correction issues. In practice, using gcc, it is not a problem.
> > >>>
> > >>> Callees can clobber flags, so it seems fine to me.
> > >>
> > >> Just to check I understand your approach right: you made a macro
> > >> called "call", and you're therefore causing all instances of "call" to
> > >> become magic?  This is... terrifying.  It's even plausibly worse than
> > >> "#define if" :)  The scariest bit is that it will impact inline asm as
> > >> well.  Maybe a gcc plugin would be less alarming?
> > >
> > > It is likely to look less alarming. When I looked at the inline retpoline
> > > implementation of gcc, it didn’t look much better than what I did - it
> > > basically just emits assembly instructions.
> >
> > To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”
> 
> Although... how do you avoid matching on things that really don't want
> this treatment?  paravirt ops come 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Andy Lutomirski
On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski  wrote:
>
>
> On Nov 28, 2018, at 6:06 PM, Nadav Amit  wrote:
>
> >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
> >>
> >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  
> >>> wrote:
> >>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
> > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
> >
> >> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> >> This RFC introduces indirect call promotion in runtime, which for the
> >> matter of simplification (and branding) will be called here 
> >> "relpolines"
> >> (relative call + trampoline). Relpolines are mainly intended as a way
> >> of reducing retpoline overheads due to Spectre v2.
> >>
> >> Unlike indirect call promotion through profile guided optimization, the
> >> proposed approach does not require a profiling stage, works well with
> >> modules whose address is unknown and can adapt to changing workloads.
> >>
> >> The main idea is simple: for every indirect call, we inject a piece of
> >> code with fast- and slow-path calls. The fast path is used if the 
> >> target
> >> matches the expected (hot) target. The slow-path uses a retpoline.
> >> During training, the slow-path is set to call a function that saves the
> >> call source and target in a hash-table and keep count for call
> >> frequency. The most common target is then patched into the hot path.
> >>
> >> The patching is done on-the-fly by patching the conditional branch
> >> (opcode and offset) that is used to compare the target to the hot
> >> target. This allows to direct all cores to the fast-path, while 
> >> patching
> >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> >> patch a single byte when the code might be executed by any core. (2)
> >> When patching more than one byte, ensure that all cores do not run the
> >> to-be-patched-code by preventing this code from being preempted, and
> >> using synchronize_sched() after patching the branch that jumps over 
> >> this
> >> code.
> >>
> >> Changing all the indirect calls to use relpolines is done using 
> >> assembly
> >> macro magic. There are alternative solutions, but this one is
> >> relatively simple and transparent. There is also logic to retrain the
> >> software predictor, but the policy it uses may need to be refined.
> >>
> >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> >>
> >>  baserelpoline
> >>  -
> >> nginx  22898   25178 (+10%)
> >> redis-ycsb 24523   25486 (+4%)
> >> dbench 21442103 (+2%)
> >>
> >> When retpolines are disabled, and if retraining is off, performance
> >> benefits are up to 2% (nginx), but are much less impressive.
> >
> > Hi Nadav,
> >
> > Peter pointed me to these patches during a discussion about retpoline
> > profiling.  Personally, I think this is brilliant.  This could help
> > networking and filesystem intensive workloads a lot.
> 
>  Thanks! I was a bit held-back by the relatively limited number of 
>  responses.
> >>>
> >>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> >>>
>  I finished another version two weeks ago, and every day I think: "should 
>  it
>  be RFCv2 or v1”, ending up not sending it…
> 
>  There is one issue that I realized while working on the new version: I’m 
>  not
>  sure it is well-defined what an outline retpoline is allowed to do. The
>  indirect branch promotion code can change rflags, which might cause
>  correction issues. In practice, using gcc, it is not a problem.
> >>>
> >>> Callees can clobber flags, so it seems fine to me.
> >>
> >> Just to check I understand your approach right: you made a macro
> >> called "call", and you're therefore causing all instances of "call" to
> >> become magic?  This is... terrifying.  It's even plausibly worse than
> >> "#define if" :)  The scariest bit is that it will impact inline asm as
> >> well.  Maybe a gcc plugin would be less alarming?
> >
> > It is likely to look less alarming. When I looked at the inline retpoline
> > implementation of gcc, it didn’t look much better than what I did - it
> > basically just emits assembly instructions.
>
> To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”

Although... how do you avoid matching on things that really don't want
this treatment?  paravirt ops come to mind.


Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Andy Lutomirski
On Wed, Nov 28, 2018 at 7:24 PM Andy Lutomirski  wrote:
>
>
> On Nov 28, 2018, at 6:06 PM, Nadav Amit  wrote:
>
> >> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
> >>
> >>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  
> >>> wrote:
> >>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
> > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
> >
> >> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> >> This RFC introduces indirect call promotion in runtime, which for the
> >> matter of simplification (and branding) will be called here 
> >> "relpolines"
> >> (relative call + trampoline). Relpolines are mainly intended as a way
> >> of reducing retpoline overheads due to Spectre v2.
> >>
> >> Unlike indirect call promotion through profile guided optimization, the
> >> proposed approach does not require a profiling stage, works well with
> >> modules whose address is unknown and can adapt to changing workloads.
> >>
> >> The main idea is simple: for every indirect call, we inject a piece of
> >> code with fast- and slow-path calls. The fast path is used if the 
> >> target
> >> matches the expected (hot) target. The slow-path uses a retpoline.
> >> During training, the slow-path is set to call a function that saves the
> >> call source and target in a hash-table and keep count for call
> >> frequency. The most common target is then patched into the hot path.
> >>
> >> The patching is done on-the-fly by patching the conditional branch
> >> (opcode and offset) that is used to compare the target to the hot
> >> target. This allows to direct all cores to the fast-path, while 
> >> patching
> >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> >> patch a single byte when the code might be executed by any core. (2)
> >> When patching more than one byte, ensure that all cores do not run the
> >> to-be-patched-code by preventing this code from being preempted, and
> >> using synchronize_sched() after patching the branch that jumps over 
> >> this
> >> code.
> >>
> >> Changing all the indirect calls to use relpolines is done using 
> >> assembly
> >> macro magic. There are alternative solutions, but this one is
> >> relatively simple and transparent. There is also logic to retrain the
> >> software predictor, but the policy it uses may need to be refined.
> >>
> >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> >>
> >>  baserelpoline
> >>  -
> >> nginx  22898   25178 (+10%)
> >> redis-ycsb 24523   25486 (+4%)
> >> dbench 21442103 (+2%)
> >>
> >> When retpolines are disabled, and if retraining is off, performance
> >> benefits are up to 2% (nginx), but are much less impressive.
> >
> > Hi Nadav,
> >
> > Peter pointed me to these patches during a discussion about retpoline
> > profiling.  Personally, I think this is brilliant.  This could help
> > networking and filesystem intensive workloads a lot.
> 
>  Thanks! I was a bit held-back by the relatively limited number of 
>  responses.
> >>>
> >>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
> >>>
>  I finished another version two weeks ago, and every day I think: "should 
>  it
>  be RFCv2 or v1”, ending up not sending it…
> 
>  There is one issue that I realized while working on the new version: I’m 
>  not
>  sure it is well-defined what an outline retpoline is allowed to do. The
>  indirect branch promotion code can change rflags, which might cause
>  correction issues. In practice, using gcc, it is not a problem.
> >>>
> >>> Callees can clobber flags, so it seems fine to me.
> >>
> >> Just to check I understand your approach right: you made a macro
> >> called "call", and you're therefore causing all instances of "call" to
> >> become magic?  This is... terrifying.  It's even plausibly worse than
> >> "#define if" :)  The scariest bit is that it will impact inline asm as
> >> well.  Maybe a gcc plugin would be less alarming?
> >
> > It is likely to look less alarming. When I looked at the inline retpoline
> > implementation of gcc, it didn’t look much better than what I did - it
> > basically just emits assembly instructions.
>
> To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”

Although... how do you avoid matching on things that really don't want
this treatment?  paravirt ops come to mind.


Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Josh Poimboeuf
On Wed, Nov 28, 2018 at 07:24:08PM -0800, Andy Lutomirski wrote:
> To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”
> 
> Hey Josh - you could potentially do the same hack to generate the
> static call tables. Take that, objtool.

Ha, after witnessing Nadav's glorious hack, I considered that.  But I
didn't see a way to pull it off, because asm macro conditionals don't
seem to have a way to test for a regex (or at least a named prefix) for
the call target.

I'd need a way to detect "call __static_call_tramp_".

-- 
Josh


Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Josh Poimboeuf
On Wed, Nov 28, 2018 at 07:24:08PM -0800, Andy Lutomirski wrote:
> To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”
> 
> Hey Josh - you could potentially do the same hack to generate the
> static call tables. Take that, objtool.

Ha, after witnessing Nadav's glorious hack, I considered that.  But I
didn't see a way to pull it off, because asm macro conditionals don't
seem to have a way to test for a regex (or at least a named prefix) for
the call target.

I'd need a way to detect "call __static_call_tramp_".

-- 
Josh


Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Andy Lutomirski


On Nov 28, 2018, at 6:06 PM, Nadav Amit  wrote:

>> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
>> 
>>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  wrote:
>>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
> 
>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>> This RFC introduces indirect call promotion in runtime, which for the
>> matter of simplification (and branding) will be called here "relpolines"
>> (relative call + trampoline). Relpolines are mainly intended as a way
>> of reducing retpoline overheads due to Spectre v2.
>> 
>> Unlike indirect call promotion through profile guided optimization, the
>> proposed approach does not require a profiling stage, works well with
>> modules whose address is unknown and can adapt to changing workloads.
>> 
>> The main idea is simple: for every indirect call, we inject a piece of
>> code with fast- and slow-path calls. The fast path is used if the target
>> matches the expected (hot) target. The slow-path uses a retpoline.
>> During training, the slow-path is set to call a function that saves the
>> call source and target in a hash-table and keep count for call
>> frequency. The most common target is then patched into the hot path.
>> 
>> The patching is done on-the-fly by patching the conditional branch
>> (opcode and offset) that is used to compare the target to the hot
>> target. This allows to direct all cores to the fast-path, while patching
>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>> patch a single byte when the code might be executed by any core. (2)
>> When patching more than one byte, ensure that all cores do not run the
>> to-be-patched-code by preventing this code from being preempted, and
>> using synchronize_sched() after patching the branch that jumps over this
>> code.
>> 
>> Changing all the indirect calls to use relpolines is done using assembly
>> macro magic. There are alternative solutions, but this one is
>> relatively simple and transparent. There is also logic to retrain the
>> software predictor, but the policy it uses may need to be refined.
>> 
>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>> 
>>  baserelpoline
>>  -
>> nginx  22898   25178 (+10%)
>> redis-ycsb 24523   25486 (+4%)
>> dbench 21442103 (+2%)
>> 
>> When retpolines are disabled, and if retraining is off, performance
>> benefits are up to 2% (nginx), but are much less impressive.
> 
> Hi Nadav,
> 
> Peter pointed me to these patches during a discussion about retpoline
> profiling.  Personally, I think this is brilliant.  This could help
> networking and filesystem intensive workloads a lot.
 
 Thanks! I was a bit held-back by the relatively limited number of 
 responses.
>>> 
>>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>>> 
 I finished another version two weeks ago, and every day I think: "should it
 be RFCv2 or v1”, ending up not sending it…
 
 There is one issue that I realized while working on the new version: I’m 
 not
 sure it is well-defined what an outline retpoline is allowed to do. The
 indirect branch promotion code can change rflags, which might cause
 correction issues. In practice, using gcc, it is not a problem.
>>> 
>>> Callees can clobber flags, so it seems fine to me.
>> 
>> Just to check I understand your approach right: you made a macro
>> called "call", and you're therefore causing all instances of "call" to
>> become magic?  This is... terrifying.  It's even plausibly worse than
>> "#define if" :)  The scariest bit is that it will impact inline asm as
>> well.  Maybe a gcc plugin would be less alarming?
> 
> It is likely to look less alarming. When I looked at the inline retpoline
> implementation of gcc, it didn’t look much better than what I did - it
> basically just emits assembly instructions.

To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”

Hey Josh - you could potentially do the same hack to generate the static call 
tables. Take that, objtool.

> 
> Anyhow, I look (again) into using gcc-plugins.
> 
 1. An indirect branch inside the BP handler might be the one we patch
>>> 
>>> I _think_ nested INT3s should be doable, because they don't use IST.
>>> Maybe Andy can clarify.
>> 
>> int3 should survive recursion these days.  Although I admit I'm
>> currently wondering what happens if one thread puts a kprobe on an
>> address that another thread tries to text_poke.
> 
> The issue I regarded is having an indirect call *inside* the the handler.
> For example, you try to patch the call to bp_int3_handler and then get an
> 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Andy Lutomirski


On Nov 28, 2018, at 6:06 PM, Nadav Amit  wrote:

>> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
>> 
>>> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  wrote:
>>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
> 
>> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>> This RFC introduces indirect call promotion in runtime, which for the
>> matter of simplification (and branding) will be called here "relpolines"
>> (relative call + trampoline). Relpolines are mainly intended as a way
>> of reducing retpoline overheads due to Spectre v2.
>> 
>> Unlike indirect call promotion through profile guided optimization, the
>> proposed approach does not require a profiling stage, works well with
>> modules whose address is unknown and can adapt to changing workloads.
>> 
>> The main idea is simple: for every indirect call, we inject a piece of
>> code with fast- and slow-path calls. The fast path is used if the target
>> matches the expected (hot) target. The slow-path uses a retpoline.
>> During training, the slow-path is set to call a function that saves the
>> call source and target in a hash-table and keep count for call
>> frequency. The most common target is then patched into the hot path.
>> 
>> The patching is done on-the-fly by patching the conditional branch
>> (opcode and offset) that is used to compare the target to the hot
>> target. This allows to direct all cores to the fast-path, while patching
>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>> patch a single byte when the code might be executed by any core. (2)
>> When patching more than one byte, ensure that all cores do not run the
>> to-be-patched-code by preventing this code from being preempted, and
>> using synchronize_sched() after patching the branch that jumps over this
>> code.
>> 
>> Changing all the indirect calls to use relpolines is done using assembly
>> macro magic. There are alternative solutions, but this one is
>> relatively simple and transparent. There is also logic to retrain the
>> software predictor, but the policy it uses may need to be refined.
>> 
>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>> 
>>  baserelpoline
>>  -
>> nginx  22898   25178 (+10%)
>> redis-ycsb 24523   25486 (+4%)
>> dbench 21442103 (+2%)
>> 
>> When retpolines are disabled, and if retraining is off, performance
>> benefits are up to 2% (nginx), but are much less impressive.
> 
> Hi Nadav,
> 
> Peter pointed me to these patches during a discussion about retpoline
> profiling.  Personally, I think this is brilliant.  This could help
> networking and filesystem intensive workloads a lot.
 
 Thanks! I was a bit held-back by the relatively limited number of 
 responses.
>>> 
>>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>>> 
 I finished another version two weeks ago, and every day I think: "should it
 be RFCv2 or v1”, ending up not sending it…
 
 There is one issue that I realized while working on the new version: I’m 
 not
 sure it is well-defined what an outline retpoline is allowed to do. The
 indirect branch promotion code can change rflags, which might cause
 correction issues. In practice, using gcc, it is not a problem.
>>> 
>>> Callees can clobber flags, so it seems fine to me.
>> 
>> Just to check I understand your approach right: you made a macro
>> called "call", and you're therefore causing all instances of "call" to
>> become magic?  This is... terrifying.  It's even plausibly worse than
>> "#define if" :)  The scariest bit is that it will impact inline asm as
>> well.  Maybe a gcc plugin would be less alarming?
> 
> It is likely to look less alarming. When I looked at the inline retpoline
> implementation of gcc, it didn’t look much better than what I did - it
> basically just emits assembly instructions.

To be clear, that wasn’t a NAK.  It was merely a “this is alarming.”

Hey Josh - you could potentially do the same hack to generate the static call 
tables. Take that, objtool.

> 
> Anyhow, I look (again) into using gcc-plugins.
> 
 1. An indirect branch inside the BP handler might be the one we patch
>>> 
>>> I _think_ nested INT3s should be doable, because they don't use IST.
>>> Maybe Andy can clarify.
>> 
>> int3 should survive recursion these days.  Although I admit I'm
>> currently wondering what happens if one thread puts a kprobe on an
>> address that another thread tries to text_poke.
> 
> The issue I regarded is having an indirect call *inside* the the handler.
> For example, you try to patch the call to bp_int3_handler and then get an
> 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Nadav Amit
> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
> 
> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  wrote:
>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
 On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
 
 On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> This RFC introduces indirect call promotion in runtime, which for the
> matter of simplification (and branding) will be called here "relpolines"
> (relative call + trampoline). Relpolines are mainly intended as a way
> of reducing retpoline overheads due to Spectre v2.
> 
> Unlike indirect call promotion through profile guided optimization, the
> proposed approach does not require a profiling stage, works well with
> modules whose address is unknown and can adapt to changing workloads.
> 
> The main idea is simple: for every indirect call, we inject a piece of
> code with fast- and slow-path calls. The fast path is used if the target
> matches the expected (hot) target. The slow-path uses a retpoline.
> During training, the slow-path is set to call a function that saves the
> call source and target in a hash-table and keep count for call
> frequency. The most common target is then patched into the hot path.
> 
> The patching is done on-the-fly by patching the conditional branch
> (opcode and offset) that is used to compare the target to the hot
> target. This allows to direct all cores to the fast-path, while patching
> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> patch a single byte when the code might be executed by any core. (2)
> When patching more than one byte, ensure that all cores do not run the
> to-be-patched-code by preventing this code from being preempted, and
> using synchronize_sched() after patching the branch that jumps over this
> code.
> 
> Changing all the indirect calls to use relpolines is done using assembly
> macro magic. There are alternative solutions, but this one is
> relatively simple and transparent. There is also logic to retrain the
> software predictor, but the policy it uses may need to be refined.
> 
> Eventually the results are not bad (2 VCPU VM, throughput reported):
> 
>   baserelpoline
>   -
> nginx  22898   25178 (+10%)
> redis-ycsb 24523   25486 (+4%)
> dbench 21442103 (+2%)
> 
> When retpolines are disabled, and if retraining is off, performance
> benefits are up to 2% (nginx), but are much less impressive.
 
 Hi Nadav,
 
 Peter pointed me to these patches during a discussion about retpoline
 profiling.  Personally, I think this is brilliant.  This could help
 networking and filesystem intensive workloads a lot.
>>> 
>>> Thanks! I was a bit held-back by the relatively limited number of responses.
>> 
>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>> 
>>> I finished another version two weeks ago, and every day I think: "should it
>>> be RFCv2 or v1”, ending up not sending it…
>>> 
>>> There is one issue that I realized while working on the new version: I’m not
>>> sure it is well-defined what an outline retpoline is allowed to do. The
>>> indirect branch promotion code can change rflags, which might cause
>>> correction issues. In practice, using gcc, it is not a problem.
>> 
>> Callees can clobber flags, so it seems fine to me.
> 
> Just to check I understand your approach right: you made a macro
> called "call", and you're therefore causing all instances of "call" to
> become magic?  This is... terrifying.  It's even plausibly worse than
> "#define if" :)  The scariest bit is that it will impact inline asm as
> well.  Maybe a gcc plugin would be less alarming?

It is likely to look less alarming. When I looked at the inline retpoline
implementation of gcc, it didn’t look much better than what I did - it
basically just emits assembly instructions.

Anyhow, I look (again) into using gcc-plugins.

>>> 1. An indirect branch inside the BP handler might be the one we patch
>> 
>> I _think_ nested INT3s should be doable, because they don't use IST.
>> Maybe Andy can clarify.
> 
> int3 should survive recursion these days.  Although I admit I'm
> currently wondering what happens if one thread puts a kprobe on an
> address that another thread tries to text_poke.

The issue I regarded is having an indirect call *inside* the the handler.
For example, you try to patch the call to bp_int3_handler and then get an
int3. They can be annotated to prevent them from being patched. Then again,
I need to see how gcc plugins can get these annotations.

> 
> Also, this relpoline magic is likely to start patching text at runtime
> on a semi-regular basis.  This type of patching is *slow*.  Is it a
> problem?

It didn’t appear so. Although there are >1 indirect 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Nadav Amit
> On Nov 28, 2018, at 5:40 PM, Andy Lutomirski  wrote:
> 
> On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  wrote:
>> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
 On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
 
 On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> This RFC introduces indirect call promotion in runtime, which for the
> matter of simplification (and branding) will be called here "relpolines"
> (relative call + trampoline). Relpolines are mainly intended as a way
> of reducing retpoline overheads due to Spectre v2.
> 
> Unlike indirect call promotion through profile guided optimization, the
> proposed approach does not require a profiling stage, works well with
> modules whose address is unknown and can adapt to changing workloads.
> 
> The main idea is simple: for every indirect call, we inject a piece of
> code with fast- and slow-path calls. The fast path is used if the target
> matches the expected (hot) target. The slow-path uses a retpoline.
> During training, the slow-path is set to call a function that saves the
> call source and target in a hash-table and keep count for call
> frequency. The most common target is then patched into the hot path.
> 
> The patching is done on-the-fly by patching the conditional branch
> (opcode and offset) that is used to compare the target to the hot
> target. This allows to direct all cores to the fast-path, while patching
> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> patch a single byte when the code might be executed by any core. (2)
> When patching more than one byte, ensure that all cores do not run the
> to-be-patched-code by preventing this code from being preempted, and
> using synchronize_sched() after patching the branch that jumps over this
> code.
> 
> Changing all the indirect calls to use relpolines is done using assembly
> macro magic. There are alternative solutions, but this one is
> relatively simple and transparent. There is also logic to retrain the
> software predictor, but the policy it uses may need to be refined.
> 
> Eventually the results are not bad (2 VCPU VM, throughput reported):
> 
>   baserelpoline
>   -
> nginx  22898   25178 (+10%)
> redis-ycsb 24523   25486 (+4%)
> dbench 21442103 (+2%)
> 
> When retpolines are disabled, and if retraining is off, performance
> benefits are up to 2% (nginx), but are much less impressive.
 
 Hi Nadav,
 
 Peter pointed me to these patches during a discussion about retpoline
 profiling.  Personally, I think this is brilliant.  This could help
 networking and filesystem intensive workloads a lot.
>>> 
>>> Thanks! I was a bit held-back by the relatively limited number of responses.
>> 
>> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>> 
>>> I finished another version two weeks ago, and every day I think: "should it
>>> be RFCv2 or v1”, ending up not sending it…
>>> 
>>> There is one issue that I realized while working on the new version: I’m not
>>> sure it is well-defined what an outline retpoline is allowed to do. The
>>> indirect branch promotion code can change rflags, which might cause
>>> correction issues. In practice, using gcc, it is not a problem.
>> 
>> Callees can clobber flags, so it seems fine to me.
> 
> Just to check I understand your approach right: you made a macro
> called "call", and you're therefore causing all instances of "call" to
> become magic?  This is... terrifying.  It's even plausibly worse than
> "#define if" :)  The scariest bit is that it will impact inline asm as
> well.  Maybe a gcc plugin would be less alarming?

It is likely to look less alarming. When I looked at the inline retpoline
implementation of gcc, it didn’t look much better than what I did - it
basically just emits assembly instructions.

Anyhow, I look (again) into using gcc-plugins.

>>> 1. An indirect branch inside the BP handler might be the one we patch
>> 
>> I _think_ nested INT3s should be doable, because they don't use IST.
>> Maybe Andy can clarify.
> 
> int3 should survive recursion these days.  Although I admit I'm
> currently wondering what happens if one thread puts a kprobe on an
> address that another thread tries to text_poke.

The issue I regarded is having an indirect call *inside* the the handler.
For example, you try to patch the call to bp_int3_handler and then get an
int3. They can be annotated to prevent them from being patched. Then again,
I need to see how gcc plugins can get these annotations.

> 
> Also, this relpoline magic is likely to start patching text at runtime
> on a semi-regular basis.  This type of patching is *slow*.  Is it a
> problem?

It didn’t appear so. Although there are >1 indirect 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Andy Lutomirski
On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  wrote:
>
> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
> > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
> > >
> > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> > >> This RFC introduces indirect call promotion in runtime, which for the
> > >> matter of simplification (and branding) will be called here "relpolines"
> > >> (relative call + trampoline). Relpolines are mainly intended as a way
> > >> of reducing retpoline overheads due to Spectre v2.
> > >>
> > >> Unlike indirect call promotion through profile guided optimization, the
> > >> proposed approach does not require a profiling stage, works well with
> > >> modules whose address is unknown and can adapt to changing workloads.
> > >>
> > >> The main idea is simple: for every indirect call, we inject a piece of
> > >> code with fast- and slow-path calls. The fast path is used if the target
> > >> matches the expected (hot) target. The slow-path uses a retpoline.
> > >> During training, the slow-path is set to call a function that saves the
> > >> call source and target in a hash-table and keep count for call
> > >> frequency. The most common target is then patched into the hot path.
> > >>
> > >> The patching is done on-the-fly by patching the conditional branch
> > >> (opcode and offset) that is used to compare the target to the hot
> > >> target. This allows to direct all cores to the fast-path, while patching
> > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> > >> patch a single byte when the code might be executed by any core. (2)
> > >> When patching more than one byte, ensure that all cores do not run the
> > >> to-be-patched-code by preventing this code from being preempted, and
> > >> using synchronize_sched() after patching the branch that jumps over this
> > >> code.
> > >>
> > >> Changing all the indirect calls to use relpolines is done using assembly
> > >> macro magic. There are alternative solutions, but this one is
> > >> relatively simple and transparent. There is also logic to retrain the
> > >> software predictor, but the policy it uses may need to be refined.
> > >>
> > >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> > >>
> > >>baserelpoline
> > >>-
> > >> nginx  22898   25178 (+10%)
> > >> redis-ycsb 24523   25486 (+4%)
> > >> dbench 21442103 (+2%)
> > >>
> > >> When retpolines are disabled, and if retraining is off, performance
> > >> benefits are up to 2% (nginx), but are much less impressive.
> > >
> > > Hi Nadav,
> > >
> > > Peter pointed me to these patches during a discussion about retpoline
> > > profiling.  Personally, I think this is brilliant.  This could help
> > > networking and filesystem intensive workloads a lot.
> >
> > Thanks! I was a bit held-back by the relatively limited number of responses.
>
> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>
> > I finished another version two weeks ago, and every day I think: "should it
> > be RFCv2 or v1”, ending up not sending it…
> >
> > There is one issue that I realized while working on the new version: I’m not
> > sure it is well-defined what an outline retpoline is allowed to do. The
> > indirect branch promotion code can change rflags, which might cause
> > correction issues. In practice, using gcc, it is not a problem.
>
> Callees can clobber flags, so it seems fine to me.

Just to check I understand your approach right: you made a macro
called "call", and you're therefore causing all instances of "call" to
become magic?  This is... terrifying.  It's even plausibly worse than
"#define if" :)  The scariest bit is that it will impact inline asm as
well.  Maybe a gcc plugin would be less alarming?

> >
> > 1. An indirect branch inside the BP handler might be the one we patch
>
> I _think_ nested INT3s should be doable, because they don't use IST.
> Maybe Andy can clarify.

int3 should survive recursion these days.  Although I admit I'm
currently wondering what happens if one thread puts a kprobe on an
address that another thread tries to text_poke.

Also, this relpoline magic is likely to start patching text at runtime
on a semi-regular basis.  This type of patching is *slow*.  Is it a
problem?

> > 2. An indirect branch inside an interrupt or NMI handler might be the
> >one we patch
>
> But INT3s just use the existing stack, and NMIs support nesting, so I'm
> thinking that should also be doable.  Andy?
>

In principle, as long as the code isn't NOKPROBE_SYMBOL-ified, we
should be fine, right?  I'd be a little nervous if we get an int3 in
the C code that handles the early part of an NMI from user mode.  It's
*probably* okay, but one of the alarming issues is that the int3
return path will implicitly unmask NMI, which isn't fantastic.  Maybe
we finally need to dust off my old "return using RET" code to get rid

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Andy Lutomirski
On Wed, Nov 28, 2018 at 4:38 PM Josh Poimboeuf  wrote:
>
> On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
> > > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
> > >
> > > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> > >> This RFC introduces indirect call promotion in runtime, which for the
> > >> matter of simplification (and branding) will be called here "relpolines"
> > >> (relative call + trampoline). Relpolines are mainly intended as a way
> > >> of reducing retpoline overheads due to Spectre v2.
> > >>
> > >> Unlike indirect call promotion through profile guided optimization, the
> > >> proposed approach does not require a profiling stage, works well with
> > >> modules whose address is unknown and can adapt to changing workloads.
> > >>
> > >> The main idea is simple: for every indirect call, we inject a piece of
> > >> code with fast- and slow-path calls. The fast path is used if the target
> > >> matches the expected (hot) target. The slow-path uses a retpoline.
> > >> During training, the slow-path is set to call a function that saves the
> > >> call source and target in a hash-table and keep count for call
> > >> frequency. The most common target is then patched into the hot path.
> > >>
> > >> The patching is done on-the-fly by patching the conditional branch
> > >> (opcode and offset) that is used to compare the target to the hot
> > >> target. This allows to direct all cores to the fast-path, while patching
> > >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> > >> patch a single byte when the code might be executed by any core. (2)
> > >> When patching more than one byte, ensure that all cores do not run the
> > >> to-be-patched-code by preventing this code from being preempted, and
> > >> using synchronize_sched() after patching the branch that jumps over this
> > >> code.
> > >>
> > >> Changing all the indirect calls to use relpolines is done using assembly
> > >> macro magic. There are alternative solutions, but this one is
> > >> relatively simple and transparent. There is also logic to retrain the
> > >> software predictor, but the policy it uses may need to be refined.
> > >>
> > >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> > >>
> > >>baserelpoline
> > >>-
> > >> nginx  22898   25178 (+10%)
> > >> redis-ycsb 24523   25486 (+4%)
> > >> dbench 21442103 (+2%)
> > >>
> > >> When retpolines are disabled, and if retraining is off, performance
> > >> benefits are up to 2% (nginx), but are much less impressive.
> > >
> > > Hi Nadav,
> > >
> > > Peter pointed me to these patches during a discussion about retpoline
> > > profiling.  Personally, I think this is brilliant.  This could help
> > > networking and filesystem intensive workloads a lot.
> >
> > Thanks! I was a bit held-back by the relatively limited number of responses.
>
> It is a rather, erm, ambitious idea, maybe they were speechless :-)
>
> > I finished another version two weeks ago, and every day I think: "should it
> > be RFCv2 or v1”, ending up not sending it…
> >
> > There is one issue that I realized while working on the new version: I’m not
> > sure it is well-defined what an outline retpoline is allowed to do. The
> > indirect branch promotion code can change rflags, which might cause
> > correction issues. In practice, using gcc, it is not a problem.
>
> Callees can clobber flags, so it seems fine to me.

Just to check I understand your approach right: you made a macro
called "call", and you're therefore causing all instances of "call" to
become magic?  This is... terrifying.  It's even plausibly worse than
"#define if" :)  The scariest bit is that it will impact inline asm as
well.  Maybe a gcc plugin would be less alarming?

> >
> > 1. An indirect branch inside the BP handler might be the one we patch
>
> I _think_ nested INT3s should be doable, because they don't use IST.
> Maybe Andy can clarify.

int3 should survive recursion these days.  Although I admit I'm
currently wondering what happens if one thread puts a kprobe on an
address that another thread tries to text_poke.

Also, this relpoline magic is likely to start patching text at runtime
on a semi-regular basis.  This type of patching is *slow*.  Is it a
problem?

> > 2. An indirect branch inside an interrupt or NMI handler might be the
> >one we patch
>
> But INT3s just use the existing stack, and NMIs support nesting, so I'm
> thinking that should also be doable.  Andy?
>

In principle, as long as the code isn't NOKPROBE_SYMBOL-ified, we
should be fine, right?  I'd be a little nervous if we get an int3 in
the C code that handles the early part of an NMI from user mode.  It's
*probably* okay, but one of the alarming issues is that the int3
return path will implicitly unmask NMI, which isn't fantastic.  Maybe
we finally need to dust off my old "return using RET" code to get rid

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Josh Poimboeuf
On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
> > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
> > 
> > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> >> This RFC introduces indirect call promotion in runtime, which for the
> >> matter of simplification (and branding) will be called here "relpolines"
> >> (relative call + trampoline). Relpolines are mainly intended as a way
> >> of reducing retpoline overheads due to Spectre v2.
> >> 
> >> Unlike indirect call promotion through profile guided optimization, the
> >> proposed approach does not require a profiling stage, works well with
> >> modules whose address is unknown and can adapt to changing workloads.
> >> 
> >> The main idea is simple: for every indirect call, we inject a piece of
> >> code with fast- and slow-path calls. The fast path is used if the target
> >> matches the expected (hot) target. The slow-path uses a retpoline.
> >> During training, the slow-path is set to call a function that saves the
> >> call source and target in a hash-table and keep count for call
> >> frequency. The most common target is then patched into the hot path.
> >> 
> >> The patching is done on-the-fly by patching the conditional branch
> >> (opcode and offset) that is used to compare the target to the hot
> >> target. This allows to direct all cores to the fast-path, while patching
> >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> >> patch a single byte when the code might be executed by any core. (2)
> >> When patching more than one byte, ensure that all cores do not run the
> >> to-be-patched-code by preventing this code from being preempted, and
> >> using synchronize_sched() after patching the branch that jumps over this
> >> code.
> >> 
> >> Changing all the indirect calls to use relpolines is done using assembly
> >> macro magic. There are alternative solutions, but this one is
> >> relatively simple and transparent. There is also logic to retrain the
> >> software predictor, but the policy it uses may need to be refined.
> >> 
> >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> >> 
> >>baserelpoline
> >>-
> >> nginx  22898   25178 (+10%)
> >> redis-ycsb 24523   25486 (+4%)
> >> dbench 21442103 (+2%)
> >> 
> >> When retpolines are disabled, and if retraining is off, performance
> >> benefits are up to 2% (nginx), but are much less impressive.
> > 
> > Hi Nadav,
> > 
> > Peter pointed me to these patches during a discussion about retpoline
> > profiling.  Personally, I think this is brilliant.  This could help
> > networking and filesystem intensive workloads a lot.
> 
> Thanks! I was a bit held-back by the relatively limited number of responses.

It is a rather, erm, ambitious idea, maybe they were speechless :-)

> I finished another version two weeks ago, and every day I think: "should it
> be RFCv2 or v1”, ending up not sending it…
> 
> There is one issue that I realized while working on the new version: I’m not
> sure it is well-defined what an outline retpoline is allowed to do. The
> indirect branch promotion code can change rflags, which might cause
> correction issues. In practice, using gcc, it is not a problem.

Callees can clobber flags, so it seems fine to me.

> > Some high-level comments:
> > 
> > - "Relpoline" looks confusingly a lot like "retpoline".  How about
> >  "optpoline"?  To avoid confusing myself I will hereafter refer to it
> >  as such :-)
> 
> Sure. For the academic paper we submitted, we used a much worse name that my
> colleague came up with. I’m ok with anything other than that name (not
> mentioned to prevent double-blinding violations). I’ll go with your name.
> 
> > - Instead of patching one byte at a time, is there a reason why
> >  text_poke_bp() can't be used?  That would greatly simplify the
> >  patching process, as everything could be patched in a single step.
> 
> I thought of it and maybe it is somehow possible, but there are several
> problems, for which I didn’t find a simple solution:
> 
> 1. An indirect branch inside the BP handler might be the one we patch

I _think_ nested INT3s should be doable, because they don't use IST.
Maybe Andy can clarify.

To avoid recursion we'd have to make sure not to use any function
pointers in do_int3() before or during the call to poke_int3_handler().

Incidentally I'm working on a patch which adds an indirect call to
poke_int3_handler().  We'd have to disable optolines for that code.

> 2. An indirect branch inside an interrupt or NMI handler might be the
>one we patch

But INT3s just use the existing stack, and NMIs support nesting, so I'm
thinking that should also be doable.  Andy?

> 3. Overall, we need to patch more than a single instruction, and
>text_poke_bp() is not suitable

Hm, I suppose that's true.

> > - In many cases, a single direct call may not be sufficient, as there
> >  

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Josh Poimboeuf
On Wed, Nov 28, 2018 at 07:34:52PM +, Nadav Amit wrote:
> > On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
> > 
> > On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> >> This RFC introduces indirect call promotion in runtime, which for the
> >> matter of simplification (and branding) will be called here "relpolines"
> >> (relative call + trampoline). Relpolines are mainly intended as a way
> >> of reducing retpoline overheads due to Spectre v2.
> >> 
> >> Unlike indirect call promotion through profile guided optimization, the
> >> proposed approach does not require a profiling stage, works well with
> >> modules whose address is unknown and can adapt to changing workloads.
> >> 
> >> The main idea is simple: for every indirect call, we inject a piece of
> >> code with fast- and slow-path calls. The fast path is used if the target
> >> matches the expected (hot) target. The slow-path uses a retpoline.
> >> During training, the slow-path is set to call a function that saves the
> >> call source and target in a hash-table and keep count for call
> >> frequency. The most common target is then patched into the hot path.
> >> 
> >> The patching is done on-the-fly by patching the conditional branch
> >> (opcode and offset) that is used to compare the target to the hot
> >> target. This allows to direct all cores to the fast-path, while patching
> >> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> >> patch a single byte when the code might be executed by any core. (2)
> >> When patching more than one byte, ensure that all cores do not run the
> >> to-be-patched-code by preventing this code from being preempted, and
> >> using synchronize_sched() after patching the branch that jumps over this
> >> code.
> >> 
> >> Changing all the indirect calls to use relpolines is done using assembly
> >> macro magic. There are alternative solutions, but this one is
> >> relatively simple and transparent. There is also logic to retrain the
> >> software predictor, but the policy it uses may need to be refined.
> >> 
> >> Eventually the results are not bad (2 VCPU VM, throughput reported):
> >> 
> >>baserelpoline
> >>-
> >> nginx  22898   25178 (+10%)
> >> redis-ycsb 24523   25486 (+4%)
> >> dbench 21442103 (+2%)
> >> 
> >> When retpolines are disabled, and if retraining is off, performance
> >> benefits are up to 2% (nginx), but are much less impressive.
> > 
> > Hi Nadav,
> > 
> > Peter pointed me to these patches during a discussion about retpoline
> > profiling.  Personally, I think this is brilliant.  This could help
> > networking and filesystem intensive workloads a lot.
> 
> Thanks! I was a bit held-back by the relatively limited number of responses.

It is a rather, erm, ambitious idea, maybe they were speechless :-)

> I finished another version two weeks ago, and every day I think: "should it
> be RFCv2 or v1”, ending up not sending it…
> 
> There is one issue that I realized while working on the new version: I’m not
> sure it is well-defined what an outline retpoline is allowed to do. The
> indirect branch promotion code can change rflags, which might cause
> correction issues. In practice, using gcc, it is not a problem.

Callees can clobber flags, so it seems fine to me.

> > Some high-level comments:
> > 
> > - "Relpoline" looks confusingly a lot like "retpoline".  How about
> >  "optpoline"?  To avoid confusing myself I will hereafter refer to it
> >  as such :-)
> 
> Sure. For the academic paper we submitted, we used a much worse name that my
> colleague came up with. I’m ok with anything other than that name (not
> mentioned to prevent double-blinding violations). I’ll go with your name.
> 
> > - Instead of patching one byte at a time, is there a reason why
> >  text_poke_bp() can't be used?  That would greatly simplify the
> >  patching process, as everything could be patched in a single step.
> 
> I thought of it and maybe it is somehow possible, but there are several
> problems, for which I didn’t find a simple solution:
> 
> 1. An indirect branch inside the BP handler might be the one we patch

I _think_ nested INT3s should be doable, because they don't use IST.
Maybe Andy can clarify.

To avoid recursion we'd have to make sure not to use any function
pointers in do_int3() before or during the call to poke_int3_handler().

Incidentally I'm working on a patch which adds an indirect call to
poke_int3_handler().  We'd have to disable optolines for that code.

> 2. An indirect branch inside an interrupt or NMI handler might be the
>one we patch

But INT3s just use the existing stack, and NMIs support nesting, so I'm
thinking that should also be doable.  Andy?

> 3. Overall, we need to patch more than a single instruction, and
>text_poke_bp() is not suitable

Hm, I suppose that's true.

> > - In many cases, a single direct call may not be sufficient, as there
> >  

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Nadav Amit
> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
> 
> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>> This RFC introduces indirect call promotion in runtime, which for the
>> matter of simplification (and branding) will be called here "relpolines"
>> (relative call + trampoline). Relpolines are mainly intended as a way
>> of reducing retpoline overheads due to Spectre v2.
>> 
>> Unlike indirect call promotion through profile guided optimization, the
>> proposed approach does not require a profiling stage, works well with
>> modules whose address is unknown and can adapt to changing workloads.
>> 
>> The main idea is simple: for every indirect call, we inject a piece of
>> code with fast- and slow-path calls. The fast path is used if the target
>> matches the expected (hot) target. The slow-path uses a retpoline.
>> During training, the slow-path is set to call a function that saves the
>> call source and target in a hash-table and keep count for call
>> frequency. The most common target is then patched into the hot path.
>> 
>> The patching is done on-the-fly by patching the conditional branch
>> (opcode and offset) that is used to compare the target to the hot
>> target. This allows to direct all cores to the fast-path, while patching
>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>> patch a single byte when the code might be executed by any core. (2)
>> When patching more than one byte, ensure that all cores do not run the
>> to-be-patched-code by preventing this code from being preempted, and
>> using synchronize_sched() after patching the branch that jumps over this
>> code.
>> 
>> Changing all the indirect calls to use relpolines is done using assembly
>> macro magic. There are alternative solutions, but this one is
>> relatively simple and transparent. There is also logic to retrain the
>> software predictor, but the policy it uses may need to be refined.
>> 
>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>> 
>>  baserelpoline
>>  -
>> nginx22898   25178 (+10%)
>> redis-ycsb   24523   25486 (+4%)
>> dbench   21442103 (+2%)
>> 
>> When retpolines are disabled, and if retraining is off, performance
>> benefits are up to 2% (nginx), but are much less impressive.
> 
> Hi Nadav,
> 
> Peter pointed me to these patches during a discussion about retpoline
> profiling.  Personally, I think this is brilliant.  This could help
> networking and filesystem intensive workloads a lot.

Thanks! I was a bit held-back by the relatively limited number of responses.
I finished another version two weeks ago, and every day I think: "should it
be RFCv2 or v1”, ending up not sending it…

There is one issue that I realized while working on the new version: I’m not
sure it is well-defined what an outline retpoline is allowed to do. The
indirect branch promotion code can change rflags, which might cause
correction issues. In practice, using gcc, it is not a problem.

> Some high-level comments:
> 
> - "Relpoline" looks confusingly a lot like "retpoline".  How about
>  "optpoline"?  To avoid confusing myself I will hereafter refer to it
>  as such :-)

Sure. For the academic paper we submitted, we used a much worse name that my
colleague came up with. I’m ok with anything other than that name (not
mentioned to prevent double-blinding violations). I’ll go with your name.

> - Instead of patching one byte at a time, is there a reason why
>  text_poke_bp() can't be used?  That would greatly simplify the
>  patching process, as everything could be patched in a single step.

I thought of it and maybe it is somehow possible, but there are several
problems, for which I didn’t find a simple solution:

1. An indirect branch inside the BP handler might be the one we patch

2. An indirect branch inside an interrupt or NMI handler might be the
   one we patch

3. Overall, we need to patch more than a single instruction, and
   text_poke_bp() is not suitable

> - In many cases, a single direct call may not be sufficient, as there
>  could be for example multiple tasks using different network protocols
>  which need different callbacks for the same call site.

We want to know during compilation how many targets to use. It is not
super-simple to support multiple inlined targets, but it is feasible if you
are willing to annotate when multiple targets are needed. We have a version
which uses outlined indirect branch promotion when there are multiple
targets, but it’s not ready for prime time, and the code-cache misses can
induce some overheads.

> - I'm not sure about the periodic retraining logic, it seems a bit
>  nondeterministic and bursty.

I agree. It can be limited to cases in which modules are loaded/removed,
or when the user explicitly asks for it to take place.

> 
> So I'd propose the following changes:
> 
> - In the optpoline, reserve space for multiple (5 or so) 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Nadav Amit
> On Nov 28, 2018, at 8:08 AM, Josh Poimboeuf  wrote:
> 
> On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
>> This RFC introduces indirect call promotion in runtime, which for the
>> matter of simplification (and branding) will be called here "relpolines"
>> (relative call + trampoline). Relpolines are mainly intended as a way
>> of reducing retpoline overheads due to Spectre v2.
>> 
>> Unlike indirect call promotion through profile guided optimization, the
>> proposed approach does not require a profiling stage, works well with
>> modules whose address is unknown and can adapt to changing workloads.
>> 
>> The main idea is simple: for every indirect call, we inject a piece of
>> code with fast- and slow-path calls. The fast path is used if the target
>> matches the expected (hot) target. The slow-path uses a retpoline.
>> During training, the slow-path is set to call a function that saves the
>> call source and target in a hash-table and keep count for call
>> frequency. The most common target is then patched into the hot path.
>> 
>> The patching is done on-the-fly by patching the conditional branch
>> (opcode and offset) that is used to compare the target to the hot
>> target. This allows to direct all cores to the fast-path, while patching
>> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
>> patch a single byte when the code might be executed by any core. (2)
>> When patching more than one byte, ensure that all cores do not run the
>> to-be-patched-code by preventing this code from being preempted, and
>> using synchronize_sched() after patching the branch that jumps over this
>> code.
>> 
>> Changing all the indirect calls to use relpolines is done using assembly
>> macro magic. There are alternative solutions, but this one is
>> relatively simple and transparent. There is also logic to retrain the
>> software predictor, but the policy it uses may need to be refined.
>> 
>> Eventually the results are not bad (2 VCPU VM, throughput reported):
>> 
>>  baserelpoline
>>  -
>> nginx22898   25178 (+10%)
>> redis-ycsb   24523   25486 (+4%)
>> dbench   21442103 (+2%)
>> 
>> When retpolines are disabled, and if retraining is off, performance
>> benefits are up to 2% (nginx), but are much less impressive.
> 
> Hi Nadav,
> 
> Peter pointed me to these patches during a discussion about retpoline
> profiling.  Personally, I think this is brilliant.  This could help
> networking and filesystem intensive workloads a lot.

Thanks! I was a bit held-back by the relatively limited number of responses.
I finished another version two weeks ago, and every day I think: "should it
be RFCv2 or v1”, ending up not sending it…

There is one issue that I realized while working on the new version: I’m not
sure it is well-defined what an outline retpoline is allowed to do. The
indirect branch promotion code can change rflags, which might cause
correction issues. In practice, using gcc, it is not a problem.

> Some high-level comments:
> 
> - "Relpoline" looks confusingly a lot like "retpoline".  How about
>  "optpoline"?  To avoid confusing myself I will hereafter refer to it
>  as such :-)

Sure. For the academic paper we submitted, we used a much worse name that my
colleague came up with. I’m ok with anything other than that name (not
mentioned to prevent double-blinding violations). I’ll go with your name.

> - Instead of patching one byte at a time, is there a reason why
>  text_poke_bp() can't be used?  That would greatly simplify the
>  patching process, as everything could be patched in a single step.

I thought of it and maybe it is somehow possible, but there are several
problems, for which I didn’t find a simple solution:

1. An indirect branch inside the BP handler might be the one we patch

2. An indirect branch inside an interrupt or NMI handler might be the
   one we patch

3. Overall, we need to patch more than a single instruction, and
   text_poke_bp() is not suitable

> - In many cases, a single direct call may not be sufficient, as there
>  could be for example multiple tasks using different network protocols
>  which need different callbacks for the same call site.

We want to know during compilation how many targets to use. It is not
super-simple to support multiple inlined targets, but it is feasible if you
are willing to annotate when multiple targets are needed. We have a version
which uses outlined indirect branch promotion when there are multiple
targets, but it’s not ready for prime time, and the code-cache misses can
induce some overheads.

> - I'm not sure about the periodic retraining logic, it seems a bit
>  nondeterministic and bursty.

I agree. It can be limited to cases in which modules are loaded/removed,
or when the user explicitly asks for it to take place.

> 
> So I'd propose the following changes:
> 
> - In the optpoline, reserve space for multiple (5 or so) 

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Josh Poimboeuf
On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> This RFC introduces indirect call promotion in runtime, which for the
> matter of simplification (and branding) will be called here "relpolines"
> (relative call + trampoline). Relpolines are mainly intended as a way
> of reducing retpoline overheads due to Spectre v2.
> 
> Unlike indirect call promotion through profile guided optimization, the
> proposed approach does not require a profiling stage, works well with
> modules whose address is unknown and can adapt to changing workloads.
> 
> The main idea is simple: for every indirect call, we inject a piece of
> code with fast- and slow-path calls. The fast path is used if the target
> matches the expected (hot) target. The slow-path uses a retpoline.
> During training, the slow-path is set to call a function that saves the
> call source and target in a hash-table and keep count for call
> frequency. The most common target is then patched into the hot path.
> 
> The patching is done on-the-fly by patching the conditional branch
> (opcode and offset) that is used to compare the target to the hot
> target. This allows to direct all cores to the fast-path, while patching
> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> patch a single byte when the code might be executed by any core. (2)
> When patching more than one byte, ensure that all cores do not run the
> to-be-patched-code by preventing this code from being preempted, and
> using synchronize_sched() after patching the branch that jumps over this
> code.
> 
> Changing all the indirect calls to use relpolines is done using assembly
> macro magic. There are alternative solutions, but this one is
> relatively simple and transparent. There is also logic to retrain the
> software predictor, but the policy it uses may need to be refined.
> 
> Eventually the results are not bad (2 VCPU VM, throughput reported):
> 
>   baserelpoline
>   -
> nginx 22898   25178 (+10%)
> redis-ycsb24523   25486 (+4%)
> dbench21442103 (+2%)
> 
> When retpolines are disabled, and if retraining is off, performance
> benefits are up to 2% (nginx), but are much less impressive.

Hi Nadav,

Peter pointed me to these patches during a discussion about retpoline
profiling.  Personally, I think this is brilliant.  This could help
networking and filesystem intensive workloads a lot.

Some high-level comments:

- "Relpoline" looks confusingly a lot like "retpoline".  How about
  "optpoline"?  To avoid confusing myself I will hereafter refer to it
  as such :-)

- Instead of patching one byte at a time, is there a reason why
  text_poke_bp() can't be used?  That would greatly simplify the
  patching process, as everything could be patched in a single step.

- In many cases, a single direct call may not be sufficient, as there
  could be for example multiple tasks using different network protocols
  which need different callbacks for the same call site.

- I'm not sure about the periodic retraining logic, it seems a bit
  nondeterministic and bursty.
  
So I'd propose the following changes:

- In the optpoline, reserve space for multiple (5 or so) comparisons and
  direct calls.  Maybe the number of reserved cmp/jne/call slots can be
  tweaked by the caller somehow.  Or maybe it could grow as needed.
  Starting out, they would just be NOPs.

- Instead of the temporary learning mode, add permanent tracking to
  detect a direct call "miss" -- i.e., when none of the existing direct
  calls are applicable and the retpoline will be used.

- In the case of a miss (or N misses), it could trigger a direct call
  patching operation to be run later (workqueue or syscall exit).  If
  all the direct call slots are full, it could patch the least recently
  modified one.  If this causes thrashing (>x changes over y time), it
  could increase the number of direct call slots using a trampoline.
  Even if there were several slots, CPU branch prediction would
  presumably help make it much faster than a basic retpoline.

Thoughts?

-- 
Josh


Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-11-28 Thread Josh Poimboeuf
On Wed, Oct 17, 2018 at 05:54:15PM -0700, Nadav Amit wrote:
> This RFC introduces indirect call promotion in runtime, which for the
> matter of simplification (and branding) will be called here "relpolines"
> (relative call + trampoline). Relpolines are mainly intended as a way
> of reducing retpoline overheads due to Spectre v2.
> 
> Unlike indirect call promotion through profile guided optimization, the
> proposed approach does not require a profiling stage, works well with
> modules whose address is unknown and can adapt to changing workloads.
> 
> The main idea is simple: for every indirect call, we inject a piece of
> code with fast- and slow-path calls. The fast path is used if the target
> matches the expected (hot) target. The slow-path uses a retpoline.
> During training, the slow-path is set to call a function that saves the
> call source and target in a hash-table and keep count for call
> frequency. The most common target is then patched into the hot path.
> 
> The patching is done on-the-fly by patching the conditional branch
> (opcode and offset) that is used to compare the target to the hot
> target. This allows to direct all cores to the fast-path, while patching
> the slow-path and vice-versa. Patching follows 2 more rules: (1) Only
> patch a single byte when the code might be executed by any core. (2)
> When patching more than one byte, ensure that all cores do not run the
> to-be-patched-code by preventing this code from being preempted, and
> using synchronize_sched() after patching the branch that jumps over this
> code.
> 
> Changing all the indirect calls to use relpolines is done using assembly
> macro magic. There are alternative solutions, but this one is
> relatively simple and transparent. There is also logic to retrain the
> software predictor, but the policy it uses may need to be refined.
> 
> Eventually the results are not bad (2 VCPU VM, throughput reported):
> 
>   baserelpoline
>   -
> nginx 22898   25178 (+10%)
> redis-ycsb24523   25486 (+4%)
> dbench21442103 (+2%)
> 
> When retpolines are disabled, and if retraining is off, performance
> benefits are up to 2% (nginx), but are much less impressive.

Hi Nadav,

Peter pointed me to these patches during a discussion about retpoline
profiling.  Personally, I think this is brilliant.  This could help
networking and filesystem intensive workloads a lot.

Some high-level comments:

- "Relpoline" looks confusingly a lot like "retpoline".  How about
  "optpoline"?  To avoid confusing myself I will hereafter refer to it
  as such :-)

- Instead of patching one byte at a time, is there a reason why
  text_poke_bp() can't be used?  That would greatly simplify the
  patching process, as everything could be patched in a single step.

- In many cases, a single direct call may not be sufficient, as there
  could be for example multiple tasks using different network protocols
  which need different callbacks for the same call site.

- I'm not sure about the periodic retraining logic, it seems a bit
  nondeterministic and bursty.
  
So I'd propose the following changes:

- In the optpoline, reserve space for multiple (5 or so) comparisons and
  direct calls.  Maybe the number of reserved cmp/jne/call slots can be
  tweaked by the caller somehow.  Or maybe it could grow as needed.
  Starting out, they would just be NOPs.

- Instead of the temporary learning mode, add permanent tracking to
  detect a direct call "miss" -- i.e., when none of the existing direct
  calls are applicable and the retpoline will be used.

- In the case of a miss (or N misses), it could trigger a direct call
  patching operation to be run later (workqueue or syscall exit).  If
  all the direct call slots are full, it could patch the least recently
  modified one.  If this causes thrashing (>x changes over y time), it
  could increase the number of direct call slots using a trampoline.
  Even if there were several slots, CPU branch prediction would
  presumably help make it much faster than a basic retpoline.

Thoughts?

-- 
Josh


Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-10-23 Thread Dave Hansen
On 10/23/18 1:32 PM, Nadav Amit wrote:
>> On 10/17/18 5:54 PM, Nadav Amit wrote:
>>> baserelpoline
>>> -
>>> nginx   22898   25178 (+10%)
>>> redis-ycsb  24523   25486 (+4%)
>>> dbench  21442103 (+2%)
>> Just out of curiosity, which indirect branches are the culprits here for
>> causing the slowdowns?
> So I didn’t try to measure exactly which one. There are roughly 500 that
> actually “run” in my tests.

OK, cool, that's pretty much all I wanted to know, just that there
aren't 3 of them or something for which we need all this infrastructure.


Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-10-23 Thread Dave Hansen
On 10/23/18 1:32 PM, Nadav Amit wrote:
>> On 10/17/18 5:54 PM, Nadav Amit wrote:
>>> baserelpoline
>>> -
>>> nginx   22898   25178 (+10%)
>>> redis-ycsb  24523   25486 (+4%)
>>> dbench  21442103 (+2%)
>> Just out of curiosity, which indirect branches are the culprits here for
>> causing the slowdowns?
> So I didn’t try to measure exactly which one. There are roughly 500 that
> actually “run” in my tests.

OK, cool, that's pretty much all I wanted to know, just that there
aren't 3 of them or something for which we need all this infrastructure.


Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-10-23 Thread Nadav Amit
at 11:36 AM, Dave Hansen  wrote:

> On 10/17/18 5:54 PM, Nadav Amit wrote:
>> base relpoline
>>  -
>> nginx22898   25178 (+10%)
>> redis-ycsb   24523   25486 (+4%)
>> dbench   21442103 (+2%)
> 
> Just out of curiosity, which indirect branches are the culprits here for
> causing the slowdowns?

So I didn’t try to measure exactly which one. There are roughly 500 that
actually “run” in my tests. Initially, I took the silly approach of trying
to patch the C source-code using semi automatically-generated Coccinelle
scripts, so I can tell you it is not just few branches but many. The
network stack is full of function pointers (e.g., tcp_congestion_ops,
tcp_sock_af_ops, dst_ops). The file-system also uses many function pointers
(file_operations specifically). Compound-pages have d’tor and so on.

If you want, you can rebuild the kernel without retpolines and run

  perf record -e br_inst_exec.taken_indirect_near_call:k (your workload)

For some reason I didn’t manage to use PEBS (:ppp) from either the guest or
the host, so my results are a bit skewed (i.e., the sampled location is
usually after the call was taken). Running dbench in the VM gives me the
following “hot-spots”:

# Samples: 304  of event 'br_inst_exec.taken_indirect_near_call'
# Event count (approx.): 60800912
#
# Overhead  Command  Shared ObjectSymbol
   
#   ...  ...  
.
#
 5.26%  :197970  [guest.kernel.kallsyms]  [g] __fget_light
 4.28%  :197969  [guest.kernel.kallsyms]  [g] __fget_light
 3.95%  :197969  [guest.kernel.kallsyms]  [g] dcache_readdir
 3.29%  :197970  [guest.kernel.kallsyms]  [g] next_positive.isra.14
 2.96%  :197970  [guest.kernel.kallsyms]  [g] __do_sys_kill
 2.30%  :197970  [guest.kernel.kallsyms]  [g] apparmor_file_open
 1.97%  :197969  [guest.kernel.kallsyms]  [g] __do_sys_kill
 1.97%  :197969  [guest.kernel.kallsyms]  [g] next_positive.isra.14
 1.97%  :197970  [guest.kernel.kallsyms]  [g] _raw_spin_lock
 1.64%  :197969  [guest.kernel.kallsyms]  [g] __alloc_file
 1.64%  :197969  [guest.kernel.kallsyms]  [g] common_file_perm
 1.64%  :197969  [guest.kernel.kallsyms]  [g] filldir
 1.64%  :197970  [guest.kernel.kallsyms]  [g] do_dentry_open
 1.64%  :197970  [guest.kernel.kallsyms]  [g] kmem_cache_free
 1.32%  :197969  [guest.kernel.kallsyms]  [g] 
__raw_callee_save___pv_queued_spin_unlock
 1.32%  :197969  [guest.kernel.kallsyms]  [g] __slab_free

Regards,
Nadav

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-10-23 Thread Nadav Amit
at 11:36 AM, Dave Hansen  wrote:

> On 10/17/18 5:54 PM, Nadav Amit wrote:
>> base relpoline
>>  -
>> nginx22898   25178 (+10%)
>> redis-ycsb   24523   25486 (+4%)
>> dbench   21442103 (+2%)
> 
> Just out of curiosity, which indirect branches are the culprits here for
> causing the slowdowns?

So I didn’t try to measure exactly which one. There are roughly 500 that
actually “run” in my tests. Initially, I took the silly approach of trying
to patch the C source-code using semi automatically-generated Coccinelle
scripts, so I can tell you it is not just few branches but many. The
network stack is full of function pointers (e.g., tcp_congestion_ops,
tcp_sock_af_ops, dst_ops). The file-system also uses many function pointers
(file_operations specifically). Compound-pages have d’tor and so on.

If you want, you can rebuild the kernel without retpolines and run

  perf record -e br_inst_exec.taken_indirect_near_call:k (your workload)

For some reason I didn’t manage to use PEBS (:ppp) from either the guest or
the host, so my results are a bit skewed (i.e., the sampled location is
usually after the call was taken). Running dbench in the VM gives me the
following “hot-spots”:

# Samples: 304  of event 'br_inst_exec.taken_indirect_near_call'
# Event count (approx.): 60800912
#
# Overhead  Command  Shared ObjectSymbol
   
#   ...  ...  
.
#
 5.26%  :197970  [guest.kernel.kallsyms]  [g] __fget_light
 4.28%  :197969  [guest.kernel.kallsyms]  [g] __fget_light
 3.95%  :197969  [guest.kernel.kallsyms]  [g] dcache_readdir
 3.29%  :197970  [guest.kernel.kallsyms]  [g] next_positive.isra.14
 2.96%  :197970  [guest.kernel.kallsyms]  [g] __do_sys_kill
 2.30%  :197970  [guest.kernel.kallsyms]  [g] apparmor_file_open
 1.97%  :197969  [guest.kernel.kallsyms]  [g] __do_sys_kill
 1.97%  :197969  [guest.kernel.kallsyms]  [g] next_positive.isra.14
 1.97%  :197970  [guest.kernel.kallsyms]  [g] _raw_spin_lock
 1.64%  :197969  [guest.kernel.kallsyms]  [g] __alloc_file
 1.64%  :197969  [guest.kernel.kallsyms]  [g] common_file_perm
 1.64%  :197969  [guest.kernel.kallsyms]  [g] filldir
 1.64%  :197970  [guest.kernel.kallsyms]  [g] do_dentry_open
 1.64%  :197970  [guest.kernel.kallsyms]  [g] kmem_cache_free
 1.32%  :197969  [guest.kernel.kallsyms]  [g] 
__raw_callee_save___pv_queued_spin_unlock
 1.32%  :197969  [guest.kernel.kallsyms]  [g] __slab_free

Regards,
Nadav

Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-10-23 Thread Dave Hansen
On 10/17/18 5:54 PM, Nadav Amit wrote:
>   baserelpoline
>   -
> nginx 22898   25178 (+10%)
> redis-ycsb24523   25486 (+4%)
> dbench21442103 (+2%)

Just out of curiosity, which indirect branches are the culprits here for
causing the slowdowns?


Re: [RFC PATCH 0/5] x86: dynamic indirect call promotion

2018-10-23 Thread Dave Hansen
On 10/17/18 5:54 PM, Nadav Amit wrote:
>   baserelpoline
>   -
> nginx 22898   25178 (+10%)
> redis-ycsb24523   25486 (+4%)
> dbench21442103 (+2%)

Just out of curiosity, which indirect branches are the culprits here for
causing the slowdowns?