Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-11 Thread Qing Zhao via Gcc-patches
Hi, Alexandre,

CC’ing Richard for his comments on this.


> On Aug 10, 2020, at 9:39 PM, Alexandre Oliva  wrote:
>> I think that moving how to zeroing the registers part to each target
>> will be a better solution since each target has
>> Better idea on how to use the most efficient insns to do the work.
> 
> It's certainly good to allow machine-specific optimized code sequences,
> but it would certainly be desirable to have a machine-independent
> fallback.  It doesn't seem exceedingly hard to loop over the registers
> and emit a (set (reg:M N) (const_int 0)) for each one that is to be
> zeroed out.

The current implementation already includes such machine-independent code, it 
should be very easy to add this.

Richard, what’s your opinion on this?
Do we need a machine-independent implementation to zeroing the registers for 
the default when the target does not provide a optimized
Implementation?

Thanks.

Qing

> 
> 



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-19 Thread Qing Zhao via Gcc-patches
Hi,

Based on all the previous discussion and more extensive study on ROP and its 
mitigation techniques these days, I came up with the following
High-level proposal as requested, please take a look and let me know what I 
should change in this high-level design:

> On Aug 6, 2020, at 6:37 PM, Segher Boessenkool  
> wrote:
> 
> Anyway.  This all needs a good description in the user manual (is there?
> I couldn't find any), explaining what exactly it does (user-visible),
> and when you would want to use it, etc.  We need that before we can
> review anything else in this patch sanely.
> 
> 
> Segher

zeroing call-used registers for security purpose

8/19/2020
Qing Zhao
=

**Motivation:
There are two purposes of this patch:

1. ROP mitigation:

ROP (Return-oriented programming, 
https://en.wikipedia.org/wiki/Return-oriented_programming) is 
one of the most popular code reuse attack technique, which executes gadget 
chains to perform malicious tasks.
A gadget is a carefully chosen machine instruction sequence that is already 
present in the machines' memory. 
Each gadget typically ends in a return instruction and is located in a 
subroutine within the existing program 
and/or shared library code.

There are two variations that use gadgets that end with indirect call (COP, 
Call Oriented Programming )
 and jump instruction (JOP, Jump-Oriented Programming). However, performing ROP 
without return 
instructions in reality is difficult because the gadgets of COP and JOP that 
can form a completed chain 
are almost nonexistent. 

As a result, gadgets based on return instructions remain the most popular.

One important feature of ROP attack is (Clean the Scratch Registers:A Way to 
Mitigate Return-Oriented
Programming Attacks https://ieeexplore.ieee.org/document/8445132):
the destination of using gadget chains usually call system functions to perform 
malicious behaviour,
on many of the mordern architectures, the registers would be used to pass 
parameters for those 
system functions.

So, cleaning the scratch registers that are used to pass parameters at return 
instructions should 
effectively mitigate ROP attack. 

2. Register Erasure:

In the SECURE project and GCC (https://gcc.gnu.org/wiki/cauldron2018#secure)

One of the well known security techniques is stack and register erasure. 
Ensuring that on return from a function, no data is left lying on the stack or 
in registers.

As mentioned in the slides 
(https://gmarkall.files.wordpress.com/2018/09/secure_and_gcc.pdf), 
there is a seperate project that tried to resolve the stack erasure problem. 
and the patch for
 stack erasure had been ready to submit. That specific patch does not handle 
register erasure problem. 

So, we will also address the register erasure problem with this patch along 
with the ROP mitigation. 

** Questions and Answers:

Q1. Which registers should be set to zeros at the return of the function?
A. the caller-saved, i.e, call-used, or call-clobbered registers.
   For ROP mitigation purpose, only the call-used registers that pass
parameters need to be zeroed. 
   For register erasure purpose, all the call-used registers might need to
be zeroed. we can provide multiple levels to user for controling the runtime
overhead. 

Q2. Why zeroing the registers other than randomalize them?
A. (From Kees Cook)
In the performance analysis I looked at a while ago, doing the
register-self-xor is extremely fast to run (IIRC the cycle counts on x86
were absolutely tiny), and it's smaller for code size which minimized
the overall image footprint.
a fixed value is a significantly better defensive position to take
for ROP. And specifically zero _tends_ to be the safest choice as it's
less "useful" to be used as a size, index, or pointer. And, no, it is
not perfect, but nothing can be if we're dealing with trying to defend
against arbitrary ROP gadget finding (or uninitialized stack contents,
where the same argument for "zero is best" also holds[1]).

-Kees
([1]https://lists.llvm.org/pipermail/cfe-dev/2020-April/065221.html)

So, from both run-time performance and code-size aspects, setting the
registers to zero is a better approach. 

** Proposal:

We will provide a new feature into GCC for the above security purposes. 

Add -fzero-call-used-regs=[skip|rop-mitigation|used-gpr|all-gpr|used|all] 
command-line option
and 
zero_call_used_regs("skip|used-arg-gpr|used-arg|arg|used-gpr|all-gpr|used|all") 
function attribues:

1. -mzero-call-used-regs=skip and zero_call_used_regs("skip")

Don't zero call-used registers upon function return. This is the default 
behavior.

2. -mzero-call-used-regs=used-arg-gpr and 
zero_call_used_regs("used-arg-gpr")

Zero used call-used general purpose registers that are used to pass 
parameters upon function return.

3. -mzero-call-used-regs=used-arg and zero_call_used_regs("used-arg")

Zero used call-used registers that are used to pass parameters upo

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-19 Thread Segher Boessenkool
Hi!

On Wed, Aug 19, 2020 at 03:05:36PM -0500, Qing Zhao wrote:
> So, cleaning the scratch registers that are used to pass parameters at return 
> instructions should 
> effectively mitigate ROP attack. 

But that is *very* expensive, in general.  Instead of doing just a
return instruction (which effectively costs 0 cycles, and is just one
insn), you now have to zero all call-clobbered register at every return
(typically many returns per function, and you are talking 10+ registers
even if only considering the simple integer registers).

Numbers on how expensive this is (for what arch, in code size and in
execution time) would be useful.  If it is so expensive that no one will
use it, it helps security at most none at all :-(

> Q1. Which registers should be set to zeros at the return of the function?
> A. the caller-saved, i.e, call-used, or call-clobbered registers.
>For ROP mitigation purpose, only the call-used registers that pass
> parameters need to be zeroed. 
>For register erasure purpose, all the call-used registers might need to
> be zeroed. we can provide multiple levels to user for controling the runtime
> overhead. 

The call-clobbered regs are the only ones you *can* touch.  That does
not mean you should clear them all (it doesn't help much at all in some
cases).  Only the backend knows.

> So, from both run-time performance and code-size aspects, setting the
> registers to zero is a better approach. 

>From a security perspective, this isn't clear though.  But that is a lot
of extra research ;-)


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-19 Thread Qing Zhao via Gcc-patches



> On Aug 19, 2020, at 5:57 PM, Segher Boessenkool  
> wrote:
> 
> Hi!
> 
> On Wed, Aug 19, 2020 at 03:05:36PM -0500, Qing Zhao wrote:
>> So, cleaning the scratch registers that are used to pass parameters at 
>> return instructions should 
>> effectively mitigate ROP attack. 
> 
> But that is *very* expensive, in general.  Instead of doing just a
> return instruction (which effectively costs 0 cycles, and is just one
> insn), you now have to zero all call-clobbered register at every return
> (typically many returns per function, and you are talking 10+ registers
> even if only considering the simple integer registers).

Yes, the run-time overhead and also the code-size overhead are major concerns. 
We should minimize the overhead
as much as we can during implementation. However, such overhead cannot be 
completely avoided for the security purpose. 

In order to reduce the overhead for the ROP mitigation, I added 3 new values 
for -fzero-call-used-regs=used-arg-grp|used-arg|arg

For “used-arg-grp”, we only zero the integer registers that are used in the 
routine and can pass parameters; this should provide ROP mitigation
with the minimum overhead. 

For “used-arg”, in addition to “used-arg-grp”, the other registers (for 
example, FP registers) that can pass parameters will be zeroed. But I am not
very sure whether this option is really needed in practical. 

For “arg”, in addition to “used-arg”, all registers that pass parameters will 
be zeroed. Same as “used-arg”, I am not very sure whether we need this option
Or not. 

> 
> Numbers on how expensive this is (for what arch, in code size and in
> execution time) would be useful.  If it is so expensive that no one will
> use it, it helps security at most none at all :-(

CLEAR Linux project has been using a similar patch since GCC 8, the option it 
used is an equivalent to -fzero-call-used-regs=used-gpr.

-fzero-call-used-regs=used-arg-gpr in this new proposal will have smaller 
overhead than the one currently being used in CLEAR Linux.

Victor, do you have any data on the overhead of the option that currently is 
used by CLEAR project?

> 
>> Q1. Which registers should be set to zeros at the return of the function?
>> A. the caller-saved, i.e, call-used, or call-clobbered registers.
>>   For ROP mitigation purpose, only the call-used registers that pass
>> parameters need to be zeroed. 
>>   For register erasure purpose, all the call-used registers might need to
>> be zeroed. we can provide multiple levels to user for controling the runtime
>> overhead. 
> 
> The call-clobbered regs are the only ones you *can* touch.  That does
> not mean you should clear them all (it doesn't help much at all in some
> cases).  Only the backend knows.

I think that for ROP mitigation purpose, we only need to clear the call-used 
(i.e, call-clobbered) registers that are used in the current routine and
can pass parameters. 

But for preventing information leak from callee registers, we might need to 
clear all the call-used registers at return.


> 
>>So, from both run-time performance and code-size aspects, setting the
>> registers to zero is a better approach. 
> 
> From a security perspective, this isn't clear though.  But that is a lot
> of extra research ;-)

There has been quite some discussion on this topic at

https://lists.llvm.org/pipermail/cfe-dev/2020-April/065221.html 


From those old discussion, we can see that zero value should be good enough for 
the security purpose (though it’s not perfect).

Qing

> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Richard Biener
On Tue, 11 Aug 2020, Qing Zhao wrote:

> Hi, Alexandre,
> 
> CC’ing Richard for his comments on this.
> 
> 
> > On Aug 10, 2020, at 9:39 PM, Alexandre Oliva  wrote:
> >> I think that moving how to zeroing the registers part to each target
> >> will be a better solution since each target has
> >> Better idea on how to use the most efficient insns to do the work.
> > 
> > It's certainly good to allow machine-specific optimized code sequences,
> > but it would certainly be desirable to have a machine-independent
> > fallback.  It doesn't seem exceedingly hard to loop over the registers
> > and emit a (set (reg:M N) (const_int 0)) for each one that is to be
> > zeroed out.
> 
> The current implementation already includes such machine-independent code, it 
> should be very easy to add this.
> 
> Richard, what’s your opinion on this?
> Do we need a machine-independent implementation to zeroing the registers for 
> the default when the target does not provide a optimized
> Implementation?

Well, at least silently doing nothing when the option is used would be 
bad.  So at least a diagnostic would be required.  Note since the
option is quite elaborate on what (sub-)set of regs is supposed to be
cleared I'm not sure an implementation not involving any target hook
is possible?

Richard.

> Thanks.
> 
> Qing
> 
> > 
> > 
> 
> 

-- 
Richard Biener 
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Rodriguez Bahena, Victor via Gcc-patches


-Original Message-
From: Segher Boessenkool 
Date: Wednesday, August 19, 2020 at 5:58 PM
To: Qing Zhao 
Cc: Richard Biener , Jeff Law , 
Uros Bizjak , "H. J. Lu" , Jakub 
Jelinek , GCC Patches , Kees Cook 
, "Rodriguez Bahena, Victor" 

Subject: Re: PING [Patch][Middle-end]Add 
-fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

Hi!

On Wed, Aug 19, 2020 at 03:05:36PM -0500, Qing Zhao wrote:
> So, cleaning the scratch registers that are used to pass parameters at 
return instructions should 
> effectively mitigate ROP attack. 

But that is *very* expensive, in general.  Instead of doing just a
return instruction (which effectively costs 0 cycles, and is just one
insn), you now have to zero all call-clobbered register at every return
(typically many returns per function, and you are talking 10+ registers
even if only considering the simple integer registers).

Numbers on how expensive this is (for what arch, in code size and in
execution time) would be useful.  If it is so expensive that no one will
use it, it helps security at most none at all :-(

It is used in some operating systems and packages such as 

https://github.com/clearlinux-pkgs/gettext/blob/master/gettext.spec#L138

export CFLAGS="$CFLAGS -O3 -ffat-lto-objects -flto=4 -fstack-protector-strong 
-mzero-caller-saved-regs=used "

There is no record that this flag creates a considerable penalty in execution 
time.

> Q1. Which registers should be set to zeros at the return of the function?
> A. the caller-saved, i.e, call-used, or call-clobbered registers.
>For ROP mitigation purpose, only the call-used registers that pass
> parameters need to be zeroed. 
>For register erasure purpose, all the call-used registers might need to
> be zeroed. we can provide multiple levels to user for controling the 
runtime
> overhead. 

The call-clobbered regs are the only ones you *can* touch.  That does
not mean you should clear them all (it doesn't help much at all in some
cases).  Only the backend knows.

> So, from both run-time performance and code-size aspects, setting the
> registers to zero is a better approach. 

From a security perspective, this isn't clear though.  But that is a lot
of extra research ;-)

The paper from IEEE provide a clear example on how to use mzero-caller

I think the patch has a solid background and there are multiple projects that 
highlight the importance of cleaning as technique to prevent security issues in 
ROP attacks

Regards

Victor Rodriguez



Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Rodriguez Bahena, Victor via Gcc-patches


From: Qing Zhao 
Date: Wednesday, August 19, 2020 at 6:28 PM
To: Segher Boessenkool , "Rodriguez Bahena, Victor" 

Cc: Richard Biener , Jeff Law , 
Uros Bizjak , "H. J. Lu" , Jakub 
Jelinek , GCC Patches , Kees Cook 

Subject: Re: PING [Patch][Middle-end]Add 
-fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]




On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
mailto:seg...@kernel.crashing.org>> wrote:

Hi!

On Wed, Aug 19, 2020 at 03:05:36PM -0500, Qing Zhao wrote:

So, cleaning the scratch registers that are used to pass parameters at return 
instructions should
effectively mitigate ROP attack.

But that is *very* expensive, in general.  Instead of doing just a
return instruction (which effectively costs 0 cycles, and is just one
insn), you now have to zero all call-clobbered register at every return
(typically many returns per function, and you are talking 10+ registers
even if only considering the simple integer registers).

Yes, the run-time overhead and also the code-size overhead are major concerns. 
We should minimize the overhead
as much as we can during implementation. However, such overhead cannot be 
completely avoided for the security purpose.

In order to reduce the overhead for the ROP mitigation, I added 3 new values 
for -fzero-call-used-regs=used-arg-grp|used-arg|arg

For “used-arg-grp”, we only zero the integer registers that are used in the 
routine and can pass parameters; this should provide ROP mitigation
with the minimum overhead.

For “used-arg”, in addition to “used-arg-grp”, the other registers (for 
example, FP registers) that can pass parameters will be zeroed. But I am not
very sure whether this option is really needed in practical.

For “arg”, in addition to “used-arg”, all registers that pass parameters will 
be zeroed. Same as “used-arg”, I am not very sure whether we need this option
Or not.


Numbers on how expensive this is (for what arch, in code size and in
execution time) would be useful.  If it is so expensive that no one will
use it, it helps security at most none at all :-(

CLEAR Linux project has been using a similar patch since GCC 8, the option it 
used is an equivalent to -fzero-call-used-regs=used-gpr.

-fzero-call-used-regs=used-arg-gpr in this new proposal will have smaller 
overhead than the one currently being used in CLEAR Linux.

Victor, do you have any data on the overhead of the option that currently is 
used by CLEAR project?


This is a quick list of packages compiled with similar flag as you mention

https://gist.github.com/bryteise/f3469f318e82c626d20a83f557d897a2

The spec files can be located at:

https://github.com/clearlinux-pkgs

I don’t have any data on the overhead, the patch as you mention was implemented 
since GCC8 (2018) . The distro has been measure by community since then. I was 
looking for any major drop detected by community after this patches but I was 
not able to find it.

Maybe it will be worth to ask in the Clear Linux community project mailing list

Regards

Victor Rodriguez


Q1. Which registers should be set to zeros at the return of the function?
A. the caller-saved, i.e, call-used, or call-clobbered registers.
  For ROP mitigation purpose, only the call-used registers that pass
parameters need to be zeroed.
  For register erasure purpose, all the call-used registers might need to
be zeroed. we can provide multiple levels to user for controling the runtime
overhead.

The call-clobbered regs are the only ones you *can* touch.  That does
not mean you should clear them all (it doesn't help much at all in some
cases).  Only the backend knows.

I think that for ROP mitigation purpose, we only need to clear the call-used 
(i.e, call-clobbered) registers that are used in the current routine and
can pass parameters.

But for preventing information leak from callee registers, we might need to 
clear all the call-used registers at return.





   So, from both run-time performance and code-size aspects, setting the
registers to zero is a better approach.

From a security perspective, this isn't clear though.  But that is a lot
of extra research ;-)

There has been quite some discussion on this topic at

https://lists.llvm.org/pipermail/cfe-dev/2020-April/065221.html

From those old discussion, we can see that zero value should be good enough for 
the security purpose (though it’s not perfect).

Qing



I saw the same discussion on latest ELC/OSSNA conference this year by LLVM 
community. The flag is getting a lot of attraction

Regards

Victor Rodriguez



Segher




Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Qing Zhao via Gcc-patches



> On Aug 24, 2020, at 5:50 AM, Richard Biener  wrote:
> 
> On Tue, 11 Aug 2020, Qing Zhao wrote:
> 
>> Hi, Alexandre,
>> 
>> CC’ing Richard for his comments on this.
>> 
>> 
>>> On Aug 10, 2020, at 9:39 PM, Alexandre Oliva  wrote:
 I think that moving how to zeroing the registers part to each target
 will be a better solution since each target has
 Better idea on how to use the most efficient insns to do the work.
>>> 
>>> It's certainly good to allow machine-specific optimized code sequences,
>>> but it would certainly be desirable to have a machine-independent
>>> fallback.  It doesn't seem exceedingly hard to loop over the registers
>>> and emit a (set (reg:M N) (const_int 0)) for each one that is to be
>>> zeroed out.
>> 
>> The current implementation already includes such machine-independent code, 
>> it should be very easy to add this.
>> 
>> Richard, what’s your opinion on this?
>> Do we need a machine-independent implementation to zeroing the registers for 
>> the default when the target does not provide a optimized
>> Implementation?
> 
> Well, at least silently doing nothing when the option is used would be 
> bad.  So at least a diagnostic would be required.

Yes, this is the current behavior in the current implementation. 
>  Note since the
> option is quite elaborate on what (sub-)set of regs is supposed to be
> cleared I'm not sure an implementation not involving any target hook
> is possible?

Agreed.

Thanks 

Qing
> 
> Richard.
> 
>> Thanks.
>> 
>> Qing
>> 
>>> 
>>> 
>> 
>> 
> 
> -- 
> Richard Biener mailto:rguent...@suse.de>>
> SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
> Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Segher Boessenkool
On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
> > On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
> >  wrote:
> > Numbers on how expensive this is (for what arch, in code size and in
> > execution time) would be useful.  If it is so expensive that no one will
> > use it, it helps security at most none at all :-(

Without numbers on this, no one can determine if it is a good tradeoff
for them.  And we (the GCC people) cannot know if it will be useful for
enough users that it will be worth the effort for us.  Which is why I
keep hammering on this point.

(The other side of the coin is how much this helps prevent exploitation;
numbers on that would be good to see, too.)

> >>So, from both run-time performance and code-size aspects, setting the
> >> registers to zero is a better approach. 
> > 
> > From a security perspective, this isn't clear though.  But that is a lot
> > of extra research ;-)
> 
> There has been quite some discussion on this topic at
> 
> https://lists.llvm.org/pipermail/cfe-dev/2020-April/065221.html 
> 
> 
> From those old discussion, we can see that zero value should be good enough 
> for the security purpose (though it’s not perfect).

And there has been zero proof or even any arguments from the security
angle, only "anything other than 0 is too expensive", which isn't
obviously true either (it isn't even cheaper than other small numbers,
on many archs).

A large fraction of function arguments is zero in valid executions, so
zeroing them out to try to prevent exploitation attempts might not help
so much.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Segher Boessenkool
[ Please quote correctly.  I fixed this up a bit. ]

On Mon, Aug 24, 2020 at 02:47:22PM +, Rodriguez Bahena, Victor wrote:
> > The call-clobbered regs are the only ones you *can* touch.  That does
> > not mean you should clear them all (it doesn't help much at all in some
> > cases).  Only the backend knows.
> 
> I think that for ROP mitigation purpose, we only need to clear the call-used 
> (i.e, call-clobbered) registers that are used in the current routine and
> can pass parameters.

Which is more than you *can* do as well (consider return value registers
for example; there are more cases, in general; only the backend code can
know what is safe to do).


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Qing Zhao via Gcc-patches



> On Aug 24, 2020, at 12:49 PM, Segher Boessenkool  
> wrote:
> 
> On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
>>> On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
>>>  wrote:
>>> Numbers on how expensive this is (for what arch, in code size and in
>>> execution time) would be useful.  If it is so expensive that no one will
>>> use it, it helps security at most none at all :-(
> 
> Without numbers on this, no one can determine if it is a good tradeoff
> for them.  And we (the GCC people) cannot know if it will be useful for
> enough users that it will be worth the effort for us.  Which is why I
> keep hammering on this point.
I can collect some run-time overhead data on this, do you have a recommendation 
on what test suite I can use
For this testing? (Is CPU2017 good enough)?

> 
> (The other side of the coin is how much this helps prevent exploitation;
> numbers on that would be good to see, too.)

This can be well showed from the paper:

"Clean the Scratch Registers: A Way to Mitigate Return-Oriented Programming 
Attacks"

https://ieeexplore.ieee.org/document/8445132 


Please take a look at this paper. 

> 
   So, from both run-time performance and code-size aspects, setting the
 registers to zero is a better approach. 
>>> 
>>> From a security perspective, this isn't clear though.  But that is a lot
>>> of extra research ;-)
>> 
>> There has been quite some discussion on this topic at
>> 
>> https://urldefense.com/v3/__https://lists.llvm.org/pipermail/cfe-dev/2020-April/065221.html__;!!GqivPVa7Brio!PFjWvu3miQeS8XQehbw1moYxXTbbRvu9MTbjQxtxad_YQQGSdZg97Dl8-c2w5Y32$
>>   
>> >  >
>> 
>> From those old discussion, we can see that zero value should be good enough 
>> for the security purpose (though it’s not perfect).
> 
> And there has been zero proof or even any arguments from the security
> angle, only "anything other than 0 is too expensive", which isn't
> obviously true either (it isn't even cheaper than other small numbers,
> on many archs).
> 
> A large fraction of function arguments is zero in valid executions, so
> zeroing them out to try to prevent exploitation attempts might not help
> so much.

Please take a look at the paper:
"Clean the Scratch Registers: A Way to Mitigate Return-Oriented Programming 
Attacks"

https://ieeexplore.ieee.org/document/8445132 


From the study, zeroing out the registers mitigate the ROP very well.

thanks.

Qing



> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Qing Zhao via Gcc-patches



> On Aug 24, 2020, at 12:59 PM, Segher Boessenkool  
> wrote:
> 
> [ Please quote correctly.  I fixed this up a bit. ]
> 
> On Mon, Aug 24, 2020 at 02:47:22PM +, Rodriguez Bahena, Victor wrote:
>>> The call-clobbered regs are the only ones you *can* touch.  That does
>>> not mean you should clear them all (it doesn't help much at all in some
>>> cases).  Only the backend knows.
>> 
>> I think that for ROP mitigation purpose, we only need to clear the call-used 
>> (i.e, call-clobbered) registers that are used in the current routine and
>> can pass parameters.
> 
> Which is more than you *can* do as well (consider return value registers
> for example; there are more cases, in general; only the backend code can
> know what is safe to do).

Yes, So, we agreed to move the code generation implementation part into backend.

In Middle-end, we will only compute the hard register set based on call abi 
information and data flow information, also handle the command line option.

Qing
> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Segher Boessenkool
Hi!

On Mon, Aug 24, 2020 at 01:02:03PM -0500, Qing Zhao wrote:
> > On Aug 24, 2020, at 12:49 PM, Segher Boessenkool 
> >  wrote:
> > On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
> >>> On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
> >>>  wrote:
> >>> Numbers on how expensive this is (for what arch, in code size and in
> >>> execution time) would be useful.  If it is so expensive that no one will
> >>> use it, it helps security at most none at all :-(
> > 
> > Without numbers on this, no one can determine if it is a good tradeoff
> > for them.  And we (the GCC people) cannot know if it will be useful for
> > enough users that it will be worth the effort for us.  Which is why I
> > keep hammering on this point.
> I can collect some run-time overhead data on this, do you have a 
> recommendation on what test suite I can use
> For this testing? (Is CPU2017 good enough)?

I would use something more real-life, not 12 small pieces of code.

> > (The other side of the coin is how much this helps prevent exploitation;
> > numbers on that would be good to see, too.)
> 
> This can be well showed from the paper:
> 
> "Clean the Scratch Registers: A Way to Mitigate Return-Oriented Programming 
> Attacks"
> 
> https://ieeexplore.ieee.org/document/8445132 
> 
> 
> Please take a look at this paper. 

As I told you before, that isn't open information, I cannot reply to
any of that.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Segher Boessenkool
On Mon, Aug 24, 2020 at 01:48:02PM -0500, Qing Zhao wrote:
> 
> 
> > On Aug 24, 2020, at 12:59 PM, Segher Boessenkool 
> >  wrote:
> > 
> > [ Please quote correctly.  I fixed this up a bit. ]
> > 
> > On Mon, Aug 24, 2020 at 02:47:22PM +, Rodriguez Bahena, Victor wrote:
> >>> The call-clobbered regs are the only ones you *can* touch.  That does
> >>> not mean you should clear them all (it doesn't help much at all in some
> >>> cases).  Only the backend knows.
> >> 
> >> I think that for ROP mitigation purpose, we only need to clear the 
> >> call-used (i.e, call-clobbered) registers that are used in the current 
> >> routine and
> >> can pass parameters.
> > 
> > Which is more than you *can* do as well (consider return value registers
> > for example; there are more cases, in general; only the backend code can
> > know what is safe to do).
> 
> Yes, So, we agreed to move the code generation implementation part into 
> backend.
> 
> In Middle-end, we will only compute the hard register set based on call abi 
> information and data flow information, also handle the command line option.

You cannot in general figure out what registers you can clobber without
asking the backend.  You can figure out some that you *cannot* clobber,
but that isn't very useful.

Do you want to do this before or after the epilogue code is generated?


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Qing Zhao via Gcc-patches



> On Aug 24, 2020, at 3:20 PM, Segher Boessenkool  
> wrote:
> 
> Hi!
> 
> On Mon, Aug 24, 2020 at 01:02:03PM -0500, Qing Zhao wrote:
>>> On Aug 24, 2020, at 12:49 PM, Segher Boessenkool 
>>>  wrote:
>>> On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
> On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
>  wrote:
> Numbers on how expensive this is (for what arch, in code size and in
> execution time) would be useful.  If it is so expensive that no one will
> use it, it helps security at most none at all :-(
>>> 
>>> Without numbers on this, no one can determine if it is a good tradeoff
>>> for them.  And we (the GCC people) cannot know if it will be useful for
>>> enough users that it will be worth the effort for us.  Which is why I
>>> keep hammering on this point.
>> I can collect some run-time overhead data on this, do you have a 
>> recommendation on what test suite I can use
>> For this testing? (Is CPU2017 good enough)?
> 
> I would use something more real-life, not 12 small pieces of code.

Then, what kind of real-life benchmark you are suggesting? 

> 
>>> (The other side of the coin is how much this helps prevent exploitation;
>>> numbers on that would be good to see, too.)
>> 
>> This can be well showed from the paper:
>> 
>> "Clean the Scratch Registers: A Way to Mitigate Return-Oriented Programming 
>> Attacks"
>> 
>> https://urldefense.com/v3/__https://ieeexplore.ieee.org/document/8445132__;!!GqivPVa7Brio!JbdLvo54xB3ORTeZqpy_PwZsL9drNLaKjbg14bTKMOwxt8LWnjZ8gJWlqtlrFKPh$
>>   
>> >  >
>> 
>> Please take a look at this paper. 
> 
> As I told you before, that isn't open information, I cannot reply to
> any of that.

A little confused here, what’s you mean by “open information”? Is the 
information in a published paper not open information?

Qing
> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Qing Zhao via Gcc-patches



> On Aug 24, 2020, at 3:26 PM, Segher Boessenkool  
> wrote:
> 
> On Mon, Aug 24, 2020 at 01:48:02PM -0500, Qing Zhao wrote:
>> 
>> 
>>> On Aug 24, 2020, at 12:59 PM, Segher Boessenkool 
>>>  wrote:
>>> 
>>> [ Please quote correctly.  I fixed this up a bit. ]
>>> 
>>> On Mon, Aug 24, 2020 at 02:47:22PM +, Rodriguez Bahena, Victor wrote:
> The call-clobbered regs are the only ones you *can* touch.  That does
> not mean you should clear them all (it doesn't help much at all in some
> cases).  Only the backend knows.
 
 I think that for ROP mitigation purpose, we only need to clear the 
 call-used (i.e, call-clobbered) registers that are used in the current 
 routine and
 can pass parameters.
>>> 
>>> Which is more than you *can* do as well (consider return value registers
>>> for example; there are more cases, in general; only the backend code can
>>> know what is safe to do).
>> 
>> Yes, So, we agreed to move the code generation implementation part into 
>> backend.
>> 
>> In Middle-end, we will only compute the hard register set based on call abi 
>> information and data flow information, also handle the command line option.
> 
> You cannot in general figure out what registers you can clobber without
> asking the backend.  You can figure out some that you *cannot* clobber,
> but that isn't very useful.
> 
> Do you want to do this before or after the epilogue code is generated?

static rtx_insn *
make_epilogue_seq (void)
{
  if (!targetm.have_epilogue ())
return NULL;

  start_sequence ();
  emit_note (NOTE_INSN_EPILOGUE_BEG);

 + gen_call_used_regs_seq (); // this is the place to 
emit the zeroing insn sequence

  rtx_insn *seq = targetm.gen_epilogue ();
…
}

Any comment on this?

thanks.

Qing




> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Alexandre Oliva
On Aug 24, 2020, Richard Biener  wrote:

> since the option is quite elaborate on what (sub-)set of regs is
> supposed to be cleared I'm not sure an implementation not involving
> any target hook is possible?

I don't think this follows.  Machine-independent code has a pretty good
notion of what registers are call-saved or call-clobbered, which ones
could be changed in this regard for function-specific calling
conventions, which ones may be used by a function to hold its return
value, which ones are used within a function...

It *should* be possible to introduce this in machine-independent code,
emitting insns to set registers to zero and regarding them as holding
values to be returned from the function.  Machine-specific code could
use more efficient insns to get the same result, but I can't see good
reason to not have a generic fallback implementation with at least a
best-effort attempt to offer the desired feature.


Now, this is for the regular return path.  Is zeroing registers in
exception-propagation paths not relevant?

I thought it is, and I think we could have generic code that identifies
the registers that ought to be zeroed, issues CFI notes to get them
zeroed in the exception path, and requests a target hook to emit the
insns to zero them in the regular return path.

-- 
Alexandre Oliva, happy hacker
https://FSFLA.org/blogs/lxo/
Free Software Activist
GNU Toolchain Engineer


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-24 Thread Uros Bizjak via Gcc-patches
On Mon, Aug 24, 2020 at 10:43 PM Qing Zhao  wrote:
>
>
>
> > On Aug 24, 2020, at 3:20 PM, Segher Boessenkool 
> >  wrote:
> >
> > Hi!
> >
> > On Mon, Aug 24, 2020 at 01:02:03PM -0500, Qing Zhao wrote:
> >>> On Aug 24, 2020, at 12:49 PM, Segher Boessenkool 
> >>>  wrote:
> >>> On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
> > On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
> >  wrote:
> > Numbers on how expensive this is (for what arch, in code size and in
> > execution time) would be useful.  If it is so expensive that no one will
> > use it, it helps security at most none at all :-(
> >>>
> >>> Without numbers on this, no one can determine if it is a good tradeoff
> >>> for them.  And we (the GCC people) cannot know if it will be useful for
> >>> enough users that it will be worth the effort for us.  Which is why I
> >>> keep hammering on this point.
> >> I can collect some run-time overhead data on this, do you have a 
> >> recommendation on what test suite I can use
> >> For this testing? (Is CPU2017 good enough)?
> >
> > I would use something more real-life, not 12 small pieces of code.
>
> Then, what kind of real-life benchmark you are suggesting?
>
> >
> >>> (The other side of the coin is how much this helps prevent exploitation;
> >>> numbers on that would be good to see, too.)
> >>
> >> This can be well showed from the paper:
> >>
> >> "Clean the Scratch Registers: A Way to Mitigate Return-Oriented 
> >> Programming Attacks"
> >>
> >> https://urldefense.com/v3/__https://ieeexplore.ieee.org/document/8445132__;!!GqivPVa7Brio!JbdLvo54xB3ORTeZqpy_PwZsL9drNLaKjbg14bTKMOwxt8LWnjZ8gJWlqtlrFKPh$
> >>   
> >>  >>  >
> >>
> >> Please take a look at this paper.
> >
> > As I told you before, that isn't open information, I cannot reply to
> > any of that.
>
> A little confused here, what’s you mean by “open information”? Is the 
> information in a published paper not open information?

No, because it is behind a paywall.

Uros.


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-25 Thread Qing Zhao via Gcc-patches



> On Aug 25, 2020, at 1:41 AM, Uros Bizjak  wrote:
> 
>>> 
> (The other side of the coin is how much this helps prevent exploitation;
> numbers on that would be good to see, too.)
 
 This can be well showed from the paper:
 
 "Clean the Scratch Registers: A Way to Mitigate Return-Oriented 
 Programming Attacks"
 
 https://urldefense.com/v3/__https://ieeexplore.ieee.org/document/8445132__;!!GqivPVa7Brio!JbdLvo54xB3ORTeZqpy_PwZsL9drNLaKjbg14bTKMOwxt8LWnjZ8gJWlqtlrFKPh$
  
 
 
 Please take a look at this paper.
>>> 
>>> As I told you before, that isn't open information, I cannot reply to
>>> any of that.
>> 
>> A little confused here, what’s you mean by “open information”? Is the 
>> information in a published paper not open information?
> 
> No, because it is behind a paywall.

Still don’t understand here:  this paper has been published in the proceeding 
of “ 2018 IEEE 29th International Conference on Application-specific Systems, 
Architectures and Processors (ASAP)”.
If you want to read the complete version online, you need to pay for it.

However, it’s still a published paper, and the information inside it should be 
“open information”. 

So, what’s the definition of “open information” you have?

I downloaded a PDF copy of this paper through my company’s paid account.  But I 
am not sure whether it’s legal for me to attach it to this mailing list?

Qing


> 
> Uros.



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-25 Thread Jeff Law via Gcc-patches
On Tue, 2020-08-25 at 02:16 -0300, Alexandre Oliva wrote:
> On Aug 24, 2020, Richard Biener  wrote:
> 
> > since the option is quite elaborate on what (sub-)set of regs is
> > supposed to be cleared I'm not sure an implementation not involving
> > any target hook is possible?
> 
> I don't think this follows.  Machine-independent code has a pretty good
> notion of what registers are call-saved or call-clobbered, which ones
> could be changed in this regard for function-specific calling
> conventions, which ones may be used by a function to hold its return
> value, which ones are used within a function...
> 
> It *should* be possible to introduce this in machine-independent code,
> emitting insns to set registers to zero and regarding them as holding
> values to be returned from the function.  Machine-specific code could
> use more efficient insns to get the same result, but I can't see good
> reason to not have a generic fallback implementation with at least a
> best-effort attempt to offer the desired feature.
I think part of the problem here is you have to worry about stubs which can
change caller-saved registers.  Return path stubs aren't particularly common, 
but
they do exist -- 32 bit hpux for example :(

Jeff



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-25 Thread Qing Zhao via Gcc-patches



> On Aug 24, 2020, at 3:20 PM, Segher Boessenkool  
> wrote:
> 
> Hi!
> 
> On Mon, Aug 24, 2020 at 01:02:03PM -0500, Qing Zhao wrote:
>>> On Aug 24, 2020, at 12:49 PM, Segher Boessenkool 
>>>  wrote:
>>> On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
> On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
>  wrote:
> Numbers on how expensive this is (for what arch, in code size and in
> execution time) would be useful.  If it is so expensive that no one will
> use it, it helps security at most none at all :-(
>>> 
>>> Without numbers on this, no one can determine if it is a good tradeoff
>>> for them.  And we (the GCC people) cannot know if it will be useful for
>>> enough users that it will be worth the effort for us.  Which is why I
>>> keep hammering on this point.
>> I can collect some run-time overhead data on this, do you have a 
>> recommendation on what test suite I can use
>> For this testing? (Is CPU2017 good enough)?
> 
> I would use something more real-life, not 12 small pieces of code.

There is some basic information about the benchmarks of CPU2017 in below link:

https://www.spec.org/cpu2017/Docs/overview.html#suites 


GCC itself is one of the benchmarks in CPU2017 (502.gcc_r). And 526.blender_r 
is even larger than 502.gcc_r. 
And there are several other quite big benchmarks as well (perlbench, xalancbmk, 
parest, imagick, etc).

thanks.

Qing

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-25 Thread Qing Zhao via Gcc-patches



> On Aug 25, 2020, at 9:05 AM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> 
> 
>> On Aug 25, 2020, at 1:41 AM, Uros Bizjak  wrote:
>> 
 
>> (The other side of the coin is how much this helps prevent exploitation;
>> numbers on that would be good to see, too.)
> 
> This can be well showed from the paper:
> 
> "Clean the Scratch Registers: A Way to Mitigate Return-Oriented 
> Programming Attacks"
> 
> https://urldefense.com/v3/__https://ieeexplore.ieee.org/document/8445132__;!!GqivPVa7Brio!JbdLvo54xB3ORTeZqpy_PwZsL9drNLaKjbg14bTKMOwxt8LWnjZ8gJWlqtlrFKPh$
>  
>   >
> 
> Please take a look at this paper.
 
 As I told you before, that isn't open information, I cannot reply to
 any of that.
>>> 
>>> A little confused here, what’s you mean by “open information”? Is the 
>>> information in a published paper not open information?
>> 
>> No, because it is behind a paywall.
> 
> Still don’t understand here:  this paper has been published in the proceeding 
> of “ 2018 IEEE 29th International Conference on Application-specific Systems, 
> Architectures and Processors (ASAP)”.
> If you want to read the complete version online, you need to pay for it.
> 
> However, it’s still a published paper, and the information inside it should 
> be “open information”. 
> 
> So, what’s the definition of “open information” you have?
> 
> I downloaded a PDF copy of this paper through my company’s paid account.  But 
> I am not sure whether it’s legal for me to attach it to this mailing list?

After consulting, it turned out that I was not allowed to further forward the 
copy I downloaded through my company’s account to this alias. 
There is some more information on this paper online though:

https://www.semanticscholar.org/paper/Clean-the-Scratch-Registers:-A-Way-to-Mitigate-Rong-Xie/6f2ce4fd31baa0f6c02f9eb5c57b90d39fe5fa13

All the figures and tables in this paper are available in this link. 

In which, Figure 1 is an illustration  of a typical ROP attack, please pay 
special attention on the “Gadgets”, which are carefully chosen machine 
instruction sequences that are already present in the machine's memory, Each 
gadget typically ends in a return instruction and is located in a subroutine 
within the existing program and/or shared library code. Chained together, these 
gadgets allow an attacker to perform arbitrary operations on a machine 
employing defenses that thwart simpler attacks.

The paper identified the important features of ROP attack as following:

"First, the destination of using gadget chains in usual is performing system 
call or system fucntion to perform malicious behaviour such as file access, 
network access and W ⊕ X disable. In most cases, the adversary would like to 
disable W ⊕ X. Because once W ⊕ X has been disabled, shellcode can be executed 
directly instead of rewritting shellcode to ROP chains which may cause some 
troubles for the adversary. 

Second, if the adversary performs ROP attacks using system call instruction, no 
matter on x86 or x64 architecture, the register would be used to pass 
parameter. Or if the adversary performs ROP attacks using system function such 
as “read” or “mprotect”, on x64 system, the register would still be used to 
pass parameters, as mentioned in subsection B and C.”
As a result, the paper proposed the idea to zeroing scratch registers that pass 
parameters at the “return” insns to mitigate the ROP attack. 

Table III, Table IV and Table V are the results of “zeroing scratch register 
mitigate ROP attack”. From the tables, zeroing scratch registers can 
successfully mitigate the ROP on all those benchmarks. 

Table VI is the performance overhead of their implementation, it looks like 
very high, average 16.2X runtime overhead.  However, this implementation is not 
use compiler to statically generate zeroing sequence, instead, it used "dynamic 
binary instrumentation at runtime “ to check every instruction to 
1. Set/unset flags to check which scratch registers are used in the routine;
2. Whether the instruction is return instruction or not, if it’s is return, 
insert the zeroing used scratch register sequence before the “return” insn. 

Due to the above run-time dynamic instrumentation method, the high runtime 
overhead is expecting, I think.

If we use GCC to statically check the “used” information and add zeroing 
sequence before return insn, the run-time overhead will be much smaller. 

I will provide run-time overhead information with the 2nd version of the patch 
by using CPU2017 applications.

thanks.

Qing


> Qing
> 
> 
>> 
>> Uros.



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-26 Thread Alexandre Oliva
On Aug 25, 2020, Jeff Law  wrote:

> On Tue, 2020-08-25 at 02:16 -0300, Alexandre Oliva wrote:
>> On Aug 24, 2020, Richard Biener  wrote:
>> 
>> > since the option is quite elaborate on what (sub-)set of regs is
>> > supposed to be cleared I'm not sure an implementation not involving
>> > any target hook is possible?
>> 
>> I don't think this follows.  Machine-independent code has a pretty good
>> notion of what registers are call-saved or call-clobbered, which ones
>> could be changed in this regard for function-specific calling
>> conventions, which ones may be used by a function to hold its return
>> value, which ones are used within a function...
>> 
>> It *should* be possible to introduce this in machine-independent code,
>> emitting insns to set registers to zero and regarding them as holding
>> values to be returned from the function.  Machine-specific code could
>> use more efficient insns to get the same result, but I can't see good
>> reason to not have a generic fallback implementation with at least a
>> best-effort attempt to offer the desired feature.
> I think part of the problem here is you have to worry about stubs which can
> change caller-saved registers.  Return path stubs aren't particularly common, 
> but
> they do exist -- 32 bit hpux for example :(

This suggests that such targets might have to implement the
target-specific hook to deal with this, but does it detract in any way
from the notion of having generic code to fall back to on targets that
do NOT require any special handling?

-- 
Alexandre Oliva, happy hacker
https://FSFLA.org/blogs/lxo/
Free Software Activist
GNU Toolchain Engineer


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-26 Thread Qing Zhao via Gcc-patches



> On Aug 26, 2020, at 7:02 AM, Alexandre Oliva  wrote:
> 
> On Aug 25, 2020, Jeff Law mailto:l...@redhat.com>> wrote:
> 
>> On Tue, 2020-08-25 at 02:16 -0300, Alexandre Oliva wrote:
>>> On Aug 24, 2020, Richard Biener  wrote:
>>> 
 since the option is quite elaborate on what (sub-)set of regs is
 supposed to be cleared I'm not sure an implementation not involving
 any target hook is possible?
>>> 
>>> I don't think this follows.  Machine-independent code has a pretty good
>>> notion of what registers are call-saved or call-clobbered, which ones
>>> could be changed in this regard for function-specific calling
>>> conventions, which ones may be used by a function to hold its return
>>> value, which ones are used within a function...
>>> 
>>> It *should* be possible to introduce this in machine-independent code,
>>> emitting insns to set registers to zero and regarding them as holding
>>> values to be returned from the function.  Machine-specific code could
>>> use more efficient insns to get the same result, but I can't see good
>>> reason to not have a generic fallback implementation with at least a
>>> best-effort attempt to offer the desired feature.
>> I think part of the problem here is you have to worry about stubs which can
>> change caller-saved registers.  Return path stubs aren't particularly 
>> common, but
>> they do exist -- 32 bit hpux for example :(
> 
> This suggests that such targets might have to implement the
> target-specific hook to deal with this, but does it detract in any way
> from the notion of having generic code to fall back to on targets that
> do NOT require any special handling?

There are two issues I can see with adding a default generator in middle end:

1. In order to determine where a target should not use the generic code to emit 
the zeroing sequence, 
a new target hook to determine this has to be added;

2. In order to avoid the generated zeroing insns (which are simply insns that 
set registers) being deleted, 
We have to define a new insn “pro_epilogue_use” in the target. 
So, any target that want to use the default generator in middle end, must 
provide such a new target hook.

Based on the above 2, I don’t think that adding the default generator in middle 
end is a good idea.

Qing

> 
> -- 
> Alexandre Oliva, happy hacker
> https://urldefense.com/v3/__https://FSFLA.org/blogs/lxo/__;!!GqivPVa7Brio!Pee3_l4yYpNOUbnymMqrEM68oDGk-2Q3zebqLnQ255SX5go78t8Sq1RmM72wJP3a$
>  
> 
>  
> Free Software Activist
> GNU Toolchain Engineer



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-26 Thread Jeff Law via Gcc-patches
On Wed, 2020-08-26 at 09:02 -0300, Alexandre Oliva wrote:
> On Aug 25, 2020, Jeff Law  wrote:
> 
> > On Tue, 2020-08-25 at 02:16 -0300, Alexandre Oliva wrote:
> > > On Aug 24, 2020, Richard Biener  wrote:
> > > 
> > > > since the option is quite elaborate on what (sub-)set of regs is
> > > > supposed to be cleared I'm not sure an implementation not involving
> > > > any target hook is possible?
> > > 
> > > I don't think this follows.  Machine-independent code has a pretty good
> > > notion of what registers are call-saved or call-clobbered, which ones
> > > could be changed in this regard for function-specific calling
> > > conventions, which ones may be used by a function to hold its return
> > > value, which ones are used within a function...
> > > 
> > > It *should* be possible to introduce this in machine-independent code,
> > > emitting insns to set registers to zero and regarding them as holding
> > > values to be returned from the function.  Machine-specific code could
> > > use more efficient insns to get the same result, but I can't see good
> > > reason to not have a generic fallback implementation with at least a
> > > best-effort attempt to offer the desired feature.
> > I think part of the problem here is you have to worry about stubs which can
> > change caller-saved registers.  Return path stubs aren't particularly 
> > common, but
> > they do exist -- 32 bit hpux for example :(
> 
> This suggests that such targets might have to implement the
> target-specific hook to deal with this, but does it detract in any way
> from the notion of having generic code to fall back to on targets that
> do NOT require any special handling?
Agreed.  Sorry if I wasn't clear that generic code + a hook should be 
sufficient.

jeff



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-28 Thread Alexandre Oliva
On Aug 26, 2020, Qing Zhao  wrote:

> There are two issues I can see with adding a default generator in middle end:

> 1. In order to determine where a target should not use the generic
> code to emit the zeroing sequence,
> a new target hook to determine this has to be added;

Yeah, a target hook whose default is the generic code, and that targets
that need it, or that benefit from it, can override.  That's the point
of hooks, to enable overriding.  Why should this be an issue?

> 2. In order to avoid the generated zeroing insns (which are simply
> insns that set registers) being deleted,
> We have to define a new insn “pro_epilogue_use” in the target. 

Why won't a naked USE pattern do?  We already issue those in generic
code, for regs holding return values.  If we were to pretend that other
registers are also holding zeros as values to be returned, why shouldn't
that work for them as well?

-- 
Alexandre Oliva, happy hacker
https://FSFLA.org/blogs/lxo/
Free Software Activist
GNU Toolchain Engineer


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-28 Thread Qing Zhao via Gcc-patches



> On Aug 28, 2020, at 2:47 AM, Alexandre Oliva  wrote:
> 
> On Aug 26, 2020, Qing Zhao  wrote:
> 
>> There are two issues I can see with adding a default generator in middle end:
> 
>> 1. In order to determine where a target should not use the generic
>> code to emit the zeroing sequence,
>> a new target hook to determine this has to be added;
> 
> Yeah, a target hook whose default is the generic code, and that targets
> that need it, or that benefit from it, can override.  That's the point
> of hooks, to enable overriding.  Why should this be an issue?

A default handler will be invoked for all the targets. So, if the target does 
not provide any 
target-specific handler to override it. The default handler should be correct 
on this target. 

So, the default handler should be correct on all the targets assuming no 
override happening. 

Correct me if I am wrong with the above understanding.

Then, for example, for the 32 bit hpux, is a default handler without any 
special target handling 
correct on it? My understanding from the previous discussion is, we need some 
special handling 
On 32 bit hpux to make it correct, So, in order to make the default handler 
correct on 32 bit hpux,
We need to add another target hook, for example, targetm.has_return_stubs() to 
check whether
A target has such feature, then in the default handler, we can call this new 
target hook to check and
Then make sure the default handler is correct on 32 bit hpux. 

There might be other targets that might need other special handlings which we 
currently don’t know
Yet. Do we need to identify all those targets and all those special features, 
and then add new 
Target hook for each of the identified special feature?

Yes, theoretically, it’s doable to run testing on all the targets and to see 
which targets need special
Handling and what kind of special handling we need, however, is doing this 
really necessary?


> 
>> 2. In order to avoid the generated zeroing insns (which are simply
>> insns that set registers) being deleted,
>> We have to define a new insn “pro_epilogue_use” in the target. 
> 
> Why won't a naked USE pattern do?  We already issue those in generic
> code, for regs holding return values.  If we were to pretend that other
> registers are also holding zeros as values to be returned, why shouldn't
> that work for them as well?

From the current implementation based on X86, I see the following comments:

;; As USE insns aren't meaningful after reload, this is used instead
;; to prevent deleting instructions setting registers for PIC code
(define_insn "pro_epilogue_use"
  [(unspec_volatile [(match_operand 0)] UNSPECV_PRO_EPILOGUE_USE)]

My understanding is, the “USE” will not be useful after reload. So a new 
“pro_eplogue_use” should
be added.

HongJiu, could you please provide more information on this?

Thanks.

Qing

> 
> -- 
> Alexandre Oliva, happy hacker
> https://urldefense.com/v3/__https://FSFLA.org/blogs/lxo/__;!!GqivPVa7Brio!NzNvCeA4fLoYPOD4RHTzKJd3QtgXG8bY2zXVcztQohMQRn5yROpYDp9CRbjjtcRV$
>  
> Free Software Activist
> GNU Toolchain Engineer



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-08-28 Thread H.J. Lu via Gcc-patches
On Fri, Aug 28, 2020 at 8:22 AM Qing Zhao  wrote:
>
>
>
> > On Aug 28, 2020, at 2:47 AM, Alexandre Oliva  wrote:
> >
> > On Aug 26, 2020, Qing Zhao  wrote:
> >
> >> There are two issues I can see with adding a default generator in middle 
> >> end:
> >
> >> 1. In order to determine where a target should not use the generic
> >> code to emit the zeroing sequence,
> >> a new target hook to determine this has to be added;
> >
> > Yeah, a target hook whose default is the generic code, and that targets
> > that need it, or that benefit from it, can override.  That's the point
> > of hooks, to enable overriding.  Why should this be an issue?
>
> A default handler will be invoked for all the targets. So, if the target does 
> not provide any
> target-specific handler to override it. The default handler should be correct 
> on this target.
>
> So, the default handler should be correct on all the targets assuming no 
> override happening.
>
> Correct me if I am wrong with the above understanding.
>
> Then, for example, for the 32 bit hpux, is a default handler without any 
> special target handling
> correct on it? My understanding from the previous discussion is, we need some 
> special handling
> On 32 bit hpux to make it correct, So, in order to make the default handler 
> correct on 32 bit hpux,
> We need to add another target hook, for example, targetm.has_return_stubs() 
> to check whether
> A target has such feature, then in the default handler, we can call this new 
> target hook to check and
> Then make sure the default handler is correct on 32 bit hpux.
>
> There might be other targets that might need other special handlings which we 
> currently don’t know
> Yet. Do we need to identify all those targets and all those special features, 
> and then add new
> Target hook for each of the identified special feature?
>
> Yes, theoretically, it’s doable to run testing on all the targets and to see 
> which targets need special
> Handling and what kind of special handling we need, however, is doing this 
> really necessary?
>
>
> >
> >> 2. In order to avoid the generated zeroing insns (which are simply
> >> insns that set registers) being deleted,
> >> We have to define a new insn “pro_epilogue_use” in the target.
> >
> > Why won't a naked USE pattern do?  We already issue those in generic
> > code, for regs holding return values.  If we were to pretend that other
> > registers are also holding zeros as values to be returned, why shouldn't
> > that work for them as well?
>
> From the current implementation based on X86, I see the following comments:
>
> ;; As USE insns aren't meaningful after reload, this is used instead
> ;; to prevent deleting instructions setting registers for PIC code
> (define_insn "pro_epilogue_use"
>   [(unspec_volatile [(match_operand 0)] UNSPECV_PRO_EPILOGUE_USE)]
>
> My understanding is, the “USE” will not be useful after reload. So a new 
> “pro_eplogue_use” should
> be added.
>
> HongJiu, could you please provide more information on this?

pro_epilogue_use is needed.  Otherwise, these zeroing instructions
will be removed after reload.

-- 
H.J.


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Qing Zhao via Gcc-patches
Hi,

Per request, I collected runtime performance data and code size data with 
CPU2017 on a X86 platform. 

*** Machine info:
model name>-: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
$ lscpu | grep NUMA
NUMA node(s):  2
NUMA node0 CPU(s): 0-21,44-65
NUMA node1 CPU(s): 22-43,66-87

***CPU2017 benchmarks: 
all the benchmarks with C/C++, 9 Integer benchmarks, 10 FP benchmarks. 

***Configures:
Intrate and fprate, 22 copies. 

***Compiler options:
no :-g -O2 -march=native
used_gpr_arg:   no + -fzero-call-used-regs=used-gpr-arg
used_arg:   no + -fzero-call-used-regs=used-arg
all_arg:no + -fzero-call-used-regs=all-arg
used_gpr:   no + -fzero-call-used-regs=used-gpr
all_gpr:no + -fzero-call-used-regs=all-gpr
used:   no + -fzero-call-used-regs=used
all:no + -fzero-call-used-regs=all

***each benchmark runs 3 times. 

***runtime performance data:
Please see the attached csv file


From the data, we can see that:
On average, all the options starting with “used_…”  (i.e, only the registers 
that are used in the routine will be zeroed) have very low runtime overheads, 
at most 1.72% for integer benchmarks, and 1.17% for FP benchmarks. 
If all the registers will be zeroed, the runtime overhead is bigger, all_arg is 
5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on average. 
Looks like the overhead of zeroing vector registers is much bigger. 

For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, the 
runtime overhead with this is very small.

***code size increase data:

Please see the attached file 


From the data, we can see that:
The code size impact in general is very small, the biggest is “all_arg”, which 
is 1.06% for integer benchmark, and 1.13% for FP benchmarks.

So, from the data collected, I think that the run-time overhead and code size 
increase from this option are very reasonable. 

Let me know you comments and opinions.

thanks.

Qing

> On Aug 25, 2020, at 4:54 PM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> 
> 
>> On Aug 24, 2020, at 3:20 PM, Segher Boessenkool  
>> wrote:
>> 
>> Hi!
>> 
>> On Mon, Aug 24, 2020 at 01:02:03PM -0500, Qing Zhao wrote:
 On Aug 24, 2020, at 12:49 PM, Segher Boessenkool 
  wrote:
 On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
>> On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
>>  wrote:
>> Numbers on how expensive this is (for what arch, in code size and in
>> execution time) would be useful.  If it is so expensive that no one will
>> use it, it helps security at most none at all :-(
 
 Without numbers on this, no one can determine if it is a good tradeoff
 for them.  And we (the GCC people) cannot know if it will be useful for
 enough users that it will be worth the effort for us.  Which is why I
 keep hammering on this point.
>>> I can collect some run-time overhead data on this, do you have a 
>>> recommendation on what test suite I can use
>>> For this testing? (Is CPU2017 good enough)?
>> 
>> I would use something more real-life, not 12 small pieces of code.
> 
> There is some basic information about the benchmarks of CPU2017 in below link:
> 
> https://urldefense.com/v3/__https://www.spec.org/cpu2017/Docs/overview.html*suites__;Iw!!GqivPVa7Brio!PmRE_sLg10gVnn1UZLs1q1TPoTV0SCnw0Foo5QQlZgD03MeL0KIyPVXl0XlvVVRP$
>  
> 
>  
>   
> 
>  >
> 
> GCC itself is one of the benchmarks in CPU2017 (502.gcc_r). And 526.blender_r 
> is even larger than 502.gcc_r. 
> And there are several other quite big benchmarks as well (perlbench, 
> xalancbmk, parest, imagick, etc).
> 
> thanks.
> 
> Qing



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Qing Zhao via Gcc-patches


Hi,

Looks like both attached .csv files were deleted during the email delivery 
procedure. Not sure what’s the reason for this.

Then I have to copy the text file here for you reference:

benchmarks:
C   500.perlbench_r  
C   502.gcc_r 
C   505.mcf_r   
C++ 520.omnetpp_r
C++ 523.xalancbmk_r  
C   525.x264_r
C++ 531.deepsjeng_r
C++ 541.leela_r
C   557.xz_r   
  

C++/C/Fortran   507.cactuBSSN_r  
C++ 508.namd_r
C++ 510.parest_r 
C++/C   511.povray_r   
C   519.lbm_r 
Fortran/C   521.wrf_r 
C++/C   526.blender_r   
Fortran/C   527.cam4_r  
C   538.imagick_r  
C   544.nab_r

***runtime overhead data and code size overhead data, I converted then to PDF 
files, hopefully this time I can attach it with the email:

thanks.

Qing






> On Sep 3, 2020, at 9:29 AM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> Hi,
> 
> Per request, I collected runtime performance data and code size data with 
> CPU2017 on a X86 platform. 
> 
> *** Machine info:
> model name>-: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> $ lscpu | grep NUMA
> NUMA node(s):  2
> NUMA node0 CPU(s): 0-21,44-65
> NUMA node1 CPU(s): 22-43,66-87
> 
> ***CPU2017 benchmarks: 
> all the benchmarks with C/C++, 9 Integer benchmarks, 10 FP benchmarks. 
> 
> ***Configures:
> Intrate and fprate, 22 copies. 
> 
> ***Compiler options:
> no :  -g -O2 -march=native
> used_gpr_arg: no + -fzero-call-used-regs=used-gpr-arg
> used_arg: no + -fzero-call-used-regs=used-arg
> all_arg:  no + -fzero-call-used-regs=all-arg
> used_gpr: no + -fzero-call-used-regs=used-gpr
> all_gpr:  no + -fzero-call-used-regs=all-gpr
> used: no + -fzero-call-used-regs=used
> all:  no + -fzero-call-used-regs=all
> 
> ***each benchmark runs 3 times. 
> 
> ***runtime performance data:
> Please see the attached csv file
> 
> 
> From the data, we can see that:
> On average, all the options starting with “used_…”  (i.e, only the registers 
> that are used in the routine will be zeroed) have very low runtime overheads, 
> at most 1.72% for integer benchmarks, and 1.17% for FP benchmarks. 
> If all the registers will be zeroed, the runtime overhead is bigger, all_arg 
> is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on 
> average. 
> Looks like the overhead of zeroing vector registers is much bigger. 
> 
> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, the 
> runtime overhead with this is very small.
> 
> ***code size increase data:
> 
> Please see the attached file 
> 
> 
> From the data, we can see that:
> The code size impact in general is very small, the biggest is “all_arg”, 
> which is 1.06% for integer benchmark, and 1.13% for FP benchmarks.
> 
> So, from the data collected, I think that the run-time overhead and code size 
> increase from this option are very reasonable. 
> 
> Let me know you comments and opinions.
> 
> thanks.
> 
> Qing
> 
>> On Aug 25, 2020, at 4:54 PM, Qing Zhao via Gcc-patches 
>>  wrote:
>> 
>> 
>> 
>>> On Aug 24, 2020, at 3:20 PM, Segher Boessenkool 
>>>  wrote:
>>> 
>>> Hi!
>>> 
>>> On Mon, Aug 24, 2020 at 01:02:03PM -0500, Qing Zhao wrote:
> On Aug 24, 2020, at 12:49 PM, Segher Boessenkool 
>  wrote:
> On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
>>> On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
>>>  wrote:
>>> Numbers on how expensive this is (for what arch, in code size and in
>>> execution time) would be useful.  If it is so expensive that no one will
>>> use it, it helps security at most none at all :-(
> 
> Without numbers on this, no one can determine if it is a good tradeoff
> for them.  And we (the GCC people) cannot know if it will be useful for
> enough users that it will be worth the effort for us.  Which is why I
> keep hammering on this point.
 I can collect some run-time overhead data on this, do you have a 
 recommendation on what test suite I can use
 For this testing? (Is CPU2017 good enough)?
>>> 
>>> I would use something more real-life, not 12 small pieces of code.
>> 
>> There is some basic information about the benchmarks of CPU2017 in below 
>> link:
>> 
>> https://urldefense.com/v3/__https://www.spec.org/cpu2017/Docs/overview.html*suites__;Iw!!GqivPVa7Brio!PmRE_sLg10gVnn1UZLs1q1TPoTV0SCnw0Foo5QQlZgD03MeL0KIyPVXl0XlvVVRP$
>>  
>> 

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Qing Zhao via Gcc-patches
Looks like that the PDF attachments do not work with this alias either. 
H.J. LU helped me to upload the performance data and code size data to the 
following wiki page:

https://gitlab.com/x86-gcc/gcc/-/wikis/Zero-call-used-registers-data

Please refer to this link for the data.

thanks.

Qing

> On Sep 3, 2020, at 10:08 AM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> 
> Hi,
> 
> Looks like both attached .csv files were deleted during the email delivery 
> procedure. Not sure what’s the reason for this.
> 
> Then I have to copy the text file here for you reference:
> 
> benchmarks:
> C   500.perlbench_r  
> C   502.gcc_r 
> C   505.mcf_r   
> C++ 520.omnetpp_r
> C++ 523.xalancbmk_r  
> C   525.x264_r
> C++ 531.deepsjeng_r
> C++ 541.leela_r
> C   557.xz_r   
> 
> 
> C++/C/Fortran   507.cactuBSSN_r  
> C++ 508.namd_r
> C++ 510.parest_r 
> C++/C   511.povray_r   
> C   519.lbm_r 
> Fortran/C   521.wrf_r 
> C++/C   526.blender_r   
> Fortran/C   527.cam4_r  
> C   538.imagick_r  
> C   544.nab_r
> 
> ***runtime overhead data and code size overhead data, I converted then to PDF 
> files, hopefully this time I can attach it with the email:
> 
> thanks.
> 
> Qing
> 
> 
> 
> 
> 
> 
>> On Sep 3, 2020, at 9:29 AM, Qing Zhao via Gcc-patches 
>>  wrote:
>> 
>> Hi,
>> 
>> Per request, I collected runtime performance data and code size data with 
>> CPU2017 on a X86 platform. 
>> 
>> *** Machine info:
>> model name>-: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
>> $ lscpu | grep NUMA
>> NUMA node(s):  2
>> NUMA node0 CPU(s): 0-21,44-65
>> NUMA node1 CPU(s): 22-43,66-87
>> 
>> ***CPU2017 benchmarks: 
>> all the benchmarks with C/C++, 9 Integer benchmarks, 10 FP benchmarks. 
>> 
>> ***Configures:
>> Intrate and fprate, 22 copies. 
>> 
>> ***Compiler options:
>> no : -g -O2 -march=native
>> used_gpr_arg:no + -fzero-call-used-regs=used-gpr-arg
>> used_arg:no + -fzero-call-used-regs=used-arg
>> all_arg: no + -fzero-call-used-regs=all-arg
>> used_gpr:no + -fzero-call-used-regs=used-gpr
>> all_gpr: no + -fzero-call-used-regs=all-gpr
>> used:no + -fzero-call-used-regs=used
>> all: no + -fzero-call-used-regs=all
>> 
>> ***each benchmark runs 3 times. 
>> 
>> ***runtime performance data:
>> Please see the attached csv file
>> 
>> 
>> From the data, we can see that:
>> On average, all the options starting with “used_…”  (i.e, only the registers 
>> that are used in the routine will be zeroed) have very low runtime 
>> overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
>> benchmarks. 
>> If all the registers will be zeroed, the runtime overhead is bigger, all_arg 
>> is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on 
>> average. 
>> Looks like the overhead of zeroing vector registers is much bigger. 
>> 
>> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, the 
>> runtime overhead with this is very small.
>> 
>> ***code size increase data:
>> 
>> Please see the attached file 
>> 
>> 
>> From the data, we can see that:
>> The code size impact in general is very small, the biggest is “all_arg”, 
>> which is 1.06% for integer benchmark, and 1.13% for FP benchmarks.
>> 
>> So, from the data collected, I think that the run-time overhead and code 
>> size increase from this option are very reasonable. 
>> 
>> Let me know you comments and opinions.
>> 
>> thanks.
>> 
>> Qing
>> 
>>> On Aug 25, 2020, at 4:54 PM, Qing Zhao via Gcc-patches 
>>>  wrote:
>>> 
>>> 
>>> 
 On Aug 24, 2020, at 3:20 PM, Segher Boessenkool 
  wrote:
 
 Hi!
 
 On Mon, Aug 24, 2020 at 01:02:03PM -0500, Qing Zhao wrote:
>> On Aug 24, 2020, at 12:49 PM, Segher Boessenkool 
>>  wrote:
>> On Wed, Aug 19, 2020 at 06:27:45PM -0500, Qing Zhao wrote:
 On Aug 19, 2020, at 5:57 PM, Segher Boessenkool 
  wrote:
 Numbers on how expensive this is (for what arch, in code size and in
 execution time) would be useful.  If it is so expensive that no one 
 will
 use it, it helps security at most none at all :-(
>> 
>> Without numbers on this, no one can determine if it is a good tradeoff
>> for them.  And we (the GCC people) cannot know if it will be useful for
>> enough users that it will be worth the effort for us.  Which is why I
>> keep hammering on this point.
> I can collect some run-time overhead data on this, do you have a 
> recommendation on what test suite I can use
> For this testing? (Is CPU2017 good enough)?
 
 I would use something more real-life, not 12 small pieces of code.
>>> 
>>> There is some basic information about the benchmarks of CPU2017 in below 
>>> link:
>>> 
>>> https://urldefense.com/v3/__

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Kees Cook via Gcc-patches
On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
> On average, all the options starting with “used_…”  (i.e, only the registers 
> that are used in the routine will be zeroed) have very low runtime overheads, 
> at most 1.72% for integer benchmarks, and 1.17% for FP benchmarks. 
> If all the registers will be zeroed, the runtime overhead is bigger, all_arg 
> is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on 
> average. 
> Looks like the overhead of zeroing vector registers is much bigger. 
> 
> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, the 
> runtime overhead with this is very small.

That looks great; thanks for doing those tests!

(And it seems like these benchmarks are kind of a "worst case" scenario
with regard to performance, yes? As in it's mostly tight call loops?)

-- 
Kees Cook


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Qing Zhao via Gcc-patches



> On Sep 3, 2020, at 12:13 PM, Kees Cook  wrote:
> 
> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
>> On average, all the options starting with “used_…”  (i.e, only the registers 
>> that are used in the routine will be zeroed) have very low runtime 
>> overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
>> benchmarks. 
>> If all the registers will be zeroed, the runtime overhead is bigger, all_arg 
>> is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on 
>> average. 
>> Looks like the overhead of zeroing vector registers is much bigger. 
>> 
>> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, the 
>> runtime overhead with this is very small.
> 
> That looks great; thanks for doing those tests!
> 
> (And it seems like these benchmarks are kind of a "worst case" scenario
> with regard to performance, yes? As in it's mostly tight call loops?)

The top 3 benchmarks that have the most overhead from this option are: 
531.deepsjeng_r, 541.leela_r, and 511.povray_r.
All of them are C++ benchmarks. 
I guess that the most important reason is  the smaller routine size in general 
(especially at the hot execution path or loops).
As a result, the overhead of these additional zeroing instructions in each 
routine will be relatively higher.  

Qing

> 
> -- 
> Kees Cook



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Ramana Radhakrishnan via Gcc-patches
On Thu, Sep 3, 2020 at 6:13 PM Kees Cook via Gcc-patches
 wrote:
>
> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
> > On average, all the options starting with “used_…”  (i.e, only the 
> > registers that are used in the routine will be zeroed) have very low 
> > runtime overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
> > benchmarks.
> > If all the registers will be zeroed, the runtime overhead is bigger, 
> > all_arg is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks 
> > on average.
> > Looks like the overhead of zeroing vector registers is much bigger.
> >
> > For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, 
> > the runtime overhead with this is very small.
>
> That looks great; thanks for doing those tests!
>
> (And it seems like these benchmarks are kind of a "worst case" scenario
> with regard to performance, yes? As in it's mostly tight call loops?)


That's true of some of them but definitely not all - the GCC benchmark
springs to mind in SPEC as having quite a flat profile, so I'd take a
look there and probe a bit more in that one to see what happens. Don't
ask me what else , that's all I have in my cache this evening :)

I'd also query the "average" slowdown metric in those numbers as
something that's being measured in a different way here. IIRC the SPEC
scores for int and FP are computed with a geometric mean of the
individual ratios of each of the benchmark. Thus I don't think the
average of the slowdowns is enough to talk about slowdowns for the
benchmark suite. A quick calculation of the arithmetic mean of column
B in my head suggests that it's the arithmetic mean of all the
slowdowns ?

i.e. Slowdown (Geometric Mean (x, y, z, ))  != Arithmetic mean (
Slowdown (x), Slowdown (y) .)

So another metric to look at would be to look at the Slowdown of your
estimated (probably non-reportable) SPEC scores as well to get a more
"spec like" metric.

regards
Ramana
>
> --
> Kees Cook


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Qing Zhao via Gcc-patches



> On Sep 3, 2020, at 12:48 PM, Ramana Radhakrishnan  
> wrote:
> 
> On Thu, Sep 3, 2020 at 6:13 PM Kees Cook via Gcc-patches
>  wrote:
>> 
>> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
>>> On average, all the options starting with “used_…”  (i.e, only the 
>>> registers that are used in the routine will be zeroed) have very low 
>>> runtime overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
>>> benchmarks.
>>> If all the registers will be zeroed, the runtime overhead is bigger, 
>>> all_arg is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks 
>>> on average.
>>> Looks like the overhead of zeroing vector registers is much bigger.
>>> 
>>> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, 
>>> the runtime overhead with this is very small.
>> 
>> That looks great; thanks for doing those tests!
>> 
>> (And it seems like these benchmarks are kind of a "worst case" scenario
>> with regard to performance, yes? As in it's mostly tight call loops?)
> 
> 
> That's true of some of them but definitely not all - the GCC benchmark
> springs to mind in SPEC as having quite a flat profile, so I'd take a
> look there and probe a bit more in that one to see what happens. Don't
> ask me what else , that's all I have in my cache this evening :)
> 
> I'd also query the "average" slowdown metric in those numbers as
> something that's being measured in a different way here. IIRC the SPEC
> scores for int and FP are computed with a geometric mean of the
> individual ratios of each of the benchmark. Thus I don't think the
> average of the slowdowns is enough to talk about slowdowns for the
> benchmark suite. A quick calculation of the arithmetic mean of column
> B in my head suggests that it's the arithmetic mean of all the
> slowdowns ?
> 
> i.e. Slowdown (Geometric Mean (x, y, z, ))  != Arithmetic mean (
> Slowdown (x), Slowdown (y) .)
> 
> So another metric to look at would be to look at the Slowdown of your
> estimated (probably non-reportable) SPEC scores as well to get a more
> "spec like" metric.

Please take a look at the new csv file at:

https://gitlab.com/x86-gcc/gcc/-/wikis/Zero-call-used-registers-data 


I just uploaded the slowdown data computed based on 
Est.SPECrate(R)2017_int_base and Est.SPECrate(R)2017_fp_base. All data are 
computed against “no”. 

Compare this slowdown data to the one I computed previously as (Arithmetic mean 
(Slowdown(x), slowdown(y)…), the numbers do change a little bit, however, the 
basic information provided from the data keeps the same as before. 

Let me know if you have further comments.

thanks.

Qing


> 
> regards
> Ramana
>> 
>> --
>> Kees Cook



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-03 Thread Rodriguez Bahena, Victor via Gcc-patches


-Original Message-
From: Qing Zhao 
Date: Thursday, September 3, 2020 at 12:55 PM
To: Kees Cook 
Cc: Segher Boessenkool , Jakub Jelinek 
, Uros Bizjak , "Rodriguez Bahena, Victor" 
, GCC Patches 
Subject: Re: PING [Patch][Middle-end]Add 
-fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]



> On Sep 3, 2020, at 12:13 PM, Kees Cook  wrote:
> 
> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
>> On average, all the options starting with “used_…”  (i.e, only the 
registers that are used in the routine will be zeroed) have very low runtime 
overheads, at most 1.72% for integer benchmarks, and 1.17% for FP benchmarks. 
>> If all the registers will be zeroed, the runtime overhead is bigger, 
all_arg is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on 
average. 
>> Looks like the overhead of zeroing vector registers is much bigger. 
>> 
>> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, 
the runtime overhead with this is very small.
> 
> That looks great; thanks for doing those tests!
> 
> (And it seems like these benchmarks are kind of a "worst case" scenario
> with regard to performance, yes? As in it's mostly tight call loops?)

The top 3 benchmarks that have the most overhead from this option are: 
531.deepsjeng_r, 541.leela_r, and 511.povray_r.
All of them are C++ benchmarks. 
I guess that the most important reason is  the smaller routine size in 
general (especially at the hot execution path or loops).
As a result, the overhead of these additional zeroing instructions in each 
routine will be relatively higher.  

Qing

I think that overhead is expected in benchmarks like 541.leela_r, according to 
https://www.spec.org/cpu2017/Docs/benchmarks/541.leela_r.html is a benchmark 
for Artificial Intelligence (Monte Carlo simulation, game tree search & pattern 
recognition). The addition of fzero-call-used-regs will represent an overhead 
each time the functions are being call and in areas like game tree search is 
high. 

Qing, thanks a lot for the measurement, I am not sure if this is the limit of 
overhead the community is willing to accept by adding extra security (me as gcc 
user will be willing to accept). 

Regards

Victor 


> 
> -- 
> Kees Cook




Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-04 Thread Qing Zhao via Gcc-patches



> On Sep 3, 2020, at 8:23 PM, Rodriguez Bahena, Victor 
>  wrote:
> 
> 
> 
> -Original Message-
> From: Qing Zhao mailto:qing.z...@oracle.com>>
> Date: Thursday, September 3, 2020 at 12:55 PM
> To: Kees Cook mailto:keesc...@chromium.org>>
> Cc: Segher Boessenkool  <mailto:seg...@kernel.crashing.org>>, Jakub Jelinek  <mailto:ja...@redhat.com>>, Uros Bizjak  <mailto:ubiz...@gmail.com>>, "Rodriguez Bahena, Victor" 
>  <mailto:victor.rodriguez.bah...@intel.com>>, GCC Patches 
> mailto:gcc-patches@gcc.gnu.org>>
> Subject: Re: PING [Patch][Middle-end]Add 
> -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]
> 
> 
> 
>> On Sep 3, 2020, at 12:13 PM, Kees Cook  wrote:
>> 
>> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
>>> On average, all the options starting with “used_…”  (i.e, only the 
>>> registers that are used in the routine will be zeroed) have very low 
>>> runtime overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
>>> benchmarks. 
>>> If all the registers will be zeroed, the runtime overhead is bigger, 
>>> all_arg is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks 
>>> on average. 
>>> Looks like the overhead of zeroing vector registers is much bigger. 
>>> 
>>> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, 
>>> the runtime overhead with this is very small.
>> 
>> That looks great; thanks for doing those tests!
>> 
>> (And it seems like these benchmarks are kind of a "worst case" scenario
>> with regard to performance, yes? As in it's mostly tight call loops?)
> 
>The top 3 benchmarks that have the most overhead from this option are: 
> 531.deepsjeng_r, 541.leela_r, and 511.povray_r.
>All of them are C++ benchmarks. 
>I guess that the most important reason is  the smaller routine size in 
> general (especially at the hot execution path or loops).
>As a result, the overhead of these additional zeroing instructions in each 
> routine will be relatively higher.  
> 
>Qing
> 
> I think that overhead is expected in benchmarks like 541.leela_r, according 
> to 
> https://urldefense.com/v3/__https://www.spec.org/cpu2017/Docs/benchmarks/541.leela_r.html__;!!GqivPVa7Brio!I4c2wyzrNGbeOTsX7BSD-4C9Cv3ypQ4N1qfRzSK__STxRGa5M4VarBKof2ak8-dT$
>  
> <https://urldefense.com/v3/__https://www.spec.org/cpu2017/Docs/benchmarks/541.leela_r.html__;!!GqivPVa7Brio!I4c2wyzrNGbeOTsX7BSD-4C9Cv3ypQ4N1qfRzSK__STxRGa5M4VarBKof2ak8-dT$>
>   is a benchmark for Artificial Intelligence (Monte Carlo simulation, game 
> tree search & pattern recognition). The addition of fzero-call-used-regs will 
> represent an overhead each time the functions are being call and in areas 
> like game tree search is high. 
> 
> Qing, thanks a lot for the measurement, I am not sure if this is the limit of 
> overhead the community is willing to accept by adding extra security (me as 
> gcc user will be willing to accept). 

From the performance data, we can see that the runtime overhead of clearing 
only_used registers is very reasonable, even for 541.leela_r, 531.deepsjent_r, 
and 511.povray.   If try to clear all registers whatever used or not in the 
current routine, the overhead will be increased dramatically. 

So, my question is:

From the security point of view, does clearing ALL registers have more benefit 
than clearing USED registers?  
From my understanding, clearing registers that are not used in the current 
routine does NOT provide additional benefit, correct me if I am wrong here.

Thanks.

Qing


> 
> Regards
> 
> Victor 
> 
> 
>> 
>> -- 
>> Kees Cook



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-04 Thread Segher Boessenkool
Hi!

On Mon, Aug 24, 2020 at 03:49:50PM -0500, Qing Zhao wrote:
> > Do you want to do this before or after the epilogue code is generated?
> 
> static rtx_insn *
> make_epilogue_seq (void)
> {
>   if (!targetm.have_epilogue ())
> return NULL;
> 
>   start_sequence ();
>   emit_note (NOTE_INSN_EPILOGUE_BEG);
> 
>  + gen_call_used_regs_seq (); // this is the place to 
> emit the zeroing insn sequence
> 
>   rtx_insn *seq = targetm.gen_epilogue ();
> …
> }
> 
> Any comment on this?

So, before.  This is problematic if the epilogue uses any of those
registers: if the epilogue expects some value there, you just destroyed
it; and, conversely, if the epilogue writes such a reg, your zeroing is
useless.


You probably have to do this for every target separately?  But it is not
enough to handle it in the epilogue, you also need to make sure it is
done on every path that returns *without* epilogue.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-04 Thread Segher Boessenkool
On Mon, Aug 24, 2020 at 03:43:11PM -0500, Qing Zhao wrote:
> > On Aug 24, 2020, at 3:20 PM, Segher Boessenkool 
> >  wrote:
> >> For this testing? (Is CPU2017 good enough)?
> > 
> > I would use something more real-life, not 12 small pieces of code.
> 
> Then, what kind of real-life benchmark you are suggesting? 

Picking benchmark code is Hard (and that is your job, not mine, sorry).
Maybe firefox or openoffice or whatever.  Some *bigger* code.  Real-life
code.

> >> Please take a look at this paper. 
> > 
> > As I told you before, that isn't open information, I cannot reply to
> > any of that.
> 
> A little confused here, what’s you mean by “open information”? Is the 
> information in a published paper not open information?

I am not allowed to quote it here.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-04 Thread Segher Boessenkool
On Thu, Sep 03, 2020 at 10:13:35AM -0700, Kees Cook wrote:
> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
> > On average, all the options starting with “used_…”  (i.e, only the 
> > registers that are used in the routine will be zeroed) have very low 
> > runtime overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
> > benchmarks. 
> > If all the registers will be zeroed, the runtime overhead is bigger, 
> > all_arg is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks 
> > on average. 
> > Looks like the overhead of zeroing vector registers is much bigger. 
> > 
> > For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, 
> > the runtime overhead with this is very small.
> 
> That looks great; thanks for doing those tests!
> 
> (And it seems like these benchmarks are kind of a "worst case" scenario
> with regard to performance, yes? As in it's mostly tight call loops?)

I call this very expensive, already, and it is benchmarked on a target
where this should be very cheap (it has few registers) :-/


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-04 Thread Qing Zhao via Gcc-patches



> On Sep 4, 2020, at 10:43 AM, Segher Boessenkool  
> wrote:
> 
> On Thu, Sep 03, 2020 at 10:13:35AM -0700, Kees Cook wrote:
>> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
>>> On average, all the options starting with “used_…”  (i.e, only the 
>>> registers that are used in the routine will be zeroed) have very low 
>>> runtime overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
>>> benchmarks. 
>>> If all the registers will be zeroed, the runtime overhead is bigger, 
>>> all_arg is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks 
>>> on average. 
>>> Looks like the overhead of zeroing vector registers is much bigger. 
>>> 
>>> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, 
>>> the runtime overhead with this is very small.
>> 
>> That looks great; thanks for doing those tests!
>> 
>> (And it seems like these benchmarks are kind of a "worst case" scenario
>> with regard to performance, yes? As in it's mostly tight call loops?)
> 
> I call this very expensive, already,

Yes, I think that 17.56% on average is quite expensive. That’s the data for 
-fzero-call-used-regs=all, the worst case i.e, clearing all the call-used 
registers at the return.

However, if we only clear USED registers, the worst case is 1.72% on average.  
This overhead is very reasonable. 
Furthermore, if we only clear used_gpr_arg, i.e used general purpose registers 
that pass parameters, this should be enough to be used for mitigation ROP, the 
overhead is even smaller, it’s 0.84% on average. 


> and it is benchmarked on a target
> where this should be very cheap (it has few registers) :-/

It’s a tradeoff to improve the software security with some runtime overhead. 

For compiler, we should provide such option to the users to satisfy their 
security need even though the runtime overhead.  Of course, during compiler 
implementation, we will do our best to minimize the runtime overhead.

Qing



> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-04 Thread H.J. Lu via Gcc-patches
On Fri, Sep 4, 2020 at 8:18 AM Segher Boessenkool
 wrote:
>
> Hi!
>
> On Mon, Aug 24, 2020 at 03:49:50PM -0500, Qing Zhao wrote:
> > > Do you want to do this before or after the epilogue code is generated?
> >
> > static rtx_insn *
> > make_epilogue_seq (void)
> > {
> >   if (!targetm.have_epilogue ())
> > return NULL;
> >
> >   start_sequence ();
> >   emit_note (NOTE_INSN_EPILOGUE_BEG);
> >
> >  + gen_call_used_regs_seq (); // this is the place 
> > to emit the zeroing insn sequence
> >
> >   rtx_insn *seq = targetm.gen_epilogue ();
> > …
> > }
> >
> > Any comment on this?
>
> So, before.  This is problematic if the epilogue uses any of those
> registers: if the epilogue expects some value there, you just destroyed
> it; and, conversely, if the epilogue writes such a reg, your zeroing is
> useless.
>
>
> You probably have to do this for every target separately?  But it is not
> enough to handle it in the epilogue, you also need to make sure it is
> done on every path that returns *without* epilogue.

This feature is designed for normal return with epilogue.

-- 
H.J.


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-04 Thread Segher Boessenkool
On Fri, Sep 04, 2020 at 12:18:12PM -0500, Qing Zhao wrote:
> > I call this very expensive, already,
> 
> Yes, I think that 17.56% on average is quite expensive. That’s the data for 
> -fzero-call-used-regs=all, the worst case i.e, clearing all the call-used 
> registers at the return.
> 
> However, if we only clear USED registers, the worst case is 1.72% on average. 
>  This overhead is very reasonable. 

No, that is the number I meant.  2% overhead is extremely much, unless
this is magically super effective, and actually protects many things
from exploitation (that aren't already protected some other way, SSP for
example).

> > and it is benchmarked on a target
> > where this should be very cheap (it has few registers) :-/
> 
> It’s a tradeoff to improve the software security with some runtime overhead. 

Yes.  Which is why I asked for numbers of both sides of the equation:
how much it costs, vs. how much value it brings.

> For compiler, we should provide such option to the users to satisfy their 
> security need even though the runtime overhead.  Of course, during compiler 
> implementation, we will do our best to minimize the runtime overhead.

There also is a real cost to the compiler *developers*.  Which is my
prime worry here.  If this gives users at most marginal value, then it
is real cost to us, but nothing to hold up to that.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-04 Thread Segher Boessenkool
On Fri, Sep 04, 2020 at 10:34:23AM -0700, H.J. Lu wrote:
> > You probably have to do this for every target separately?  But it is not
> > enough to handle it in the epilogue, you also need to make sure it is
> > done on every path that returns *without* epilogue.
> 
> This feature is designed for normal return with epilogue.

Very many normal returns do *not* pass through an epilogue, but are
simple_return.  Disabling that is *much* more expensive than that 2%.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-04 Thread H.J. Lu via Gcc-patches
On Fri, Sep 4, 2020 at 11:09 AM Segher Boessenkool
 wrote:
>
> On Fri, Sep 04, 2020 at 10:34:23AM -0700, H.J. Lu wrote:
> > > You probably have to do this for every target separately?  But it is not
> > > enough to handle it in the epilogue, you also need to make sure it is
> > > done on every path that returns *without* epilogue.
> >
> > This feature is designed for normal return with epilogue.
>
> Very many normal returns do *not* pass through an epilogue, but are
> simple_return.  Disabling that is *much* more expensive than that 2%.

Sibcall isn't covered.  What other cases don't have an epilogue?

-- 
H.J.


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-04 Thread Qing Zhao via Gcc-patches



> On Sep 4, 2020, at 1:04 PM, Segher Boessenkool  
> wrote:
> 
> On Fri, Sep 04, 2020 at 12:18:12PM -0500, Qing Zhao wrote:
>>> I call this very expensive, already,
>> 
>> Yes, I think that 17.56% on average is quite expensive. That’s the data for 
>> -fzero-call-used-regs=all, the worst case i.e, clearing all the call-used 
>> registers at the return.
>> 
>> However, if we only clear USED registers, the worst case is 1.72% on 
>> average.  This overhead is very reasonable. 
> 
> No, that is the number I meant.  2% overhead is extremely much, unless
> this is magically super effective, and actually protects many things
> from exploitation (that aren't already protected some other way, SSP for
> example).

Then how about the 0.81% overhead on average for 
-fzero-call-used-regs=used_gpr_arg? 

This option can be used to effectively mitigate ROP attack. 

and currently,   Clear Linux project has been using a similar option as this 
one since GCC 8 (similar as -fzero-call-used-regs=used_gpr). 


>>> and it is benchmarked on a target
>>> where this should be very cheap (it has few registers) :-/
>> 
>> It’s a tradeoff to improve the software security with some runtime overhead. 
> 
> Yes.  Which is why I asked for numbers of both sides of the equation:
> how much it costs, vs. how much value it brings.

Reasonable. 

> 
>> For compiler, we should provide such option to the users to satisfy their 
>> security need even though the runtime overhead.  Of course, during compiler 
>> implementation, we will do our best to minimize the runtime overhead.
> 
> There also is a real cost to the compiler *developers*.  Which is my
> prime worry here.  If this gives users at most marginal value, then it
> is real cost to us, but nothing to hold up to that.

Here, you mean the future maintenance  cost  for this part of the code?

Qing
> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-07 Thread Rodriguez Bahena, Victor via Gcc-patches


From: Qing Zhao 
Date: Friday, September 4, 2020 at 9:19 AM
To: "Rodriguez Bahena, Victor" , Kees Cook 

Cc: Segher Boessenkool , Jakub Jelinek 
, Uros Bizjak , GCC Patches 

Subject: Re: PING [Patch][Middle-end]Add 
-fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]




On Sep 3, 2020, at 8:23 PM, Rodriguez Bahena, Victor 
mailto:victor.rodriguez.bah...@intel.com>> 
wrote:



-Original Message-
From: Qing Zhao mailto:qing.z...@oracle.com>>
Date: Thursday, September 3, 2020 at 12:55 PM
To: Kees Cook mailto:keesc...@chromium.org>>
Cc: Segher Boessenkool 
mailto:seg...@kernel.crashing.org>>, Jakub Jelinek 
mailto:ja...@redhat.com>>, Uros Bizjak 
mailto:ubiz...@gmail.com>>, "Rodriguez Bahena, Victor" 
mailto:victor.rodriguez.bah...@intel.com>>, 
GCC Patches mailto:gcc-patches@gcc.gnu.org>>
Subject: Re: PING [Patch][Middle-end]Add 
-fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]




On Sep 3, 2020, at 12:13 PM, Kees Cook 
mailto:keesc...@chromium.org>> wrote:

On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:

On average, all the options starting with “used_…”  (i.e, only the registers 
that are used in the routine will be zeroed) have very low runtime overheads, 
at most 1.72% for integer benchmarks, and 1.17% for FP benchmarks.
If all the registers will be zeroed, the runtime overhead is bigger, all_arg is 
5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks on average.
Looks like the overhead of zeroing vector registers is much bigger.

For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, the 
runtime overhead with this is very small.

That looks great; thanks for doing those tests!

(And it seems like these benchmarks are kind of a "worst case" scenario
with regard to performance, yes? As in it's mostly tight call loops?)

   The top 3 benchmarks that have the most overhead from this option are: 
531.deepsjeng_r, 541.leela_r, and 511.povray_r.
   All of them are C++ benchmarks.
   I guess that the most important reason is  the smaller routine size in 
general (especially at the hot execution path or loops).
   As a result, the overhead of these additional zeroing instructions in each 
routine will be relatively higher.

   Qing

I think that overhead is expected in benchmarks like 541.leela_r, according to 
https://urldefense.com/v3/__https://www.spec.org/cpu2017/Docs/benchmarks/541.leela_r.html__;!!GqivPVa7Brio!I4c2wyzrNGbeOTsX7BSD-4C9Cv3ypQ4N1qfRzSK__STxRGa5M4VarBKof2ak8-dT$<https://urldefense.com/v3/__https:/www.spec.org/cpu2017/Docs/benchmarks/541.leela_r.html__;!!GqivPVa7Brio!I4c2wyzrNGbeOTsX7BSD-4C9Cv3ypQ4N1qfRzSK__STxRGa5M4VarBKof2ak8-dT$>
  is a benchmark for Artificial Intelligence (Monte Carlo simulation, game tree 
search & pattern recognition). The addition of fzero-call-used-regs will 
represent an overhead each time the functions are being call and in areas like 
game tree search is high.

Qing, thanks a lot for the measurement, I am not sure if this is the limit of 
overhead the community is willing to accept by adding extra security (me as gcc 
user will be willing to accept).

From the performance data, we can see that the runtime overhead of clearing 
only_used registers is very reasonable, even for 541.leela_r, 531.deepsjent_r, 
and 511.povray.   If try to clear all registers whatever used or not in the 
current routine, the overhead will be increased dramatically.

So, my question is:

From the security point of view, does clearing ALL registers have more benefit 
than clearing USED registers?
From my understanding, clearing registers that are not used in the current 
routine does NOT provide additional benefit, correct me if I am wrong here.

You are right, it does not provide additional security


Thanks.

Qing



Regards

Victor




--
Kees Cook




Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-07 Thread Segher Boessenkool
On Fri, Sep 04, 2020 at 11:52:13AM -0700, H.J. Lu wrote:
> On Fri, Sep 4, 2020 at 11:09 AM Segher Boessenkool
>  wrote:
> > On Fri, Sep 04, 2020 at 10:34:23AM -0700, H.J. Lu wrote:
> > > > You probably have to do this for every target separately?  But it is not
> > > > enough to handle it in the epilogue, you also need to make sure it is
> > > > done on every path that returns *without* epilogue.
> > >
> > > This feature is designed for normal return with epilogue.
> >
> > Very many normal returns do *not* pass through an epilogue, but are
> > simple_return.  Disabling that is *much* more expensive than that 2%.
> 
> Sibcall isn't covered.  What other cases don't have an epilogue?

Shrink-wrapped stuff.  Quite important for performance.  Not something
you can throw away.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-07 Thread Segher Boessenkool
On Fri, Sep 04, 2020 at 02:00:41PM -0500, Qing Zhao wrote:
> >> However, if we only clear USED registers, the worst case is 1.72% on 
> >> average.  This overhead is very reasonable. 
> > 
> > No, that is the number I meant.  2% overhead is extremely much, unless
> > this is magically super effective, and actually protects many things
> > from exploitation (that aren't already protected some other way, SSP for
> > example).
> 
> Then how about the 0.81% overhead on average for 
> -fzero-call-used-regs=used_gpr_arg? 

That is still quite a lot.

> This option can be used to effectively mitigate ROP attack. 

Nice assertion.  Show it!

> > Yes.  Which is why I asked for numbers of both sides of the equation:
> > how much it costs, vs. how much value it brings.
> 
> Reasonable. 

I'm glad you agree :-)

> >> For compiler, we should provide such option to the users to satisfy their 
> >> security need even though the runtime overhead.  Of course, during 
> >> compiler implementation, we will do our best to minimize the runtime 
> >> overhead.
> > 
> > There also is a real cost to the compiler *developers*.  Which is my
> > prime worry here.  If this gives users at most marginal value, then it
> > is real cost to us, but nothing to hold up to that.
> 
> Here, you mean the future maintenance  cost  for this part of the code?

Not just that.  *All* support costs, and consider all other
optimisations it will interfere with, etc.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-07 Thread Segher Boessenkool
On Fri, Sep 04, 2020 at 01:23:14AM +, Rodriguez Bahena, Victor wrote:
> Qing, thanks a lot for the measurement, I am not sure if this is the limit of 
> overhead the community is willing to accept by adding extra security (me as 
> gcc user will be willing to accept). 

The overhead is of course bearable for most programs / users, but what
is the return?  For what percentage of programs are ROP attacks no
longer possible, for example.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-07 Thread H.J. Lu via Gcc-patches
On Mon, Sep 7, 2020 at 7:06 AM Segher Boessenkool
 wrote:
>
> On Fri, Sep 04, 2020 at 11:52:13AM -0700, H.J. Lu wrote:
> > On Fri, Sep 4, 2020 at 11:09 AM Segher Boessenkool
> >  wrote:
> > > On Fri, Sep 04, 2020 at 10:34:23AM -0700, H.J. Lu wrote:
> > > > > You probably have to do this for every target separately?  But it is 
> > > > > not
> > > > > enough to handle it in the epilogue, you also need to make sure it is
> > > > > done on every path that returns *without* epilogue.
> > > >
> > > > This feature is designed for normal return with epilogue.
> > >
> > > Very many normal returns do *not* pass through an epilogue, but are
> > > simple_return.  Disabling that is *much* more expensive than that 2%.
> >
> > Sibcall isn't covered.  What other cases don't have an epilogue?
>
> Shrink-wrapped stuff.  Quite important for performance.  Not something
> you can throw away.
>

Qing, can you check how it interacts with shrink-wrap?

-- 
H.J.


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-08 Thread Qing Zhao via Gcc-patches


> On Sep 7, 2020, at 9:36 AM, Segher Boessenkool  
> wrote:
> 
> On Fri, Sep 04, 2020 at 02:00:41PM -0500, Qing Zhao wrote:
 However, if we only clear USED registers, the worst case is 1.72% on 
 average.  This overhead is very reasonable. 
>>> 
>>> No, that is the number I meant.  2% overhead is extremely much, unless
>>> this is magically super effective, and actually protects many things
>>> from exploitation (that aren't already protected some other way, SSP for
>>> example).
>> 
>> Then how about the 0.81% overhead on average for 
>> -fzero-call-used-regs=used_gpr_arg? 
> 
> That is still quite a lot.
> 
>> This option can be used to effectively mitigate ROP attack. 
> 
> Nice assertion.  Show it!

As I mentioned multiple times,  one important background of this patch is this  
paper which was published at 2018 IEEE 29th International Conference on 
Application-specific Systems, Architectures and Processors (ASAP):

"Clean the Scratch Registers: A Way to Mitigate Return-Oriented Programming 
Attacks”

https://ieeexplore.ieee.org/document/8445132

Downloading this paper form IEEE needs a fee. I have downloaded it from my 
company’s account, however, After consulting, it turned out that I was not 
allowed to further forward the copy I downloaded through my company’s account 
to this alias. 

However, There is some more information on this paper online though:

https://www.semanticscholar.org/paper/Clean-the-Scratch-Registers:-A-Way-to-Mitigate-Rong-Xie/6f2ce4fd31baa0f6c02f9eb5c57b90d39fe5fa13

All the figures and tables in this paper are available in this link. 

In which, Table III, Table IV and Table V are the results of “zeroing scratch 
register mitigate ROP attack”. From the tables, zeroing scratch registers can 
successfully mitigate the ROP on all those benchmarks. 

What other information you need to show the effective of mitigation ROP attack?

> 
>>> Yes.  Which is why I asked for numbers of both sides of the equation:
>>> how much it costs, vs. how much value it brings.--- Begin Message ---


> On Aug 25, 2020, at 9:05 AM, Qing Zhao via Gcc-patches 
>  wrote:
> 
> 
> 
>> On Aug 25, 2020, at 1:41 AM, Uros Bizjak  wrote:
>> 
 
>> (The other side of the coin is how much this helps prevent exploitation;
>> numbers on that would be good to see, too.)
> 
> This can be well showed from the paper:
> 
> "Clean the Scratch Registers: A Way to Mitigate Return-Oriented 
> Programming Attacks"
> 
> https://urldefense.com/v3/__https://ieeexplore.ieee.org/document/8445132__;!!GqivPVa7Brio!JbdLvo54xB3ORTeZqpy_PwZsL9drNLaKjbg14bTKMOwxt8LWnjZ8gJWlqtlrFKPh$
>  
>   >
> 
> Please take a look at this paper.
 
 As I told you before, that isn't open information, I cannot reply to
 any of that.
>>> 
>>> A little confused here, what’s you mean by “open information”? Is the 
>>> information in a published paper not open information?
>> 
>> No, because it is behind a paywall.
> 
> Still don’t understand here:  this paper has been published in the proceeding 
> of “ 2018 IEEE 29th International Conference on Application-specific Systems, 
> Architectures and Processors (ASAP)”.
> If you want to read the complete version online, you need to pay for it.
> 
> However, it’s still a published paper, and the information inside it should 
> be “open information”. 
> 
> So, what’s the definition of “open information” you have?
> 
> I downloaded a PDF copy of this paper through my company’s paid account.  But 
> I am not sure whether it’s legal for me to attach it to this mailing list?

After consulting, it turned out that I was not allowed to further forward the 
copy I downloaded through my company’s account to this alias. 
There is some more information on this paper online though:

https://urldefense.com/v3/__https://www.semanticscholar.org/paper/Clean-the-Scratch-Registers:-A-Way-to-Mitigate-Rong-Xie/6f2ce4fd31baa0f6c02f9eb5c57b90d39fe5fa13__;!!GqivPVa7Brio!I4MGz7_DH7Dtcfzmgz7MxfDNnuJO-CiNo1jUcp4OOQOiPi4uEEOfuoT7_1SSMt1D$
 

All the figures and tables in this paper are available in this link. 

In which, Figure 1 is an illustration  of a typical ROP attack, please pay 
special attention on the “Gadgets”, which are carefully chosen machine 
instruction sequences that are already present in the machine's memory, Each 
gadget typically ends in a return instruction and is located in a subroutine 
within the existing program and/or shared library code. Chained together, these 
gadgets allow an attacker to perform arbitrary operations on a machine 
employing defenses that thwart simpler attacks.

The paper identified the important features of ROP attack as following:

"First, the destination of using gadget chains in usual is performing system 
call or system fucntion to perform malicious behav

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-08 Thread Qing Zhao via Gcc-patches



> On Sep 7, 2020, at 8:06 AM, Rodriguez Bahena, Victor 
>  wrote:
> 
>  
>  
> From: Qing Zhao mailto:qing.z...@oracle.com>>
> Date: Friday, September 4, 2020 at 9:19 AM
> To: "Rodriguez Bahena, Victor"  <mailto:victor.rodriguez.bah...@intel.com>>, Kees Cook  <mailto:keesc...@chromium.org>>
> Cc: Segher Boessenkool  <mailto:seg...@kernel.crashing.org>>, Jakub Jelinek  <mailto:ja...@redhat.com>>, Uros Bizjak  <mailto:ubiz...@gmail.com>>, GCC Patches  <mailto:gcc-patches@gcc.gnu.org>>
> Subject: Re: PING [Patch][Middle-end]Add 
> -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]
>  
>  
> 
> 
>> On Sep 3, 2020, at 8:23 PM, Rodriguez Bahena, Victor 
>> > <mailto:victor.rodriguez.bah...@intel.com>> wrote:
>>  
>> 
>> 
>> -Original Message-
>> From: Qing Zhao mailto:qing.z...@oracle.com>>
>> Date: Thursday, September 3, 2020 at 12:55 PM
>> To: Kees Cook mailto:keesc...@chromium.org>>
>> Cc: Segher Boessenkool > <mailto:seg...@kernel.crashing.org>>, Jakub Jelinek > <mailto:ja...@redhat.com>>, Uros Bizjak > <mailto:ubiz...@gmail.com>>, "Rodriguez Bahena, Victor" 
>> > <mailto:victor.rodriguez.bah...@intel.com>>, GCC Patches 
>> mailto:gcc-patches@gcc.gnu.org>>
>> Subject: Re: PING [Patch][Middle-end]Add 
>> -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]
>> 
>> 
>> 
>> 
>>> On Sep 3, 2020, at 12:13 PM, Kees Cook >> <mailto:keesc...@chromium.org>> wrote:
>>> 
>>> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
>>> 
>>>> On average, all the options starting with “used_…”  (i.e, only the 
>>>> registers that are used in the routine will be zeroed) have very low 
>>>> runtime overheads, at most 1.72% for integer benchmarks, and 1.17% for FP 
>>>> benchmarks. 
>>>> If all the registers will be zeroed, the runtime overhead is bigger, 
>>>> all_arg is 5.7%, all_gpr is 3.5%, and all is 17.56% for integer benchmarks 
>>>> on average. 
>>>> Looks like the overhead of zeroing vector registers is much bigger. 
>>>> 
>>>> For ROP mitigation, -fzero-call-used-regs=used-gpr-arg should be enough, 
>>>> the runtime overhead with this is very small.
>>> 
>>> That looks great; thanks for doing those tests!
>>> 
>>> (And it seems like these benchmarks are kind of a "worst case" scenario
>>> with regard to performance, yes? As in it's mostly tight call loops?)
>> 
>>The top 3 benchmarks that have the most overhead from this option are: 
>> 531.deepsjeng_r, 541.leela_r, and 511.povray_r.
>>All of them are C++ benchmarks. 
>>I guess that the most important reason is  the smaller routine size in 
>> general (especially at the hot execution path or loops).
>>As a result, the overhead of these additional zeroing instructions in 
>> each routine will be relatively higher.  
>> 
>>Qing
>> 
>> I think that overhead is expected in benchmarks like 541.leela_r, according 
>> to 
>> https://urldefense.com/v3/__https://www.spec.org/cpu2017/Docs/benchmarks/541.leela_r.html__;!!GqivPVa7Brio!I4c2wyzrNGbeOTsX7BSD-4C9Cv3ypQ4N1qfRzSK__STxRGa5M4VarBKof2ak8-dT$
>>  
>> <https://urldefense.com/v3/__https:/www.spec.org/cpu2017/Docs/benchmarks/541.leela_r.html__;!!GqivPVa7Brio!I4c2wyzrNGbeOTsX7BSD-4C9Cv3ypQ4N1qfRzSK__STxRGa5M4VarBKof2ak8-dT$>
>>   is a benchmark for Artificial Intelligence (Monte Carlo simulation, game 
>> tree search & pattern recognition). The addition of fzero-call-used-regs 
>> will represent an overhead each time the functions are being call and in 
>> areas like game tree search is high. 
>> 
>> Qing, thanks a lot for the measurement, I am not sure if this is the limit 
>> of overhead the community is willing to accept by adding extra security (me 
>> as gcc user will be willing to accept). 
>  
> From the performance data, we can see that the runtime overhead of clearing 
> only_used registers is very reasonable, even for 541.leela_r, 
> 531.deepsjent_r, and 511.povray.   If try to clear all registers whatever 
> used or not in the current routine, the overhead will be increased 
> dramatically. 
>  
> So, my question is:
>  
> From the security point of view, does clearing ALL registers have more 
> benefit than clearing USED registers?  
> From my understanding, clearing registers that are not used in the current 
> routine does NOT provide additional benefit, correct me if I am wrong here.
>  
> You are right, it does not provide additional security

Then, is it necessary to provide 

-fzero-call-used-regs=all-arg|all-gpr|all   to the user?

Can we just delete these 3 sub options?


Qing


>  
>  
> Thanks.
>  
> Qing
>  
>  
>> 
>> Regards
>> 
>> Victor 
>> 
>> 
>> 
>>> 
>>> -- 
>>> Kees Cook
> 
> 
> 



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-08 Thread Patrick McGehearty via Gcc-patches

My understanding is this feature/flag is not intended to be "default on".
It is intended to be used in security sensitive environments such
as the Linux kernel where it was requested by kernel security experts.
I'm not understanding the objection here if the feature is requested
by security teams and the average cost is modest.

My background is in performance and application optimization. I agree
that for typical computation oriented, non-secure applications, I would
not use the feature, but for system applications that have the ability
to cross protection boundaries, it seems to be clearly a worthwhile
feature.

- patrick


On 9/7/2020 9:44 AM, Segher Boessenkool wrote:

On Fri, Sep 04, 2020 at 01:23:14AM +, Rodriguez Bahena, Victor wrote:

Qing, thanks a lot for the measurement, I am not sure if this is the limit of 
overhead the community is willing to accept by adding extra security (me as gcc 
user will be willing to accept).

The overhead is of course bearable for most programs / users, but what
is the return?  For what percentage of programs are ROP attacks no
longer possible, for example.


Segher




Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-08 Thread Qing Zhao via Gcc-patches



> On Sep 7, 2020, at 10:58 AM, H.J. Lu  wrote:
> 
> On Mon, Sep 7, 2020 at 7:06 AM Segher Boessenkool
> mailto:seg...@kernel.crashing.org>> wrote:
>> 
>> On Fri, Sep 04, 2020 at 11:52:13AM -0700, H.J. Lu wrote:
>>> On Fri, Sep 4, 2020 at 11:09 AM Segher Boessenkool
>>>  wrote:
 On Fri, Sep 04, 2020 at 10:34:23AM -0700, H.J. Lu wrote:
>> You probably have to do this for every target separately?  But it is not
>> enough to handle it in the epilogue, you also need to make sure it is
>> done on every path that returns *without* epilogue.
> 
> This feature is designed for normal return with epilogue.
 
 Very many normal returns do *not* pass through an epilogue, but are
 simple_return.  Disabling that is *much* more expensive than that 2%.
>>> 
>>> Sibcall isn't covered.  What other cases don't have an epilogue?
>> 
>> Shrink-wrapped stuff.  Quite important for performance.  Not something
>> you can throw away.
>> 
> 
> Qing, can you check how it interacts with shrink-wrap?

We have some discussion on shrink-wrapping previously.  And we agreed on  the 
following at that time:

"Shrink-wrapping often deals with the non-volatile registers, so that
doesn't matter much for this patch series.”

On the other hand, we deal with volatile registers in this patch, so from the 
registers point of view, there is NO overlap between this
Patch and the shrink-wrapping. 

So, what’s the other possible issues when this patch interacting with 
shrink-wrapping?

When I checked the gcc source code on shrink-wrapping as following 
(gcc/function.c):


…….
  rtx_insn *epilogue_seq = make_epilogue_seq ();

  /* Try to perform a kind of shrink-wrapping, making sure the
 prologue/epilogue is emitted only around those parts of the
 function that require it.  */
  try_shrink_wrapping (&entry_edge, prologue_seq);

  /* If the target can handle splitting the prologue/epilogue into separate
 components, try to shrink-wrap these components separately.  */
  try_shrink_wrapping_separate (entry_edge->dest);

  /* If that did anything for any component we now need the generate the
 "main" prologue again.  Because some targets require some of these
 to be called in a specific order (i386 requires the split prologue
 to be first, for example), we create all three sequences again here.
 If this does not work for some target, that target should not enable
 separate shrink-wrapping.  */
  if (crtl->shrink_wrapped_separate)
{
  split_prologue_seq = make_split_prologue_seq ();
  prologue_seq = make_prologue_seq ();
  epilogue_seq = make_epilogue_seq ();
}
…….

My understanding from the above is:

1. “try_shrink_wrapping” should NOT interact with make_epilogue_seq since only 
“prologue_seq” will not touched. 
2. “try_shrink_wrapping_seperate”  might interact with epilogue, however, if 
there is anything changed with “try_shrink_wrapping_seperate”,
make_epilogue_seq() will be called again, and then the zeroing sequence 
will be generated still at the end of the routine. 

So, from the above, I didn’t see any obvious issues.

But I might miss some important  issues here, please let me know what I am 
missing here?

Thanks a lot for any help.

Qing



> 
> -- 
> H.J.



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-10 Thread Richard Sandiford
Patrick McGehearty via Gcc-patches  writes:
> My understanding is this feature/flag is not intended to be "default on".
> It is intended to be used in security sensitive environments such
> as the Linux kernel where it was requested by kernel security experts.
> I'm not understanding the objection here if the feature is requested
> by security teams and the average cost is modest.

Agreed.  And of course, “is modest” here means “is modest in the eyes
of the people who want to use it”.

IMO it's been established at this point that the feature is useful
enough to some people.  It might be too expensive for others,
but that's OK.

I've kind-of lost track of where we stand given all the subthreads.
If we've now decided which suboptions we want to support, would it
make sense to start a new thread with the current patch, and then
just concentrate on code review for that subthread?

Thanks,
Richard


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-10 Thread Qing Zhao via Gcc-patches
Richard,

Thank you!

> On Sep 10, 2020, at 7:11 AM, Richard Sandiford  
> wrote:
> 
> Patrick McGehearty via Gcc-patches  writes:
>> My understanding is this feature/flag is not intended to be "default on".
>> It is intended to be used in security sensitive environments such
>> as the Linux kernel where it was requested by kernel security experts.
>> I'm not understanding the objection here if the feature is requested
>> by security teams and the average cost is modest.
> 
> Agreed.  And of course, “is modest” here means “is modest in the eyes
> of the people who want to use it”.
> 
> IMO it's been established at this point that the feature is useful
> enough to some people.  It might be too expensive for others,
> but that's OK.
> 
> I've kind-of lost track of where we stand given all the subthreads.
> If we've now decided which suboptions we want to support,

From the performance data, we saw that clearing ALL registers cost too much 
more without any additional benefit, so, I’d like to delete all those 
sub-options including “ALL”, i.e, all-arg, all-gpr, all.

Now, the option will be:

-fzero-call-used-regs=skip|gpr-arg|all-arg|gpr|all

Add -fzero-call-used-regs=[skip|gpr-arg|all-arg|gpr|all] command-line option
and
zero_call_used_regs("skip|gpr-arg|all-arg|gpr|all") function attribues:

1. -mzero-call-used-regs=skip and zero_call_used_regs("skip")

Don't zero call-used registers upon function return. This is the default 
behavior.

2. -mzero-call-used-regs=gpr-arg and zero_call_used_regs("gpr-arg")

Upon function return,  zero call-used general purpose registers that are 
used in the routine and might pass parameters.

3. -mzero-call-used-regs=used-arg and zero_call_used_regs(“all-arg")

Upon function return, zero call-used registers that are used in the routine 
and might pass parameters.
4. -mzero-call-used-regs=used-gpr and zero_call_used_regs("gpr")

Upon function return, zero call-used general purpose registers that are 
used in the routine.

5. -mzero-call-used-regs=used and zero_call_used_regs(“all")

Upon function return, zero call-used registers that are used in the routine.

Let me know any objection or comment. 

> would it
> make sense to start a new thread with the current patch, and then
> just concentrate on code review for that subthread?

I will start the new thread after my new patch is ready.

Thanks again.

Qing
> 
> Thanks,
> Richard



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-10 Thread Rodriguez Bahena, Victor via Gcc-patches


-Original Message-
From: Qing Zhao 
Date: Thursday, September 10, 2020 at 9:34 AM
To: Richard Sandiford , kees Cook 
, "Rodriguez Bahena, Victor" 

Cc: Patrick McGehearty via Gcc-patches 
Subject: Re: PING [Patch][Middle-end]Add 
-fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

Richard,

Thank you!

> On Sep 10, 2020, at 7:11 AM, Richard Sandiford 
 wrote:
> 
> Patrick McGehearty via Gcc-patches  writes:
>> My understanding is this feature/flag is not intended to be "default on".
>> It is intended to be used in security sensitive environments such
>> as the Linux kernel where it was requested by kernel security experts.
>> I'm not understanding the objection here if the feature is requested
>> by security teams and the average cost is modest.
> 
> Agreed.  And of course, “is modest” here means “is modest in the eyes
> of the people who want to use it”.
> 
> IMO it's been established at this point that the feature is useful
> enough to some people.  It might be too expensive for others,
> but that's OK.
> 
> I've kind-of lost track of where we stand given all the subthreads.
> If we've now decided which suboptions we want to support,

From the performance data, we saw that clearing ALL registers cost too much 
more without any additional benefit, so, I’d like to delete all those 
sub-options including “ALL”, i.e, all-arg, all-gpr, all.

Now, the option will be:

-fzero-call-used-regs=skip|gpr-arg|all-arg|gpr|all

Add -fzero-call-used-regs=[skip|gpr-arg|all-arg|gpr|all] command-line option
and
zero_call_used_regs("skip|gpr-arg|all-arg|gpr|all") function attribues:

1. -mzero-call-used-regs=skip and zero_call_used_regs("skip")

Don't zero call-used registers upon function return. This is the 
default behavior.

2. -mzero-call-used-regs=gpr-arg and zero_call_used_regs("gpr-arg")

Upon function return,  zero call-used general purpose registers that 
are used in the routine and might pass parameters.

3. -mzero-call-used-regs=used-arg and zero_call_used_regs(“all-arg")

Upon function return, zero call-used registers that are used in the 
routine and might pass parameters.
4. -mzero-call-used-regs=used-gpr and zero_call_used_regs("gpr")

Upon function return, zero call-used general purpose registers that are 
used in the routine.

5. -mzero-call-used-regs=used and zero_call_used_regs(“all")

Upon function return, zero call-used registers that are used in the 
routine.

Let me know any objection or comment. 

+1

> would it
> make sense to start a new thread with the current patch, and then
> just concentrate on code review for that subthread?

I will start the new thread after my new patch is ready.

Thanks again.

Qing
> 
> Thanks,
> Richard




Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-10 Thread Kees Cook via Gcc-patches
[tried to clean up quoting...]

On Tue, Sep 08, 2020 at 10:00:09AM -0500, Qing Zhao wrote:
> 
> > On Sep 7, 2020, at 8:06 AM, Rodriguez Bahena, Victor 
> >  wrote:
> > 
> >>> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
> >>> So, my question is:
> >>>
> >>> From the security point of view, does clearing ALL registers have more 
> >>> benefit than clearing USED registers?  
> >>> From my understanding, clearing registers that are not used in the 
> >>> current routine does NOT provide additional benefit, correct me if I am 
> >>> wrong here.
> >  
> > You are right, it does not provide additional security
> 
> Then, is it necessary to provide 
> 
> -fzero-call-used-regs=all-arg|all-gpr|all   to the user?
> 
> Can we just delete these 3 sub options?

Well... I'd say there is some benefit (remember that ROP gadgets are
built from function trailers, so there is rarely a concern over what the
rest of the function is doing). Generally, they are chained together
based on just the last couple instructions:

 *useful action*
 *ret*

So with ...=used this turns into:

 *useful action*
 *clear some registers*
 *ret*

Which may still be helpful (if, for example, the state being built by
the attacker is using registers _not_ in the cleared list). However:

 *useful action*
 *clear all registers*
 *ret*

Means that suddenly the ROP chain cannot use *any* of the caller-saved
registers to hold state.

So, while ...=used is likely going to block a lot, ...=all will block
even more. I'd prefer to have both available, if for no other reason
than to compare the ROP gadget availability for any given binary (e.g.
if some future attack is found that bypasses ...=used, does it also
bypass ...=all?)

-- 
Kees Cook


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-10 Thread Segher Boessenkool
On Tue, Sep 08, 2020 at 09:55:19AM -0500, Qing Zhao wrote:
> Downloading this paper form IEEE needs a fee.

Yes, and we cannot discuss it here.

> What other information you need to show the effective of mitigation ROP 
> attack?

Anything that we *can* talk about.  Stuff we cannot talk about does not
let us progress in one way or the other.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-10 Thread Segher Boessenkool
On Tue, Sep 08, 2020 at 11:43:30AM -0500, Qing Zhao wrote:
> > On Sep 7, 2020, at 10:58 AM, H.J. Lu  wrote:
> > On Mon, Sep 7, 2020 at 7:06 AM Segher Boessenkool
> > mailto:seg...@kernel.crashing.org>> wrote:
> >> On Fri, Sep 04, 2020 at 11:52:13AM -0700, H.J. Lu wrote:
> >>> On Fri, Sep 4, 2020 at 11:09 AM Segher Boessenkool
> >>>  wrote:
>  Very many normal returns do *not* pass through an epilogue, but are
>  simple_return.  Disabling that is *much* more expensive than that 2%.
> >>> 
> >>> Sibcall isn't covered.  What other cases don't have an epilogue?
> >> 
> >> Shrink-wrapped stuff.  Quite important for performance.  Not something
> >> you can throw away.
> > 
> > Qing, can you check how it interacts with shrink-wrap?
> >> 



> But I might miss some important  issues here, please let me know what I am 
> missing here?

Start looking at handle_simple_exit()?  targetm.gen_simple_return()...


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-10 Thread Qing Zhao via Gcc-patches



> On Sep 10, 2020, at 2:07 PM, Kees Cook  wrote:
> 
> [tried to clean up quoting...]
> 
> On Tue, Sep 08, 2020 at 10:00:09AM -0500, Qing Zhao wrote:
>> 
>>> On Sep 7, 2020, at 8:06 AM, Rodriguez Bahena, Victor 
>>>  wrote:
>>> 
> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
> So, my question is:
> 
> From the security point of view, does clearing ALL registers have more 
> benefit than clearing USED registers?  
> From my understanding, clearing registers that are not used in the 
> current routine does NOT provide additional benefit, correct me if I am 
> wrong here.
>>> 
>>> You are right, it does not provide additional security
>> 
>> Then, is it necessary to provide 
>> 
>> -fzero-call-used-regs=all-arg|all-gpr|all   to the user?
>> 
>> Can we just delete these 3 sub options?
> 
> Well... I'd say there is some benefit (remember that ROP gadgets are
> built from function trailers, so there is rarely a concern over what the
> rest of the function is doing). Generally, they are chained together
> based on just the last couple instructions:
> 
> *useful action*
> *ret*
> 
> So with ...=used this turns into:
> 
> *useful action*
> *clear some registers*
> *ret*
> 
> Which may still be helpful (if, for example, the state being built by
> the attacker is using registers _not_ in the cleared list). However:
> 
> *useful action*
> *clear all registers*
> *ret*
> 
> Means that suddenly the ROP chain cannot use *any* of the caller-saved
> registers to hold state.
> 
> So, while ...=used is likely going to block a lot, ...=all will block
> even more. I'd prefer to have both available,

Okay. I am fine with this. 

My biggest concern is the much bigger run-time overhead from zeroing those 
unused-registers.
Might need to mention the big run-time overhead in the users’s manual.

Qing
> if for no other reason
> than to compare the ROP gadget availability for any given binary (e.g.
> if some future attack is found that bypasses ...=used, does it also
> bypass ...=all?)
> 
> -- 
> Kees Cook



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-10 Thread Qing Zhao via Gcc-patches



> On Sep 10, 2020, at 5:05 PM, Segher Boessenkool  
> wrote:
> 
> On Tue, Sep 08, 2020 at 11:43:30AM -0500, Qing Zhao wrote:
>>> On Sep 7, 2020, at 10:58 AM, H.J. Lu  wrote:
>>> On Mon, Sep 7, 2020 at 7:06 AM Segher Boessenkool
>>> mailto:seg...@kernel.crashing.org>> wrote:
 On Fri, Sep 04, 2020 at 11:52:13AM -0700, H.J. Lu wrote:
> On Fri, Sep 4, 2020 at 11:09 AM Segher Boessenkool
>  wrote:
>> Very many normal returns do *not* pass through an epilogue, but are
>> simple_return.  Disabling that is *much* more expensive than that 2%.
> 
> Sibcall isn't covered.  What other cases don't have an epilogue?
 
 Shrink-wrapped stuff.  Quite important for performance.  Not something
 you can throw away.
>>> 
>>> Qing, can you check how it interacts with shrink-wrap?
 
> 
> 
> 
>> But I might miss some important  issues here, please let me know what I am 
>> missing here?
> 
> Start looking at handle_simple_exit()?  targetm.gen_simple_return()…

Yes, I have been looking at this since this morning. 
You are right, we also need to insert zeroing sequence before  this 
simple_return which the current patch missed.

I am currently try to resolve this issue with the following idea:

In the routine “thread_prologue_and_epilogue_insns”,  After both 
“make_epilogue_seq” and “try_shrink_wrapping” finished, 

Scan every exit block to see whether the last insn is a ANY_RETURN_P(insn), 
If YES, generate the zero sequence before this RETURN insn. 

Then we should take care all the exit path that returns.

Do you see any issue from this idea? 

Thanks a lot for your help.

Qing

> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Richard Sandiford
Kees Cook via Gcc-patches  writes:
> [tried to clean up quoting...]
>
> On Tue, Sep 08, 2020 at 10:00:09AM -0500, Qing Zhao wrote:
>> 
>> > On Sep 7, 2020, at 8:06 AM, Rodriguez Bahena, Victor 
>> >  wrote:
>> > 
>> >>> On Thu, Sep 03, 2020 at 09:29:54AM -0500, Qing Zhao wrote:
>> >>> So, my question is:
>> >>>
>> >>> From the security point of view, does clearing ALL registers have more 
>> >>> benefit than clearing USED registers?  
>> >>> From my understanding, clearing registers that are not used in the 
>> >>> current routine does NOT provide additional benefit, correct me if I am 
>> >>> wrong here.
>> >  
>> > You are right, it does not provide additional security
>> 
>> Then, is it necessary to provide 
>> 
>> -fzero-call-used-regs=all-arg|all-gpr|all   to the user?
>> 
>> Can we just delete these 3 sub options?
>
> Well... I'd say there is some benefit (remember that ROP gadgets are
> built from function trailers, so there is rarely a concern over what the
> rest of the function is doing). Generally, they are chained together
> based on just the last couple instructions:
>
>  *useful action*
>  *ret*
>
> So with ...=used this turns into:
>
>  *useful action*
>  *clear some registers*
>  *ret*
>
> Which may still be helpful (if, for example, the state being built by
> the attacker is using registers _not_ in the cleared list). However:
>
>  *useful action*
>  *clear all registers*
>  *ret*
>
> Means that suddenly the ROP chain cannot use *any* of the caller-saved
> registers to hold state.
>
> So, while ...=used is likely going to block a lot, ...=all will block
> even more. I'd prefer to have both available, if for no other reason
> than to compare the ROP gadget availability for any given binary (e.g.
> if some future attack is found that bypasses ...=used, does it also
> bypass ...=all?)

This might have already been discussed/answered, sorry, but:
when there's a choice, is there an obvious winner between:

(1) clearing call-clobbered registers and then restoring call-preserved ones
(2) restoring call-preserved registers and then clearing call-clobbered ones

Is one option more likely to be useful to attackers than the other?

(For some frames, it might be necessary to use a small number of
call-clobbered registers to perform the restore sequence, so (1)
wouldn't be fully achievable in all cases.)

Thanks,
Richard


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Segher Boessenkool
On Fri, Sep 11, 2020 at 11:06:03AM +0100, Richard Sandiford wrote:
> This might have already been discussed/answered, sorry, but:
> when there's a choice, is there an obvious winner between:
> 
> (1) clearing call-clobbered registers and then restoring call-preserved ones
> (2) restoring call-preserved registers and then clearing call-clobbered ones
> 
> Is one option more likely to be useful to attackers than the other?
> 
> (For some frames, it might be necessary to use a small number of
> call-clobbered registers to perform the restore sequence, so (1)
> wouldn't be fully achievable in all cases.)

The same is true for what you have to do *after* restoring registers, as
I said before.  Clearing all is not correct in all cases, and also it is
not useful in all cases (code right after it might write the registers
again.

This really is very (sub-)target-specific, it cannot be done by generic
code on its own *at all*.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Qing Zhao via Gcc-patches



> On Sep 11, 2020, at 11:14 AM, Segher Boessenkool  
> wrote:
> 
> On Fri, Sep 11, 2020 at 11:06:03AM +0100, Richard Sandiford wrote:
>> This might have already been discussed/answered, sorry, but:
>> when there's a choice, is there an obvious winner between:
>> 
>> (1) clearing call-clobbered registers and then restoring call-preserved ones
>> (2) restoring call-preserved registers and then clearing call-clobbered ones
>> 
>> Is one option more likely to be useful to attackers than the other?

for mitigating ROP purpose, I think that (2) is better than (1). i.e, the 
clearing
call-clobbered register sequence will be immediately before “ret” instruction. 
This will prevent the gadget from doing any useful things.

>> 
>> (For some frames, it might be necessary to use a small number of
>> call-clobbered registers to perform the restore sequence, so (1)
>> wouldn't be fully achievable in all cases.)
> 

Yes, looks like that (1) is also not correct.

> The same is true for what you have to do *after* restoring registers, as
> I said before.  Clearing all is not correct in all cases, and also it is
> not useful in all cases (code right after it might write the registers
> again.

I don’t understand why it’s not correct if we clearing call-clobbered registers 
AFTER restoring call-preserved registers?

Even though we might need to use some call-clobbered registers to restore 
the call-preserved registers, after the restoring is done, we can use data flow
to make sure the call-clobbered registers not lived at that point anymore, then
Clearing those not-lived call-clobbered registers immediately before “ret”.

For me, this should be correct. 

Let me know anything I am missing here.

Thanks.

Qing



> 
> This really is very (sub-)target-specific, it cannot be done by generic
> code on its own *at all*.
> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Segher Boessenkool
On Fri, Sep 11, 2020 at 11:52:29AM -0500, Qing Zhao wrote:
> I don’t understand why it’s not correct if we clearing call-clobbered 
> registers 
> AFTER restoring call-preserved registers?

Because the compiler backend (or the linker!  Or the dynamic linker!
Etc.) can use volatile registers for their own purposes.

Like, on Power, r11 and r12 are used for various calling convention
purposes; they are also used for other purposes; and r0 is used as some
all-purpose volatile (it typically holds the return address near the
end of a function).

"Call-clobbered" is pretty meaningless.  It only holds meaning for a
function calling another, and then only to know which registers lose
their value then.  It has no semantics for other cases, like a function
that will return soonish, as here.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Segher Boessenkool
On Thu, Sep 10, 2020 at 05:50:40PM -0500, Qing Zhao wrote:
>  Shrink-wrapped stuff.  Quite important for performance.  Not something
>  you can throw away.

^^^ !!! ^^^

> > Start looking at handle_simple_exit()?  targetm.gen_simple_return()…
> 
> Yes, I have been looking at this since this morning. 
> You are right, we also need to insert zeroing sequence before  this 
> simple_return which the current patch missed.

Please run the performance loss numbers again after you have something
more realistic :-(

> I am currently try to resolve this issue with the following idea:
> 
> In the routine “thread_prologue_and_epilogue_insns”,  After both 
> “make_epilogue_seq” and “try_shrink_wrapping” finished, 
> 
> Scan every exit block to see whether the last insn is a ANY_RETURN_P(insn), 
> If YES, generate the zero sequence before this RETURN insn. 
> 
> Then we should take care all the exit path that returns.
> 
> Do you see any issue from this idea? 

You need to let the backend decide what to do, for this as well as for
all other cases.  I do not know how often I will have to repeat that.

There also is separate shrink-wrapping, which you haven't touched on at
all yet.  Joy.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Richard Sandiford
Qing Zhao  writes:
>> On Sep 11, 2020, at 11:14 AM, Segher Boessenkool 
>>  wrote:
>> 
>> On Fri, Sep 11, 2020 at 11:06:03AM +0100, Richard Sandiford wrote:
>>> This might have already been discussed/answered, sorry, but:
>>> when there's a choice, is there an obvious winner between:
>>> 
>>> (1) clearing call-clobbered registers and then restoring call-preserved ones
>>> (2) restoring call-preserved registers and then clearing call-clobbered ones
>>> 
>>> Is one option more likely to be useful to attackers than the other?
>
> for mitigating ROP purpose, I think that (2) is better than (1). i.e, the 
> clearing
> call-clobbered register sequence will be immediately before “ret” 
> instruction. 
> This will prevent the gadget from doing any useful things.

OK.  The reason I was asking was that (from the naive perspective of
someone not well versed in this stuff): if the effect of one of the
register restores is itself a useful gadget, the clearing wouldn't
protect against it.  But if the register restores are not part of the
intended effect, it seemed that having them immediately before the
ret might make the gadget harder to use than clearing registers would,
because the side-effects of restores would be harder to control than the
(predictable) effect of clearing registers.

But like I say, this is very much not my area of expertise, so that's
probably missing the point in a major way. ;-)

I think the original patch plugged into pass_thread_prologue_and_epilogue,
is that right?  If we go for (2), then I think it would be better to do
it at the start of pass_late_compilation instead.  (Some targets wouldn't
cope with doing it later.)  The reason for doing it so late is that the
set of used “volatile”/caller-saved registers is not fixed at prologue
and epilogue generation: later optimisation passes can introduce uses
of volatile registers that weren't used previously.  (Sorry if this
has already been suggested.)

Unlike Segher, I think this can/should be done in target-independent
code as far as possible (like the patch seemed to do).

Thanks,
Richard


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Qing Zhao via Gcc-patches



> On Sep 11, 2020, at 12:13 PM, Segher Boessenkool  
> wrote:
> 
> On Fri, Sep 11, 2020 at 11:52:29AM -0500, Qing Zhao wrote:
>> I don’t understand why it’s not correct if we clearing call-clobbered 
>> registers 
>> AFTER restoring call-preserved registers?
> 
> Because the compiler backend (or the linker!  Or the dynamic linker!
> Etc.) can use volatile registers for their own purposes.

For the following sequence at the end of a routine:

*...*
“restore call-preserved registers”
*clear call-clobbered registers"*
*ret*

“Clear call-clobbered registers” will only clear the call-clobbered registers 
that are not live at the end of the routine.
If the call-clobbered register is live at the end of the routine, for example, 
holding the return value,
It will NOT be cleared at all.  

If the call-clobbered register has some other usage after the routine return, 
then the backend should know this and will not
clear it. Then we will resolve this issue, right?


> 
> Like, on Power, r11 and r12 are used for various calling convention
> purposes; they are also used for other purposes; and r0 is used as some
> all-purpose volatile (it typically holds the return address near the
> end of a function).

In the new version of the patch,  the implementation of clearing call-clobbered 
registers is done in backend, middle end only 
computes a hard register set based on user option, source attribute, data flow 
information, and function abi information, and
Then pass this hard register set to the target hook to generate the clearing 
sequence.  The backend will have all the details
on the special situations you mentioned. 

Let me know any more concerns here.

thanks.

Qing

> 
> "Call-clobbered" is pretty meaningless.  It only holds meaning for a
> function calling another, and then only to know which registers lose
> their value then.  It has no semantics for other cases, like a function
> that will return soonish, as here.
> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Qing Zhao via Gcc-patches



> On Sep 11, 2020, at 12:18 PM, Segher Boessenkool  
> wrote:
> 
> On Thu, Sep 10, 2020 at 05:50:40PM -0500, Qing Zhao wrote:
>> Shrink-wrapped stuff.  Quite important for performance.  Not something
>> you can throw away.
> 
> ^^^ !!! ^^^
> 
>>> Start looking at handle_simple_exit()?  targetm.gen_simple_return()…
>> 
>> Yes, I have been looking at this since this morning. 
>> You are right, we also need to insert zeroing sequence before  this 
>> simple_return which the current patch missed.
> 
> Please run the performance loss numbers again after you have something
> more realistic :-(

Yes, I will collect the performance data with the new patch. 

> 
>> I am currently try to resolve this issue with the following idea:
>> 
>> In the routine “thread_prologue_and_epilogue_insns”,  After both 
>> “make_epilogue_seq” and “try_shrink_wrapping” finished, 
>> 
>> Scan every exit block to see whether the last insn is a ANY_RETURN_P(insn), 
>> If YES, generate the zero sequence before this RETURN insn. 
>> 
>> Then we should take care all the exit path that returns.
>> 
>> Do you see any issue from this idea? 
> 
> You need to let the backend decide what to do, for this as well as for
> all other cases.  I do not know how often I will have to repeat that.

Yes, the new patch will separate the whole task into two parts:

A. Compute the hard register set based on user option, source code attribute, 
data flow information, function abi information, 
 The result will be “need_zeroed_register_set”, and then pass this hard reg 
set to the target hook.
B. Each target will have it’s own implementation of emitting the zeroing 
sequence based on the “need_zeroed_register_set”.


> 
> There also is separate shrink-wrapping, which you haven't touched on at
> all yet.  Joy.

Yes, in addition to shrink-wrapping, I also noticed that there are other places 
that generate “simple_return” or “return” that are not in
The epilogue, for example, in “dbr” phase (delay_slots phase), in “mach” phase 
(machine reorg phase), etc. 

So, only generate zeroing sequence in epilogue is not enough. 

Hongjiu and I discussed this more, and we came up with a new implementation, I 
will describe this new implementation in another email later.

Thanks.

Qing
> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Segher Boessenkool
On Fri, Sep 11, 2020 at 06:32:56PM +0100, Richard Sandiford wrote:
> Unlike Segher, I think this can/should be done in target-independent
> code as far as possible (like the patch seemed to do).

My problem with that is that it is both incorrect *and* inefficient.  It
writes registers it should not touch; and some of those will be written
with other values later again anyway; and if the goal is to clear as
many parameter passing registers as possible, so why does it touch
others at all?  This makes no sense.

Only the backend knows which registers it can write when.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Segher Boessenkool
On Fri, Sep 11, 2020 at 02:40:06PM -0500, Qing Zhao wrote:
> > On Sep 11, 2020, at 12:13 PM, Segher Boessenkool 
> >  wrote:
> > On Fri, Sep 11, 2020 at 11:52:29AM -0500, Qing Zhao wrote:
> >> I don’t understand why it’s not correct if we clearing call-clobbered 
> >> registers 
> >> AFTER restoring call-preserved registers?
> > 
> > Because the compiler backend (or the linker!  Or the dynamic linker!
> > Etc.) can use volatile registers for their own purposes.
> 
> For the following sequence at the end of a routine:
> 
> *...*
> “restore call-preserved registers”
> *clear call-clobbered registers"*
> *ret*
> 
> “Clear call-clobbered registers” will only clear the call-clobbered registers 
> that are not live at the end of the routine.

And they can be written again right after the routine, by linker-
generated code for example.  This is a waste.

> In the new version of the patch,  the implementation of clearing 
> call-clobbered registers is done in backend, middle end only 
> computes a hard register set based on user option, source attribute, data 
> flow information, and function abi information, and
> Then pass this hard register set to the target hook to generate the clearing 
> sequence.  The backend will have all the details
> on the special situations you mentioned. 
> 
> Let me know any more concerns here.

I cannot find that patch?


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Qing Zhao via Gcc-patches



> On Sep 11, 2020, at 12:32 PM, Richard Sandiford  
> wrote:
> 
> Qing Zhao  writes:
>>> On Sep 11, 2020, at 11:14 AM, Segher Boessenkool 
>>>  wrote:
>>> 
>>> On Fri, Sep 11, 2020 at 11:06:03AM +0100, Richard Sandiford wrote:
 This might have already been discussed/answered, sorry, but:
 when there's a choice, is there an obvious winner between:
 
 (1) clearing call-clobbered registers and then restoring call-preserved 
 ones
 (2) restoring call-preserved registers and then clearing call-clobbered 
 ones
 
 Is one option more likely to be useful to attackers than the other?
>> 
>> for mitigating ROP purpose, I think that (2) is better than (1). i.e, the 
>> clearing
>> call-clobbered register sequence will be immediately before “ret” 
>> instruction. 
>> This will prevent the gadget from doing any useful things.
> 
> OK.  The reason I was asking was that (from the naive perspective of
> someone not well versed in this stuff): if the effect of one of the
> register restores is itself a useful gadget, the clearing wouldn't
> protect against it.  But if the register restores are not part of the
> intended effect, it seemed that having them immediately before the
> ret might make the gadget harder to use than clearing registers would,
> because the side-effects of restores would be harder to control than the
> (predictable) effect of clearing registers.
> 
> But like I say, this is very much not my area of expertise, so that's
> probably missing the point in a major way. ;-)

I am not an expert on the security area either. :-)

My understanding of how this scheme helps ROP is:  the attacker usually uses 
scratch register to pass
parameters to the sys call in the gadget, if clearing the scratch registers 
immediately before “ret”, then 
The parameters that are passed to sys call will be destroyed, therefore, the 
attack will likely failed.

So, clearing the scratch registers immediately before “ret” will be very 
helpful to mitigate ROP.

> 
> I think the original patch plugged into pass_thread_prologue_and_epilogue,
> is that right?

Yes.

>  If we go for (2), then I think it would be better to do
> it at the start of pass_late_compilation instead.  (Some targets wouldn't
> cope with doing it later.)  The reason for doing it so late is that the
> set of used “volatile”/caller-saved registers is not fixed at prologue
> and epilogue generation: later optimisation passes can introduce uses
> of volatile registers that weren't used previously.  (Sorry if this
> has already been suggested.)

Yes, I agree.

I thought that it might be better to move this task at the very late of the RTL 
stage, i.e, before “final” phase. 

Another solution is (discussed with Hongjiu):

1. Define a new target hook:

targetm.return_with_zeroing(bool simple_return_p, HARD_REG_SET 
need_zeroed_hardregs, bool gpr_only)

2. Add the following routine in middle end:

rtx_insn *
generate_return_rtx (bool simple_return_p)
{
  if (targetm.return_with_zeroing)
{
  Compute the hardregs set for clearing into “need_zeroed_hardregs”;
 return targetm.return_with_zeroing (simple_return_p, need_zeroed_hardregs, 
gpr_only);
   }
 else
{
 if (simple_return_p)
   return targetm.gen_simple_return ( );
else
   return targetm.gen_return ();
  }
}

Then replace all call to “targetm.gen_simple_return” and “targetm.gen_return” 
to “generate_return_rtx()”.

3. In the target, 
Implement “return_with_zeroing”.


Let me know your comments on this.

Thanks a lot.

Qing
> 
> Unlike Segher, I think this can/should be done in target-independent
> code as far as possible (like the patch seemed to do).
> 
> Thanks,
> Richard



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Qing Zhao via Gcc-patches



> On Sep 11, 2020, at 3:05 PM, Segher Boessenkool  
> wrote:
> 
> On Fri, Sep 11, 2020 at 02:40:06PM -0500, Qing Zhao wrote:
>>> On Sep 11, 2020, at 12:13 PM, Segher Boessenkool 
>>>  wrote:
>>> On Fri, Sep 11, 2020 at 11:52:29AM -0500, Qing Zhao wrote:
 I don’t understand why it’s not correct if we clearing call-clobbered 
 registers 
 AFTER restoring call-preserved registers?
>>> 
>>> Because the compiler backend (or the linker!  Or the dynamic linker!
>>> Etc.) can use volatile registers for their own purposes.
>> 
>> For the following sequence at the end of a routine:
>> 
>> *...*
>> “restore call-preserved registers”
>> *clear call-clobbered registers"*
>> *ret*
>> 
>> “Clear call-clobbered registers” will only clear the call-clobbered 
>> registers that are not live at the end of the routine.
> 
> And they can be written again right after the routine, by linker-
> generated code for example.  This is a waste.
> 
>> In the new version of the patch,  the implementation of clearing 
>> call-clobbered registers is done in backend, middle end only 
>> computes a hard register set based on user option, source attribute, data 
>> flow information, and function abi information, and
>> Then pass this hard register set to the target hook to generate the clearing 
>> sequence.  The backend will have all the details
>> on the special situations you mentioned. 
>> 
>> Let me know any more concerns here.
> 
> I cannot find that patch?

Haven’t finished yet. -:).

Qing
> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Segher Boessenkool
On Fri, Sep 11, 2020 at 03:17:19PM -0500, Qing Zhao wrote:
> > On Sep 11, 2020, at 3:05 PM, Segher Boessenkool 
> >  wrote:
> > On Fri, Sep 11, 2020 at 02:40:06PM -0500, Qing Zhao wrote:
> >>> On Sep 11, 2020, at 12:13 PM, Segher Boessenkool 
> >>>  wrote:
> >>> On Fri, Sep 11, 2020 at 11:52:29AM -0500, Qing Zhao wrote:
>  I don’t understand why it’s not correct if we clearing call-clobbered 
>  registers 
>  AFTER restoring call-preserved registers?
> >>> 
> >>> Because the compiler backend (or the linker!  Or the dynamic linker!
> >>> Etc.) can use volatile registers for their own purposes.
> >> 
> >> For the following sequence at the end of a routine:
> >> 
> >> *...*
> >> “restore call-preserved registers”
> >> *clear call-clobbered registers"*
> >> *ret*
> >> 
> >> “Clear call-clobbered registers” will only clear the call-clobbered 
> >> registers that are not live at the end of the routine.
> > 
> > And they can be written again right after the routine, by linker-
> > generated code for example.  This is a waste.
> > 
> >> In the new version of the patch,  the implementation of clearing 
> >> call-clobbered registers is done in backend, middle end only 
> >> computes a hard register set based on user option, source attribute, data 
> >> flow information, and function abi information, and
> >> Then pass this hard register set to the target hook to generate the 
> >> clearing sequence.  The backend will have all the details
> >> on the special situations you mentioned. 
> >> 
> >> Let me know any more concerns here.
> > 
> > I cannot find that patch?
> 
> Haven’t finished yet. -:).

Ah okay :-)

If you have, please send it in a new thread (not as a reply)?  So that
it will be much easirer to handle :-)


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Segher Boessenkool
Hi!

On Fri, Sep 11, 2020 at 03:14:57PM -0500, Qing Zhao wrote:
> My understanding of how this scheme helps ROP is:  the attacker usually uses 
> scratch register to pass

Help obstruct ROP ;-)

> parameters to the sys call in the gadget, if clearing the scratch registers 
> immediately before “ret”, then 
> The parameters that are passed to sys call will be destroyed, therefore, the 
> attack will likely failed.

But you do not need more than one non-zero argument for execv*, and that
is usually the same register as the normal return value register; all
other registers *should* be zero for a simple execv*("/bin/sh", ...)!

(There is also the system call number register, rax on x86-64, but if
overwriting that would be any effective, you could just do that one
always and everywhere.  This is only an effective defence if there are
no gadgets that do the system call an attacker wants, and he has to
construct that sequence himself; but it very effective and cheap then).


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Qing Zhao via Gcc-patches



> On Sep 11, 2020, at 3:36 PM, Segher Boessenkool  
> wrote:
> 
> On Fri, Sep 11, 2020 at 03:17:19PM -0500, Qing Zhao wrote:
>>> On Sep 11, 2020, at 3:05 PM, Segher Boessenkool 
>>>  wrote:
>>> On Fri, Sep 11, 2020 at 02:40:06PM -0500, Qing Zhao wrote:
> On Sep 11, 2020, at 12:13 PM, Segher Boessenkool 
>  wrote:
> On Fri, Sep 11, 2020 at 11:52:29AM -0500, Qing Zhao wrote:
>> I don’t understand why it’s not correct if we clearing call-clobbered 
>> registers 
>> AFTER restoring call-preserved registers?
> 
> Because the compiler backend (or the linker!  Or the dynamic linker!
> Etc.) can use volatile registers for their own purposes.
 
 For the following sequence at the end of a routine:
 
 *...*
 “restore call-preserved registers”
 *clear call-clobbered registers"*
 *ret*
 
 “Clear call-clobbered registers” will only clear the call-clobbered 
 registers that are not live at the end of the routine.
>>> 
>>> And they can be written again right after the routine, by linker-
>>> generated code for example.  This is a waste.
>>> 
 In the new version of the patch,  the implementation of clearing 
 call-clobbered registers is done in backend, middle end only 
 computes a hard register set based on user option, source attribute, data 
 flow information, and function abi information, and
 Then pass this hard register set to the target hook to generate the 
 clearing sequence.  The backend will have all the details
 on the special situations you mentioned. 
 
 Let me know any more concerns here.
>>> 
>>> I cannot find that patch?
>> 
>> Haven’t finished yet. -:).
> 
> Ah okay :-)
> 
> If you have, please send it in a new thread (not as a reply)?  So that
> it will be much easirer to handle :-)

Okay. Will do.

Qing
> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Qing Zhao via Gcc-patches



> On Sep 11, 2020, at 4:03 PM, Segher Boessenkool  
> wrote:
> 
> Hi!
> 
> On Fri, Sep 11, 2020 at 03:14:57PM -0500, Qing Zhao wrote:
>> My understanding of how this scheme helps ROP is:  the attacker usually uses 
>> scratch register to pass
> 
> Help obstruct ROP ;-)
Thanks for catching my mistake.
> 
>> parameters to the sys call in the gadget, if clearing the scratch registers 
>> immediately before “ret”, then 
>> The parameters that are passed to sys call will be destroyed, therefore, the 
>> attack will likely failed.
> 
> But you do not need more than one non-zero argument for execv*, and that
> is usually the same register as the normal return value register; all
> other registers *should* be zero for a simple execv*("/bin/sh", ...)!
> 
> (There is also the system call number register, rax on x86-64, but if
> overwriting that would be any effective, you could just do that one
> always and everywhere.  This is only an effective defence if there are
> no gadgets that do the system call an attacker wants, and he has to
> construct that sequence himself; but it very effective and cheap then).

In the above, do you mean only clearing “rax” on x86-64 should be effective 
enough? 

Qing
> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Richard Sandiford
Qing Zhao  writes:
>> On Sep 11, 2020, at 12:32 PM, Richard Sandiford  
>> >>  If we go for (2), then I think it would be better to do
>> it at the start of pass_late_compilation instead.  (Some targets wouldn't
>> cope with doing it later.)  The reason for doing it so late is that the
>> set of used “volatile”/caller-saved registers is not fixed at prologue
>> and epilogue generation: later optimisation passes can introduce uses
>> of volatile registers that weren't used previously.  (Sorry if this
>> has already been suggested.)
>
> Yes, I agree.
>
> I thought that it might be better to move this task at the very late of the 
> RTL stage, i.e, before “final” phase. 
>
> Another solution is (discussed with Hongjiu):
>
> 1. Define a new target hook:
>
> targetm.return_with_zeroing(bool simple_return_p, HARD_REG_SET 
> need_zeroed_hardregs, bool gpr_only)
>
> 2. Add the following routine in middle end:
>
> rtx_insn *
> generate_return_rtx (bool simple_return_p)
> {
>   if (targetm.return_with_zeroing)
> {
>   Compute the hardregs set for clearing into “need_zeroed_hardregs”;
>  return targetm.return_with_zeroing (simple_return_p, 
> need_zeroed_hardregs, gpr_only);
>}
>  else
> {
>  if (simple_return_p)
>return targetm.gen_simple_return ( );
> else
>return targetm.gen_return ();
>   }
> }
>
> Then replace all call to “targetm.gen_simple_return” and “targetm.gen_return” 
> to “generate_return_rtx()”.
>
> 3. In the target, 
> Implement “return_with_zeroing”.
>
>
> Let me know your comments on this.

I think having a separate pass is better.  We don't normally know
at the point of generating the return which registers will need
to be cleared.  So IMO the pass should just search for all the
returns in a function and insert the zeroing instructions before
each one.

Having a target hook sounds good, but I think it should have a
default definition that just uses the move patterns to zero each
selected register.  I expect the default will be good enough for
most targets.

Thanks,
Richard


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Segher Boessenkool
On Fri, Sep 11, 2020 at 04:29:16PM -0500, Qing Zhao wrote:
> > On Sep 11, 2020, at 4:03 PM, Segher Boessenkool 
> >  wrote:
> >> The parameters that are passed to sys call will be destroyed, therefore, 
> >> the attack will likely failed.
> > 
> > But you do not need more than one non-zero argument for execv*, and that
> > is usually the same register as the normal return value register; all
> > other registers *should* be zero for a simple execv*("/bin/sh", ...)!
> > 
> > (There is also the system call number register, rax on x86-64, but if
> > overwriting that would be any effective, you could just do that one
> > always and everywhere.  This is only an effective defence if there are
> > no gadgets that do the system call an attacker wants, and he has to
> > construct that sequence himself; but it very effective and cheap then).
> 
> In the above, do you mean only clearing “rax” on x86-64 should be effective 
> enough? 

(rax=0 is "read", you might want to do another value, but that's just
details.)

"This is only an effective defence if there are
no gadgets that do the system call an attacker wants, and he has to
construct that sequence himself; but it very effective and cheap then)."

It is definitely *not* effective if there are gadgets that set rax to
a value the attacker wants and then do a syscall.  It of course is quite
effective in breaking a ROP chain of (set rax) -> (syscall).  How
effective it is in practice, I have no idea.

My point was that your proposed scheme does not protect the other
syscall parameters very much either.

And, hrm, rax is actually the first return value.  On most ABIs the
same registers are used for arguments and for return values, I got
confused.  Sorry.  So this cannot be very effective for x86-64 no
matter what.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Qing Zhao via Gcc-patches



> On Sep 11, 2020, at 4:44 PM, Richard Sandiford  
> wrote:
> 
> Qing Zhao  writes:
>>> On Sep 11, 2020, at 12:32 PM, Richard Sandiford  
>>> >>  If we go for (2), then I think it would be better to do
>>> it at the start of pass_late_compilation instead.  (Some targets wouldn't
>>> cope with doing it later.)  The reason for doing it so late is that the
>>> set of used “volatile”/caller-saved registers is not fixed at prologue
>>> and epilogue generation: later optimisation passes can introduce uses
>>> of volatile registers that weren't used previously.  (Sorry if this
>>> has already been suggested.)
>> 
>> Yes, I agree.
>> 
>> I thought that it might be better to move this task at the very late of the 
>> RTL stage, i.e, before “final” phase. 
>> 
>> Another solution is (discussed with Hongjiu):
>> 
>> 1. Define a new target hook:
>> 
>> targetm.return_with_zeroing(bool simple_return_p, HARD_REG_SET 
>> need_zeroed_hardregs, bool gpr_only)
>> 
>> 2. Add the following routine in middle end:
>> 
>> rtx_insn *
>> generate_return_rtx (bool simple_return_p)
>> {
>>  if (targetm.return_with_zeroing)
>>{
>>  Compute the hardregs set for clearing into “need_zeroed_hardregs”;
>> return targetm.return_with_zeroing (simple_return_p, 
>> need_zeroed_hardregs, gpr_only);
>>   }
>> else
>>{
>> if (simple_return_p)
>>   return targetm.gen_simple_return ( );
>>else
>>   return targetm.gen_return ();
>>  }
>> }
>> 
>> Then replace all call to “targetm.gen_simple_return” and 
>> “targetm.gen_return” to “generate_return_rtx()”.
>> 
>> 3. In the target, 
>> Implement “return_with_zeroing”.
>> 
>> 
>> Let me know your comments on this.
> 
> I think having a separate pass is better.  We don't normally know
> at the point of generating the return which registers will need
> to be cleared.  

At the point of generating the return, we can compute the 
“need_zeroed_hardregs” HARD_REG_SET 
by using data flow information, the function abi of the routine, and also the 
user option and source code 
attribute information together. These information should be available at each 
point when generating the return.


> So IMO the pass should just search for all the
> returns in a function and insert the zeroing instructions before
> each one.

I was considering this approach too for some time, however, there is one major 
issue with this as 
Segher mentioned, The middle end does not know some details on the registers, 
lacking such 
detailed information might result incorrect code generation at middle end.

For example, on x86_64 target, when “return” with pop, the scratch register 
“ECX” will be 
used for returning, then it’s incorrect to zero “ecx” before generating the 
return. Since middle end
doesn't have such information, it cannot avoid to zero “ecx”. Therefore 
incorrect code might be 
generated. 

Segher also mentioned that on Power, there are some scratch registers also are 
used for 
Other purpose, clearing them before return is not correct. 


> 
> Having a target hook sounds good, but I think it should have a
> default definition that just uses the move patterns to zero each
> selected register.  I expect the default will be good enough for
> most targets.

Based on the above, I think that generating the zeroing instructions at middle 
end is not correct. 

Thanks.

Qing
> 
> Thanks,
> Richard



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Qing Zhao via Gcc-patches



> On Sep 11, 2020, at 4:51 PM, Segher Boessenkool  
> wrote:
> 
> On Fri, Sep 11, 2020 at 04:29:16PM -0500, Qing Zhao wrote:
>>> On Sep 11, 2020, at 4:03 PM, Segher Boessenkool 
>>>  wrote:
 The parameters that are passed to sys call will be destroyed, therefore, 
 the attack will likely failed.
>>> 
>>> But you do not need more than one non-zero argument for execv*, and that
>>> is usually the same register as the normal return value register; all
>>> other registers *should* be zero for a simple execv*("/bin/sh", ...)!
>>> 
>>> (There is also the system call number register, rax on x86-64, but if
>>> overwriting that would be any effective, you could just do that one
>>> always and everywhere.  This is only an effective defence if there are
>>> no gadgets that do the system call an attacker wants, and he has to
>>> construct that sequence himself; but it very effective and cheap then).
>> 
>> In the above, do you mean only clearing “rax” on x86-64 should be effective 
>> enough? 
> 
> (rax=0 is "read", you might want to do another value, but that's just
> details.)
> 
> "This is only an effective defence if there are
> no gadgets that do the system call an attacker wants, and he has to
> construct that sequence himself; but it very effective and cheap then)."
> 
> It is definitely *not* effective if there are gadgets that set rax to
> a value the attacker wants and then do a syscall.

You mean the following gadget:


Gadget 1:

mov  rax,  value
syscall
ret

Qing

> It of course is quite
> effective in breaking a ROP chain of (set rax) -> (syscall).  How
> effective it is in practice, I have no idea.
> 
> My point was that your proposed scheme does not protect the other
> syscall parameters very much either.
> 
> And, hrm, rax is actually the first return value.  On most ABIs the
> same registers are used for arguments and for return values, I got
> confused.  Sorry.  So this cannot be very effective for x86-64 no
> matter what.
> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-11 Thread Richard Sandiford
Qing Zhao  writes:
>> On Sep 11, 2020, at 4:44 PM, Richard Sandiford  
>> wrote:
>> 
>> Qing Zhao  writes:
 On Sep 11, 2020, at 12:32 PM, Richard Sandiford 
  >>  If we go for (2), then I think it would be 
 better to do
 it at the start of pass_late_compilation instead.  (Some targets wouldn't
 cope with doing it later.)  The reason for doing it so late is that the
 set of used “volatile”/caller-saved registers is not fixed at prologue
 and epilogue generation: later optimisation passes can introduce uses
 of volatile registers that weren't used previously.  (Sorry if this
 has already been suggested.)
>>> 
>>> Yes, I agree.
>>> 
>>> I thought that it might be better to move this task at the very late of the 
>>> RTL stage, i.e, before “final” phase. 
>>> 
>>> Another solution is (discussed with Hongjiu):
>>> 
>>> 1. Define a new target hook:
>>> 
>>> targetm.return_with_zeroing(bool simple_return_p, HARD_REG_SET 
>>> need_zeroed_hardregs, bool gpr_only)
>>> 
>>> 2. Add the following routine in middle end:
>>> 
>>> rtx_insn *
>>> generate_return_rtx (bool simple_return_p)
>>> {
>>>  if (targetm.return_with_zeroing)
>>>{
>>>  Compute the hardregs set for clearing into “need_zeroed_hardregs”;
>>> return targetm.return_with_zeroing (simple_return_p, 
>>> need_zeroed_hardregs, gpr_only);
>>>   }
>>> else
>>>{
>>> if (simple_return_p)
>>>   return targetm.gen_simple_return ( );
>>>else
>>>   return targetm.gen_return ();
>>>  }
>>> }
>>> 
>>> Then replace all call to “targetm.gen_simple_return” and 
>>> “targetm.gen_return” to “generate_return_rtx()”.
>>> 
>>> 3. In the target, 
>>> Implement “return_with_zeroing”.
>>> 
>>> 
>>> Let me know your comments on this.
>> 
>> I think having a separate pass is better.  We don't normally know
>> at the point of generating the return which registers will need
>> to be cleared.  
>
> At the point of generating the return, we can compute the 
> “need_zeroed_hardregs” HARD_REG_SET 
> by using data flow information, the function abi of the routine, and also the 
> user option and source code 
> attribute information together. These information should be available at each 
> point when generating the return.

Like I mentioned earlier though, passes that run after
pass_thread_prologue_and_epilogue can use call-clobbered registers that
weren't previously used.  For example, on x86_64, the function might
not use %r8 when the prologue, epilogue and returns are generated,
but pass_regrename might later introduce a new use of %r8.  AIUI,
the “used” version of the new command-line option is supposed to clear
%r8 in these circumstances, but it wouldn't do so if the data was
collected at the point that the return is generated.

That's why I think it's more robust to do this later (at the beginning
of pass_late_compilation) and insert the zeroing before returns that
already exist.

>> So IMO the pass should just search for all the
>> returns in a function and insert the zeroing instructions before
>> each one.
>
> I was considering this approach too for some time, however, there is one 
> major issue with this as 
> Segher mentioned, The middle end does not know some details on the registers, 
> lacking such 
> detailed information might result incorrect code generation at middle end.
>
> For example, on x86_64 target, when “return” with pop, the scratch register 
> “ECX” will be 
> used for returning, then it’s incorrect to zero “ecx” before generating the 
> return. Since middle end
> doesn't have such information, it cannot avoid to zero “ecx”. Therefore 
> incorrect code might be 
> generated. 
>
> Segher also mentioned that on Power, there are some scratch registers also 
> are used for 
> Other purpose, clearing them before return is not correct. 

But the dataflow information has to be correct between
pass_thread_prologue_and_epilogue and pass_free_cfg, otherwise
any pass in that region could clobber the registers in the same way.

To get the registers that are live before the return, you can start with
the registers that are live out from the block that contains the return,
then “simulate” the return instruction backwards to get the set of
registers that are live before the return instruction
(see df_simulate_one_insn_backwards).

In the x86_64 case you mention, the pattern is:

(define_insn "*simple_return_indirect_internal"
  [(simple_return)
   (use (match_operand:W 0 "register_operand" "r"))]
  "reload_completed"
  …)

This (use …) tells the df machinery that the instruction needs
operand 0 (= ecx).  The process above would therefore realise
that ecx can't be clobbered.

Thanks,
Richard


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-14 Thread Qing Zhao via Gcc-patches
Hi, Richard,

> On Sep 11, 2020, at 5:56 PM, Richard Sandiford  
> wrote:
> 
> Qing Zhao mailto:qing.z...@oracle.com>> writes:
>>> On Sep 11, 2020, at 4:44 PM, Richard Sandiford  
>>> wrote:
>>> 
>>> Qing Zhao  writes:
> On Sep 11, 2020, at 12:32 PM, Richard Sandiford 
>  >>  If we go for (2), then I think it would 
> be better to do
> it at the start of pass_late_compilation instead.  (Some targets wouldn't
> cope with doing it later.)  The reason for doing it so late is that the
> set of used “volatile”/caller-saved registers is not fixed at prologue
> and epilogue generation: later optimisation passes can introduce uses
> of volatile registers that weren't used previously.  (Sorry if this
> has already been suggested.)
 
 Yes, I agree.
 
 I thought that it might be better to move this task at the very late of 
 the RTL stage, i.e, before “final” phase. 
 
 Another solution is (discussed with Hongjiu):
 
 1. Define a new target hook:
 
 targetm.return_with_zeroing(bool simple_return_p, HARD_REG_SET 
 need_zeroed_hardregs, bool gpr_only)
 
 2. Add the following routine in middle end:
 
 rtx_insn *
 generate_return_rtx (bool simple_return_p)
 {
 if (targetm.return_with_zeroing)
   {
 Compute the hardregs set for clearing into “need_zeroed_hardregs”;
return targetm.return_with_zeroing (simple_return_p, 
 need_zeroed_hardregs, gpr_only);
  }
 else
   {
if (simple_return_p)
  return targetm.gen_simple_return ( );
   else
  return targetm.gen_return ();
 }
 }
 
 Then replace all call to “targetm.gen_simple_return” and 
 “targetm.gen_return” to “generate_return_rtx()”.
 
 3. In the target, 
 Implement “return_with_zeroing”.
 
 
 Let me know your comments on this.
>>> 
>>> I think having a separate pass is better.  We don't normally know
>>> at the point of generating the return which registers will need
>>> to be cleared.  
>> 
>> At the point of generating the return, we can compute the 
>> “need_zeroed_hardregs” HARD_REG_SET 
>> by using data flow information, the function abi of the routine, and also 
>> the user option and source code 
>> attribute information together. These information should be available at 
>> each point when generating the return.
> 
> Like I mentioned earlier though, passes that run after
> pass_thread_prologue_and_epilogue can use call-clobbered registers that
> weren't previously used.  For example, on x86_64, the function might
> not use %r8 when the prologue, epilogue and returns are generated,
> but pass_regrename might later introduce a new use of %r8.  AIUI,
> the “used” version of the new command-line option is supposed to clear
> %r8 in these circumstances, but it wouldn't do so if the data was
> collected at the point that the return is generated.

Thanks for the information.

> 
> That's why I think it's more robust to do this later (at the beginning
> of pass_late_compilation) and insert the zeroing before returns that
> already exist.

Yes, looks like it’s not correct to insert the zeroing at the time when 
prologue, epilogue and return are generated.
As I also checked, “return” might be also generated as late as pass 
“pass_delay_slots”,  So, shall we move the
New pass as late as possible?

Can I put it immediately before “pass_final”? What’s the latest place I can put 
it?


> 
>>> So IMO the pass should just search for all the
>>> returns in a function and insert the zeroing instructions before
>>> each one.
>> 
>> I was considering this approach too for some time, however, there is one 
>> major issue with this as 
>> Segher mentioned, The middle end does not know some details on the 
>> registers, lacking such 
>> detailed information might result incorrect code generation at middle end.
>> 
>> For example, on x86_64 target, when “return” with pop, the scratch register 
>> “ECX” will be 
>> used for returning, then it’s incorrect to zero “ecx” before generating the 
>> return. Since middle end
>> doesn't have such information, it cannot avoid to zero “ecx”. Therefore 
>> incorrect code might be 
>> generated. 
>> 
>> Segher also mentioned that on Power, there are some scratch registers also 
>> are used for 
>> Other purpose, clearing them before return is not correct. 
> 
> But the dataflow information has to be correct between
> pass_thread_prologue_and_epilogue and pass_free_cfg, otherwise
> any pass in that region could clobber the registers in the same way.

You mean, the data flow information will be not correct after pass_free_cfg? 
 “pass_delay_slots” is after “pass_free_cfg”,  and there might be new “return” 
generated in “pass_delay_slots”, 
If we want to generate zeroing for the new “return” which was generated in 
“pass_delay_slots”, can we correctly to do so?

> 
> To get the registers that are live before the return, you can start with
>

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-14 Thread Richard Sandiford
Qing Zhao  writes:
>> Like I mentioned earlier though, passes that run after
>> pass_thread_prologue_and_epilogue can use call-clobbered registers that
>> weren't previously used.  For example, on x86_64, the function might
>> not use %r8 when the prologue, epilogue and returns are generated,
>> but pass_regrename might later introduce a new use of %r8.  AIUI,
>> the “used” version of the new command-line option is supposed to clear
>> %r8 in these circumstances, but it wouldn't do so if the data was
>> collected at the point that the return is generated.
>
> Thanks for the information.
>
>> 
>> That's why I think it's more robust to do this later (at the beginning
>> of pass_late_compilation) and insert the zeroing before returns that
>> already exist.
>
> Yes, looks like it’s not correct to insert the zeroing at the time when 
> prologue, epilogue and return are generated.
> As I also checked, “return” might be also generated as late as pass 
> “pass_delay_slots”,  So, shall we move the
> New pass as late as possible?

If we insert the zeroing before pass_delay_slots and describe the
result correctly, pass_delay_slots should do the right thing.

Describing the result correctly includes ensuring that the cleared
registers are treated as live on exit from the function, so that the
zeroing doesn't get deleted again, or skipped by pass_delay_slots.

> Can I put it immediately before “pass_final”? What’s the latest place
> I can put it?

Like you say here…

 So IMO the pass should just search for all the
 returns in a function and insert the zeroing instructions before
 each one.
>>> 
>>> I was considering this approach too for some time, however, there is one 
>>> major issue with this as 
>>> Segher mentioned, The middle end does not know some details on the 
>>> registers, lacking such 
>>> detailed information might result incorrect code generation at middle end.
>>> 
>>> For example, on x86_64 target, when “return” with pop, the scratch register 
>>> “ECX” will be 
>>> used for returning, then it’s incorrect to zero “ecx” before generating the 
>>> return. Since middle end
>>> doesn't have such information, it cannot avoid to zero “ecx”. Therefore 
>>> incorrect code might be 
>>> generated. 
>>> 
>>> Segher also mentioned that on Power, there are some scratch registers also 
>>> are used for 
>>> Other purpose, clearing them before return is not correct. 
>> 
>> But the dataflow information has to be correct between
>> pass_thread_prologue_and_epilogue and pass_free_cfg, otherwise
>> any pass in that region could clobber the registers in the same way.
>
> You mean, the data flow information will be not correct after pass_free_cfg? 
>  “pass_delay_slots” is after “pass_free_cfg”,  and there might be new 
> “return” generated in “pass_delay_slots”, 
> If we want to generate zeroing for the new “return” which was generated in 
> “pass_delay_slots”, can we correctly to do so?

…the zeroing has to be done before pass_free_cfg, because the information
isn't reliable after that point.  I think it would make sense to do it
before pass_compute_alignments, because inserting the zeros will affect
alignment.

>> To get the registers that are live before the return, you can start with
>> the registers that are live out from the block that contains the return,
>> then “simulate” the return instruction backwards to get the set of
>> registers that are live before the return instruction
>> (see df_simulate_one_insn_backwards).
>
> Okay. 
> Currently, I am using the following to check whether a reg is live out the 
> block that contains the return:
>
> /* Check whether the hard register REGNO is live at the exit block
>  * of the current routine.  */
> static bool
> is_live_reg_at_exit (unsigned int regno)
> {
>   edge e;
>   edge_iterator ei;
>
>   FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> {
>   bitmap live_out = df_get_live_out (e->src);
>   if (REGNO_REG_SET_P (live_out, regno))
> return true;
> }
>
>   return false;
> }
>
> Is this correct?

df_get_live_out is the right way to get the set of live registers
on exit from a block.  But if we search for return instructions
and find a return instruction R, we should do:

  basic_block bb = BLOCK_FOR_INSN (R);
  auto_bitmap live_regs;
  bitmap_copy (regs, df_get_live_out (bb));
  df_simulate_one_insn_backwards (bb, R, live_regs);

and then use LIVE_REGS as the set of registers that are live before R,
and so can't be clobbered.

For extra safety, you could/should also check targetm.hard_regno_scratch_ok
to see whether there's a target-specific reason why a register can't
be clobbered.

>> In the x86_64 case you mention, the pattern is:
>> 
>> (define_insn "*simple_return_indirect_internal"
>>  [(simple_return)
>>   (use (match_operand:W 0 "register_operand" "r"))]
>>  "reload_completed"
>>  …)
>> 
>> This (use …) tells the df machinery that the instruction needs
>> operand 0 (= ecx).  The process above would therefore 

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-14 Thread Qing Zhao via Gcc-patches



> On Sep 14, 2020, at 11:33 AM, Richard Sandiford  
> wrote:
> 
> Qing Zhao  writes:
>>> Like I mentioned earlier though, passes that run after
>>> pass_thread_prologue_and_epilogue can use call-clobbered registers that
>>> weren't previously used.  For example, on x86_64, the function might
>>> not use %r8 when the prologue, epilogue and returns are generated,
>>> but pass_regrename might later introduce a new use of %r8.  AIUI,
>>> the “used” version of the new command-line option is supposed to clear
>>> %r8 in these circumstances, but it wouldn't do so if the data was
>>> collected at the point that the return is generated.
>> 
>> Thanks for the information.
>> 
>>> 
>>> That's why I think it's more robust to do this later (at the beginning
>>> of pass_late_compilation) and insert the zeroing before returns that
>>> already exist.
>> 
>> Yes, looks like it’s not correct to insert the zeroing at the time when 
>> prologue, epilogue and return are generated.
>> As I also checked, “return” might be also generated as late as pass 
>> “pass_delay_slots”,  So, shall we move the
>> New pass as late as possible?
> 
> If we insert the zeroing before pass_delay_slots and describe the
> result correctly, pass_delay_slots should do the right thing.
> 
> Describing the result correctly includes ensuring that the cleared
> registers are treated as live on exit from the function, so that the
> zeroing doesn't get deleted again, or skipped by pass_delay_slots.

In the current implementation for x86, when we generating a zeroing insn as the 
following:

(insn 18 16 19 2 (set (reg:SI 1 dx)
(const_int 0 [0])) "t10.c":11:1 -1
 (nil))
(insn 19 18 20 2 (unspec_volatile [
(reg:SI 1 dx)
] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1
 (nil))

i.e, after each zeroing insn, the register that is zeroed is marked as 
“UNSPECV_PRO_EPILOGUE_USE”, 
By doing this, we can avoid this zeroing insn from being deleted or skipped. 

Is doing this enough to describe the result correctly?
Is there other thing we need to do in addition to this?

> 
>> Can I put it immediately before “pass_final”? What’s the latest place
>> I can put it?
> 
> Like you say here…
> 
> So IMO the pass should just search for all the
> returns in a function and insert the zeroing instructions before
> each one.
 
 I was considering this approach too for some time, however, there is one 
 major issue with this as 
 Segher mentioned, The middle end does not know some details on the 
 registers, lacking such 
 detailed information might result incorrect code generation at middle end.
 
 For example, on x86_64 target, when “return” with pop, the scratch 
 register “ECX” will be 
 used for returning, then it’s incorrect to zero “ecx” before generating 
 the return. Since middle end
 doesn't have such information, it cannot avoid to zero “ecx”. Therefore 
 incorrect code might be 
 generated. 
 
 Segher also mentioned that on Power, there are some scratch registers also 
 are used for 
 Other purpose, clearing them before return is not correct. 
>>> 
>>> But the dataflow information has to be correct between
>>> pass_thread_prologue_and_epilogue and pass_free_cfg, otherwise
>>> any pass in that region could clobber the registers in the same way.
>> 
>> You mean, the data flow information will be not correct after pass_free_cfg? 
>> “pass_delay_slots” is after “pass_free_cfg”,  and there might be new 
>> “return” generated in “pass_delay_slots”, 
>> If we want to generate zeroing for the new “return” which was generated in 
>> “pass_delay_slots”, can we correctly to do so?
> 
> …the zeroing has to be done before pass_free_cfg, because the information
> isn't reliable after that point.  I think it would make sense to do it
> before pass_compute_alignments, because inserting the zeros will affect
> alignment.

Okay. 

Then there is another problem:  what about the new “return”s that are generated 
at pass_delay_slots?

Should we generate the zeroing for these new returns? Since the data flow 
information might not be correct at this pass,
It looks like that there is no correct way to add the zeroing insn for these 
new “return”, then, what should we do about this?

> 
>>> To get the registers that are live before the return, you can start with
>>> the registers that are live out from the block that contains the return,
>>> then “simulate” the return instruction backwards to get the set of
>>> registers that are live before the return instruction
>>> (see df_simulate_one_insn_backwards).
>> 
>> Okay. 
>> Currently, I am using the following to check whether a reg is live out the 
>> block that contains the return:
>> 
>> /* Check whether the hard register REGNO is live at the exit block
>> * of the current routine.  */
>> static bool
>> is_live_reg_at_exit (unsigned int regno)
>> {
>>  edge e;
>>  edge_iterator ei;
>> 
>>  FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-14 Thread Richard Sandiford
Qing Zhao  writes:
>> On Sep 14, 2020, at 11:33 AM, Richard Sandiford  
>> wrote:
>> 
>> Qing Zhao  writes:
 Like I mentioned earlier though, passes that run after
 pass_thread_prologue_and_epilogue can use call-clobbered registers that
 weren't previously used.  For example, on x86_64, the function might
 not use %r8 when the prologue, epilogue and returns are generated,
 but pass_regrename might later introduce a new use of %r8.  AIUI,
 the “used” version of the new command-line option is supposed to clear
 %r8 in these circumstances, but it wouldn't do so if the data was
 collected at the point that the return is generated.
>>> 
>>> Thanks for the information.
>>> 
 
 That's why I think it's more robust to do this later (at the beginning
 of pass_late_compilation) and insert the zeroing before returns that
 already exist.
>>> 
>>> Yes, looks like it’s not correct to insert the zeroing at the time when 
>>> prologue, epilogue and return are generated.
>>> As I also checked, “return” might be also generated as late as pass 
>>> “pass_delay_slots”,  So, shall we move the
>>> New pass as late as possible?
>> 
>> If we insert the zeroing before pass_delay_slots and describe the
>> result correctly, pass_delay_slots should do the right thing.
>> 
>> Describing the result correctly includes ensuring that the cleared
>> registers are treated as live on exit from the function, so that the
>> zeroing doesn't get deleted again, or skipped by pass_delay_slots.
>
> In the current implementation for x86, when we generating a zeroing insn as 
> the following:
>
> (insn 18 16 19 2 (set (reg:SI 1 dx)
> (const_int 0 [0])) "t10.c":11:1 -1
>  (nil))
> (insn 19 18 20 2 (unspec_volatile [
> (reg:SI 1 dx)
> ] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1
>  (nil))
>
> i.e, after each zeroing insn, the register that is zeroed is marked as 
> “UNSPECV_PRO_EPILOGUE_USE”, 
> By doing this, we can avoid this zeroing insn from being deleted or skipped. 
>
> Is doing this enough to describe the result correctly?
> Is there other thing we need to do in addition to this?

I guess that works, but I think it would be better to abstract
EPILOGUE_USES into a new target-independent wrapper function that
(a) returns true if EPILOGUE_USES itself returns true and (b) returns
true for registers that need to be zero on return, if the zeroing
instructions have already been inserted.  The places that currently
test EPILOGUE_USES should then test this new wrapper function instead.

After inserting the zeroing instructions, the pass should recompute the
live-out sets based on this.

 But the dataflow information has to be correct between
 pass_thread_prologue_and_epilogue and pass_free_cfg, otherwise
 any pass in that region could clobber the registers in the same way.
>>> 
>>> You mean, the data flow information will be not correct after 
>>> pass_free_cfg? 
>>> “pass_delay_slots” is after “pass_free_cfg”,  and there might be new 
>>> “return” generated in “pass_delay_slots”, 
>>> If we want to generate zeroing for the new “return” which was generated in 
>>> “pass_delay_slots”, can we correctly to do so?
>> 
>> …the zeroing has to be done before pass_free_cfg, because the information
>> isn't reliable after that point.  I think it would make sense to do it
>> before pass_compute_alignments, because inserting the zeros will affect
>> alignment.
>
> Okay. 
>
> Then there is another problem:  what about the new “return”s that are 
> generated at pass_delay_slots?
>
> Should we generate the zeroing for these new returns? Since the data flow 
> information might not be correct at this pass,
> It looks like that there is no correct way to add the zeroing insn for these 
> new “return”, then, what should we do about this?

pass_delay_slots isn't a problem.  It doesn't change *what* happens
on each return path, it just changes how the instructions to achieve
it are arranged.

So e.g. if every path through the function clears register R before
pass_delay_slots, and if that clearing is represented as being necessary,
then every path through the function will clear register R after the pass
as well.

>> For extra safety, you could/should also check targetm.hard_regno_scratch_ok
>> to see whether there's a target-specific reason why a register can't
>> be clobbered.
>
> /* Return true if is OK to use a hard register REGNO as scratch register
>in peephole2.  */
> DEFHOOK
> (hard_regno_scratch_ok,
>
>
> Is this checking only valid for pass_peephole2?

No, that comment looks out of date.  The hook is already used in
postreload, for example.

Thanks,
Richard


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-14 Thread Qing Zhao via Gcc-patches



> On Sep 14, 2020, at 2:20 PM, Richard Sandiford  
> wrote:
> 
> Qing Zhao mailto:qing.z...@oracle.com>> writes:
>>> On Sep 14, 2020, at 11:33 AM, Richard Sandiford  
>>> wrote:
>>> 
>>> Qing Zhao  writes:
> Like I mentioned earlier though, passes that run after
> pass_thread_prologue_and_epilogue can use call-clobbered registers that
> weren't previously used.  For example, on x86_64, the function might
> not use %r8 when the prologue, epilogue and returns are generated,
> but pass_regrename might later introduce a new use of %r8.  AIUI,
> the “used” version of the new command-line option is supposed to clear
> %r8 in these circumstances, but it wouldn't do so if the data was
> collected at the point that the return is generated.
 
 Thanks for the information.
 
> 
> That's why I think it's more robust to do this later (at the beginning
> of pass_late_compilation) and insert the zeroing before returns that
> already exist.
 
 Yes, looks like it’s not correct to insert the zeroing at the time when 
 prologue, epilogue and return are generated.
 As I also checked, “return” might be also generated as late as pass 
 “pass_delay_slots”,  So, shall we move the
 New pass as late as possible?
>>> 
>>> If we insert the zeroing before pass_delay_slots and describe the
>>> result correctly, pass_delay_slots should do the right thing.
>>> 
>>> Describing the result correctly includes ensuring that the cleared
>>> registers are treated as live on exit from the function, so that the
>>> zeroing doesn't get deleted again, or skipped by pass_delay_slots.
>> 
>> In the current implementation for x86, when we generating a zeroing insn as 
>> the following:
>> 
>> (insn 18 16 19 2 (set (reg:SI 1 dx)
>>(const_int 0 [0])) "t10.c":11:1 -1
>> (nil))
>> (insn 19 18 20 2 (unspec_volatile [
>>(reg:SI 1 dx)
>>] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1
>> (nil))
>> 
>> i.e, after each zeroing insn, the register that is zeroed is marked as 
>> “UNSPECV_PRO_EPILOGUE_USE”, 
>> By doing this, we can avoid this zeroing insn from being deleted or skipped. 
>> 
>> Is doing this enough to describe the result correctly?
>> Is there other thing we need to do in addition to this?
> 
> I guess that works, but I think it would be better to abstract
> EPILOGUE_USES into a new target-independent wrapper function that
> (a) returns true if EPILOGUE_USES itself returns true and (b) returns
> true for registers that need to be zero on return, if the zeroing
> instructions have already been inserted.  The places that currently
> test EPILOGUE_USES should then test this new wrapper function instead.

Okay, I see. 
Looks like that EPILOGUE_USES is used in df-scan.c to compute the data flow 
information. If EPILOUGE_USES return true
for registers that need to be zeroed on return, those registers will be 
included in the data flow information, as a result, later
passes will not be able to delete them. 

This sounds to be a cleaner approach than the current one that marks the 
registers  “UNSPECV_PRO_EPILOGUE_USE”. 

A more detailed implementation question on this: 
Where should I put this new target-independent wrapper function in? Which 
header file will be a proper place to hold this new function?

> 
> After inserting the zeroing instructions, the pass should recompute the
> live-out sets based on this.

Is only computing the live-out sets of the block that including the return insn 
enough? Or we should re-compute the whole procedure? 

Which utility routine I should use to recompute the live-out sets?

> 
> But the dataflow information has to be correct between
> pass_thread_prologue_and_epilogue and pass_free_cfg, otherwise
> any pass in that region could clobber the registers in the same way.
 
 You mean, the data flow information will be not correct after 
 pass_free_cfg? 
 “pass_delay_slots” is after “pass_free_cfg”,  and there might be new 
 “return” generated in “pass_delay_slots”, 
 If we want to generate zeroing for the new “return” which was generated in 
 “pass_delay_slots”, can we correctly to do so?
>>> 
>>> …the zeroing has to be done before pass_free_cfg, because the information
>>> isn't reliable after that point.  I think it would make sense to do it
>>> before pass_compute_alignments, because inserting the zeros will affect
>>> alignment.
>> 
>> Okay. 
>> 
>> Then there is another problem:  what about the new “return”s that are 
>> generated at pass_delay_slots?
>> 
>> Should we generate the zeroing for these new returns? Since the data flow 
>> information might not be correct at this pass,
>> It looks like that there is no correct way to add the zeroing insn for these 
>> new “return”, then, what should we do about this?
> 
> pass_delay_slots isn't a problem.  It doesn't change *what* happens
> on each return path, it just changes how the instructions to achieve
> it are

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-14 Thread Segher Boessenkool
On Fri, Sep 11, 2020 at 05:41:47PM -0500, Qing Zhao wrote:
> > On Sep 11, 2020, at 4:51 PM, Segher Boessenkool 
> >  wrote:
> > It is definitely *not* effective if there are gadgets that set rax to
> > a value the attacker wants and then do a syscall.
> 
> You mean the following gadget:
> 
> 
> Gadget 1:
> 
> mov  rax,  value
> syscall
> ret

No, just

mov rax,59
syscall

(no ret necessary!)

I.e. just anything that already does an execve.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-14 Thread Segher Boessenkool
Hi!

On Fri, Sep 11, 2020 at 05:24:58PM -0500, Qing Zhao wrote:
> > So IMO the pass should just search for all the
> > returns in a function and insert the zeroing instructions before
> > each one.
> 
> I was considering this approach too for some time, however, there is one 
> major issue with this as 
> Segher mentioned, The middle end does not know some details on the registers, 
> lacking such 
> detailed information might result incorrect code generation at middle end.
> 
> For example, on x86_64 target, when “return” with pop, the scratch register 
> “ECX” will be 
> used for returning, then it’s incorrect to zero “ecx” before generating the 
> return. Since middle end
> doesn't have such information, it cannot avoid to zero “ecx”. Therefore 
> incorrect code might be 
> generated. 
> 
> Segher also mentioned that on Power, there are some scratch registers also 
> are used for 
> Other purpose, clearing them before return is not correct. 

Depending where you insert those insns, it can be non-harmful, but in
most places it will not be useful.


What you can do (easy and safe) is change the RTL return instructions to
clear all necessary registers (by outputting extra assembler
instructions).  I still have big doubts how effective that will be, and
esp. compared with how expensive that is, but at least its effect on the
compiler is very local, and it does not get in the way of most things.

(This also works with shrink-wrapping and similar.)

(The effectiveness of this whole scheme depends a *lot* on specifics of
the ABI, btw; in that way it is not generic at all!)


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-14 Thread Segher Boessenkool
On Mon, Sep 14, 2020 at 05:33:33PM +0100, Richard Sandiford wrote:
> > However, for the cases on Power as Segher mentioned, there are also some 
> > scratch registers used for
> > Other purpose, not sure whether we can correctly generate zeroing in 
> > middle-end for Power?
> 
> Segher would be better placed to answer that, but I think the process
> above has to give a conservatively-accurate list of live registers.
> If it misses a register, the other late rtl passes could clobber
> that same register.

It will zero a whole bunch of registers that are overwritten later, that
are not parameter passing registers either.

Doing this with the limited information the middle end has is not the
best idea.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-14 Thread Qing Zhao via Gcc-patches



> On Sep 14, 2020, at 6:09 PM, Segher Boessenkool  
> wrote:
> 
> On Fri, Sep 11, 2020 at 05:41:47PM -0500, Qing Zhao wrote:
>>> On Sep 11, 2020, at 4:51 PM, Segher Boessenkool 
>>>  wrote:
>>> It is definitely *not* effective if there are gadgets that set rax to
>>> a value the attacker wants and then do a syscall.
>> 
>> You mean the following gadget:
>> 
>> 
>> Gadget 1:
>> 
>> mov  rax,  value
>> syscall
>> ret
> 
> No, just
> 
> mov rax,59
> syscall
> 
> (no ret necessary!)

But for ROP, a typical gadget should be ended with a “ret” (or indirect 
branch), right?

Qing
> 
> I.e. just anything that already does an execve.
> 
> 
> Segher



Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-15 Thread Richard Sandiford
Qing Zhao  writes:
>> On Sep 14, 2020, at 2:20 PM, Richard Sandiford  
>> wrote:
>> 
>> Qing Zhao mailto:qing.z...@oracle.com>> writes:
 On Sep 14, 2020, at 11:33 AM, Richard Sandiford 
  wrote:
 
 Qing Zhao  writes:
>> Like I mentioned earlier though, passes that run after
>> pass_thread_prologue_and_epilogue can use call-clobbered registers that
>> weren't previously used.  For example, on x86_64, the function might
>> not use %r8 when the prologue, epilogue and returns are generated,
>> but pass_regrename might later introduce a new use of %r8.  AIUI,
>> the “used” version of the new command-line option is supposed to clear
>> %r8 in these circumstances, but it wouldn't do so if the data was
>> collected at the point that the return is generated.
> 
> Thanks for the information.
> 
>> 
>> That's why I think it's more robust to do this later (at the beginning
>> of pass_late_compilation) and insert the zeroing before returns that
>> already exist.
> 
> Yes, looks like it’s not correct to insert the zeroing at the time when 
> prologue, epilogue and return are generated.
> As I also checked, “return” might be also generated as late as pass 
> “pass_delay_slots”,  So, shall we move the
> New pass as late as possible?
 
 If we insert the zeroing before pass_delay_slots and describe the
 result correctly, pass_delay_slots should do the right thing.
 
 Describing the result correctly includes ensuring that the cleared
 registers are treated as live on exit from the function, so that the
 zeroing doesn't get deleted again, or skipped by pass_delay_slots.
>>> 
>>> In the current implementation for x86, when we generating a zeroing insn as 
>>> the following:
>>> 
>>> (insn 18 16 19 2 (set (reg:SI 1 dx)
>>>(const_int 0 [0])) "t10.c":11:1 -1
>>> (nil))
>>> (insn 19 18 20 2 (unspec_volatile [
>>>(reg:SI 1 dx)
>>>] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1
>>> (nil))
>>> 
>>> i.e, after each zeroing insn, the register that is zeroed is marked as 
>>> “UNSPECV_PRO_EPILOGUE_USE”, 
>>> By doing this, we can avoid this zeroing insn from being deleted or 
>>> skipped. 
>>> 
>>> Is doing this enough to describe the result correctly?
>>> Is there other thing we need to do in addition to this?
>> 
>> I guess that works, but I think it would be better to abstract
>> EPILOGUE_USES into a new target-independent wrapper function that
>> (a) returns true if EPILOGUE_USES itself returns true and (b) returns
>> true for registers that need to be zero on return, if the zeroing
>> instructions have already been inserted.  The places that currently
>> test EPILOGUE_USES should then test this new wrapper function instead.
>
> Okay, I see. 
> Looks like that EPILOGUE_USES is used in df-scan.c to compute the data flow 
> information. If EPILOUGE_USES return true
> for registers that need to be zeroed on return, those registers will be 
> included in the data flow information, as a result, later
> passes will not be able to delete them. 
>
> This sounds to be a cleaner approach than the current one that marks the 
> registers  “UNSPECV_PRO_EPILOGUE_USE”. 
>
> A more detailed implementation question on this: 
> Where should I put this new target-independent wrapper function in? Which 
> header file will be a proper place to hold this new function?

Not a strong opinion, but: maybe df.h and df-scan.c, since this is
really a DF query.

>> After inserting the zeroing instructions, the pass should recompute the
>> live-out sets based on this.

Sorry, I was wrong here.  It should *cause* the sets to be recomputed
where necessary (rather than recompute them directly), but see below.

> Is only computing the live-out sets of the block that including the return 
> insn enough? Or we should re-compute the whole procedure? 
>
> Which utility routine I should use to recompute the live-out sets?

Inserting the instructions will cause the containing block to be marked
dirty, via df_set_bb_dirty.  I think the pass should also call
df_set_bb_dirty on the exit block itself, to indicate that the
wrapper around EPILOGUE_USES has changed behaviour, but that might
not be necessary.

This gives the df machinery enough information to work out what has changed.
It will then propagate those changes throughout the function.  (I don't
think any propagation would be necessary here, but if I'm wrong about that,
then the df machinery will do whatever propagation is necessary.)

However, the convention is for a pass that uses the df machinery to call
df_analyze first.  This call to df_analyze updates any stale df information.

So unlike what I said yesterday, the pass itself doesn't need to make sure
that the df information is up-to-date.  It just needs to indicate what
has changed, as above.

In the case of pass_delay_slots, pass_free_cfg has:

  /* The resource.c machinery uses DF but the CFG isn't guara

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-15 Thread Richard Sandiford
Segher Boessenkool  writes:
> On Mon, Sep 14, 2020 at 05:33:33PM +0100, Richard Sandiford wrote:
>> > However, for the cases on Power as Segher mentioned, there are also some 
>> > scratch registers used for
>> > Other purpose, not sure whether we can correctly generate zeroing in 
>> > middle-end for Power?
>> 
>> Segher would be better placed to answer that, but I think the process
>> above has to give a conservatively-accurate list of live registers.
>> If it misses a register, the other late rtl passes could clobber
>> that same register.
>
> It will zero a whole bunch of registers that are overwritten later, that
> are not parameter passing registers either.

This thread has covered two main issues: correctness and cost.
The question above was about correctness, but your reply seems to be
about cost.  The correctness question was instead: would the process
described in my previous message lead the compiler to think that a
register wasn't live before a Power return instruction when the
register actually was live?  (And if so, how do we get around that
for other post prologue-epilogue passes that use df?)

On the cost issue: when you say some registers are “overwritten later”:
which registers do you mean, and who would be doing the overwriting?
We were talking about inserting zeroing instructions immediately before
returns that already exist.  It looks like the main Power return
pattern is:

(define_insn "return"
  [(any_return)]
  ""
  "blr"
  [(set_attr "type" "jmpreg")])

Does this overwrite anything other than the PC?  If not, it doesn't
look like anything in the function itself would clobber other registers
later (i.e. later than the inserted zeroing instructions).  And of course,
if an attacker is performing a ROP attack, the attacker controls which
address the BLR returns to.

Thanks,
Richard


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-15 Thread Qing Zhao via Gcc-patches



> On Sep 15, 2020, at 4:11 AM, Richard Sandiford  
> wrote:
> 
> Qing Zhao mailto:qing.z...@oracle.com>> writes:
>>> On Sep 14, 2020, at 2:20 PM, Richard Sandiford  
>>> wrote:
>>> 
>>> Qing Zhao mailto:qing.z...@oracle.com>> writes:
> On Sep 14, 2020, at 11:33 AM, Richard Sandiford 
>  wrote:
> 
> Qing Zhao  writes:
>>> Like I mentioned earlier though, passes that run after
>>> pass_thread_prologue_and_epilogue can use call-clobbered registers that
>>> weren't previously used.  For example, on x86_64, the function might
>>> not use %r8 when the prologue, epilogue and returns are generated,
>>> but pass_regrename might later introduce a new use of %r8.  AIUI,
>>> the “used” version of the new command-line option is supposed to clear
>>> %r8 in these circumstances, but it wouldn't do so if the data was
>>> collected at the point that the return is generated.
>> 
>> Thanks for the information.
>> 
>>> 
>>> That's why I think it's more robust to do this later (at the beginning
>>> of pass_late_compilation) and insert the zeroing before returns that
>>> already exist.
>> 
>> Yes, looks like it’s not correct to insert the zeroing at the time when 
>> prologue, epilogue and return are generated.
>> As I also checked, “return” might be also generated as late as pass 
>> “pass_delay_slots”,  So, shall we move the
>> New pass as late as possible?
> 
> If we insert the zeroing before pass_delay_slots and describe the
> result correctly, pass_delay_slots should do the right thing.
> 
> Describing the result correctly includes ensuring that the cleared
> registers are treated as live on exit from the function, so that the
> zeroing doesn't get deleted again, or skipped by pass_delay_slots.
 
 In the current implementation for x86, when we generating a zeroing insn 
 as the following:
 
 (insn 18 16 19 2 (set (reg:SI 1 dx)
   (const_int 0 [0])) "t10.c":11:1 -1
(nil))
 (insn 19 18 20 2 (unspec_volatile [
   (reg:SI 1 dx)
   ] UNSPECV_PRO_EPILOGUE_USE) "t10.c":11:1 -1
(nil))
 
 i.e, after each zeroing insn, the register that is zeroed is marked as 
 “UNSPECV_PRO_EPILOGUE_USE”, 
 By doing this, we can avoid this zeroing insn from being deleted or 
 skipped. 
 
 Is doing this enough to describe the result correctly?
 Is there other thing we need to do in addition to this?
>>> 
>>> I guess that works, but I think it would be better to abstract
>>> EPILOGUE_USES into a new target-independent wrapper function that
>>> (a) returns true if EPILOGUE_USES itself returns true and (b) returns
>>> true for registers that need to be zero on return, if the zeroing
>>> instructions have already been inserted.  The places that currently
>>> test EPILOGUE_USES should then test this new wrapper function instead.
>> 
>> Okay, I see. 
>> Looks like that EPILOGUE_USES is used in df-scan.c to compute the data flow 
>> information. If EPILOUGE_USES return true
>> for registers that need to be zeroed on return, those registers will be 
>> included in the data flow information, as a result, later
>> passes will not be able to delete them. 
>> 
>> This sounds to be a cleaner approach than the current one that marks the 
>> registers  “UNSPECV_PRO_EPILOGUE_USE”. 
>> 
>> A more detailed implementation question on this: 
>> Where should I put this new target-independent wrapper function in? Which 
>> header file will be a proper place to hold this new function?
> 
> Not a strong opinion, but: maybe df.h and df-scan.c, since this is
> really a DF query.

Okay.

> 
>>> After inserting the zeroing instructions, the pass should recompute the
>>> live-out sets based on this.
> 
> Sorry, I was wrong here.  It should *cause* the sets to be recomputed
> where necessary (rather than recompute them directly), but see below.
> 
>> Is only computing the live-out sets of the block that including the return 
>> insn enough? Or we should re-compute the whole procedure? 
>> 
>> Which utility routine I should use to recompute the live-out sets?
> 
> Inserting the instructions will cause the containing block to be marked
> dirty, via df_set_bb_dirty.  I think the pass should also call
> df_set_bb_dirty on the exit block itself, to indicate that the
> wrapper around EPILOGUE_USES has changed behaviour, but that might
> not be necessary.
> 
> This gives the df machinery enough information to work out what has changed.
> It will then propagate those changes throughout the function.  (I don't
> think any propagation would be necessary here, but if I'm wrong about that,
> then the df machinery will do whatever propagation is necessary.)
> 
> However, the convention is for a pass that uses the df machinery to call
> df_analyze first.  This call to df_analyze updates any stale df information.
> 
> So unlike what I said yesterday, the pass itself d

Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-15 Thread Segher Boessenkool
On Mon, Sep 14, 2020 at 10:07:31PM -0500, Qing Zhao wrote:
> > On Sep 14, 2020, at 6:09 PM, Segher Boessenkool 
> >  wrote:
> >> Gadget 1:
> >> 
> >> mov  rax,  value
> >> syscall
> >> ret
> > 
> > No, just
> > 
> > mov rax,59
> > syscall
> > 
> > (no ret necessary!)
> 
> But for ROP, a typical gadget should be ended with a “ret” (or indirect 
> branch), right?

Not the last one :-)  (Especially if it is exec!)


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-15 Thread Segher Boessenkool
On Tue, Sep 15, 2020 at 12:46:00PM +0100, Richard Sandiford wrote:
> Segher Boessenkool  writes:
> > On Mon, Sep 14, 2020 at 05:33:33PM +0100, Richard Sandiford wrote:
> >> > However, for the cases on Power as Segher mentioned, there are also some 
> >> > scratch registers used for
> >> > Other purpose, not sure whether we can correctly generate zeroing in 
> >> > middle-end for Power?
> >> 
> >> Segher would be better placed to answer that, but I think the process
> >> above has to give a conservatively-accurate list of live registers.
> >> If it misses a register, the other late rtl passes could clobber
> >> that same register.
> >
> > It will zero a whole bunch of registers that are overwritten later, that
> > are not parameter passing registers either.
> 
> This thread has covered two main issues: correctness and cost.
> The question above was about correctness, but your reply seems to be
> about cost.

The issues are very heavily intertwined.  A much too high execution
cost is unacceptable, just like machine code that does not implement the
source code faithfully.

> On the cost issue: when you say some registers are “overwritten later”:
> which registers do you mean, and who would be doing the overwriting?

(Glue) code that is generated by the linker.

> We were talking about inserting zeroing instructions immediately before
> returns that already exist.  It looks like the main Power return
> pattern is:

It is.

> (define_insn "return"
>   [(any_return)]
>   ""
>   "blr"
>   [(set_attr "type" "jmpreg")])
> 
> Does this overwrite anything other than the PC?  If not, it doesn't

(We do not have a "PC" register, but :-) )

Nope.  The blr instruction does not write any register.  (The base
"bclr[l]" insn can write to CTR and LR).

> look like anything in the function itself would clobber other registers
> later (i.e. later than the inserted zeroing instructions).  And of course,
> if an attacker is performing a ROP attack, the attacker controls which
> address the BLR returns to.

That does not matter for the *normal* case.  Making the normal case even
more expensive than this scheme already is is no good.


Anyway, I was concerned about other architectures, too (that may not
even *have* a GCC port (yet)).  The point is that this should follow all
the rules we have for RTL.  Now that it will use DF (thanks!), most of
that will follow automatically (or easily, anyway).


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-15 Thread Segher Boessenkool
On Tue, Sep 15, 2020 at 10:11:41AM +0100, Richard Sandiford wrote:
> Qing Zhao  writes:
> >> On Sep 14, 2020, at 2:20 PM, Richard Sandiford  
> >> wrote:
(Putting correct info in DF, inserting the new insns in pro_and_epi).

But, scheduling runs *after* that, and then you need to prevent the
inserted (zeroing) insns from moving -- if you don't, the code after
some zeroing can be used as gadget!  You want to always have all
zeroing insns after *any* computational insn, or it becomes a gadget.


Segher


Re: PING [Patch][Middle-end]Add -fzero-call-used-regs=[skip|used-gpr|all-gpr|used|all]

2020-09-15 Thread Qing Zhao via Gcc-patches



> On Sep 15, 2020, at 2:41 PM, Segher Boessenkool  
> wrote:
> 
> On Tue, Sep 15, 2020 at 10:11:41AM +0100, Richard Sandiford wrote:
>> Qing Zhao  writes:
 On Sep 14, 2020, at 2:20 PM, Richard Sandiford  
 wrote:
> (Putting correct info in DF, inserting the new insns in pro_and_epi).
> 
> But, scheduling runs *after* that, and then you need to prevent the
> inserted (zeroing) insns from moving -- if you don't, the code after
> some zeroing can be used as gadget!  You want to always have all
> zeroing insns after *any* computational insn, or it becomes a gadget.

Please see the previous discussion, we have agreed to put the new pass   
(pass_zero_call_used_regs) 
in the beginning of the pass_late_compilation as following:

 PUSH_INSERT_PASSES_WITHIN (pass_late_compilation)
  NEXT_PASS (pass_zero_call_used_regs);
 NEXT_PASS (pass_compute_alignments);
 NEXT_PASS (pass_variable_tracking);
 NEXT_PASS (pass_free_cfg);
 NEXT_PASS (pass_machine_reorg);
 NEXT_PASS (pass_cleanup_barriers);
 NEXT_PASS (pass_delay_slots);

Scheduling has been done already. 

Qing


> 
> 
> Segher



  1   2   >