Re: gcc feature request / RFC: extra clobbered regs

2015-07-01 Thread Vladimir Makarov



On 07/01/2015 01:43 PM, Jakub Jelinek wrote:

On Wed, Jul 01, 2015 at 01:35:16PM -0400, Vladimir Makarov wrote:

Actually it raise a question for me.  If we describe that a function
clobbers more than calling convention and then use it as a value (assigning
a variable or passing as an argument) and loosing a track of it and than
call it.  How can RA know what the call clobbers actually.  So for the
function with the attributes we should prohibit use it as a value or make
the attributes as a part of the function type, or at least say it is unsafe.
So now I see this as a *bigger problem* with this extension.  Although I
guess it already exists as we have description of different ABI as an
extension.

Unfortunately target attribute is function decl attribute rather than
function type.  And having more attributes affect switchable targets will be
non-fun.



Making attributes a part of type probably creates a lot issues too.

Although I am not a front-end developer, still I think it is hard to 
implement in front-end.  Sticking fully to this approach, it would be 
logical to describe this as a debug info (I am not sure it is even 
possible).


Portability would be an issue too.  It is hard to prevent for a regular 
C developer to assign such function to variable because it is ok on his 
system while the compilation of such code may fail on another system.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: gcc feature request / RFC: extra clobbered regs

2015-07-01 Thread Vladimir Makarov



On 07/01/2015 11:27 AM, Andy Lutomirski wrote:

On Wed, Jul 1, 2015 at 8:23 AM, Vladimir Makarov  wrote:


On 06/30/2015 05:37 PM, Jakub Jelinek wrote:

On Tue, Jun 30, 2015 at 02:22:33PM -0700, Andy Lutomirski wrote:

I'm working on a massive set of cleanups to Linux's syscall handling.
We currently have a nasty optimization in which we don't save rbx,
rbp, r12, r13, r14, and r15 on x86_64 before calling C functions.
This works, but it makes the code a huge mess.  I'd rather save all
regs in asm and then call C code.

Unfortunately, this will add five cycles (on SNB) to one of the
hottest paths in the kernel.  To counteract it, I have a gcc feature
request that might not be all that crazy.  When writing C functions
intended to be called from asm, what if we could do:

__attribute__((extra_clobber("rbx", "rbp", "r12", "r13", "r14",
"r15"))) void func(void);

This will save enough pushes and pops that it could easily give us our
five cycles back and then some.  It's also easy to be compatible with
old GCC versions -- we could just omit the attribute, since preserving
a register is always safe.

Thoughts?  Is this totally crazy?  Is it easy to implement?

(I'm not necessarily suggesting that we do this for the syscall bodies
themselves.  I want to do it for the entry and exit helpers, so we'd
still lose the five cycles in the full fast-path case, but we'd do
better in the slower paths, and the slower paths are becoming
increasingly important in real workloads.)

GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG
options, which allow to tweak the calling conventions; but it is per
translation unit right now.  It isn't clear which of these options
you mean with the extra_clobber.
I assume you are looking for a possibility to change this to be
per-function, with caller with a different calling convention having to
adjust for different ABI callee.  To some extent, recent GCC versions
do that automatically with -fipa-ra already - if some call used registers
are not clobbered by some call and the caller can analyze that callee,
it can stick values in such registers across the call.
I'd say the most natural API for this would be to allow
f{fixed,call-{used,saved}}-REG in target attribute.



One consequence of frequent changing calling convention per function or
register usage could be GCC slowdown.  RA calculates too many data and it
requires a lot of time to recalculate them after something in the register
usage convention is changed.

Do you mean that RA precalculates things based on the calling
convention and saves it across functions?
RA calculates a lot info (register classes, class x class relations etc) 
based on register usage convention (fixed regs, call used registers 
etc).  If register usage convention is not changed from previous 
function compilation, RA reuses the info.  Otherwise, RA recalculates it.

   Hmm.  I don't think this
would be a big problem in my intended use case -- there would only be
a handful of functions using this extension, and they'd have very few
non-asm callers.
Good.  I guess it will be rarely used and people will tolerate some 
extra compilation time.

Another consequence would be that RA fails generate the code in some cases
and even worse the failure might depend on version of GCC (I already saw PRs
where RA worked for an asm in one GCC version because a pseudo was changed
by equivalent constant and failed in another GCC version where it did not
happen).


Would this be a problem generating code for a function with extra
"used" regs or just a problem generating code to call such a function.
I imagine that, in the former case, RA's job would be easier, not
harder, since there would be more registers to work with.
Sorry, I meant that the problem will be mostly when the attributes 
describe more fixed regs.  If you describe more clobbered regs, they 
still can be used for allocator which can spill/restore them (around 
calls) when they can not be used. Still i think there will be some rare 
and complicated cases where even describing only clobbered regs can make 
RA fails in a function calling the function with additional clobbered regs.

   In
practice, though, I think it would just end up changing the prologue
and epilogue.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: gcc feature request / RFC: extra clobbered regs

2015-07-01 Thread Vladimir Makarov



On 07/01/2015 11:31 AM, Jakub Jelinek wrote:

On Wed, Jul 01, 2015 at 11:23:17AM -0400, Vladimir Makarov wrote:

(I'm not necessarily suggesting that we do this for the syscall bodies
themselves.  I want to do it for the entry and exit helpers, so we'd
still lose the five cycles in the full fast-path case, but we'd do
better in the slower paths, and the slower paths are becoming
increasingly important in real workloads.)

GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG
options, which allow to tweak the calling conventions; but it is per
translation unit right now.  It isn't clear which of these options
you mean with the extra_clobber.
I assume you are looking for a possibility to change this to be
per-function, with caller with a different calling convention having to
adjust for different ABI callee.  To some extent, recent GCC versions
do that automatically with -fipa-ra already - if some call used registers
are not clobbered by some call and the caller can analyze that callee,
it can stick values in such registers across the call.
I'd say the most natural API for this would be to allow
f{fixed,call-{used,saved}}-REG in target attribute.



One consequence of frequent changing calling convention per function or
register usage could be GCC slowdown.  RA calculates too many data and it
requires a lot of time to recalculate them after something in the register
usage convention is changed.

That is true.  i?86/x86_64 is a switchable target, so at least for the case
of info computed for the callee with non-standard calling convention such
info can be computed just once when the function with such a target
attribute would be seen first.
Yes, more clever way could be used.  We can can calculate the info for 
specific calling convention, save it and reuse it for the function with 
the same attributes.  The compilation speed will be ok even with the 
current implementation if there are few calling convention changes.

   But for the caller side, I agree not
everything can be precomputed, if we can't use e.g. regsets saved in the
callee; as a single function can call different functions with different
ABIs.  But to some extent we have that already with -fipa-ra, don't we?


Yes, for -fipa-ra if we saw the function, we know what registers it 
actually clobbers.  If we did not processed it yet, we use the worst 
case scenario (clobbering all clobbered registers according to calling 
convention).


Actually it raise a question for me.  If we describe that a function 
clobbers more than calling convention and then use it as a value 
(assigning a variable or passing as an argument) and loosing a track of 
it and than call it.  How can RA know what the call clobbers actually.  
So for the function with the attributes we should prohibit use it as a 
value or make the attributes as a part of the function type, or at least 
say it is unsafe.  So now I see this as a *bigger problem* with this 
extension.  Although I guess it already exists as we have description of 
different ABI as an extension.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: gcc feature request / RFC: extra clobbered regs

2015-07-01 Thread Vladimir Makarov



On 06/30/2015 05:37 PM, Jakub Jelinek wrote:

On Tue, Jun 30, 2015 at 02:22:33PM -0700, Andy Lutomirski wrote:

I'm working on a massive set of cleanups to Linux's syscall handling.
We currently have a nasty optimization in which we don't save rbx,
rbp, r12, r13, r14, and r15 on x86_64 before calling C functions.
This works, but it makes the code a huge mess.  I'd rather save all
regs in asm and then call C code.

Unfortunately, this will add five cycles (on SNB) to one of the
hottest paths in the kernel.  To counteract it, I have a gcc feature
request that might not be all that crazy.  When writing C functions
intended to be called from asm, what if we could do:

__attribute__((extra_clobber("rbx", "rbp", "r12", "r13", "r14",
"r15"))) void func(void);

This will save enough pushes and pops that it could easily give us our
five cycles back and then some.  It's also easy to be compatible with
old GCC versions -- we could just omit the attribute, since preserving
a register is always safe.

Thoughts?  Is this totally crazy?  Is it easy to implement?

(I'm not necessarily suggesting that we do this for the syscall bodies
themselves.  I want to do it for the entry and exit helpers, so we'd
still lose the five cycles in the full fast-path case, but we'd do
better in the slower paths, and the slower paths are becoming
increasingly important in real workloads.)

GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG
options, which allow to tweak the calling conventions; but it is per
translation unit right now.  It isn't clear which of these options
you mean with the extra_clobber.
I assume you are looking for a possibility to change this to be
per-function, with caller with a different calling convention having to
adjust for different ABI callee.  To some extent, recent GCC versions
do that automatically with -fipa-ra already - if some call used registers
are not clobbered by some call and the caller can analyze that callee,
it can stick values in such registers across the call.
I'd say the most natural API for this would be to allow
f{fixed,call-{used,saved}}-REG in target attribute.


One consequence of frequent changing calling convention per function or 
register usage could be GCC slowdown.  RA calculates too many data and 
it requires a lot of time to recalculate them after something in the 
register usage convention is changed.


Another consequence would be that RA fails generate the code in some 
cases and even worse the failure might depend on version of GCC (I 
already saw PRs where RA worked for an asm in one GCC version because a 
pseudo was changed by equivalent constant and failed in another GCC 
version where it did not happen).


Other than that I don't see other complications with implementing such 
feature.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: gcc feature request / RFC: extra clobbered regs

2015-07-01 Thread Vladimir Makarov



On 07/01/2015 01:43 PM, Jakub Jelinek wrote:

On Wed, Jul 01, 2015 at 01:35:16PM -0400, Vladimir Makarov wrote:

Actually it raise a question for me.  If we describe that a function
clobbers more than calling convention and then use it as a value (assigning
a variable or passing as an argument) and loosing a track of it and than
call it.  How can RA know what the call clobbers actually.  So for the
function with the attributes we should prohibit use it as a value or make
the attributes as a part of the function type, or at least say it is unsafe.
So now I see this as a *bigger problem* with this extension.  Although I
guess it already exists as we have description of different ABI as an
extension.

Unfortunately target attribute is function decl attribute rather than
function type.  And having more attributes affect switchable targets will be
non-fun.



Making attributes a part of type probably creates a lot issues too.

Although I am not a front-end developer, still I think it is hard to 
implement in front-end.  Sticking fully to this approach, it would be 
logical to describe this as a debug info (I am not sure it is even 
possible).


Portability would be an issue too.  It is hard to prevent for a regular 
C developer to assign such function to variable because it is ok on his 
system while the compilation of such code may fail on another system.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: gcc feature request / RFC: extra clobbered regs

2015-07-01 Thread Vladimir Makarov



On 06/30/2015 05:37 PM, Jakub Jelinek wrote:

On Tue, Jun 30, 2015 at 02:22:33PM -0700, Andy Lutomirski wrote:

I'm working on a massive set of cleanups to Linux's syscall handling.
We currently have a nasty optimization in which we don't save rbx,
rbp, r12, r13, r14, and r15 on x86_64 before calling C functions.
This works, but it makes the code a huge mess.  I'd rather save all
regs in asm and then call C code.

Unfortunately, this will add five cycles (on SNB) to one of the
hottest paths in the kernel.  To counteract it, I have a gcc feature
request that might not be all that crazy.  When writing C functions
intended to be called from asm, what if we could do:

__attribute__((extra_clobber(rbx, rbp, r12, r13, r14,
r15))) void func(void);

This will save enough pushes and pops that it could easily give us our
five cycles back and then some.  It's also easy to be compatible with
old GCC versions -- we could just omit the attribute, since preserving
a register is always safe.

Thoughts?  Is this totally crazy?  Is it easy to implement?

(I'm not necessarily suggesting that we do this for the syscall bodies
themselves.  I want to do it for the entry and exit helpers, so we'd
still lose the five cycles in the full fast-path case, but we'd do
better in the slower paths, and the slower paths are becoming
increasingly important in real workloads.)

GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG
options, which allow to tweak the calling conventions; but it is per
translation unit right now.  It isn't clear which of these options
you mean with the extra_clobber.
I assume you are looking for a possibility to change this to be
per-function, with caller with a different calling convention having to
adjust for different ABI callee.  To some extent, recent GCC versions
do that automatically with -fipa-ra already - if some call used registers
are not clobbered by some call and the caller can analyze that callee,
it can stick values in such registers across the call.
I'd say the most natural API for this would be to allow
f{fixed,call-{used,saved}}-REG in target attribute.


One consequence of frequent changing calling convention per function or 
register usage could be GCC slowdown.  RA calculates too many data and 
it requires a lot of time to recalculate them after something in the 
register usage convention is changed.


Another consequence would be that RA fails generate the code in some 
cases and even worse the failure might depend on version of GCC (I 
already saw PRs where RA worked for an asm in one GCC version because a 
pseudo was changed by equivalent constant and failed in another GCC 
version where it did not happen).


Other than that I don't see other complications with implementing such 
feature.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: gcc feature request / RFC: extra clobbered regs

2015-07-01 Thread Vladimir Makarov



On 07/01/2015 11:31 AM, Jakub Jelinek wrote:

On Wed, Jul 01, 2015 at 11:23:17AM -0400, Vladimir Makarov wrote:

(I'm not necessarily suggesting that we do this for the syscall bodies
themselves.  I want to do it for the entry and exit helpers, so we'd
still lose the five cycles in the full fast-path case, but we'd do
better in the slower paths, and the slower paths are becoming
increasingly important in real workloads.)

GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG
options, which allow to tweak the calling conventions; but it is per
translation unit right now.  It isn't clear which of these options
you mean with the extra_clobber.
I assume you are looking for a possibility to change this to be
per-function, with caller with a different calling convention having to
adjust for different ABI callee.  To some extent, recent GCC versions
do that automatically with -fipa-ra already - if some call used registers
are not clobbered by some call and the caller can analyze that callee,
it can stick values in such registers across the call.
I'd say the most natural API for this would be to allow
f{fixed,call-{used,saved}}-REG in target attribute.



One consequence of frequent changing calling convention per function or
register usage could be GCC slowdown.  RA calculates too many data and it
requires a lot of time to recalculate them after something in the register
usage convention is changed.

That is true.  i?86/x86_64 is a switchable target, so at least for the case
of info computed for the callee with non-standard calling convention such
info can be computed just once when the function with such a target
attribute would be seen first.
Yes, more clever way could be used.  We can can calculate the info for 
specific calling convention, save it and reuse it for the function with 
the same attributes.  The compilation speed will be ok even with the 
current implementation if there are few calling convention changes.

   But for the caller side, I agree not
everything can be precomputed, if we can't use e.g. regsets saved in the
callee; as a single function can call different functions with different
ABIs.  But to some extent we have that already with -fipa-ra, don't we?


Yes, for -fipa-ra if we saw the function, we know what registers it 
actually clobbers.  If we did not processed it yet, we use the worst 
case scenario (clobbering all clobbered registers according to calling 
convention).


Actually it raise a question for me.  If we describe that a function 
clobbers more than calling convention and then use it as a value 
(assigning a variable or passing as an argument) and loosing a track of 
it and than call it.  How can RA know what the call clobbers actually.  
So for the function with the attributes we should prohibit use it as a 
value or make the attributes as a part of the function type, or at least 
say it is unsafe.  So now I see this as a *bigger problem* with this 
extension.  Although I guess it already exists as we have description of 
different ABI as an extension.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: gcc feature request / RFC: extra clobbered regs

2015-07-01 Thread Vladimir Makarov



On 07/01/2015 11:27 AM, Andy Lutomirski wrote:

On Wed, Jul 1, 2015 at 8:23 AM, Vladimir Makarov vmaka...@redhat.com wrote:


On 06/30/2015 05:37 PM, Jakub Jelinek wrote:

On Tue, Jun 30, 2015 at 02:22:33PM -0700, Andy Lutomirski wrote:

I'm working on a massive set of cleanups to Linux's syscall handling.
We currently have a nasty optimization in which we don't save rbx,
rbp, r12, r13, r14, and r15 on x86_64 before calling C functions.
This works, but it makes the code a huge mess.  I'd rather save all
regs in asm and then call C code.

Unfortunately, this will add five cycles (on SNB) to one of the
hottest paths in the kernel.  To counteract it, I have a gcc feature
request that might not be all that crazy.  When writing C functions
intended to be called from asm, what if we could do:

__attribute__((extra_clobber(rbx, rbp, r12, r13, r14,
r15))) void func(void);

This will save enough pushes and pops that it could easily give us our
five cycles back and then some.  It's also easy to be compatible with
old GCC versions -- we could just omit the attribute, since preserving
a register is always safe.

Thoughts?  Is this totally crazy?  Is it easy to implement?

(I'm not necessarily suggesting that we do this for the syscall bodies
themselves.  I want to do it for the entry and exit helpers, so we'd
still lose the five cycles in the full fast-path case, but we'd do
better in the slower paths, and the slower paths are becoming
increasingly important in real workloads.)

GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG
options, which allow to tweak the calling conventions; but it is per
translation unit right now.  It isn't clear which of these options
you mean with the extra_clobber.
I assume you are looking for a possibility to change this to be
per-function, with caller with a different calling convention having to
adjust for different ABI callee.  To some extent, recent GCC versions
do that automatically with -fipa-ra already - if some call used registers
are not clobbered by some call and the caller can analyze that callee,
it can stick values in such registers across the call.
I'd say the most natural API for this would be to allow
f{fixed,call-{used,saved}}-REG in target attribute.



One consequence of frequent changing calling convention per function or
register usage could be GCC slowdown.  RA calculates too many data and it
requires a lot of time to recalculate them after something in the register
usage convention is changed.

Do you mean that RA precalculates things based on the calling
convention and saves it across functions?
RA calculates a lot info (register classes, class x class relations etc) 
based on register usage convention (fixed regs, call used registers 
etc).  If register usage convention is not changed from previous 
function compilation, RA reuses the info.  Otherwise, RA recalculates it.

   Hmm.  I don't think this
would be a big problem in my intended use case -- there would only be
a handful of functions using this extension, and they'd have very few
non-asm callers.
Good.  I guess it will be rarely used and people will tolerate some 
extra compilation time.

Another consequence would be that RA fails generate the code in some cases
and even worse the failure might depend on version of GCC (I already saw PRs
where RA worked for an asm in one GCC version because a pseudo was changed
by equivalent constant and failed in another GCC version where it did not
happen).


Would this be a problem generating code for a function with extra
used regs or just a problem generating code to call such a function.
I imagine that, in the former case, RA's job would be easier, not
harder, since there would be more registers to work with.
Sorry, I meant that the problem will be mostly when the attributes 
describe more fixed regs.  If you describe more clobbered regs, they 
still can be used for allocator which can spill/restore them (around 
calls) when they can not be used. Still i think there will be some rare 
and complicated cases where even describing only clobbered regs can make 
RA fails in a function calling the function with additional clobbered regs.

   In
practice, though, I think it would just end up changing the prologue
and epilogue.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Optimize variable_test_bit()

2015-05-04 Thread Vladimir Makarov



On 02/05/15 08:43 AM, Peter Zijlstra wrote:

On Fri, May 01, 2015 at 03:02:24PM -0400, Vladimir Makarov wrote:

   Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS.
PPC, SH, and ARC are moving to LRA.  All other targets are still
reload based.

   So I could implement the output reloads in LRA, probably for the
next GCC release.  How to enable and mostly use it for multi-target
code like the kernel is another question.

Pretty much all inline asm is in per arch code; so one arch having
different asm features than another should not be a problem at all.
Ok, then. I'll try to implement output operands for asm-goto in LRA for 
the next GCC release.


Of course, if nobody objects to changing asm goto semantics from

 An 'asm goto' statement cannot have outputs ...

to

 An 'asm goto' statement cannot have outputs on some targets ...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Optimize variable_test_bit()

2015-05-04 Thread Vladimir Makarov



On 02/05/15 08:43 AM, Peter Zijlstra wrote:

On Fri, May 01, 2015 at 03:02:24PM -0400, Vladimir Makarov wrote:

   Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS.
PPC, SH, and ARC are moving to LRA.  All other targets are still
reload based.

   So I could implement the output reloads in LRA, probably for the
next GCC release.  How to enable and mostly use it for multi-target
code like the kernel is another question.

Pretty much all inline asm is in per arch code; so one arch having
different asm features than another should not be a problem at all.
Ok, then. I'll try to implement output operands for asm-goto in LRA for 
the next GCC release.


Of course, if nobody objects to changing asm goto semantics from

 An 'asm goto' statement cannot have outputs ...

to

 An 'asm goto' statement cannot have outputs on some targets ...


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Optimize variable_test_bit()

2015-05-01 Thread Vladimir Makarov



On 01/05/15 04:49 PM, Linus Torvalds wrote:

On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov  wrote:

   GCC RA is a major reason to prohibit output operands for asm goto.

Hmm.. Thinking some more about it, I think that what would actually
work really well at least for the kernel is:

(a) allow *memory* operands (ie "=m") as outputs and having them be
meaningful even at any output labels (obviously with the caveat that
the asm instructions that write to memory would have to happen before
the branch ;)

This covers the somewhat common case of having magic instructions that
result in conditions that can't be tested at a C level. Things like
"bit clear and test" on x86 (with or without the lock) .

  (b) allow other operands to be meaningful onlty for the fallthrough case.

 From a register allocation standpoint, these should be the easy cases.
(a) doesn't need any register allocation of the output (only on the
input to set up the effective address of the memory location), and (b)
would explicitly mean that an "asm goto" would leave any non-memory
outputs undefined in any of the goto cases, so from a RA standpoint it
ends up being equivalent to a non-goto asm..

Thanks for explanation what you need in the most common case.

Big part of GCC RA (at least local register allocators -- reload pass 
and LRA) besides assigning hard registers to pseudos is to make 
transformations to satisfy insn constraints.  If there is not enough 
hard registers, a pseudo can be allocated to a stack slot and if insn 
using the pseudo needs a hard register, load or/and store should be 
generated before/after the insn.  And the problem for the old (reload 
pass) and new RA (LRA) is that they were not designed to put new insns 
after an insn changing control flow.  Assigning hard registers itself is 
not an issue for asm goto case.


If I understood you correctly, you assume that just permitting =m will 
make GCC generates the correct code. Unfortunately, it is more 
complicated.  The operand can be not a memory or memory not satisfying 
memory constraint 'm'.  So still insns for moving memory satisfying 'm' 
into output operand location might be necessary after the asm goto.


We could make asm goto semantics requiring that a user should provide 
memory for such output operand (e.g. a pointer dereferrencing in your 
case) and generate an error otherwise.  By the way the same could be 
done for output *register* operand.  And user to avoid the error should 
use a local register variable (a GCC extension) as an operand. But it 
might be a bad idea with code performance point of view.


Unfortunately, the operand can be substituted by an equiv. value during 
different transformations and even if an user think it will be a memory 
before RA, it might be wrong.  Although I believe there are some cases 
where we can be sure that it will be memory (e.g. dereferrencing pointer 
which is a function argument and is not used anywhere else in 
function).  Still it makes asm goto semantics complicated imho.


We could prevent equiv. substitution for output memory operand of asm 
goto through all the optimizations but it is probably even harder task 
than implementing output reloads in *reload* pass (it is 28-year old 
pass with so many changes during its life that practically nobody can 
understand it now well and change w/o introducing a new bug).  As for 
LRA, I wrote implementing output reloads is a double task.



Hmm?

So as an example of something that the kernel does and which wants to
have an output register. is to do a load from user space that can
fault. When it faults, we obviously simply don't *have* an actual
result, and we return an error. But for the successful fallthrough
case, we get a value in a register.

I'd love to be able to write it as (this is simplified, and doesn't
worry about all the different access sizes, or the "stac/clac"
sequence to enable user accesses on modern Intel CPU's):

 asm goto(
 "1:"
 "\tmovl %0,%1\n"
 _ASM_EXTABLE(1b,%l[error])
 : "=r" (val)
 : "m" (*userptr)
 : : error);

where that "_ASM_EXTABLE()" is our magic macro for generating an
exception entry for that instruction, so that if the load takes an
exception, it will instead to to the "error" label.

But if it goes to the error label, the "val" output register really
doesn't contain anything, so we wouldn't even *want* gcc to try to do
any register allocation for the "jump to label from assembly" case.

So at least for one of the major cases that I'd like to use "asm goto"
with an output, I actually don't *want* any register allocation for
anything but the fallthrough case. And I suspect that's a
not-too-uncommon pattern - it's probably often about error handling.


As I wrote already if we implement output reloads after the control f

Re: [PATCH] x86: Optimize variable_test_bit()

2015-05-01 Thread Vladimir Makarov



On 01/05/15 12:33 PM, Jakub Jelinek wrote:

On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:

PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
  is this something we can get 'fixed' ?

CCing Richard as author of asm goto and Vlad as register allocator
maintainer.  There are a few enhancement requests to support this, like
http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the
reason why no outputs are allowed is the register allocation issue.
Don't know if LRA would be better suited to handle that case, but it would
indeed be pretty hard.



  GCC RA is a major reason to prohibit output operands for asm goto.

  Reload pass was not designed to deal with output reloads for control
flow insns.  It is very hard to implement this feature there and
implement it in a reliable way.  Also nobody does any development for
reload for long time.  So I doubt that somebody would do this for reload.

  LRA is more suitable to implement the feature.  In general, even
outputs used on any branch can be permitted.  Although critical edges
can complicate the implementation as new BBs are created. But it is
doable too.

  The only problem is that asm goto semantics in this case should be
defineddepending on what local register allocator (reload or LRA) GCC
for given target use.

  Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS.
PPC, SH, and ARC are moving to LRA.  All other targets are still
reload based.

  So I could implement the output reloads in LRA, probably for the
next GCC release.  How to enable and mostly use it for multi-target
code like the kernel is another question.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86: Optimize variable_test_bit()

2015-05-01 Thread Vladimir Makarov



On 01/05/15 04:49 PM, Linus Torvalds wrote:

On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov vmaka...@redhat.com wrote:

   GCC RA is a major reason to prohibit output operands for asm goto.

Hmm.. Thinking some more about it, I think that what would actually
work really well at least for the kernel is:

(a) allow *memory* operands (ie =m) as outputs and having them be
meaningful even at any output labels (obviously with the caveat that
the asm instructions that write to memory would have to happen before
the branch ;)

This covers the somewhat common case of having magic instructions that
result in conditions that can't be tested at a C level. Things like
bit clear and test on x86 (with or without the lock) .

  (b) allow other operands to be meaningful onlty for the fallthrough case.

 From a register allocation standpoint, these should be the easy cases.
(a) doesn't need any register allocation of the output (only on the
input to set up the effective address of the memory location), and (b)
would explicitly mean that an asm goto would leave any non-memory
outputs undefined in any of the goto cases, so from a RA standpoint it
ends up being equivalent to a non-goto asm..

Thanks for explanation what you need in the most common case.

Big part of GCC RA (at least local register allocators -- reload pass 
and LRA) besides assigning hard registers to pseudos is to make 
transformations to satisfy insn constraints.  If there is not enough 
hard registers, a pseudo can be allocated to a stack slot and if insn 
using the pseudo needs a hard register, load or/and store should be 
generated before/after the insn.  And the problem for the old (reload 
pass) and new RA (LRA) is that they were not designed to put new insns 
after an insn changing control flow.  Assigning hard registers itself is 
not an issue for asm goto case.


If I understood you correctly, you assume that just permitting =m will 
make GCC generates the correct code. Unfortunately, it is more 
complicated.  The operand can be not a memory or memory not satisfying 
memory constraint 'm'.  So still insns for moving memory satisfying 'm' 
into output operand location might be necessary after the asm goto.


We could make asm goto semantics requiring that a user should provide 
memory for such output operand (e.g. a pointer dereferrencing in your 
case) and generate an error otherwise.  By the way the same could be 
done for output *register* operand.  And user to avoid the error should 
use a local register variable (a GCC extension) as an operand. But it 
might be a bad idea with code performance point of view.


Unfortunately, the operand can be substituted by an equiv. value during 
different transformations and even if an user think it will be a memory 
before RA, it might be wrong.  Although I believe there are some cases 
where we can be sure that it will be memory (e.g. dereferrencing pointer 
which is a function argument and is not used anywhere else in 
function).  Still it makes asm goto semantics complicated imho.


We could prevent equiv. substitution for output memory operand of asm 
goto through all the optimizations but it is probably even harder task 
than implementing output reloads in *reload* pass (it is 28-year old 
pass with so many changes during its life that practically nobody can 
understand it now well and change w/o introducing a new bug).  As for 
LRA, I wrote implementing output reloads is a double task.



Hmm?

So as an example of something that the kernel does and which wants to
have an output register. is to do a load from user space that can
fault. When it faults, we obviously simply don't *have* an actual
result, and we return an error. But for the successful fallthrough
case, we get a value in a register.

I'd love to be able to write it as (this is simplified, and doesn't
worry about all the different access sizes, or the stac/clac
sequence to enable user accesses on modern Intel CPU's):

 asm goto(
 1:
 \tmovl %0,%1\n
 _ASM_EXTABLE(1b,%l[error])
 : =r (val)
 : m (*userptr)
 : : error);

where that _ASM_EXTABLE() is our magic macro for generating an
exception entry for that instruction, so that if the load takes an
exception, it will instead to to the error label.

But if it goes to the error label, the val output register really
doesn't contain anything, so we wouldn't even *want* gcc to try to do
any register allocation for the jump to label from assembly case.

So at least for one of the major cases that I'd like to use asm goto
with an output, I actually don't *want* any register allocation for
anything but the fallthrough case. And I suspect that's a
not-too-uncommon pattern - it's probably often about error handling.


As I wrote already if we implement output reloads after the control flow 
insn, it does not matter what operand constraint should be (memory or 
register).  Implementing it only for fall-through case

Re: [PATCH] x86: Optimize variable_test_bit()

2015-05-01 Thread Vladimir Makarov



On 01/05/15 12:33 PM, Jakub Jelinek wrote:

On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote:

PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto;
  is this something we can get 'fixed' ?

CCing Richard as author of asm goto and Vlad as register allocator
maintainer.  There are a few enhancement requests to support this, like
http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the
reason why no outputs are allowed is the register allocation issue.
Don't know if LRA would be better suited to handle that case, but it would
indeed be pretty hard.



  GCC RA is a major reason to prohibit output operands for asm goto.

  Reload pass was not designed to deal with output reloads for control
flow insns.  It is very hard to implement this feature there and
implement it in a reliable way.  Also nobody does any development for
reload for long time.  So I doubt that somebody would do this for reload.

  LRA is more suitable to implement the feature.  In general, even
outputs used on any branch can be permitted.  Although critical edges
can complicate the implementation as new BBs are created. But it is
doable too.

  The only problem is that asm goto semantics in this case should be
defineddepending on what local register allocator (reload or LRA) GCC
for given target use.

  Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS.
PPC, SH, and ARC are moving to LRA.  All other targets are still
reload based.

  So I could implement the output reloads in LRA, probably for the
next GCC release.  How to enable and mostly use it for multi-target
code like the kernel is another question.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/