Re: gcc feature request / RFC: extra clobbered regs
On 07/01/2015 01:43 PM, Jakub Jelinek wrote: On Wed, Jul 01, 2015 at 01:35:16PM -0400, Vladimir Makarov wrote: Actually it raise a question for me. If we describe that a function clobbers more than calling convention and then use it as a value (assigning a variable or passing as an argument) and loosing a track of it and than call it. How can RA know what the call clobbers actually. So for the function with the attributes we should prohibit use it as a value or make the attributes as a part of the function type, or at least say it is unsafe. So now I see this as a *bigger problem* with this extension. Although I guess it already exists as we have description of different ABI as an extension. Unfortunately target attribute is function decl attribute rather than function type. And having more attributes affect switchable targets will be non-fun. Making attributes a part of type probably creates a lot issues too. Although I am not a front-end developer, still I think it is hard to implement in front-end. Sticking fully to this approach, it would be logical to describe this as a debug info (I am not sure it is even possible). Portability would be an issue too. It is hard to prevent for a regular C developer to assign such function to variable because it is ok on his system while the compilation of such code may fail on another system. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: gcc feature request / RFC: extra clobbered regs
On 07/01/2015 11:27 AM, Andy Lutomirski wrote: On Wed, Jul 1, 2015 at 8:23 AM, Vladimir Makarov wrote: On 06/30/2015 05:37 PM, Jakub Jelinek wrote: On Tue, Jun 30, 2015 at 02:22:33PM -0700, Andy Lutomirski wrote: I'm working on a massive set of cleanups to Linux's syscall handling. We currently have a nasty optimization in which we don't save rbx, rbp, r12, r13, r14, and r15 on x86_64 before calling C functions. This works, but it makes the code a huge mess. I'd rather save all regs in asm and then call C code. Unfortunately, this will add five cycles (on SNB) to one of the hottest paths in the kernel. To counteract it, I have a gcc feature request that might not be all that crazy. When writing C functions intended to be called from asm, what if we could do: __attribute__((extra_clobber("rbx", "rbp", "r12", "r13", "r14", "r15"))) void func(void); This will save enough pushes and pops that it could easily give us our five cycles back and then some. It's also easy to be compatible with old GCC versions -- we could just omit the attribute, since preserving a register is always safe. Thoughts? Is this totally crazy? Is it easy to implement? (I'm not necessarily suggesting that we do this for the syscall bodies themselves. I want to do it for the entry and exit helpers, so we'd still lose the five cycles in the full fast-path case, but we'd do better in the slower paths, and the slower paths are becoming increasingly important in real workloads.) GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG options, which allow to tweak the calling conventions; but it is per translation unit right now. It isn't clear which of these options you mean with the extra_clobber. I assume you are looking for a possibility to change this to be per-function, with caller with a different calling convention having to adjust for different ABI callee. To some extent, recent GCC versions do that automatically with -fipa-ra already - if some call used registers are not clobbered by some call and the caller can analyze that callee, it can stick values in such registers across the call. I'd say the most natural API for this would be to allow f{fixed,call-{used,saved}}-REG in target attribute. One consequence of frequent changing calling convention per function or register usage could be GCC slowdown. RA calculates too many data and it requires a lot of time to recalculate them after something in the register usage convention is changed. Do you mean that RA precalculates things based on the calling convention and saves it across functions? RA calculates a lot info (register classes, class x class relations etc) based on register usage convention (fixed regs, call used registers etc). If register usage convention is not changed from previous function compilation, RA reuses the info. Otherwise, RA recalculates it. Hmm. I don't think this would be a big problem in my intended use case -- there would only be a handful of functions using this extension, and they'd have very few non-asm callers. Good. I guess it will be rarely used and people will tolerate some extra compilation time. Another consequence would be that RA fails generate the code in some cases and even worse the failure might depend on version of GCC (I already saw PRs where RA worked for an asm in one GCC version because a pseudo was changed by equivalent constant and failed in another GCC version where it did not happen). Would this be a problem generating code for a function with extra "used" regs or just a problem generating code to call such a function. I imagine that, in the former case, RA's job would be easier, not harder, since there would be more registers to work with. Sorry, I meant that the problem will be mostly when the attributes describe more fixed regs. If you describe more clobbered regs, they still can be used for allocator which can spill/restore them (around calls) when they can not be used. Still i think there will be some rare and complicated cases where even describing only clobbered regs can make RA fails in a function calling the function with additional clobbered regs. In practice, though, I think it would just end up changing the prologue and epilogue. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: gcc feature request / RFC: extra clobbered regs
On 07/01/2015 11:31 AM, Jakub Jelinek wrote: On Wed, Jul 01, 2015 at 11:23:17AM -0400, Vladimir Makarov wrote: (I'm not necessarily suggesting that we do this for the syscall bodies themselves. I want to do it for the entry and exit helpers, so we'd still lose the five cycles in the full fast-path case, but we'd do better in the slower paths, and the slower paths are becoming increasingly important in real workloads.) GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG options, which allow to tweak the calling conventions; but it is per translation unit right now. It isn't clear which of these options you mean with the extra_clobber. I assume you are looking for a possibility to change this to be per-function, with caller with a different calling convention having to adjust for different ABI callee. To some extent, recent GCC versions do that automatically with -fipa-ra already - if some call used registers are not clobbered by some call and the caller can analyze that callee, it can stick values in such registers across the call. I'd say the most natural API for this would be to allow f{fixed,call-{used,saved}}-REG in target attribute. One consequence of frequent changing calling convention per function or register usage could be GCC slowdown. RA calculates too many data and it requires a lot of time to recalculate them after something in the register usage convention is changed. That is true. i?86/x86_64 is a switchable target, so at least for the case of info computed for the callee with non-standard calling convention such info can be computed just once when the function with such a target attribute would be seen first. Yes, more clever way could be used. We can can calculate the info for specific calling convention, save it and reuse it for the function with the same attributes. The compilation speed will be ok even with the current implementation if there are few calling convention changes. But for the caller side, I agree not everything can be precomputed, if we can't use e.g. regsets saved in the callee; as a single function can call different functions with different ABIs. But to some extent we have that already with -fipa-ra, don't we? Yes, for -fipa-ra if we saw the function, we know what registers it actually clobbers. If we did not processed it yet, we use the worst case scenario (clobbering all clobbered registers according to calling convention). Actually it raise a question for me. If we describe that a function clobbers more than calling convention and then use it as a value (assigning a variable or passing as an argument) and loosing a track of it and than call it. How can RA know what the call clobbers actually. So for the function with the attributes we should prohibit use it as a value or make the attributes as a part of the function type, or at least say it is unsafe. So now I see this as a *bigger problem* with this extension. Although I guess it already exists as we have description of different ABI as an extension. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: gcc feature request / RFC: extra clobbered regs
On 06/30/2015 05:37 PM, Jakub Jelinek wrote: On Tue, Jun 30, 2015 at 02:22:33PM -0700, Andy Lutomirski wrote: I'm working on a massive set of cleanups to Linux's syscall handling. We currently have a nasty optimization in which we don't save rbx, rbp, r12, r13, r14, and r15 on x86_64 before calling C functions. This works, but it makes the code a huge mess. I'd rather save all regs in asm and then call C code. Unfortunately, this will add five cycles (on SNB) to one of the hottest paths in the kernel. To counteract it, I have a gcc feature request that might not be all that crazy. When writing C functions intended to be called from asm, what if we could do: __attribute__((extra_clobber("rbx", "rbp", "r12", "r13", "r14", "r15"))) void func(void); This will save enough pushes and pops that it could easily give us our five cycles back and then some. It's also easy to be compatible with old GCC versions -- we could just omit the attribute, since preserving a register is always safe. Thoughts? Is this totally crazy? Is it easy to implement? (I'm not necessarily suggesting that we do this for the syscall bodies themselves. I want to do it for the entry and exit helpers, so we'd still lose the five cycles in the full fast-path case, but we'd do better in the slower paths, and the slower paths are becoming increasingly important in real workloads.) GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG options, which allow to tweak the calling conventions; but it is per translation unit right now. It isn't clear which of these options you mean with the extra_clobber. I assume you are looking for a possibility to change this to be per-function, with caller with a different calling convention having to adjust for different ABI callee. To some extent, recent GCC versions do that automatically with -fipa-ra already - if some call used registers are not clobbered by some call and the caller can analyze that callee, it can stick values in such registers across the call. I'd say the most natural API for this would be to allow f{fixed,call-{used,saved}}-REG in target attribute. One consequence of frequent changing calling convention per function or register usage could be GCC slowdown. RA calculates too many data and it requires a lot of time to recalculate them after something in the register usage convention is changed. Another consequence would be that RA fails generate the code in some cases and even worse the failure might depend on version of GCC (I already saw PRs where RA worked for an asm in one GCC version because a pseudo was changed by equivalent constant and failed in another GCC version where it did not happen). Other than that I don't see other complications with implementing such feature. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: gcc feature request / RFC: extra clobbered regs
On 07/01/2015 01:43 PM, Jakub Jelinek wrote: On Wed, Jul 01, 2015 at 01:35:16PM -0400, Vladimir Makarov wrote: Actually it raise a question for me. If we describe that a function clobbers more than calling convention and then use it as a value (assigning a variable or passing as an argument) and loosing a track of it and than call it. How can RA know what the call clobbers actually. So for the function with the attributes we should prohibit use it as a value or make the attributes as a part of the function type, or at least say it is unsafe. So now I see this as a *bigger problem* with this extension. Although I guess it already exists as we have description of different ABI as an extension. Unfortunately target attribute is function decl attribute rather than function type. And having more attributes affect switchable targets will be non-fun. Making attributes a part of type probably creates a lot issues too. Although I am not a front-end developer, still I think it is hard to implement in front-end. Sticking fully to this approach, it would be logical to describe this as a debug info (I am not sure it is even possible). Portability would be an issue too. It is hard to prevent for a regular C developer to assign such function to variable because it is ok on his system while the compilation of such code may fail on another system. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: gcc feature request / RFC: extra clobbered regs
On 06/30/2015 05:37 PM, Jakub Jelinek wrote: On Tue, Jun 30, 2015 at 02:22:33PM -0700, Andy Lutomirski wrote: I'm working on a massive set of cleanups to Linux's syscall handling. We currently have a nasty optimization in which we don't save rbx, rbp, r12, r13, r14, and r15 on x86_64 before calling C functions. This works, but it makes the code a huge mess. I'd rather save all regs in asm and then call C code. Unfortunately, this will add five cycles (on SNB) to one of the hottest paths in the kernel. To counteract it, I have a gcc feature request that might not be all that crazy. When writing C functions intended to be called from asm, what if we could do: __attribute__((extra_clobber(rbx, rbp, r12, r13, r14, r15))) void func(void); This will save enough pushes and pops that it could easily give us our five cycles back and then some. It's also easy to be compatible with old GCC versions -- we could just omit the attribute, since preserving a register is always safe. Thoughts? Is this totally crazy? Is it easy to implement? (I'm not necessarily suggesting that we do this for the syscall bodies themselves. I want to do it for the entry and exit helpers, so we'd still lose the five cycles in the full fast-path case, but we'd do better in the slower paths, and the slower paths are becoming increasingly important in real workloads.) GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG options, which allow to tweak the calling conventions; but it is per translation unit right now. It isn't clear which of these options you mean with the extra_clobber. I assume you are looking for a possibility to change this to be per-function, with caller with a different calling convention having to adjust for different ABI callee. To some extent, recent GCC versions do that automatically with -fipa-ra already - if some call used registers are not clobbered by some call and the caller can analyze that callee, it can stick values in such registers across the call. I'd say the most natural API for this would be to allow f{fixed,call-{used,saved}}-REG in target attribute. One consequence of frequent changing calling convention per function or register usage could be GCC slowdown. RA calculates too many data and it requires a lot of time to recalculate them after something in the register usage convention is changed. Another consequence would be that RA fails generate the code in some cases and even worse the failure might depend on version of GCC (I already saw PRs where RA worked for an asm in one GCC version because a pseudo was changed by equivalent constant and failed in another GCC version where it did not happen). Other than that I don't see other complications with implementing such feature. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: gcc feature request / RFC: extra clobbered regs
On 07/01/2015 11:31 AM, Jakub Jelinek wrote: On Wed, Jul 01, 2015 at 11:23:17AM -0400, Vladimir Makarov wrote: (I'm not necessarily suggesting that we do this for the syscall bodies themselves. I want to do it for the entry and exit helpers, so we'd still lose the five cycles in the full fast-path case, but we'd do better in the slower paths, and the slower paths are becoming increasingly important in real workloads.) GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG options, which allow to tweak the calling conventions; but it is per translation unit right now. It isn't clear which of these options you mean with the extra_clobber. I assume you are looking for a possibility to change this to be per-function, with caller with a different calling convention having to adjust for different ABI callee. To some extent, recent GCC versions do that automatically with -fipa-ra already - if some call used registers are not clobbered by some call and the caller can analyze that callee, it can stick values in such registers across the call. I'd say the most natural API for this would be to allow f{fixed,call-{used,saved}}-REG in target attribute. One consequence of frequent changing calling convention per function or register usage could be GCC slowdown. RA calculates too many data and it requires a lot of time to recalculate them after something in the register usage convention is changed. That is true. i?86/x86_64 is a switchable target, so at least for the case of info computed for the callee with non-standard calling convention such info can be computed just once when the function with such a target attribute would be seen first. Yes, more clever way could be used. We can can calculate the info for specific calling convention, save it and reuse it for the function with the same attributes. The compilation speed will be ok even with the current implementation if there are few calling convention changes. But for the caller side, I agree not everything can be precomputed, if we can't use e.g. regsets saved in the callee; as a single function can call different functions with different ABIs. But to some extent we have that already with -fipa-ra, don't we? Yes, for -fipa-ra if we saw the function, we know what registers it actually clobbers. If we did not processed it yet, we use the worst case scenario (clobbering all clobbered registers according to calling convention). Actually it raise a question for me. If we describe that a function clobbers more than calling convention and then use it as a value (assigning a variable or passing as an argument) and loosing a track of it and than call it. How can RA know what the call clobbers actually. So for the function with the attributes we should prohibit use it as a value or make the attributes as a part of the function type, or at least say it is unsafe. So now I see this as a *bigger problem* with this extension. Although I guess it already exists as we have description of different ABI as an extension. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: gcc feature request / RFC: extra clobbered regs
On 07/01/2015 11:27 AM, Andy Lutomirski wrote: On Wed, Jul 1, 2015 at 8:23 AM, Vladimir Makarov vmaka...@redhat.com wrote: On 06/30/2015 05:37 PM, Jakub Jelinek wrote: On Tue, Jun 30, 2015 at 02:22:33PM -0700, Andy Lutomirski wrote: I'm working on a massive set of cleanups to Linux's syscall handling. We currently have a nasty optimization in which we don't save rbx, rbp, r12, r13, r14, and r15 on x86_64 before calling C functions. This works, but it makes the code a huge mess. I'd rather save all regs in asm and then call C code. Unfortunately, this will add five cycles (on SNB) to one of the hottest paths in the kernel. To counteract it, I have a gcc feature request that might not be all that crazy. When writing C functions intended to be called from asm, what if we could do: __attribute__((extra_clobber(rbx, rbp, r12, r13, r14, r15))) void func(void); This will save enough pushes and pops that it could easily give us our five cycles back and then some. It's also easy to be compatible with old GCC versions -- we could just omit the attribute, since preserving a register is always safe. Thoughts? Is this totally crazy? Is it easy to implement? (I'm not necessarily suggesting that we do this for the syscall bodies themselves. I want to do it for the entry and exit helpers, so we'd still lose the five cycles in the full fast-path case, but we'd do better in the slower paths, and the slower paths are becoming increasingly important in real workloads.) GCC already supports -ffixed-REG, -fcall-used-REG and -fcall-saved-REG options, which allow to tweak the calling conventions; but it is per translation unit right now. It isn't clear which of these options you mean with the extra_clobber. I assume you are looking for a possibility to change this to be per-function, with caller with a different calling convention having to adjust for different ABI callee. To some extent, recent GCC versions do that automatically with -fipa-ra already - if some call used registers are not clobbered by some call and the caller can analyze that callee, it can stick values in such registers across the call. I'd say the most natural API for this would be to allow f{fixed,call-{used,saved}}-REG in target attribute. One consequence of frequent changing calling convention per function or register usage could be GCC slowdown. RA calculates too many data and it requires a lot of time to recalculate them after something in the register usage convention is changed. Do you mean that RA precalculates things based on the calling convention and saves it across functions? RA calculates a lot info (register classes, class x class relations etc) based on register usage convention (fixed regs, call used registers etc). If register usage convention is not changed from previous function compilation, RA reuses the info. Otherwise, RA recalculates it. Hmm. I don't think this would be a big problem in my intended use case -- there would only be a handful of functions using this extension, and they'd have very few non-asm callers. Good. I guess it will be rarely used and people will tolerate some extra compilation time. Another consequence would be that RA fails generate the code in some cases and even worse the failure might depend on version of GCC (I already saw PRs where RA worked for an asm in one GCC version because a pseudo was changed by equivalent constant and failed in another GCC version where it did not happen). Would this be a problem generating code for a function with extra used regs or just a problem generating code to call such a function. I imagine that, in the former case, RA's job would be easier, not harder, since there would be more registers to work with. Sorry, I meant that the problem will be mostly when the attributes describe more fixed regs. If you describe more clobbered regs, they still can be used for allocator which can spill/restore them (around calls) when they can not be used. Still i think there will be some rare and complicated cases where even describing only clobbered regs can make RA fails in a function calling the function with additional clobbered regs. In practice, though, I think it would just end up changing the prologue and epilogue. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Optimize variable_test_bit()
On 02/05/15 08:43 AM, Peter Zijlstra wrote: On Fri, May 01, 2015 at 03:02:24PM -0400, Vladimir Makarov wrote: Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS. PPC, SH, and ARC are moving to LRA. All other targets are still reload based. So I could implement the output reloads in LRA, probably for the next GCC release. How to enable and mostly use it for multi-target code like the kernel is another question. Pretty much all inline asm is in per arch code; so one arch having different asm features than another should not be a problem at all. Ok, then. I'll try to implement output operands for asm-goto in LRA for the next GCC release. Of course, if nobody objects to changing asm goto semantics from An 'asm goto' statement cannot have outputs ... to An 'asm goto' statement cannot have outputs on some targets ... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Optimize variable_test_bit()
On 02/05/15 08:43 AM, Peter Zijlstra wrote: On Fri, May 01, 2015 at 03:02:24PM -0400, Vladimir Makarov wrote: Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS. PPC, SH, and ARC are moving to LRA. All other targets are still reload based. So I could implement the output reloads in LRA, probably for the next GCC release. How to enable and mostly use it for multi-target code like the kernel is another question. Pretty much all inline asm is in per arch code; so one arch having different asm features than another should not be a problem at all. Ok, then. I'll try to implement output operands for asm-goto in LRA for the next GCC release. Of course, if nobody objects to changing asm goto semantics from An 'asm goto' statement cannot have outputs ... to An 'asm goto' statement cannot have outputs on some targets ... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Optimize variable_test_bit()
On 01/05/15 04:49 PM, Linus Torvalds wrote: On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov wrote: GCC RA is a major reason to prohibit output operands for asm goto. Hmm.. Thinking some more about it, I think that what would actually work really well at least for the kernel is: (a) allow *memory* operands (ie "=m") as outputs and having them be meaningful even at any output labels (obviously with the caveat that the asm instructions that write to memory would have to happen before the branch ;) This covers the somewhat common case of having magic instructions that result in conditions that can't be tested at a C level. Things like "bit clear and test" on x86 (with or without the lock) . (b) allow other operands to be meaningful onlty for the fallthrough case. From a register allocation standpoint, these should be the easy cases. (a) doesn't need any register allocation of the output (only on the input to set up the effective address of the memory location), and (b) would explicitly mean that an "asm goto" would leave any non-memory outputs undefined in any of the goto cases, so from a RA standpoint it ends up being equivalent to a non-goto asm.. Thanks for explanation what you need in the most common case. Big part of GCC RA (at least local register allocators -- reload pass and LRA) besides assigning hard registers to pseudos is to make transformations to satisfy insn constraints. If there is not enough hard registers, a pseudo can be allocated to a stack slot and if insn using the pseudo needs a hard register, load or/and store should be generated before/after the insn. And the problem for the old (reload pass) and new RA (LRA) is that they were not designed to put new insns after an insn changing control flow. Assigning hard registers itself is not an issue for asm goto case. If I understood you correctly, you assume that just permitting =m will make GCC generates the correct code. Unfortunately, it is more complicated. The operand can be not a memory or memory not satisfying memory constraint 'm'. So still insns for moving memory satisfying 'm' into output operand location might be necessary after the asm goto. We could make asm goto semantics requiring that a user should provide memory for such output operand (e.g. a pointer dereferrencing in your case) and generate an error otherwise. By the way the same could be done for output *register* operand. And user to avoid the error should use a local register variable (a GCC extension) as an operand. But it might be a bad idea with code performance point of view. Unfortunately, the operand can be substituted by an equiv. value during different transformations and even if an user think it will be a memory before RA, it might be wrong. Although I believe there are some cases where we can be sure that it will be memory (e.g. dereferrencing pointer which is a function argument and is not used anywhere else in function). Still it makes asm goto semantics complicated imho. We could prevent equiv. substitution for output memory operand of asm goto through all the optimizations but it is probably even harder task than implementing output reloads in *reload* pass (it is 28-year old pass with so many changes during its life that practically nobody can understand it now well and change w/o introducing a new bug). As for LRA, I wrote implementing output reloads is a double task. Hmm? So as an example of something that the kernel does and which wants to have an output register. is to do a load from user space that can fault. When it faults, we obviously simply don't *have* an actual result, and we return an error. But for the successful fallthrough case, we get a value in a register. I'd love to be able to write it as (this is simplified, and doesn't worry about all the different access sizes, or the "stac/clac" sequence to enable user accesses on modern Intel CPU's): asm goto( "1:" "\tmovl %0,%1\n" _ASM_EXTABLE(1b,%l[error]) : "=r" (val) : "m" (*userptr) : : error); where that "_ASM_EXTABLE()" is our magic macro for generating an exception entry for that instruction, so that if the load takes an exception, it will instead to to the "error" label. But if it goes to the error label, the "val" output register really doesn't contain anything, so we wouldn't even *want* gcc to try to do any register allocation for the "jump to label from assembly" case. So at least for one of the major cases that I'd like to use "asm goto" with an output, I actually don't *want* any register allocation for anything but the fallthrough case. And I suspect that's a not-too-uncommon pattern - it's probably often about error handling. As I wrote already if we implement output reloads after the control f
Re: [PATCH] x86: Optimize variable_test_bit()
On 01/05/15 12:33 PM, Jakub Jelinek wrote: On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote: PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto; is this something we can get 'fixed' ? CCing Richard as author of asm goto and Vlad as register allocator maintainer. There are a few enhancement requests to support this, like http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the reason why no outputs are allowed is the register allocation issue. Don't know if LRA would be better suited to handle that case, but it would indeed be pretty hard. GCC RA is a major reason to prohibit output operands for asm goto. Reload pass was not designed to deal with output reloads for control flow insns. It is very hard to implement this feature there and implement it in a reliable way. Also nobody does any development for reload for long time. So I doubt that somebody would do this for reload. LRA is more suitable to implement the feature. In general, even outputs used on any branch can be permitted. Although critical edges can complicate the implementation as new BBs are created. But it is doable too. The only problem is that asm goto semantics in this case should be defineddepending on what local register allocator (reload or LRA) GCC for given target use. Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS. PPC, SH, and ARC are moving to LRA. All other targets are still reload based. So I could implement the output reloads in LRA, probably for the next GCC release. How to enable and mostly use it for multi-target code like the kernel is another question. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86: Optimize variable_test_bit()
On 01/05/15 04:49 PM, Linus Torvalds wrote: On Fri, May 1, 2015 at 12:02 PM, Vladimir Makarov vmaka...@redhat.com wrote: GCC RA is a major reason to prohibit output operands for asm goto. Hmm.. Thinking some more about it, I think that what would actually work really well at least for the kernel is: (a) allow *memory* operands (ie =m) as outputs and having them be meaningful even at any output labels (obviously with the caveat that the asm instructions that write to memory would have to happen before the branch ;) This covers the somewhat common case of having magic instructions that result in conditions that can't be tested at a C level. Things like bit clear and test on x86 (with or without the lock) . (b) allow other operands to be meaningful onlty for the fallthrough case. From a register allocation standpoint, these should be the easy cases. (a) doesn't need any register allocation of the output (only on the input to set up the effective address of the memory location), and (b) would explicitly mean that an asm goto would leave any non-memory outputs undefined in any of the goto cases, so from a RA standpoint it ends up being equivalent to a non-goto asm.. Thanks for explanation what you need in the most common case. Big part of GCC RA (at least local register allocators -- reload pass and LRA) besides assigning hard registers to pseudos is to make transformations to satisfy insn constraints. If there is not enough hard registers, a pseudo can be allocated to a stack slot and if insn using the pseudo needs a hard register, load or/and store should be generated before/after the insn. And the problem for the old (reload pass) and new RA (LRA) is that they were not designed to put new insns after an insn changing control flow. Assigning hard registers itself is not an issue for asm goto case. If I understood you correctly, you assume that just permitting =m will make GCC generates the correct code. Unfortunately, it is more complicated. The operand can be not a memory or memory not satisfying memory constraint 'm'. So still insns for moving memory satisfying 'm' into output operand location might be necessary after the asm goto. We could make asm goto semantics requiring that a user should provide memory for such output operand (e.g. a pointer dereferrencing in your case) and generate an error otherwise. By the way the same could be done for output *register* operand. And user to avoid the error should use a local register variable (a GCC extension) as an operand. But it might be a bad idea with code performance point of view. Unfortunately, the operand can be substituted by an equiv. value during different transformations and even if an user think it will be a memory before RA, it might be wrong. Although I believe there are some cases where we can be sure that it will be memory (e.g. dereferrencing pointer which is a function argument and is not used anywhere else in function). Still it makes asm goto semantics complicated imho. We could prevent equiv. substitution for output memory operand of asm goto through all the optimizations but it is probably even harder task than implementing output reloads in *reload* pass (it is 28-year old pass with so many changes during its life that practically nobody can understand it now well and change w/o introducing a new bug). As for LRA, I wrote implementing output reloads is a double task. Hmm? So as an example of something that the kernel does and which wants to have an output register. is to do a load from user space that can fault. When it faults, we obviously simply don't *have* an actual result, and we return an error. But for the successful fallthrough case, we get a value in a register. I'd love to be able to write it as (this is simplified, and doesn't worry about all the different access sizes, or the stac/clac sequence to enable user accesses on modern Intel CPU's): asm goto( 1: \tmovl %0,%1\n _ASM_EXTABLE(1b,%l[error]) : =r (val) : m (*userptr) : : error); where that _ASM_EXTABLE() is our magic macro for generating an exception entry for that instruction, so that if the load takes an exception, it will instead to to the error label. But if it goes to the error label, the val output register really doesn't contain anything, so we wouldn't even *want* gcc to try to do any register allocation for the jump to label from assembly case. So at least for one of the major cases that I'd like to use asm goto with an output, I actually don't *want* any register allocation for anything but the fallthrough case. And I suspect that's a not-too-uncommon pattern - it's probably often about error handling. As I wrote already if we implement output reloads after the control flow insn, it does not matter what operand constraint should be (memory or register). Implementing it only for fall-through case
Re: [PATCH] x86: Optimize variable_test_bit()
On 01/05/15 12:33 PM, Jakub Jelinek wrote: On Fri, May 01, 2015 at 09:03:32AM -0700, Linus Torvalds wrote: PPS. Jakub, I see gcc5.1 still hasn't got output operands for asm goto; is this something we can get 'fixed' ? CCing Richard as author of asm goto and Vlad as register allocator maintainer. There are a few enhancement requests to support this, like http://gcc.gnu.org/PR59615 and http://gcc.gnu.org/PR52381 , but indeed the reason why no outputs are allowed is the register allocation issue. Don't know if LRA would be better suited to handle that case, but it would indeed be pretty hard. GCC RA is a major reason to prohibit output operands for asm goto. Reload pass was not designed to deal with output reloads for control flow insns. It is very hard to implement this feature there and implement it in a reliable way. Also nobody does any development for reload for long time. So I doubt that somebody would do this for reload. LRA is more suitable to implement the feature. In general, even outputs used on any branch can be permitted. Although critical edges can complicate the implementation as new BBs are created. But it is doable too. The only problem is that asm goto semantics in this case should be defineddepending on what local register allocator (reload or LRA) GCC for given target use. Currently LRA is used by x86/x86-64, ARM, AARCH64, s390, and MIPS. PPC, SH, and ARC are moving to LRA. All other targets are still reload based. So I could implement the output reloads in LRA, probably for the next GCC release. How to enable and mostly use it for multi-target code like the kernel is another question. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/