Re: [Suggestion] about h8/300 architecture in gcc and binutils
On 09/10/2013 10:19 AM, Jeff Law wrote: > On 09/09/2013 07:13 PM, Chen Gang wrote: >> Hello Maintainers: >> >> After google search and check the Linux kernel, H8/300 is dead, and for >> gcc-4.9.0 and binutils-2.23.2 still has h8300, do we still need it for >> another OS ? >> >> Welcome any suggestions or completions, thanks. >> >> >> The related information in linux kernel next tree: >> >>commit d02babe847bf96b82b12cc4e4e90028ac3fac73f >>Author: Guenter Roeck >>Date: Fri Aug 30 06:01:49 2013 -0700 >> >>Drop support for Renesas H8/300 (h8300) architecture >> >>H8/300 has been dead for several years, and the kernel for it >>has not compiled for ages. Drop support for it. >> >>Cc: Yoshinori Sato >>Acked-by: Greg Kroah-Hartman >>Signed-off-by: Guenter Roeck >> >> >> The related information in gcc/binutils: >> >>We can build h8300 cross-compiler for Linux kernel, successfully, >> but it has many bugs when building Linux kernel with -Os. >>if we still need h8300 for another OS, is it still valuable to send >> these bugs to Bugzilla (although it is found under Linux)? > It is still useful to send code generation bugs for the H8/300 series to > the GCC folks. > OK, thanks, I will wait for 1-2 days which may get another members' opinions for discussing. If no additional opinions, I will report them to Bugzilla, and I should try to continue 'work' with related members (although I am a newbie for compiler and binutils programming). > jeff > > > Thanks. -- Chen Gang
Re: RFC: Inlines, LTO and GCC
On 09/09/2013 02:45 PM, Andrew MacLeod wrote: A number of header files have inline functions declared in them. Some of these functions are actually quite large, and I doubt that inlining them is the right thing. For instance, tree-flow-inline.h has some quite large functions. Many of the op_iter* functions are 30-40 lines long, and get_addr_base_and_unit_offset_1() is 130 lines. Doesn't seem like it should be static inline! :-P During the process of re-factoring header files, it could be worthwhile to also move functions like this to a .c file... I know a lot of work has going in to the inliner and LTO, and I was wondering what its state was with regards to the current gcc source base. My questions are: 1) is everyone in favour of moving these largish inlines out of header files and making them not inline, 2) what size of function is rationale for inlining. Small one obviously, but where does the line start to get vague, and what would be a good rationale litmus test for an inline? Functions which "do a lot" and look like they would use a number of registers seem like candidates to move.. I think we have a lot of functions that end up being compiled quite large because they inline functions which inline functions which inline functions 3) The significance of moving these out would be greatly reduced if GCC were produced with LTO.. have we tried or considered doing this and possibly releasing gcc compiled this way? It seems to me we could have significantly less stuff in header files tagged as inline, but still see the benefit in our final product... maybe all we'd need is the teeny tiny ones... and let the machinery figure it all out. Now that would be sweet... Unless we have evidence to show inlining a nontrivial function is a performance win, my inclination is to not have them in .h files and decorate them with inline directives. Instead put them back in a .c file where they belong and let LTO do its thing. I haven't done any research, but I suspect once you go beyond the trivial functions size is no longer a good indicator of whether or not something should be inlined. Instead I suspect the question should be, if I inline this nontrivial code, how much code either in the caller or the inlined callee gets simplified away. Of course, that's not always an easy question to answer :-) Jeff
Re: [Suggestion] about h8/300 architecture in gcc and binutils
On 09/09/2013 07:13 PM, Chen Gang wrote: Hello Maintainers: After google search and check the Linux kernel, H8/300 is dead, and for gcc-4.9.0 and binutils-2.23.2 still has h8300, do we still need it for another OS ? Welcome any suggestions or completions, thanks. The related information in linux kernel next tree: commit d02babe847bf96b82b12cc4e4e90028ac3fac73f Author: Guenter Roeck Date: Fri Aug 30 06:01:49 2013 -0700 Drop support for Renesas H8/300 (h8300) architecture H8/300 has been dead for several years, and the kernel for it has not compiled for ages. Drop support for it. Cc: Yoshinori Sato Acked-by: Greg Kroah-Hartman Signed-off-by: Guenter Roeck The related information in gcc/binutils: We can build h8300 cross-compiler for Linux kernel, successfully, but it has many bugs when building Linux kernel with -Os. if we still need h8300 for another OS, is it still valuable to send these bugs to Bugzilla (although it is found under Linux)? It is still useful to send code generation bugs for the H8/300 series to the GCC folks. jeff
[Suggestion] about h8/300 architecture in gcc and binutils
Hello Maintainers: After google search and check the Linux kernel, H8/300 is dead, and for gcc-4.9.0 and binutils-2.23.2 still has h8300, do we still need it for another OS ? Welcome any suggestions or completions, thanks. The related information in linux kernel next tree: commit d02babe847bf96b82b12cc4e4e90028ac3fac73f Author: Guenter Roeck Date: Fri Aug 30 06:01:49 2013 -0700 Drop support for Renesas H8/300 (h8300) architecture H8/300 has been dead for several years, and the kernel for it has not compiled for ages. Drop support for it. Cc: Yoshinori Sato Acked-by: Greg Kroah-Hartman Signed-off-by: Guenter Roeck The related information in gcc/binutils: We can build h8300 cross-compiler for Linux kernel, successfully, but it has many bugs when building Linux kernel with -Os. if we still need h8300 for another OS, is it still valuable to send these bugs to Bugzilla (although it is found under Linux)? Thanks. -- Chen Gang
Re: [ping] [buildrobot] gcc/config/linux-android.c:40:7: error: ‘OPTION_BIONIC’ was not declared in this scope
On 7/09/2013, at 1:31 AM, Jan-Benedict Glaw wrote: > On Mon, 2013-08-26 12:51:53 +0200, Jan-Benedict Glaw > wrote: >> On Tue, 2013-08-20 11:24:31 +0400, Alexander Ivchenko >> wrote: >>> Hi, thanks for cathing this. >>> >>> I certainly missed that OPTION_BIONIC is not defined for linux targets >>> that do not include config/linux.h in their tm.h. >>> >>> This patch fixed build for powerpc64le-linux and mn10300-linux. >>> linux_libc, LIBC_GLIBC, LIBC_BIONIC should be defined for all targets. >> [...] > > Seems the commit at Thu Sep 5 13:01:35 2013 (CEST) fixed most of the > fallout. Thanks! > >> mn10300-linux: >> http://toolchain.lug-owl.de/buildbot/showlog.php?id=9657&mode=view > > This however still seems to have issues in a current build: > > http://toolchain.lug-owl.de/buildbot/showlog.php?id=10520&mode=view Jan-Benedict, Mn10300-linux does not appear to be supporting linux. Mn10300-linux target specifier expands into mn10300-unknown-linux-gnu, where *-gnu implies using Glibc library, which doesn't have mn10300 port. Jeff, You are mn10300 maintainer, is building GCC for mn10300-unknown-linux-gnu supposed to work? Thanks, -- Maxim Kuvyrkov www.kugelworks.com
RE: mips16 LRA vs reload - Excess reload registers
> -Original Message- > From: Vladimir Makarov [mailto:vmaka...@redhat.com] > Sent: 08 September 2013 17:51 > To: Matthew Fortune > Cc: gcc@gcc.gnu.org; ber...@codesourcery.com > Subject: Re: mips16 LRA vs reload - Excess reload registers > > On 13-08-23 5:26 AM, Matthew Fortune wrote: > > Hi Vladimir, > > > > I've been working on code size improvements for mips16 and have been > pleased to see some improvement when switching to use LRA instead of > classic reload. At the same time though I have also seen some differences > between reload and LRA in terms of how efficiently reload registers are > reused. > > > > The trigger for LRA to underperform compared with classic reload is when > IRA allocates inappropriate registers and thus puts a lot of stress on > reloading. Mips16 showed this because it can only access a small subset of > the MIPS registers for general instructions. The remaining MIPS registers are > still available as they can be accessed by some special instructions and used > via move instructions as temporaries. In the current mips16 backend, > register move costings lead IRA to determine that although the preferred > class for most pseudos is M16_REGS, the allocno class ends up as GR_REGS. > IRA then resorts to allocating registers outside of M16_REGS more and more > as register pressure increases, even though this is fairly stupid. > > > > When using classic reload the inappropriate register allocations are > effectively reverted as the reload pseudos that get invented tend to all > converge on the same hard register completely removing the original > pseudo. For LRA the reloads tend to diverge and different hard registers are > assigned to the reload pseudos leaving us with two new pseudos and the > original. Two extra move instructions and two extra hard registers used. > While I'm not saying it is LRA's fault for not fixing this situation > perfectly it > does seem that classic reload is better at it. > > > > I have found a potential solution to the original IRA register allocation > problem but I think there may still be something to address in LRA to > improve this scenario anyway. My proposed solution to the IRA problem for > mips16 is to adjust register move costings such that the total of moving > between M16_REGS and GR_REGS and back is more expensive than memory, > but moving from GR_REGS to GR_REGS is cheaper than memory (even > though this is a bit weird as you have to go through an M16_REG to move > from one GR_REG to another GR_REG). > > > > GR_REGS to GR_REGS has to be cheaper than memory as it needs to be a > candidate pressure class but the additional cost for M16->GR->M16 means > that IRA does not use GR_REGS as an alternative class and the allocno class is > just M16_REGS as desired. This feels a bit like a hack but may be the best > solution. The hard register costings used when allocating registers from an > allocno class just don't seem to be strong enough to prevent poor register > allocation in this case, I don't know if the hard register costs are supposed > to > resolve this issue or if they are just about fine tuning. > > > > With the fix in place, LRA outperforms classic reload which is fantastic! > > > > I have a small(ish) test case for this and dumps for IRA, LRA and classic > reload along with the patch to enable LRA for mips16. I can also provide the > fix to register costing that effectively avoids/hides this problem for mips16. > Should I post them here or put them in a bugzilla ticket? > > > > Any advice on which area needs fixing would be welcome and I am quite > happy to work on this given some direction. I suspect these issues are > relevant for any architecture that is not 100% orthogonal which is pretty > much all and particularly important for compressed instruction sets. > > > Sorry again than I did not find time to answer you earlier, Matt. > > Your hack could work. And I guess it is always worth to post the patch for > public with examples of the generated code before and after the patch. > May be some collective mind helps to figure out more what to do with the > patch. I'll post that shortly. > But I guess there is still a thing to do. After constraining allocation only > to > MIPS16 regs we still could use non-MIPS16 GR_REGS for storing values of > less frequently used pseudos (as storing them in non-MIPS16 GR_REGS is > better than in memory). E.g. x86-64 LRA can use SSE regs for storing values > of less frequently used pseudos requiring GENERAL_REGS. > Please look at spill_class target hook and its implementation for x86-64. I have indeed implemented that for mips16 and found that not only does it help to enable the use of non-mips16 registers as spill_class registers but including the mips16 call clobbered registers is also worthwhile. It seems that the spill_class logic is able to find some instances where spilled pseudos could actually have been colored and effectively eliminates the reload. My original post was t
RFC: Inlines, LTO and GCC
A number of header files have inline functions declared in them. Some of these functions are actually quite large, and I doubt that inlining them is the right thing. For instance, tree-flow-inline.h has some quite large functions. Many of the op_iter* functions are 30-40 lines long, and get_addr_base_and_unit_offset_1() is 130 lines. Doesn't seem like it should be static inline! :-P During the process of re-factoring header files, it could be worthwhile to also move functions like this to a .c file... I know a lot of work has going in to the inliner and LTO, and I was wondering what its state was with regards to the current gcc source base. My questions are: 1) is everyone in favour of moving these largish inlines out of header files and making them not inline, 2) what size of function is rationale for inlining. Small one obviously, but where does the line start to get vague, and what would be a good rationale litmus test for an inline? Functions which "do a lot" and look like they would use a number of registers seem like candidates to move.. I think we have a lot of functions that end up being compiled quite large because they inline functions which inline functions which inline functions 3) The significance of moving these out would be greatly reduced if GCC were produced with LTO.. have we tried or considered doing this and possibly releasing gcc compiled this way? It seems to me we could have significantly less stuff in header files tagged as inline, but still see the benefit in our final product... maybe all we'd need is the teeny tiny ones... and let the machinery figure it all out. Now that would be sweet... Andrew
Replacement of c99_runtime in testsuite
Hi, I have a little question Right now internally in gcc we flexibly check whether a particular function (or rather "function class", which could be easily extended) is present or not in libc by calling target hook "libc_has_function", however in the testsuite for c99 runtime we still check whether the full support of it is in place (in gcc.dg/builtins-config.h). And so some tests for some targets (e.g. gcc.dg/builtins-58.c for bionic) are unsupported right now, while actually they are OK. I wonder may be we can somehow get the value of libc_has_function hook for a particular function in the test so to flexibly define whether the test should be unsupported or not? E.g. by adding some debug options for gcc that will return the result of the hook? But I doubt that such option would be a pretty solution thanks --Alexander
Re: [RFC] Vectorization of indexed elements
On Mon, 9 Sep 2013, Vidya Praveen wrote: Hello, This post details some thoughts on an enhancement to the vectorizer that could take advantage of the SIMD instructions that allows indexed element as an operand thus reducing the need for duplication and possibly improve reuse of previously loaded data. Appreciate your opinion on this. --- A phrase like this: for(i=0;i<4;i++) a[i] = b[i] c[2]; is usually vectorized as: va:V4SI = a[0:3] vb:V4SI = b[0:3] t = c[2] vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init ... va:V4SI = vb:V4SI vc:V4SI But this could be simplified further if a target has instructions that support indexed element as a parameter. For example an instruction like this: mul v0.4s, v1.4s, v2.4s[2] can perform multiplication of each element of v2.4s with the third element of v2.4s (specified as v2.4s[2]) and store the results in the corresponding elements of v0.4s. For this to happen, vectorizer needs to understand this idiom and treat the operand c[2] specially (and by taking in to consideration if the machine supports indexed element as an operand for through a target hook or macro) and consider this as vectorizable statement without having to duplicate the elements explicitly. There are fews ways this could be represented at gimple: ... va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) ... or by allowing a vectorizer treat an indexed element as a valid operand in a vectorizable statement: Might as well allow any scalar then... ... va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) ... For the sake of explanation, the above two representations assumes that c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the only use of 'c' then it may be safer to just load one element and use it like this: vc:V4SI[0] = c[2] va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) This could also mean that expressions involving scalar could be treated similarly. For example, for(i=0;i<4;i++) a[i] = b[i] c could be vectorized as: vc:V4SI[0] = c va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) Such a change would also require new standard pattern names to be defined for each . Alternatively, having something like this: ... vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) va:V4SI = vb:V4SI vt:V4SI ... would remove the need to introduce several new standard pattern names but have just one to represent vec_duplicate(vec_select()) but ofcourse this will expect the target to have combiner patterns. The cost estimation wouldn't be very good, but aren't combine patterns enough for the whole thing? Don't you model your mul instruction as: (mult:V4SI (match_operand:V4SI) (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI anyway? Seems that combine should be able to handle it. What currently happens that we fail to generate the right instruction? In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR for vec_duplicate, adding new nodes is always painful. This enhancement could possibly help further optimizing larger scenarios such as linear systems. Regards VP -- Marc Glisse
[RFC] Vectorization of indexed elements
Hello, This post details some thoughts on an enhancement to the vectorizer that could take advantage of the SIMD instructions that allows indexed element as an operand thus reducing the need for duplication and possibly improve reuse of previously loaded data. Appreciate your opinion on this. --- A phrase like this: for(i=0;i<4;i++) a[i] = b[i] c[2]; is usually vectorized as: va:V4SI = a[0:3] vb:V4SI = b[0:3] t = c[2] vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init ... va:V4SI = vb:V4SI vc:V4SI But this could be simplified further if a target has instructions that support indexed element as a parameter. For example an instruction like this: mul v0.4s, v1.4s, v2.4s[2] can perform multiplication of each element of v2.4s with the third element of v2.4s (specified as v2.4s[2]) and store the results in the corresponding elements of v0.4s. For this to happen, vectorizer needs to understand this idiom and treat the operand c[2] specially (and by taking in to consideration if the machine supports indexed element as an operand for through a target hook or macro) and consider this as vectorizable statement without having to duplicate the elements explicitly. There are fews ways this could be represented at gimple: ... va:V4SI = vb:V4SI VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) ... or by allowing a vectorizer treat an indexed element as a valid operand in a vectorizable statement: ... va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 2) ... For the sake of explanation, the above two representations assumes that c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the only use of 'c' then it may be safer to just load one element and use it like this: vc:V4SI[0] = c[2] va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) This could also mean that expressions involving scalar could be treated similarly. For example, for(i=0;i<4;i++) a[i] = b[i] c could be vectorized as: vc:V4SI[0] = c va:V4SI = vb:V4SI VEC_SELECT_EXPR (vc:V4SI 0) Such a change would also require new standard pattern names to be defined for each . Alternatively, having something like this: ... vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2)) va:V4SI = vb:V4SI vt:V4SI ... would remove the need to introduce several new standard pattern names but have just one to represent vec_duplicate(vec_select()) but ofcourse this will expect the target to have combiner patterns. This enhancement could possibly help further optimizing larger scenarios such as linear systems. Regards VP
Re: RFC: SIMD pragma independent of Cilk Plus / OpenMPv4
On Mon, Sep 09, 2013 at 10:18:20AM -0400, Tim Prince wrote: > I pulled down an update of gcc gomp-4_0-branch yesterday and see in > the not-yet-working additions to gcc testsuite there appears to be a > move toward adding more cilkplus clauses to omp simd, such as > firstprivate lastprivate (which are accepted but apparently ignored > in the Intel omp simd implementation). lastprivate is valid OpenMP 4.0 #pragma omp simd clause, just firstprivate is not, and that is just easy to support for Cilk+ #pragma simd which allows it. Jakub
Re: RFC: SIMD pragma independent of Cilk Plus / OpenMPv4
On 9/9/2013 9:37 AM, Tobias Burnus wrote: Dear all, sometimes it can be useful to annotate loops for better vectorization, which is rather independent from parallelization. For vectorization, GCC has [0]: a) Cilk Plus's #pragma simd [1] b) OpenMP 4.0's #pragma omp simd [2] Those require -fcilkplus and -fopenmp, respectively, and activate much more. The question is whether it makes sense to provide a means to ask the compiler for SIMD vectorization without enabling all the other things of Cilk Plus/OpenMP. What's your opinion? [If one provides it, the question is whether it is always on or not, which syntax/semantics it uses [e.g. just the one of Cilk or OpenMP] and what to do with conflicting pragmas which can occur in this case.] Side remark: For vectorization, the widely supported #pragma ivdep, vector, novector can be also useful, even if they are less formally defined. "ivdep" seems to be one of the more useful ones, whose semantics one can map to a safelen of infinity in OpenMP's semenatics [i.e. loop->safelen = INT_MAX]. Tobias [0] In the trunk is currently only some initial middle-end support. OpenMP's imp simd is in the gomp-4_0-branch; Cilk Plus's simd has been submitted for the trunk at http://gcc.gnu.org/ml/gcc-patches/2013-08/msg01626.html [1] http://www.cilkplus.org/download#open-specification [2] http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf ifort/icc have a separate option -openmp-simd for the purpose of activating omp simd directives without invoking OpenMP. In the previous release, in order to activate both OpenMP parallel and omp simd, both options were required (-openmp -openmp-simd). In the new "SP1" release last week, -openmp implies -openmp-simd. Last time I checked, turning off the options did not cause the compiler to accept but ignore all omp simd directives, as I personally thought would be desirable. A few cases are active regardless of compile line option, but many will be rejected without matching options. Current Intel implementations of safelen will fail to vectorize and give notice if the value is set unnecessarily large. It's been agreed that increasing the safelen value beyond the optimum level should not turn off vectorization. safelen(32) is optimum for several float/single precision cases in the Intel(r) Xeon Phi(tm) cross compiler; needless to say, safelen(8) is sufficient for 128-bit SSE2. I pulled down an update of gcc gomp-4_0-branch yesterday and see in the not-yet-working additions to gcc testsuite there appears to be a move toward adding more cilkplus clauses to omp simd, such as firstprivate lastprivate (which are accepted but apparently ignored in the Intel omp simd implementation). I'll be discussing in a meeting later today my effort to publish material including discussion of OpenMP 4.0 implementations. -- Tim Prince
RFC: SIMD pragma independent of Cilk Plus / OpenMPv4
Dear all, sometimes it can be useful to annotate loops for better vectorization, which is rather independent from parallelization. For vectorization, GCC has [0]: a) Cilk Plus's #pragma simd [1] b) OpenMP 4.0's #pragma omp simd [2] Those require -fcilkplus and -fopenmp, respectively, and activate much more. The question is whether it makes sense to provide a means to ask the compiler for SIMD vectorization without enabling all the other things of Cilk Plus/OpenMP. What's your opinion? [If one provides it, the question is whether it is always on or not, which syntax/semantics it uses [e.g. just the one of Cilk or OpenMP] and what to do with conflicting pragmas which can occur in this case.] Side remark: For vectorization, the widely supported #pragma ivdep, vector, novector can be also useful, even if they are less formally defined. "ivdep" seems to be one of the more useful ones, whose semantics one can map to a safelen of infinity in OpenMP's semenatics [i.e. loop->safelen = INT_MAX]. Tobias [0] In the trunk is currently only some initial middle-end support. OpenMP's imp simd is in the gomp-4_0-branch; Cilk Plus's simd has been submitted for the trunk at http://gcc.gnu.org/ml/gcc-patches/2013-08/msg01626.html [1] http://www.cilkplus.org/download#open-specification [2] http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf