Re: [Suggestion] about h8/300 architecture in gcc and binutils

2013-09-09 Thread Chen Gang
On 09/10/2013 10:19 AM, Jeff Law wrote:
> On 09/09/2013 07:13 PM, Chen Gang wrote:
>> Hello Maintainers:
>>
>> After google search and check the Linux kernel, H8/300 is dead, and for
>> gcc-4.9.0 and binutils-2.23.2 still has h8300, do we still need it for
>> another OS ?
>>
>> Welcome any suggestions or completions, thanks.
>>
>>
>> The related information in linux kernel next tree:
>>
>>commit d02babe847bf96b82b12cc4e4e90028ac3fac73f
>>Author: Guenter Roeck 
>>Date:   Fri Aug 30 06:01:49 2013 -0700
>>
>>Drop support for Renesas H8/300 (h8300) architecture
>>
>>H8/300 has been dead for several years, and the kernel for it
>>has not compiled for ages. Drop support for it.
>>
>>Cc: Yoshinori Sato 
>>Acked-by: Greg Kroah-Hartman 
>>Signed-off-by: Guenter Roeck 
>>
>>
>> The related information in gcc/binutils:
>>
>>We can build h8300 cross-compiler for Linux kernel, successfully,
>> but it has many bugs when building Linux kernel with -Os.
>>if we still need h8300 for another OS, is it still valuable to send
>> these bugs to Bugzilla (although it is found under Linux)?
> It is still useful to send code generation bugs for the H8/300 series to
> the GCC folks.
> 

OK, thanks, I will wait for 1-2 days which may get another members'
opinions for discussing.

If no additional opinions, I will report them to Bugzilla, and I should
try to continue 'work' with related members (although I am a newbie for
compiler and binutils programming).

> jeff
> 
> 
> 

Thanks.
-- 
Chen Gang


Re: RFC: Inlines, LTO and GCC

2013-09-09 Thread Jeff Law

On 09/09/2013 02:45 PM, Andrew MacLeod wrote:

A number of header files have inline functions declared in them. Some of
these functions are actually quite large, and I doubt that inlining them
is the right thing.   For instance, tree-flow-inline.h has some quite
large functions.  Many of the op_iter* functions are 30-40 lines long,
and get_addr_base_and_unit_offset_1() is 130 lines.  Doesn't seem like
it should be static inline! :-P

During the process of re-factoring header files, it could be worthwhile
to also move  functions like this to a .c file...

I know a lot of work has  going in to the inliner and LTO, and I was
wondering what its state was with regards to the current gcc source base.

My questions are:

1) is everyone in favour of moving these largish inlines out of header
files and making them not inline,
2) what size of function is rationale for inlining. Small one obviously,
but where does the line start to get vague, and what would be a good
rationale litmus test for an inline?  Functions which "do a lot" and
look like they would use a number of registers seem like candidates to
move..   I think we have a lot of functions that end up being compiled
quite large because they inline functions which inline functions which
inline functions
3) The significance of moving these out would be greatly reduced if GCC
were produced with LTO.. have we tried or considered doing this and
possibly releasing gcc compiled this way?  It seems to me we could have
significantly  less stuff in header files tagged as inline, but still
see the benefit in our final product...   maybe all we'd need is the
teeny tiny ones... and let the machinery figure it all out.  Now that
would be sweet...
Unless we have evidence to show inlining a nontrivial function is a 
performance win, my inclination is to not have them in .h files and 
decorate them with inline directives.  Instead put them back in a .c 
file where they belong and let LTO do its thing.


I haven't done any research, but I suspect once you go beyond the 
trivial functions size is no longer a good indicator of whether or not 
something should be inlined.  Instead I suspect the question should be, 
if I inline this nontrivial code, how much code either in the caller or 
the inlined callee gets simplified away.   Of course, that's not always 
an easy question to answer :-)


Jeff


Re: [Suggestion] about h8/300 architecture in gcc and binutils

2013-09-09 Thread Jeff Law

On 09/09/2013 07:13 PM, Chen Gang wrote:

Hello Maintainers:

After google search and check the Linux kernel, H8/300 is dead, and for
gcc-4.9.0 and binutils-2.23.2 still has h8300, do we still need it for
another OS ?

Welcome any suggestions or completions, thanks.


The related information in linux kernel next tree:

   commit d02babe847bf96b82b12cc4e4e90028ac3fac73f
   Author: Guenter Roeck 
   Date:   Fri Aug 30 06:01:49 2013 -0700

   Drop support for Renesas H8/300 (h8300) architecture

   H8/300 has been dead for several years, and the kernel for it
   has not compiled for ages. Drop support for it.

   Cc: Yoshinori Sato 
   Acked-by: Greg Kroah-Hartman 
   Signed-off-by: Guenter Roeck 


The related information in gcc/binutils:

   We can build h8300 cross-compiler for Linux kernel, successfully, but it has 
many bugs when building Linux kernel with -Os.
   if we still need h8300 for another OS, is it still valuable to send these 
bugs to Bugzilla (although it is found under Linux)?
It is still useful to send code generation bugs for the H8/300 series to 
the GCC folks.


jeff



[Suggestion] about h8/300 architecture in gcc and binutils

2013-09-09 Thread Chen Gang
Hello Maintainers:

After google search and check the Linux kernel, H8/300 is dead, and for
gcc-4.9.0 and binutils-2.23.2 still has h8300, do we still need it for
another OS ?

Welcome any suggestions or completions, thanks.


The related information in linux kernel next tree:

  commit d02babe847bf96b82b12cc4e4e90028ac3fac73f
  Author: Guenter Roeck 
  Date:   Fri Aug 30 06:01:49 2013 -0700
  
  Drop support for Renesas H8/300 (h8300) architecture
  
  H8/300 has been dead for several years, and the kernel for it
  has not compiled for ages. Drop support for it.
  
  Cc: Yoshinori Sato 
  Acked-by: Greg Kroah-Hartman 
  Signed-off-by: Guenter Roeck 


The related information in gcc/binutils:

  We can build h8300 cross-compiler for Linux kernel, successfully, but it has 
many bugs when building Linux kernel with -Os.
  if we still need h8300 for another OS, is it still valuable to send these 
bugs to Bugzilla (although it is found under Linux)?


Thanks.
--
Chen Gang


Re: [ping] [buildrobot] gcc/config/linux-android.c:40:7: error: ‘OPTION_BIONIC’ was not declared in this scope

2013-09-09 Thread Maxim Kuvyrkov
On 7/09/2013, at 1:31 AM, Jan-Benedict Glaw wrote:

> On Mon, 2013-08-26 12:51:53 +0200, Jan-Benedict Glaw  
> wrote:
>> On Tue, 2013-08-20 11:24:31 +0400, Alexander Ivchenko  
>> wrote:
>>> Hi, thanks for cathing this.
>>> 
>>> I certainly missed that OPTION_BIONIC is not defined for linux targets
>>> that do not include config/linux.h in their tm.h.
>>> 
>>> This patch fixed build for powerpc64le-linux and mn10300-linux.
>>> linux_libc, LIBC_GLIBC, LIBC_BIONIC should be defined for all targets.
>> [...]
> 
> Seems the commit at Thu Sep 5 13:01:35 2013 (CEST) fixed most of the
> fallout.  Thanks!
> 
>> mn10300-linux: 
>> http://toolchain.lug-owl.de/buildbot/showlog.php?id=9657&mode=view
> 
> This however still seems to have issues in a current build:
> 
>   http://toolchain.lug-owl.de/buildbot/showlog.php?id=10520&mode=view

Jan-Benedict,

Mn10300-linux does not appear to be supporting linux.  Mn10300-linux target 
specifier expands into mn10300-unknown-linux-gnu, where *-gnu implies using 
Glibc library, which doesn't have mn10300 port.

Jeff,

You are mn10300 maintainer, is building GCC for mn10300-unknown-linux-gnu 
supposed to work?  

Thanks,

--
Maxim Kuvyrkov
www.kugelworks.com



RE: mips16 LRA vs reload - Excess reload registers

2013-09-09 Thread Matthew Fortune


> -Original Message-
> From: Vladimir Makarov [mailto:vmaka...@redhat.com]
> Sent: 08 September 2013 17:51
> To: Matthew Fortune
> Cc: gcc@gcc.gnu.org; ber...@codesourcery.com
> Subject: Re: mips16 LRA vs reload - Excess reload registers
> 
> On 13-08-23 5:26 AM, Matthew Fortune wrote:
> > Hi Vladimir,
> >
> > I've been working on code size improvements for mips16 and have been
> pleased to see some improvement when switching to use LRA instead of
> classic reload. At the same time though I have also seen some differences
> between reload and LRA in terms of how efficiently reload registers are
> reused.
> >
> > The trigger for LRA to underperform compared with classic reload is when
> IRA allocates inappropriate registers and thus puts a lot of stress on
> reloading. Mips16 showed this because it can only access a small subset of
> the MIPS registers for general instructions. The remaining MIPS registers are
> still available as they can be accessed by some special instructions and used
> via move instructions as temporaries. In the current mips16 backend,
> register move costings lead IRA to determine that although the preferred
> class for most pseudos is M16_REGS, the allocno class ends up as GR_REGS.
> IRA then resorts to allocating registers outside of M16_REGS more and more
> as register pressure increases, even though this is fairly stupid.
> >
> > When using classic reload the inappropriate register allocations are
> effectively reverted as the reload pseudos that get invented tend to all
> converge on the same hard register completely removing the original
> pseudo. For LRA the reloads tend to diverge and different hard registers are
> assigned to the reload pseudos leaving us with two new pseudos and the
> original. Two extra move instructions and two extra hard registers used.
> While I'm not saying it is LRA's fault for not fixing this situation 
> perfectly it
> does seem that classic reload is better at it.
> >
> > I have found a potential solution to the original IRA register allocation
> problem but I think there may still be something to address in LRA to
> improve this scenario anyway. My proposed solution to the IRA problem for
> mips16 is to adjust register move costings such that the total of moving
> between M16_REGS and GR_REGS and back is more expensive than memory,
> but moving from GR_REGS to GR_REGS is cheaper than memory (even
> though this is a bit weird as you have to go through an M16_REG to move
> from one GR_REG to another GR_REG).
> >
> > GR_REGS to GR_REGS has to be cheaper than memory as it needs to be a
> candidate pressure class but the additional cost for M16->GR->M16 means
> that IRA does not use GR_REGS as an alternative class and the allocno class is
> just M16_REGS as desired. This feels a bit like a hack but may be the best
> solution. The hard register costings used when allocating registers from an
> allocno class just don't seem to be strong enough to prevent poor register
> allocation in this case, I don't know if the hard register costs are supposed 
> to
> resolve this issue or if they are just about fine tuning.
> >
> > With the fix in place, LRA outperforms classic reload which is fantastic!
> >
> > I have a small(ish) test case for this and dumps for IRA, LRA and classic
> reload along with the patch to enable LRA for mips16. I can also provide the
> fix to register costing that effectively avoids/hides this problem for mips16.
> Should I post them here or put them in a bugzilla ticket?
> >
> > Any advice on which area needs fixing would be welcome and I am quite
> happy to work on this given some direction. I suspect these issues are
> relevant for any architecture that is not 100% orthogonal which is pretty
> much all and particularly important for compressed instruction sets.
> >
> Sorry again than I did not find time to answer you earlier, Matt.
> 
> Your hack could work.  And I guess it is always worth to post the patch for
> public with examples of the generated code before and after the patch.
> May be some collective mind helps to figure out more what to do with the
> patch.

I'll post that shortly.
 
> But I guess there is still a thing to do. After constraining allocation only 
> to
> MIPS16 regs we still could use non-MIPS16 GR_REGS for storing values of
> less frequently used pseudos (as storing them in non-MIPS16 GR_REGS is
> better than in memory).  E.g. x86-64 LRA can use SSE regs for storing values
> of less frequently used pseudos requiring GENERAL_REGS.
> Please look at spill_class target hook and its implementation for x86-64.

I have indeed implemented that for mips16 and found that not only does it help 
to enable the use of non-mips16 registers as spill_class registers but 
including the mips16 call clobbered registers is also worthwhile. It seems that 
the spill_class logic is able to find some instances where spilled pseudos 
could actually have been colored and effectively eliminates the reload.

My original post was t

RFC: Inlines, LTO and GCC

2013-09-09 Thread Andrew MacLeod
A number of header files have inline functions declared in them. Some of 
these functions are actually quite large, and I doubt that inlining them 
is the right thing.   For instance, tree-flow-inline.h has some quite 
large functions.  Many of the op_iter* functions are 30-40 lines long, 
and get_addr_base_and_unit_offset_1() is 130 lines.  Doesn't seem like 
it should be static inline! :-P


During the process of re-factoring header files, it could be worthwhile 
to also move  functions like this to a .c file...


I know a lot of work has  going in to the inliner and LTO, and I was 
wondering what its state was with regards to the current gcc source base.


My questions are:

1) is everyone in favour of moving these largish inlines out of header 
files and making them not inline,
2) what size of function is rationale for inlining. Small one obviously, 
but where does the line start to get vague, and what would be a good 
rationale litmus test for an inline?  Functions which "do a lot" and 
look like they would use a number of registers seem like candidates to 
move..   I think we have a lot of functions that end up being compiled 
quite large because they inline functions which inline functions which 
inline functions
3) The significance of moving these out would be greatly reduced if GCC 
were produced with LTO.. have we tried or considered doing this and 
possibly releasing gcc compiled this way?  It seems to me we could have 
significantly  less stuff in header files tagged as inline, but still 
see the benefit in our final product...   maybe all we'd need is the 
teeny tiny ones... and let the machinery figure it all out.  Now that 
would be sweet...


Andrew


Replacement of c99_runtime in testsuite

2013-09-09 Thread Alexander Ivchenko
Hi, I have a little question

Right now internally in gcc we flexibly check whether a particular
function (or rather "function class", which could be easily extended)
is present or not in libc by calling target hook "libc_has_function",
however in the testsuite for c99 runtime we still check whether the
full support of it is in place (in gcc.dg/builtins-config.h).

And so some tests for some targets (e.g. gcc.dg/builtins-58.c for
bionic) are unsupported right now, while actually they are OK.

I wonder may be we can somehow get the value of libc_has_function hook
for a particular function in the test so to flexibly define whether
the test should be unsupported or not?
E.g. by adding some debug options for gcc that will return the result
of the hook? But I doubt that such option would be a pretty solution

thanks
--Alexander


Re: [RFC] Vectorization of indexed elements

2013-09-09 Thread Marc Glisse

On Mon, 9 Sep 2013, Vidya Praveen wrote:


Hello,

This post details some thoughts on an enhancement to the vectorizer that
could take advantage of the SIMD instructions that allows indexed element
as an operand thus reducing the need for duplication and possibly improve
reuse of previously loaded data.

Appreciate your opinion on this.

---

A phrase like this:

for(i=0;i<4;i++)
  a[i] = b[i]  c[2];

is usually vectorized as:

 va:V4SI = a[0:3]
 vb:V4SI = b[0:3]
 t = c[2]
 vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
 ...
 va:V4SI = vb:V4SI  vc:V4SI

But this could be simplified further if a target has instructions that support
indexed element as a parameter. For example an instruction like this:

 mul v0.4s, v1.4s, v2.4s[2]

can perform multiplication of each element of v2.4s with the third element of
v2.4s (specified as v2.4s[2]) and store the results in the corresponding
elements of v0.4s.

For this to happen, vectorizer needs to understand this idiom and treat the
operand c[2] specially (and by taking in to consideration if the machine
supports indexed element as an operand for  through a target hook or macro)
and consider this as vectorizable statement without having to duplicate the
elements explicitly.

There are fews ways this could be represented at gimple:

 ...
 va:V4SI = vb:V4SI  VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
 ...

or by allowing a vectorizer treat an indexed element as a valid operand in a
vectorizable statement:


Might as well allow any scalar then...


 ...
 va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 2)
 ...

For the sake of explanation, the above two representations assumes that
c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the
only use of 'c' then it may be safer to just load one element and use it like
this:

 vc:V4SI[0] = c[2]
 va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)

This could also mean that expressions involving scalar could be treated
similarly. For example,

 for(i=0;i<4;i++)
   a[i] = b[i]  c

could be vectorized as:

 vc:V4SI[0] = c
 va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)

Such a change would also require new standard pattern names to be defined for
each .

Alternatively, having something like this:

 ...
 vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
 va:V4SI = vb:V4SI  vt:V4SI
 ...

would remove the need to introduce several new standard pattern names but have
just one to represent vec_duplicate(vec_select()) but ofcourse this will expect
the target to have combiner patterns.


The cost estimation wouldn't be very good, but aren't combine patterns 
enough for the whole thing? Don't you model your mul instruction as:


(mult:V4SI
  (match_operand:V4SI)
  (vec_duplicate:V4SI (vec_select:SI (match_operand:V4SI

anyway? Seems that combine should be able to handle it. What currently 
happens that we fail to generate the right instruction?


In gimple, we already have BIT_FIELD_REF for vec_select and CONSTRUCTOR 
for vec_duplicate, adding new nodes is always painful.



This enhancement could possibly help further optimizing larger scenarios such
as linear systems.

Regards
VP


--
Marc Glisse


[RFC] Vectorization of indexed elements

2013-09-09 Thread Vidya Praveen
Hello,

This post details some thoughts on an enhancement to the vectorizer that 
could take advantage of the SIMD instructions that allows indexed element
as an operand thus reducing the need for duplication and possibly improve
reuse of previously loaded data.

Appreciate your opinion on this. 

--- 

A phrase like this:

 for(i=0;i<4;i++)
   a[i] = b[i]  c[2];

is usually vectorized as:

  va:V4SI = a[0:3]
  vb:V4SI = b[0:3]
  t = c[2]
  vc:V4SI = { t, t, t, t } // typically expanded as vec_duplicate at vec_init
  ...
  va:V4SI = vb:V4SI  vc:V4SI

But this could be simplified further if a target has instructions that support
indexed element as a parameter. For example an instruction like this:

  mul v0.4s, v1.4s, v2.4s[2]

can perform multiplication of each element of v2.4s with the third element of
v2.4s (specified as v2.4s[2]) and store the results in the corresponding 
elements of v0.4s. 

For this to happen, vectorizer needs to understand this idiom and treat the 
operand c[2] specially (and by taking in to consideration if the machine 
supports indexed element as an operand for  through a target hook or macro)
and consider this as vectorizable statement without having to duplicate the 
elements explicitly. 

There are fews ways this could be represented at gimple:

  ...
  va:V4SI = vb:V4SI  VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
  ...

or by allowing a vectorizer treat an indexed element as a valid operand in a 
vectorizable statement:

  ...
  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 2)
  ...

For the sake of explanation, the above two representations assumes that 
c[0:3] is loaded in vc for some other use and reused here. But when c[2] is the
only use of 'c' then it may be safer to just load one element and use it like
this:

  vc:V4SI[0] = c[2]
  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)

This could also mean that expressions involving scalar could be treated 
similarly. For example,

  for(i=0;i<4;i++)
a[i] = b[i]  c

could be vectorized as:

  vc:V4SI[0] = c
  va:V4SI = vb:V4SI  VEC_SELECT_EXPR (vc:V4SI 0)
  
Such a change would also require new standard pattern names to be defined for
each .

Alternatively, having something like this:

  ...
  vt:V4SI = VEC_DUPLICATE_EXPR (VEC_SELECT_EXPR (vc:V4SI 2))
  va:V4SI = vb:V4SI  vt:V4SI
  ...

would remove the need to introduce several new standard pattern names but have
just one to represent vec_duplicate(vec_select()) but ofcourse this will expect
the target to have combiner patterns.

This enhancement could possibly help further optimizing larger scenarios such 
as linear systems.

Regards
VP





Re: RFC: SIMD pragma independent of Cilk Plus / OpenMPv4

2013-09-09 Thread Jakub Jelinek
On Mon, Sep 09, 2013 at 10:18:20AM -0400, Tim Prince wrote:
> I pulled down an update of gcc gomp-4_0-branch yesterday and see in
> the not-yet-working additions to gcc testsuite there appears to be a
> move toward adding more cilkplus clauses to omp simd, such as
> firstprivate lastprivate (which are accepted but apparently ignored
> in the Intel omp simd implementation).

lastprivate is valid OpenMP 4.0 #pragma omp simd clause, just firstprivate is 
not,
and that is just easy to support for Cilk+ #pragma simd which allows it.

Jakub


Re: RFC: SIMD pragma independent of Cilk Plus / OpenMPv4

2013-09-09 Thread Tim Prince

On 9/9/2013 9:37 AM, Tobias Burnus wrote:

Dear all,

sometimes it can be useful to annotate loops for better vectorization,
which is rather independent from parallelization.

For vectorization, GCC has [0]:
a) Cilk Plus's  #pragma simd  [1]
b) OpenMP 4.0's #pragma omp simd [2]

Those require -fcilkplus and -fopenmp, respectively, and activate much
more. The question is whether it makes sense to provide a means to ask
the compiler for SIMD vectorization without enabling all the other things
of Cilk Plus/OpenMP. What's your opinion?

[If one provides it, the question is whether it is always on or not,
which syntax/semantics it uses [e.g. just the one of Cilk or OpenMP]
and what to do with conflicting pragmas which can occur in this case.]


Side remark: For vectorization, the widely supported #pragma ivdep,
vector, novector can be also useful, even if they are less formally
defined. "ivdep" seems to be one of the more useful ones, whose
semantics one can map to a safelen of infinity in OpenMP's semenatics
[i.e. loop->safelen = INT_MAX].

Tobias

[0] In the trunk is currently only some initial middle-end support.
OpenMP's imp simd is in the gomp-4_0-branch; Cilk Plus's simd has been
submitted for the trunk at
http://gcc.gnu.org/ml/gcc-patches/2013-08/msg01626.html
[1] http://www.cilkplus.org/download#open-specification
[2] http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf
ifort/icc have a separate option -openmp-simd for the purpose of 
activating omp simd directives without invoking OpenMP.  In the previous 
release, in order to activate both OpenMP parallel and omp simd, both 
options were required (-openmp -openmp-simd).  In the new "SP1" release 
last week, -openmp implies -openmp-simd.  Last time I checked, turning 
off the options did not cause the compiler to accept but ignore all omp 
simd directives, as I personally thought would be desirable.  A few 
cases are active regardless of compile line option, but many will be 
rejected without matching options.


Current Intel implementations of safelen will fail to vectorize and give 
notice if the value is set unnecessarily large.  It's been agreed that 
increasing the safelen value beyond the optimum level should not turn 
off vectorization.  safelen(32) is optimum for several float/single 
precision cases in the Intel(r) Xeon Phi(tm) cross compiler; needless to 
say, safelen(8) is sufficient for 128-bit SSE2.


I pulled down an update of gcc gomp-4_0-branch yesterday and see in the 
not-yet-working additions to gcc testsuite there appears to be a move 
toward adding more cilkplus clauses to omp simd, such as firstprivate 
lastprivate (which are accepted but apparently ignored in the Intel omp 
simd implementation).
I'll be discussing in a meeting later today my effort to publish 
material including discussion of OpenMP 4.0 implementations.


--
Tim Prince



RFC: SIMD pragma independent of Cilk Plus / OpenMPv4

2013-09-09 Thread Tobias Burnus
Dear all,

sometimes it can be useful to annotate loops for better vectorization,
which is rather independent from parallelization.

For vectorization, GCC has [0]:
a) Cilk Plus's  #pragma simd  [1]
b) OpenMP 4.0's #pragma omp simd [2]

Those require -fcilkplus and -fopenmp, respectively, and activate much
more. The question is whether it makes sense to provide a means to ask
the compiler for SIMD vectorization without enabling all the other things
of Cilk Plus/OpenMP. What's your opinion?

[If one provides it, the question is whether it is always on or not,
which syntax/semantics it uses [e.g. just the one of Cilk or OpenMP]
and what to do with conflicting pragmas which can occur in this case.]


Side remark: For vectorization, the widely supported #pragma ivdep,
vector, novector can be also useful, even if they are less formally
defined. "ivdep" seems to be one of the more useful ones, whose
semantics one can map to a safelen of infinity in OpenMP's semenatics
[i.e. loop->safelen = INT_MAX].

Tobias

[0] In the trunk is currently only some initial middle-end support.
OpenMP's imp simd is in the gomp-4_0-branch; Cilk Plus's simd has been
submitted for the trunk at
http://gcc.gnu.org/ml/gcc-patches/2013-08/msg01626.html
[1] http://www.cilkplus.org/download#open-specification
[2] http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf