autovectorization in gcc

2019-01-09 Thread Kay F. Jahnke

Hi there!

I am developing software which tries to deliberately exploit the 
compiler's autovectorization facilities by feeding data in 
autovectorization-friendly loops. I'm currently using both g++ and 
clang++ to see how well this approach works. Using simple arithmetic, I 
often get good results. To widen the scope of my work, I was looking for 
documentation on which constructs would be recognized by the 
autovectorization stage, and found


https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html

By the looks of it, this document has not seen any changes for several 
years. Has development on the autovectorization stage stopped, or is 
there simply no documentation?


In my experience, vectorization is essential to speed up arithmetic on 
the CPU, and reliable recognition of vectorization opportunities by the 
compiler can provide vectorization to programs which don't bother to 
code it explicitly. I feel the topic is being neglected - at least the 
documentation I found suggests this. To demonstrate what I mean, I have 
two concrete scenarios which I'd like to be handled by the 
autovectorization stage:


- gather/scatter with arbitrary indexes

In C, this would be loops like

// gather from B to A using gather indexes

for ( int i = 0 ; i < vsz ; i++ )
  A [ i ] = B [ indexes [ i ] ] ;

From the AVX2 ISA onwards, there are hardware gather/scatter 
operations, which can speed things up a good deal.


- repeated use of vectorizable functions

for ( int i = 0 ; i < vsz ; i++ )
  A [ i ] = sqrt ( B [ i ] ) ;

Here, replacing the repeated call of sqrt with the vectorized equivalent 
gives a dramatic speedup (ca. 4X)


If the compiler were to provide the autovectorization facilities, and if 
the patterns it recognizes were well-documented, users could rely on 
certain code patterns being recognized and autovectorized - sort of a 
contract between the user and the compiler. With a well-chosen spectrum 
of patterns, this would make it unnecessary to have to rely on explicit 
vectorization in many cases. My hope is that such an interface would 
help vectorization to become more frequently used - as I understand the 
status quo, this is still a niche topic, even though many processors 
provide suitable hardware nowadays.


Can you point me to where 'the action is' in this regard?

With regards

Kay F. Jahnke




[RFC] Update Stage 4 description

2019-01-09 Thread Tom de Vries
[ To revisit https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00385.html ]

The current formulation for the description of Stage 4 here (
https://gcc.gnu.org/develop.html ) is:
...
During this period, the only (non-documentation) changes that may be
made are changes that fix regressions.

Other changes may not be done during this period.

Note that the same constraints apply to release branches.

This period lasts until stage 1 opens for the next release.
...

This updated formulation was proposed by Richi (with a request for
review of wording):
...
 During this period, the only (non-documentation) changes that may
 be made are changes that fix regressions.

-Other changes may not be done during this period.
+Other important bugs like wrong-code, rejects-valid or build issues may
+be fixed as well.  All changes during this period should be done with
+extra care on not introducing new regressions - fixing bugs at all cost
+is not wanted.

 Note that the same constraints apply to release branches.

 This period lasts until stage 1 opens for the next release.
...

If a text can be agreed upon, then I can prepare a patch for wwwdocs.

Thanks,
- Tom


Re: [RFC] Update Stage 4 description

2019-01-09 Thread Richard Biener
On Wed, 9 Jan 2019, Tom de Vries wrote:

> [ To revisit https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00385.html ]
> 
> The current formulation for the description of Stage 4 here (
> https://gcc.gnu.org/develop.html ) is:
> ...
> During this period, the only (non-documentation) changes that may be
> made are changes that fix regressions.
> 
> Other changes may not be done during this period.
> 
> Note that the same constraints apply to release branches.
> 
> This period lasts until stage 1 opens for the next release.
> ...
> 
> This updated formulation was proposed by Richi (with a request for
> review of wording):
> ...
>  During this period, the only (non-documentation) changes that may
>  be made are changes that fix regressions.
> 
> -Other changes may not be done during this period.
> +Other important bugs like wrong-code, rejects-valid or build issues may
> +be fixed as well.  All changes during this period should be done with
> +extra care on not introducing new regressions - fixing bugs at all cost
> +is not wanted.
> 
>  Note that the same constraints apply to release branches.
> 
>  This period lasts until stage 1 opens for the next release.
> ...
> 
> If a text can be agreed upon, then I can prepare a patch for wwwdocs.

The proposed text sounds good, please post a patch and apply!

Thanks,
Richard.


Re: [RFC] Update Stage 4 description

2019-01-09 Thread Jonathan Wakely
On Wed, 9 Jan 2019 at 08:41, Tom de Vries  wrote:
>
> [ To revisit https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00385.html ]
>
> The current formulation for the description of Stage 4 here (
> https://gcc.gnu.org/develop.html ) is:
> ...
> During this period, the only (non-documentation) changes that may be
> made are changes that fix regressions.
>
> Other changes may not be done during this period.
>
> Note that the same constraints apply to release branches.
>
> This period lasts until stage 1 opens for the next release.
> ...
>
> This updated formulation was proposed by Richi (with a request for
> review of wording):
> ...
>  During this period, the only (non-documentation) changes that may
>  be made are changes that fix regressions.
>
> -Other changes may not be done during this period.
> +Other important bugs like wrong-code, rejects-valid or build issues may
> +be fixed as well.  All changes during this period should be done with
> +extra care on not introducing new regressions - fixing bugs at all cost

ISTM that this should be either "at any cost" or "at all costs". The
current wording can't make up its mind if it's singular or plural.

I also stumbled over "on not introducing" ... would that be better as
"to not introduce"?


> +is not wanted.
>
>  Note that the same constraints apply to release branches.
>
>  This period lasts until stage 1 opens for the next release.
> ...
>
> If a text can be agreed upon, then I can prepare a patch for wwwdocs.
>
> Thanks,
> - Tom


[wwwdocs, committed] Update Stage 4 description

2019-01-09 Thread Tom de Vries
[ was: Re: [RFC] Update Stage 4 description ]

On 09-01-19 09:47, Richard Biener wrote:
> On Wed, 9 Jan 2019, Tom de Vries wrote:
> 
>> [ To revisit https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00385.html ]
>>
>> The current formulation for the description of Stage 4 here (
>> https://gcc.gnu.org/develop.html ) is:
>> ...
>> During this period, the only (non-documentation) changes that may be
>> made are changes that fix regressions.
>>
>> Other changes may not be done during this period.
>>
>> Note that the same constraints apply to release branches.
>>
>> This period lasts until stage 1 opens for the next release.
>> ...
>>
>> This updated formulation was proposed by Richi (with a request for
>> review of wording):
>> ...
>>  During this period, the only (non-documentation) changes that may
>>  be made are changes that fix regressions.
>>
>> -Other changes may not be done during this period.
>> +Other important bugs like wrong-code, rejects-valid or build issues may
>> +be fixed as well.  All changes during this period should be done with
>> +extra care on not introducing new regressions - fixing bugs at all cost
>> +is not wanted.
>>
>>  Note that the same constraints apply to release branches.
>>
>>  This period lasts until stage 1 opens for the next release.
>> ...
>>
>> If a text can be agreed upon, then I can prepare a patch for wwwdocs.
> 
> The proposed text sounds good, please post a patch and apply!

Attached patch committed.

Thanks,
- Tom
Index: htdocs/develop.html
===
RCS file: /cvs/gcc/wwwdocs/htdocs/develop.html,v
retrieving revision 1.190
diff -r1.190 develop.html
135,138c135,140
< be made are changes that fix regressions.  Other changes may not be
< done during this period.  Note that the same constraints apply
< to release branches.  This period lasts until stage 1 opens for
< the next release.
---
> be made are changes that fix regressions.  Other important bugs
> like wrong-code, rejects-valid or build issues may be fixed as well.
> All changes during this period should be done with extra care on
> not introducing new regressions - fixing bugs at all cost is not
> wanted.  Note that the same constraints apply to release branches.
> This period lasts until stage 1 opens for the next release.


Garbage collection bugs

2019-01-09 Thread Joern Wolfgang Rennecke
We've been running builds/regression tests for GCC 8.2 configured with 
--enable-checking=all, and have observed some failures related to 
garbage collection.


First problem:

The g++.dg/pr85039-2.C tests (I've looked in detail at -std=c++98, but 
-std=c++11 and -std=c++14 appear to follow the same pattern) see gcc 
garbage-collecting a live vector.  A subsequent access to the vector 
with vec_quick_push causes a segmentation fault, as m_vecpfx.m_num is 
0xa5a5a5a5 . The vec data is also being freed / poisoned. The vector in 
question is an auto-variable of cp_parser_parenthesized_expression_list, 
which is declared as: vec *expression_list;


According to doc/gty/texi: "you should reference all your data from 
static or external @code{GTY}-ed variables, and it is advised to call 
@code{ggc_collect} with a shallow call stack."


In this case, cgraph_node::finalize_function calls the garage collector,
as we are finishing a member function of a struct. gdb shows a backtrace 
of 34 frames, which is not really much as far as C++ parsing goes. The 
caller of finalize_function is expand_or_defer_fn, which uses the 
expression "function_depth > 1" to compute the no_collect paramter to 
finalize_function.
cp_parser_parenthesized_expression_list is in frame 21 of the backtrace 
at this point.
So, if we consider this shallow, cp_parser_parenthesized_expression_list 
either has to refrain from using a vector with garbage-collected 
allocation, or it has to make the pointer reachable from a GC root - at 
least if function_depth <= 1.

Is the attached patch the right approach?

When looking at regression test results for gcc version 9.0.0 20181028 
(experimental), the excess errors test for g++.dg/pr85039-2.C seems to 
pass, yet I can see no definite reason in the source code why that is 
so.  I tried running the test by hand in order to check if maybe the 
patch for PR c++/84455 plays a role,
but running the test by hand, it crashes again, and gdb shows the 
telltale a5 pattern in a pointer register.

#0  vec::quick_push (obj=,
this=0x705ece60)
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:974

#1  vec_safe_push (obj=,
v=@0x7fffd038: 0x705ece60)
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:766

#2  cp_parser_parenthesized_expression_list (
parser=parser@entry=0x77ff83f0,
is_attribute_list=is_attribute_list@entry=0, cast_p=cast_p@entry=false,
allow_expansion_p=allow_expansion_p@entry=true,
non_constant_p=non_constant_p@entry=0x7fffd103,
close_paren_loc=close_paren_loc@entry=0x0, wrap_locations_p=false)
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:7803

#3  0x006e910d in cp_parser_initializer (
parser=parser@entry=0x77ff83f0,
is_direct_init=is_direct_init@entry=0x7fffd102,
non_constant_p=non_constant_p@entry=0x7fffd103,
subexpression_p=subexpression_p@entry=false)
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:22009

#4  0x0070954e in cp_parser_init_declarator (
parser=parser@entry=0x77ff83f0,
decl_specifiers=decl_specifiers@entry=0x7fffd1c0,
checks=checks@entry=0x0,
function_definition_allowed_p=function_definition_allowed_p@entry=true,
member_p=member_p@entry=false, declares_class_or_enum=,
function_definition_p=0x7fffd250, maybe_range_for_decl=0x0,
init_loc=0x7fffd1ac, auto_result=0x7fffd2e0)
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:19827
#5  0x00711c5d in cp_parser_simple_declaration 
(parser=0x77ff83f0,
function_definition_allowed_p=, 
maybe_range_for_decl=0x0)
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:13179

#6  0x00717bb5 in cp_parser_declaration (parser=0x77ff83f0)
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:12876

#7  0x0071837d in cp_parser_translation_unit (parser=0x77ff83f0)
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:4631

#8  c_parse_file ()
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:39108

#9  0x00868db1 in c_common_parse_file ()
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/c-family/c-opts.c:1150

#10 0x00e0aaaf in compile_file ()
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/toplev.c:455

#11 0x0059248a in do_compile ()
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/toplev.c:2172

#12 toplev::main (this=this@entry=0x7fffd54e, argc=,

argc@entry=100, argv=, argv@entry=0x7fffd648)
at 
/data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/toplev.c:2307

#13 0x00594b5b in main (argc=100, argv=0x7fffd648)
at 
/data/hudson/jobs/gcc-9.0.0-linu

Re: autovectorization in gcc

2019-01-09 Thread Kyrill Tkachov

Hi Kay,

On 09/01/19 08:29, Kay F. Jahnke wrote:

Hi there!

I am developing software which tries to deliberately exploit the
compiler's autovectorization facilities by feeding data in
autovectorization-friendly loops. I'm currently using both g++ and
clang++ to see how well this approach works. Using simple arithmetic, I
often get good results. To widen the scope of my work, I was looking for
documentation on which constructs would be recognized by the
autovectorization stage, and found

https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html



Yeah, that page hasn't been updated in ages AFAIK.


By the looks of it, this document has not seen any changes for several
years. Has development on the autovectorization stage stopped, or is
there simply no documentation?



There's plenty of work being done on auto-vectorisation in GCC.
Auto-vectorisation is a performance optimisation and as such is not really
a user-visible feature that absolutely requires user documentation.


In my experience, vectorization is essential to speed up arithmetic on
the CPU, and reliable recognition of vectorization opportunities by the
compiler can provide vectorization to programs which don't bother to
code it explicitly. I feel the topic is being neglected - at least the
documentation I found suggests this. To demonstrate what I mean, I have
two concrete scenarios which I'd like to be handled by the
autovectorization stage:

- gather/scatter with arbitrary indexes

In C, this would be loops like

// gather from B to A using gather indexes

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = B [ indexes [ i ] ] ;

 From the AVX2 ISA onwards, there are hardware gather/scatter
operations, which can speed things up a good deal.

- repeated use of vectorizable functions

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = sqrt ( B [ i ] ) ;

Here, replacing the repeated call of sqrt with the vectorized equivalent
gives a dramatic speedup (ca. 4X)



I believe GCC will do some of that already given a high-enough optimisation 
level
and floating-point constraints.
Do you have examples where it doesn't? Testcases with self-contained source code
and compiler flags would be useful to analyse.


If the compiler were to provide the autovectorization facilities, and if
the patterns it recognizes were well-documented, users could rely on
certain code patterns being recognized and autovectorized - sort of a
contract between the user and the compiler. With a well-chosen spectrum
of patterns, this would make it unnecessary to have to rely on explicit
vectorization in many cases. My hope is that such an interface would
help vectorization to become more frequently used - as I understand the
status quo, this is still a niche topic, even though many processors
provide suitable hardware nowadays.



I wouldn't say it's a niche topic :)
From my monitoring of the GCC development over the last few years there's been 
lots
of improvements in auto-vectorisation in compilers (at least in GCC).

The thing is, auto-vectorisation is not always profitable for performance.
Sometimes the runtime loop iteration count is so low that setting up the 
vectorised loop
(alignment checks, loads/permutes) is slower than just doing the scalar form,
especially since SIMD performance varies from CPU to CPU.
So we would want the compiler to have the freedom to make its own judgement on 
when
to auto-vectorise rather than enforce a "contract". If the user really only 
wants
vector code, they should use one of the explicit programming paradigms.

HTH,
Kyrill


Can you point me to where 'the action is' in this regard?

With regards

Kay F. Jahnke






Re: autovectorization in gcc

2019-01-09 Thread Andrew Haley
On 1/9/19 9:45 AM, Kyrill Tkachov wrote:
> Hi Kay,
> 
> On 09/01/19 08:29, Kay F. Jahnke wrote:
>> Hi there!
>>
>> I am developing software which tries to deliberately exploit the
>> compiler's autovectorization facilities by feeding data in
>> autovectorization-friendly loops. I'm currently using both g++ and
>> clang++ to see how well this approach works. Using simple arithmetic, I
>> often get good results. To widen the scope of my work, I was looking for
>> documentation on which constructs would be recognized by the
>> autovectorization stage, and found
>>
>> https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html
>>
> 
> Yeah, that page hasn't been updated in ages AFAIK.
> 
>> By the looks of it, this document has not seen any changes for several
>> years. Has development on the autovectorization stage stopped, or is
>> there simply no documentation?
>>
> 
> There's plenty of work being done on auto-vectorisation in GCC.
> Auto-vectorisation is a performance optimisation and as such is not really
> a user-visible feature that absolutely requires user documentation.

I don't agree. Sometimes vectorization is critical. It would be nice
to have a warning which would fire if vectorization failed. That would
surely help the OP.

-- 
Andrew Haley
Java Platform Lead Engineer
Red Hat UK Ltd. 
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671


Re: autovectorization in gcc

2019-01-09 Thread Jonathan Wakely
On Wed, 9 Jan 2019 at 09:50, Andrew Haley wrote:
> I don't agree. Sometimes vectorization is critical. It would be nice
> to have a warning which would fire if vectorization failed. That would
> surely help the OP.

Dave Malcolm has been working on something like that:
https://gcc.gnu.org/ml/gcc-patches/2018-09/msg01749.html


Re: Garbage collection bugs

2019-01-09 Thread Arseny Solokha
> First problem:
> 
> The g++.dg/pr85039-2.C tests (I've looked in detail at -std=c++98, but
> -std=c++11 and -std=c++14 appear to follow the same pattern) see gcc
> garbage-collecting a live vector.  A subsequent access to the vector with
> vec_quick_push causes a segmentation fault, as m_vecpfx.m_num is 0xa5a5a5a5 .
> The vec data is also being freed / poisoned. The vector in question is an
> auto-variable of cp_parser_parenthesized_expression_list, which is declared 
> as:
> vec *expression_list;

It looks like PR88180 to me.


Re: autovectorization in gcc

2019-01-09 Thread Ramana Radhakrishnan
On Wed, Jan 9, 2019 at 9:50 AM Andrew Haley  wrote:
>
> On 1/9/19 9:45 AM, Kyrill Tkachov wrote:
> > Hi Kay,
> >
> > On 09/01/19 08:29, Kay F. Jahnke wrote:
> >> Hi there!
> >>
> >> I am developing software which tries to deliberately exploit the
> >> compiler's autovectorization facilities by feeding data in
> >> autovectorization-friendly loops. I'm currently using both g++ and
> >> clang++ to see how well this approach works. Using simple arithmetic, I
> >> often get good results. To widen the scope of my work, I was looking for
> >> documentation on which constructs would be recognized by the
> >> autovectorization stage, and found
> >>
> >> https://www.gnu.org/software/gcc/projects/tree-ssa/vectorization.html
> >>
> >
> > Yeah, that page hasn't been updated in ages AFAIK.
> >
> >> By the looks of it, this document has not seen any changes for several
> >> years. Has development on the autovectorization stage stopped, or is
> >> there simply no documentation?
> >>
> >
> > There's plenty of work being done on auto-vectorisation in GCC.
> > Auto-vectorisation is a performance optimisation and as such is not really
> > a user-visible feature that absolutely requires user documentation.
>
> I don't agree. Sometimes vectorization is critical. It would be nice
> to have a warning which would fire if vectorization failed. That would
> surely help the OP.

That would help certainly : the user could get some information out
today with the debug dumps - however they are designed more for the
compiler writers rather than users.

regards
Ramana


Re: autovectorization in gcc

2019-01-09 Thread Kay F. Jahnke

On 09.01.19 10:45, Kyrill Tkachov wrote:


There's plenty of work being done on auto-vectorisation in GCC.
Auto-vectorisation is a performance optimisation and as such is not really
a user-visible feature that absolutely requires user documentation.


Since I'm trying to deliberately exploit it, a more user-visible guise 
would help ;)



- repeated use of vectorizable functions

for ( int i = 0 ; i < vsz ; i++ )
   A [ i ] = sqrt ( B [ i ] ) ;

Here, replacing the repeated call of sqrt with the vectorized equivalent
gives a dramatic speedup (ca. 4X)


The above is a typical example. So, to give a complete source 'vec_sqrt.cc':

#include 

extern float data [ 32768 ] ;

extern void vf1()
{
  #pragma vectorize enable
  for ( int i = 0 ; i < 32768 ; i++ )
data [ i ] = std::sqrt ( data [ i ] ) ;
}

This has a large trip count, the loop is trivial. It's an ideal 
candidate for autovectorization. When I compile this source, using


g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc

the inner loop translates to:

.L2:
vmovss  (%rbx), %xmm0
vucomiss%xmm0, %xmm2
vsqrtss %xmm0, %xmm1, %xmm1
jbe .L3
vmovss  %xmm2, 12(%rsp)
addq$4, %rbx
vmovss  %xmm1, 8(%rsp)
callsqrtf@PLT
vmovss  8(%rsp), %xmm1
vmovss  %xmm1, -4(%rbx)
cmpq%rbp, %rbx
vmovss  12(%rsp), %xmm2
jne .L2

AFAICT this is not vectorized, it only uses a single float at a time.
In vector code, I'd expect the vsqrtps mnemonic to show up.

I believe GCC will do some of that already given a high-enough 
optimisation level

and floating-point constraints.
Do you have examples where it doesn't? Testcases with self-contained 
source code

and compiler flags would be useful to analyse.


so, see above. With -Ofast output is similar, just the inner loop is 
unrolled. But maybe I'm missing something? Any hints for additional flags?



If the compiler were to provide the autovectorization facilities, and if
the patterns it recognizes were well-documented, users could rely on
certain code patterns being recognized and autovectorized - sort of a
contract between the user and the compiler. With a well-chosen spectrum
of patterns, this would make it unnecessary to have to rely on explicit
vectorization in many cases. My hope is that such an interface would
help vectorization to become more frequently used - as I understand the
status quo, this is still a niche topic, even though many processors
provide suitable hardware nowadays.



I wouldn't say it's a niche topic :)
 From my monitoring of the GCC development over the last few years 
there's been lots

of improvements in auto-vectorisation in compilers (at least in GCC).


Okay, I'll take your word for it.


The thing is, auto-vectorisation is not always profitable for performance.
Sometimes the runtime loop iteration count is so low that setting up the 
vectorised loop
(alignment checks, loads/permutes) is slower than just doing the scalar 
form,

especially since SIMD performance varies from CPU to CPU.
So we would want the compiler to have the freedom to make its own 
judgement on when
to auto-vectorise rather than enforce a "contract". If the user really 
only wants

vector code, they should use one of the explicit programming paradigms.


I know that these issues are important. I am using Vc for explicit 
vectorization, so I can easily code to produce vector code for common 
targets. And I can compare the performance. I have tried the example 
given above on my AVX2 machine, linking with a main program which calls 
'vf1' 32768 times, to get one gigaroot (giggle). The vectorized version 
takes about half a second, the unvectorized takes about three. with 
functions like sqrt, trigonometric functions, exp and pow, vectorization 
is very profitable. Some further details:


Here's the main program 'memaxs.cc':

float data [ 32768 ] ;
extern void vf1() ;

int main ( int argc , char * argv[] )
{
  for ( int k = 0 ; k < 32768 ; k++ )
  {
vf1() ;
  }
}

And the compiler call to get a binary:

g++ -O3 -mavx2 -o memaxs sqrt.s memaxs.cc

Here's the performance:

$ time ./memaxs

real0m3,205s
user0m3,200s
sys 0m0,004s

This variant of vec_sqrt.cc uses Vc ('vc_vec_sqrt.cc')

#include 

extern float data [ 32768 ] ;

extern void vf1()
{
  for ( int k = 0 ; k < 32768 ; k += 8 )
  {
Vc::float_v fv ( data + k ) ;
fv = sqrt ( fv ) ;
fv.store ( data + k ) ;
  }
}

Translated to assembler, I get the inner loop

.L2:
vmovups (%rax), %xmm0
addq$32, %rax
vinsertf128 $0x1, -16(%rax), %ymm0, %ymm0
vsqrtps %ymm0, %ymm0
vmovups %xmm0, -32(%rax)
vextractf128$0x1, %ymm0, -16(%rax)
cmpq%rax, %rdx
jne .L2
vzeroupper
ret
.cfi_endproc

note how the data are read 32 bytes at a time and processed with vsqrtps.

creating the corresponding binary and executing it:

$ g++ -O3 -mavx2 -o memaxs sqr

Re: autovectorization in gcc

2019-01-09 Thread Jakub Jelinek
On Wed, Jan 09, 2019 at 11:56:03AM +0100, Kay F. Jahnke wrote:
> The above is a typical example. So, to give a complete source 'vec_sqrt.cc':
> 
> #include 
> 
> extern float data [ 32768 ] ;
> 
> extern void vf1()
> {
>   #pragma vectorize enable
>   for ( int i = 0 ; i < 32768 ; i++ )
> data [ i ] = std::sqrt ( data [ i ] ) ;
> }
> 
> This has a large trip count, the loop is trivial. It's an ideal candidate
> for autovectorization. When I compile this source, using
> 
> g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc

Generally you want -Ofast or -ffast-math or at least some suboptions of that
if you want to vectorize floating point loops, because vectorization in many
cases changes where FPU exceptions would be generated, can affect precision
by reordering the ops etc. In the above case it is just that glibc
declares the vector math functions for #ifdef __FAST_MATH__ only, as they
have worse precision.

Note, gcc doesn't recognize #pragma vectorize, you can use e.g.
#pragma omp simd
or
#pragma GCC ivdep
if you want to assert some properties of the loop the compiler can't easily
prove itself that would help the vectorization.

Jakub


Re: autovectorization in gcc

2019-01-09 Thread Jakub Jelinek
On Wed, Jan 09, 2019 at 12:03:45PM +0100, Jakub Jelinek wrote:
> > The above is a typical example. So, to give a complete source 'vec_sqrt.cc':
> > 
> > #include 
> > 
> > extern float data [ 32768 ] ;
> > 
> > extern void vf1()
> > {
> >   #pragma vectorize enable
> >   for ( int i = 0 ; i < 32768 ; i++ )
> > data [ i ] = std::sqrt ( data [ i ] ) ;
> > }
> > 
> > This has a large trip count, the loop is trivial. It's an ideal candidate
> > for autovectorization. When I compile this source, using
> > 
> > g++ -O3 -mavx2 -S -o sqrt.s sqrt_gcc.cc
> 
> Generally you want -Ofast or -ffast-math or at least some suboptions of that
> if you want to vectorize floating point loops, because vectorization in many
> cases changes where FPU exceptions would be generated, can affect precision
> by reordering the ops etc. In the above case it is just that glibc
> declares the vector math functions for #ifdef __FAST_MATH__ only, as they
> have worse precision.

Actually, the last sentence was just a wrong guess in this case, for sqrt no
glibc libcall is needed, that is for trigonometric and the like, all you
need for the above to vectorize from -ffast-math is -fno-math-errno, tell
the compiler you don't need errno set if you call sqrt on negative etc.
With  -fopt-info-vec-missed the compiler would tell you:
/tmp/1.c:5:3: note: not vectorized: control flow in loop.
/tmp/1.c:5:3: note: bad loop form.
and you could look at the dumps to see that there is
  _2 = .SQRT (_1);
  if (_1 u>= 0.0)
goto ; [99.95%]
  else
goto ; [0.05%]
...
   [local count: 531495]:
  __builtin_sqrt (_1);
which is the idiom to do sqrt inline using instruction, but in the unlikely
case when the argument is negative, also call the library function so that
it handles the errno setting.

Jakub


Re: Garbage collection bugs

2019-01-09 Thread Richard Biener
On Wed, Jan 9, 2019 at 10:46 AM Joern Wolfgang Rennecke
 wrote:
>
> We've been running builds/regression tests for GCC 8.2 configured with
> --enable-checking=all, and have observed some failures related to
> garbage collection.
>
> First problem:
>
> The g++.dg/pr85039-2.C tests (I've looked in detail at -std=c++98, but
> -std=c++11 and -std=c++14 appear to follow the same pattern) see gcc
> garbage-collecting a live vector.  A subsequent access to the vector
> with vec_quick_push causes a segmentation fault, as m_vecpfx.m_num is
> 0xa5a5a5a5 . The vec data is also being freed / poisoned. The vector in
> question is an auto-variable of cp_parser_parenthesized_expression_list,
> which is declared as: vec *expression_list;
>
> According to doc/gty/texi: "you should reference all your data from
> static or external @code{GTY}-ed variables, and it is advised to call
> @code{ggc_collect} with a shallow call stack."
>
> In this case, cgraph_node::finalize_function calls the garage collector,
> as we are finishing a member function of a struct. gdb shows a backtrace
> of 34 frames, which is not really much as far as C++ parsing goes. The
> caller of finalize_function is expand_or_defer_fn, which uses the
> expression "function_depth > 1" to compute the no_collect paramter to
> finalize_function.
> cp_parser_parenthesized_expression_list is in frame 21 of the backtrace
> at this point.
> So, if we consider this shallow, cp_parser_parenthesized_expression_list
> either has to refrain from using a vector with garbage-collected
> allocation, or it has to make the pointer reachable from a GC root - at
> least if function_depth <= 1.
> Is the attached patch the right approach?
>
> When looking at regression test results for gcc version 9.0.0 20181028
> (experimental), the excess errors test for g++.dg/pr85039-2.C seems to
> pass, yet I can see no definite reason in the source code why that is
> so.  I tried running the test by hand in order to check if maybe the
> patch for PR c++/84455 plays a role,
> but running the test by hand, it crashes again, and gdb shows the
> telltale a5 pattern in a pointer register.
> #0  vec::quick_push (obj=,
>  this=0x705ece60)
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:974
> #1  vec_safe_push (obj=,
>  v=@0x7fffd038: 0x705ece60)
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:766
> #2  cp_parser_parenthesized_expression_list (
>  parser=parser@entry=0x77ff83f0,
>  is_attribute_list=is_attribute_list@entry=0, cast_p=cast_p@entry=false,
>  allow_expansion_p=allow_expansion_p@entry=true,
>  non_constant_p=non_constant_p@entry=0x7fffd103,
>  close_paren_loc=close_paren_loc@entry=0x0, wrap_locations_p=false)
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:7803
> #3  0x006e910d in cp_parser_initializer (
>  parser=parser@entry=0x77ff83f0,
>  is_direct_init=is_direct_init@entry=0x7fffd102,
>  non_constant_p=non_constant_p@entry=0x7fffd103,
>  subexpression_p=subexpression_p@entry=false)
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:22009
> #4  0x0070954e in cp_parser_init_declarator (
>  parser=parser@entry=0x77ff83f0,
>  decl_specifiers=decl_specifiers@entry=0x7fffd1c0,
>  checks=checks@entry=0x0,
> function_definition_allowed_p=function_definition_allowed_p@entry=true,
>  member_p=member_p@entry=false, declares_class_or_enum=,
>  function_definition_p=0x7fffd250, maybe_range_for_decl=0x0,
>  init_loc=0x7fffd1ac, auto_result=0x7fffd2e0)
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:19827
> #5  0x00711c5d in cp_parser_simple_declaration
> (parser=0x77ff83f0,
>  function_definition_allowed_p=,
> maybe_range_for_decl=0x0)
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:13179
> #6  0x00717bb5 in cp_parser_declaration (parser=0x77ff83f0)
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:12876
> #7  0x0071837d in cp_parser_translation_unit (parser=0x77ff83f0)
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:4631
> #8  c_parse_file ()
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:39108
> #9  0x00868db1 in c_common_parse_file ()
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/c-family/c-opts.c:1150
> #10 0x00e0aaaf in compile_file ()
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/toplev.c:455
> #11 0x0059248a in do_compile ()
>  at
> /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/toplev.c:2172
> #12 toplev::main (this=this@entry=0x7fffd54e, argc=,
>
>  argc@

Re: Garbage collection bugs

2019-01-09 Thread Richard Biener
On Wed, Jan 9, 2019 at 12:48 PM Richard Biener
 wrote:
>
> On Wed, Jan 9, 2019 at 10:46 AM Joern Wolfgang Rennecke
>  wrote:
> >
> > We've been running builds/regression tests for GCC 8.2 configured with
> > --enable-checking=all, and have observed some failures related to
> > garbage collection.
> >
> > First problem:
> >
> > The g++.dg/pr85039-2.C tests (I've looked in detail at -std=c++98, but
> > -std=c++11 and -std=c++14 appear to follow the same pattern) see gcc
> > garbage-collecting a live vector.  A subsequent access to the vector
> > with vec_quick_push causes a segmentation fault, as m_vecpfx.m_num is
> > 0xa5a5a5a5 . The vec data is also being freed / poisoned. The vector in
> > question is an auto-variable of cp_parser_parenthesized_expression_list,
> > which is declared as: vec *expression_list;
> >
> > According to doc/gty/texi: "you should reference all your data from
> > static or external @code{GTY}-ed variables, and it is advised to call
> > @code{ggc_collect} with a shallow call stack."
> >
> > In this case, cgraph_node::finalize_function calls the garage collector,
> > as we are finishing a member function of a struct. gdb shows a backtrace
> > of 34 frames, which is not really much as far as C++ parsing goes. The
> > caller of finalize_function is expand_or_defer_fn, which uses the
> > expression "function_depth > 1" to compute the no_collect paramter to
> > finalize_function.
> > cp_parser_parenthesized_expression_list is in frame 21 of the backtrace
> > at this point.
> > So, if we consider this shallow, cp_parser_parenthesized_expression_list
> > either has to refrain from using a vector with garbage-collected
> > allocation, or it has to make the pointer reachable from a GC root - at
> > least if function_depth <= 1.
> > Is the attached patch the right approach?
> >
> > When looking at regression test results for gcc version 9.0.0 20181028
> > (experimental), the excess errors test for g++.dg/pr85039-2.C seems to
> > pass, yet I can see no definite reason in the source code why that is
> > so.  I tried running the test by hand in order to check if maybe the
> > patch for PR c++/84455 plays a role,
> > but running the test by hand, it crashes again, and gdb shows the
> > telltale a5 pattern in a pointer register.
> > #0  vec::quick_push (obj=,
> >  this=0x705ece60)
> >  at
> > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:974
> > #1  vec_safe_push (obj=,
> >  v=@0x7fffd038: 0x705ece60)
> >  at
> > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:766
> > #2  cp_parser_parenthesized_expression_list (
> >  parser=parser@entry=0x77ff83f0,
> >  is_attribute_list=is_attribute_list@entry=0, cast_p=cast_p@entry=false,
> >  allow_expansion_p=allow_expansion_p@entry=true,
> >  non_constant_p=non_constant_p@entry=0x7fffd103,
> >  close_paren_loc=close_paren_loc@entry=0x0, wrap_locations_p=false)
> >  at
> > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:7803
> > #3  0x006e910d in cp_parser_initializer (
> >  parser=parser@entry=0x77ff83f0,
> >  is_direct_init=is_direct_init@entry=0x7fffd102,
> >  non_constant_p=non_constant_p@entry=0x7fffd103,
> >  subexpression_p=subexpression_p@entry=false)
> >  at
> > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:22009
> > #4  0x0070954e in cp_parser_init_declarator (
> >  parser=parser@entry=0x77ff83f0,
> >  decl_specifiers=decl_specifiers@entry=0x7fffd1c0,
> >  checks=checks@entry=0x0,
> > function_definition_allowed_p=function_definition_allowed_p@entry=true,
> >  member_p=member_p@entry=false, declares_class_or_enum=,
> >  function_definition_p=0x7fffd250, maybe_range_for_decl=0x0,
> >  init_loc=0x7fffd1ac, auto_result=0x7fffd2e0)
> >  at
> > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:19827
> > #5  0x00711c5d in cp_parser_simple_declaration
> > (parser=0x77ff83f0,
> >  function_definition_allowed_p=,
> > maybe_range_for_decl=0x0)
> >  at
> > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:13179
> > #6  0x00717bb5 in cp_parser_declaration (parser=0x77ff83f0)
> >  at
> > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:12876
> > #7  0x0071837d in cp_parser_translation_unit (parser=0x77ff83f0)
> >  at
> > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:4631
> > #8  c_parse_file ()
> >  at
> > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:39108
> > #9  0x00868db1 in c_common_parse_file ()
> >  at
> > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/c-family/c-opts.c:1150
> > #10 0x00e0aaaf in compile_file ()
> >  at
> > /data/hudson/jobs/gcc-9.0.0-linux/w

Re: Garbage collection bugs

2019-01-09 Thread Jan Hubicka
> On Wed, Jan 9, 2019 at 12:48 PM Richard Biener
>  wrote:
> >
> > On Wed, Jan 9, 2019 at 10:46 AM Joern Wolfgang Rennecke
> >  wrote:
> > >
> > > We've been running builds/regression tests for GCC 8.2 configured with
> > > --enable-checking=all, and have observed some failures related to
> > > garbage collection.
> > >
> > > First problem:
> > >
> > > The g++.dg/pr85039-2.C tests (I've looked in detail at -std=c++98, but
> > > -std=c++11 and -std=c++14 appear to follow the same pattern) see gcc
> > > garbage-collecting a live vector.  A subsequent access to the vector
> > > with vec_quick_push causes a segmentation fault, as m_vecpfx.m_num is
> > > 0xa5a5a5a5 . The vec data is also being freed / poisoned. The vector in
> > > question is an auto-variable of cp_parser_parenthesized_expression_list,
> > > which is declared as: vec *expression_list;
> > >
> > > According to doc/gty/texi: "you should reference all your data from
> > > static or external @code{GTY}-ed variables, and it is advised to call
> > > @code{ggc_collect} with a shallow call stack."
> > >
> > > In this case, cgraph_node::finalize_function calls the garage collector,
> > > as we are finishing a member function of a struct. gdb shows a backtrace
> > > of 34 frames, which is not really much as far as C++ parsing goes. The
> > > caller of finalize_function is expand_or_defer_fn, which uses the
> > > expression "function_depth > 1" to compute the no_collect paramter to
> > > finalize_function.
> > > cp_parser_parenthesized_expression_list is in frame 21 of the backtrace
> > > at this point.
> > > So, if we consider this shallow, cp_parser_parenthesized_expression_list
> > > either has to refrain from using a vector with garbage-collected
> > > allocation, or it has to make the pointer reachable from a GC root - at
> > > least if function_depth <= 1.
> > > Is the attached patch the right approach?
> > >
> > > When looking at regression test results for gcc version 9.0.0 20181028
> > > (experimental), the excess errors test for g++.dg/pr85039-2.C seems to
> > > pass, yet I can see no definite reason in the source code why that is
> > > so.  I tried running the test by hand in order to check if maybe the
> > > patch for PR c++/84455 plays a role,
> > > but running the test by hand, it crashes again, and gdb shows the
> > > telltale a5 pattern in a pointer register.
> > > #0  vec::quick_push (obj=,
> > >  this=0x705ece60)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:974
> > > #1  vec_safe_push (obj=,
> > >  v=@0x7fffd038: 0x705ece60)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/vec.h:766
> > > #2  cp_parser_parenthesized_expression_list (
> > >  parser=parser@entry=0x77ff83f0,
> > >  is_attribute_list=is_attribute_list@entry=0, 
> > > cast_p=cast_p@entry=false,
> > >  allow_expansion_p=allow_expansion_p@entry=true,
> > >  non_constant_p=non_constant_p@entry=0x7fffd103,
> > >  close_paren_loc=close_paren_loc@entry=0x0, wrap_locations_p=false)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:7803
> > > #3  0x006e910d in cp_parser_initializer (
> > >  parser=parser@entry=0x77ff83f0,
> > >  is_direct_init=is_direct_init@entry=0x7fffd102,
> > >  non_constant_p=non_constant_p@entry=0x7fffd103,
> > >  subexpression_p=subexpression_p@entry=false)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:22009
> > > #4  0x0070954e in cp_parser_init_declarator (
> > >  parser=parser@entry=0x77ff83f0,
> > >  decl_specifiers=decl_specifiers@entry=0x7fffd1c0,
> > >  checks=checks@entry=0x0,
> > > function_definition_allowed_p=function_definition_allowed_p@entry=true,
> > >  member_p=member_p@entry=false, declares_class_or_enum= > > out>,
> > >  function_definition_p=0x7fffd250, maybe_range_for_decl=0x0,
> > >  init_loc=0x7fffd1ac, auto_result=0x7fffd2e0)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:19827
> > > #5  0x00711c5d in cp_parser_simple_declaration
> > > (parser=0x77ff83f0,
> > >  function_definition_allowed_p=,
> > > maybe_range_for_decl=0x0)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:13179
> > > #6  0x00717bb5 in cp_parser_declaration (parser=0x77ff83f0)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:12876
> > > #7  0x0071837d in cp_parser_translation_unit 
> > > (parser=0x77ff83f0)
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:4631
> > > #8  c_parse_file ()
> > >  at
> > > /data/hudson/jobs/gcc-9.0.0-linux/workspace/gcc/build/../gcc/gcc/cp/parser.c:39108
> > > #9  0x00868db1 in c_common_pars

Re: [RFC] Update Stage 4 description

2019-01-09 Thread Paul Koning



> On Jan 9, 2019, at 3:42 AM, Tom de Vries  wrote:
> 
> [ To revisit https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00385.html ]
> 
> The current formulation for the description of Stage 4 here (
> https://gcc.gnu.org/develop.html ) is:
> ...
> During this period, the only (non-documentation) changes that may be
> made are changes that fix regressions.
> 
> Other changes may not be done during this period.
> 
> Note that the same constraints apply to release branches.
> 
> This period lasts until stage 1 opens for the next release.
> ...
> 
> This updated formulation was proposed by Richi (with a request for
> review of wording):
> ...
> During this period, the only (non-documentation) changes that may
> be made are changes that fix regressions.
> 
> -Other changes may not be done during this period.
> +Other important bugs like wrong-code, rejects-valid or build issues may
> +be fixed as well.  All changes during this period should be done with
> +extra care on not introducing new regressions - fixing bugs at all cost
> +is not wanted.
...

Is there, or should there be, a distinction between primary and non-primary 
platforms?  While platform bugs typically require fixes in platform-specific 
code, I would think we would want to stay away from bugfixes in minor platforms 
during stage 4.  The wording seems to say that I could fix wrong-code bugs in 
pdp11 during stage 4; I have been assuming I should not do that.  Is this 
something that should be explicitly stated?

paul



Re: [RFC] Update Stage 4 description

2019-01-09 Thread Richard Biener
On Wed, 9 Jan 2019, Paul Koning wrote:

> 
> 
> > On Jan 9, 2019, at 3:42 AM, Tom de Vries  wrote:
> > 
> > [ To revisit https://gcc.gnu.org/ml/gcc-patches/2018-04/msg00385.html ]
> > 
> > The current formulation for the description of Stage 4 here (
> > https://gcc.gnu.org/develop.html ) is:
> > ...
> > During this period, the only (non-documentation) changes that may be
> > made are changes that fix regressions.
> > 
> > Other changes may not be done during this period.
> > 
> > Note that the same constraints apply to release branches.
> > 
> > This period lasts until stage 1 opens for the next release.
> > ...
> > 
> > This updated formulation was proposed by Richi (with a request for
> > review of wording):
> > ...
> > During this period, the only (non-documentation) changes that may
> > be made are changes that fix regressions.
> > 
> > -Other changes may not be done during this period.
> > +Other important bugs like wrong-code, rejects-valid or build issues may
> > +be fixed as well.  All changes during this period should be done with
> > +extra care on not introducing new regressions - fixing bugs at all cost
> > +is not wanted.
> ...
> 
> Is there, or should there be, a distinction between primary and non-primary 
> platforms?  While platform bugs typically require fixes in platform-specific 
> code, I would think we would want to stay away from bugfixes in minor 
> platforms during stage 4.  The wording seems to say that I could fix 
> wrong-code bugs in pdp11 during stage 4; I have been assuming I should not do 
> that.  Is this something that should be explicitly stated?

I think it's somewhere stated that during Stage 3 non-primary/secondary 
targets as well as non-C/C++ languages have no restrictions.  Of course
while technically true breaking builds is still not wanted.

For Stage 4 things are somewhat different I think, not sure if it's 
anywhere spelled out.

Richard.


Re: [RFC] Update Stage 4 description

2019-01-09 Thread Nathan Sidwell

On 1/9/19 4:02 AM, Jonathan Wakely wrote:


+extra care on not introducing new regressions - fixing bugs at all cost


ISTM that this should be either "at any cost" or "at all costs". The
current wording can't make up its mind if it's singular or plural.


Agreed, as a native english speaker, 'at any cost' would be my preferred 
formulation.



I also stumbled over "on not introducing" ... would that be better as
"to not introduce"?


Either seems fine to me

nathan
--
Nathan Sidwell


Re: autovectorization in gcc

2019-01-09 Thread David Malcolm
On Wed, 2019-01-09 at 09:56 +, Jonathan Wakely wrote:
> On Wed, 9 Jan 2019 at 09:50, Andrew Haley wrote:
> > I don't agree. Sometimes vectorization is critical. It would be
> > nice
> > to have a warning which would fire if vectorization failed. That
> > would
> > surely help the OP.
> 
> Dave Malcolm has been working on something like that:
> https://gcc.gnu.org/ml/gcc-patches/2018-09/msg01749.html

Yes: this code is in trunk for gcc 9, but it doesn't help much for the
case given elsewhere in this thread:

#include 

extern float data [ 32768 ] ;

extern void vf1()
{
   #pragma vectorize enable
   for ( int i = 0 ; i < 32768 ; i++ )
 data [ i ] = std::sqrt ( data [ i ] ) ;
}

Compiling on this x86_64 box with -fopt-info-vec-missed shows the
rather cryptic:

g++ -c /tmp/sqrt-test.cc -O3 -mavx2 -fopt-info-vec-missed
/tmp/sqrt-test.cc:8:24: missed: couldn't vectorize loop
/tmp/sqrt-test.cc:8:24: missed: not vectorized: control flow in loop.
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27:
 missed: statement clobbers memory: __builtin_sqrtf (_1);

and with -fopt-info-vec-all-internals shows:

g++ -c /tmp/sqrt-test.cc -O3 -mavx2 -fopt-info-vec-all-internals

Analyzing loop at /tmp/sqrt-test.cc:8
/tmp/sqrt-test.cc:8:24: note:  === analyze_loop_nest ===
/tmp/sqrt-test.cc:8:24: note:   === vect_analyze_loop_form ===
/tmp/sqrt-test.cc:8:24: missed:   not vectorized: control flow in loop.
/tmp/sqrt-test.cc:8:24: missed:  bad loop form.
/tmp/sqrt-test.cc:8:24: missed: couldn't vectorize loop
/tmp/sqrt-test.cc:8:24: missed: not vectorized: control flow in loop.
/tmp/sqrt-test.cc:5:13: note: vectorized 0 loops in function.
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27:
 note:  === vect_slp_analyze_bb ===
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27:
 note:   === vect_analyze_data_refs ===
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27:
 note:   got vectype for stmt: _1 = data[i_12];
vector(8) float
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27:
 missed:  not vectorized: not enough data-refs in basic block.
/home/david/coding/gcc-python/gcc-svn-trunk/install-dogfood/include/c++/9.0.0/cmath:464:27:
 missed: statement clobbers memory: __builtin_sqrtf (_1);
/tmp/sqrt-test.cc:8:24: note:  === vect_slp_analyze_bb ===
/tmp/sqrt-test.cc:8:24: note:   === vect_analyze_data_refs ===
/tmp/sqrt-test.cc:8:24: note:   got vectype for stmt: data[i_12] = _7;
vector(8) float
/tmp/sqrt-test.cc:8:24: missed:  not vectorized: not enough data-refs in basic 
block.
/tmp/sqrt-test.cc:10:1: note:  === vect_slp_analyze_bb ===
/tmp/sqrt-test.cc:10:1: note:   === vect_analyze_data_refs ===
/tmp/sqrt-test.cc:10:1: missed:  not vectorized: not enough data-refs in basic 
block.

I had to turn on -fdump-tree-all to try to figure out what that
"control flow in loop" was; it seems to be a guard against the input to
value being negative:

   [local count: 1063004407]:
  # i_12 = PHI <0(2), i_6(7)>
  # ivtmp_10 = PHI <32768(2), ivtmp_2(7)>
  # DEBUG i => i_12
  # DEBUG BEGIN_STMT
  _1 = data[i_12];
  # DEBUG __x => _1
  # DEBUG BEGIN_STMT
  _7 = .SQRT (_1);
  if (_1 u>= 0.0)
goto ; [99.95%]
  else
goto ; [0.05%]

   [local count: 1062472912]:
  goto ; [100.00%]

   [local count: 531495]:
  __builtin_sqrtf (_1);

I'm not sure where that control flow came from: it isn't in
  sqrt-test.cc.104t.stdarg
but is in
  sqrt-test.cc.105t.cdce
so I think it's coming from the argument-range code in cdce.

Arguably the location on the statement is wrong: it's on the loop
header, when it presumably should be on the std::sqrt call.

Shall I file a bugzilla about this?

Dave


Re: autovectorization in gcc

2019-01-09 Thread Jakub Jelinek
On Wed, Jan 09, 2019 at 11:10:25AM -0500, David Malcolm wrote:
> extern void vf1()
> {
>#pragma vectorize enable
>for ( int i = 0 ; i < 32768 ; i++ )
>  data [ i ] = std::sqrt ( data [ i ] ) ;
> }
> 
> Compiling on this x86_64 box with -fopt-info-vec-missed shows the

>   _7 = .SQRT (_1);
>   if (_1 u>= 0.0)
> goto ; [99.95%]
>   else
> goto ; [0.05%]
> 
>[local count: 1062472912]:
>   goto ; [100.00%]
> 
>[local count: 531495]:
>   __builtin_sqrtf (_1);
> 
> I'm not sure where that control flow came from: it isn't in
>   sqrt-test.cc.104t.stdarg
> but is in
>   sqrt-test.cc.105t.cdce
> so I think it's coming from the argument-range code in cdce.
> 
> Arguably the location on the statement is wrong: it's on the loop
> header, when it presumably should be on the std::sqrt call.

See my either mail, it is the result of the -fmath-errno default,
the inline emitted sqrt doesn't handle errno setting and we emit
essentially x = sqrt (arg); if (__builtin_expect (arg < 0.0, 0)) sqrt (arg); 
where
the former sqrt is inline using HW instructions and the latter is the
library call.

With some extra work we could vectorize it; e.g. if we make it handle
OpenMP #pragma omp ordered simd efficiently, it would be the same thing
- allow non-vectorizable portions of vectorized loops by doing there a
scalar loop from 0 to vf-1 doing the non-vectorizable stuff + drop the 
limitation
that the vectorized loop is a single bb.  Essentially, in this case it would
be
  vec1 = vec_load (data + i);
  vec2 = vec_sqrt (vec1);
  if (__builtin_expect (any (vec2 < 0.0)))
{
  for (int i = 0; i < vf; i++)
sqrt (vec2[i]);
}
  vec_store (data + i, vec2);
If that would turn to be way too hard, we could for the vectorization
purposes hide that into the .SQRT internal fn, say add a fndecl argument to
it if it should treat the exceptional cases some way so that the control
flow isn't visible in the vectorized loop.

Jakub


Re: autovectorization in gcc

2019-01-09 Thread David Malcolm
On Wed, 2019-01-09 at 11:10 -0500, David Malcolm wrote:
> On Wed, 2019-01-09 at 09:56 +, Jonathan Wakely wrote:
> > On Wed, 9 Jan 2019 at 09:50, Andrew Haley wrote:
> > > I don't agree. Sometimes vectorization is critical. It would be
> > > nice
> > > to have a warning which would fire if vectorization failed. That
> > > would
> > > surely help the OP.
> > 
> > Dave Malcolm has been working on something like that:
> > https://gcc.gnu.org/ml/gcc-patches/2018-09/msg01749.html
> 
> Yes: this code is in trunk for gcc 9, but it doesn't help much for
> the
> case given elsewhere in this thread:
> 
> #include 
> 
> extern float data [ 32768 ] ;
> 
> extern void vf1()
> {
>#pragma vectorize enable
>for ( int i = 0 ; i < 32768 ; i++ )
>  data [ i ] = std::sqrt ( data [ i ] ) ;
> }
> 
> Compiling on this x86_64 box with -fopt-info-vec-missed shows the
> rather cryptic:
> 
> g++ -c /tmp/sqrt-test.cc -O3 -mavx2 -fopt-info-vec-missed
> /tmp/sqrt-test.cc:8:24: missed: couldn't vectorize loop
> /tmp/sqrt-test.cc:8:24: missed: not vectorized: control flow in loop.
> /home/david/coding/gcc-python/gcc-svn-trunk/install-
> dogfood/include/c++/9.0.0/cmath:464:27: missed: statement clobbers
> memory: __builtin_sqrtf (_1);
> 
> and with -fopt-info-vec-all-internals shows:
> 
> g++ -c /tmp/sqrt-test.cc -O3 -mavx2 -fopt-info-vec-all-internals
> 
> Analyzing loop at /tmp/sqrt-test.cc:8
> /tmp/sqrt-test.cc:8:24: note:  === analyze_loop_nest ===
> /tmp/sqrt-test.cc:8:24: note:   === vect_analyze_loop_form ===
> /tmp/sqrt-test.cc:8:24: missed:   not vectorized: control flow in
> loop.
> /tmp/sqrt-test.cc:8:24: missed:  bad loop form.
> /tmp/sqrt-test.cc:8:24: missed: couldn't vectorize loop
> /tmp/sqrt-test.cc:8:24: missed: not vectorized: control flow in loop.
> /tmp/sqrt-test.cc:5:13: note: vectorized 0 loops in function.
> /home/david/coding/gcc-python/gcc-svn-trunk/install-
> dogfood/include/c++/9.0.0/cmath:464:27: note:  ===
> vect_slp_analyze_bb ===
> /home/david/coding/gcc-python/gcc-svn-trunk/install-
> dogfood/include/c++/9.0.0/cmath:464:27: note:   ===
> vect_analyze_data_refs ===
> /home/david/coding/gcc-python/gcc-svn-trunk/install-
> dogfood/include/c++/9.0.0/cmath:464:27: note:   got vectype for stmt:
> _1 = data[i_12];
> vector(8) float
> /home/david/coding/gcc-python/gcc-svn-trunk/install-

> dogfood/include/c++/9.0.0/cmath:464:27: missed:  not vectorized: not
> enough data-refs in basic block.
> /home/david/coding/gcc-python/gcc-svn-trunk/install-
> dogfood/include/c++/9.0.0/cmath:464:27: missed: statement clobbers
> memory: __builtin_sqrtf (_1);
> /tmp/sqrt-test.cc:8:24: note:  === vect_slp_analyze_bb ===
> /tmp/sqrt-test.cc:8:24: note:   === vect_analyze_data_refs ===
> /tmp/sqrt-test.cc:8:24: note:   got vectype for stmt: data[i_12] =
> _7;
> vector(8) float
> /tmp/sqrt-test.cc:8:24: missed:  not vectorized: not enough data-refs 
> in basic block.
> /tmp/sqrt-test.cc:10:1: note:  === vect_slp_analyze_bb ===
> /tmp/sqrt-test.cc:10:1: note:   === vect_analyze_data_refs ===
> /tmp/sqrt-test.cc:10:1: missed:  not vectorized: not enough data-refs 
> in basic block.
> 
> I had to turn on -fdump-tree-all to try to figure out what that
> "control flow in loop" was; it seems to be a guard against the input
> to
> value being negative:
> 
>[local count: 1063004407]:
>   # i_12 = PHI <0(2), i_6(7)>
>   # ivtmp_10 = PHI <32768(2), ivtmp_2(7)>
>   # DEBUG i => i_12
>   # DEBUG BEGIN_STMT
>   _1 = data[i_12];
>   # DEBUG __x => _1
>   # DEBUG BEGIN_STMT
>   _7 = .SQRT (_1);
>   if (_1 u>= 0.0)
> goto ; [99.95%]
>   else
> goto ; [0.05%]
> 
>[local count: 1062472912]:
>   goto ; [100.00%]
> 
>[local count: 531495]:
>   __builtin_sqrtf (_1);
> 
> I'm not sure where that control flow came from: it isn't in
>   sqrt-test.cc.104t.stdarg
> but is in
>   sqrt-test.cc.105t.cdce
> so I think it's coming from the argument-range code in cdce.
> 
> Arguably the location on the statement is wrong: it's on the loop
> header, when it presumably should be on the std::sqrt call.
> 
> Shall I file a bugzilla about this?

...and -fno-tree-builtin-call-dce eliminates the control flow, but it
still doesn't vectorize the loop; on godbolt.org with:
  -O3 -mavx2 -fopt-info-vec-all -fno-tree-builtin-call-dce
gcc trunk x86_64 gives:

:8:24: missed: couldn't vectorize loop
/opt/compiler-explorer/gcc-trunk-20190109/include/c++/9.0.0/cmath:464:27: 
missed: statement clobbers memory: _7 = __builtin_sqrtf (_1);
:5:13: note: vectorized 0 loops in function.
/opt/compiler-explorer/gcc-trunk-20190109/include/c++/9.0.0/cmath:464:27: 
missed: statement clobbers memory: _7 = __builtin_sqrtf (_1);
Compiler returned: 0

...so presumably it doesn't know how to vectorize that builtin call.

Dave



Re: [RFC] Update Stage 4 description

2019-01-09 Thread Joseph Myers
On Wed, 9 Jan 2019, Paul Koning wrote:

> Is there, or should there be, a distinction between primary and 
> non-primary platforms?  While platform bugs typically require fixes in 
> platform-specific code, I would think we would want to stay away from 
> bugfixes in minor platforms during stage 4.  The wording seems to say 
> that I could fix wrong-code bugs in pdp11 during stage 4; I have been 
> assuming I should not do that.  Is this something that should be 
> explicitly stated?

In target-specific code for a minor target you can more or less do as you 
want - but the decision on when to branch won't take account of what 
you're doing for a minor target, so any major work runs the risk of the 
branch happening at an unstable point in the middle of that work.

-- 
Joseph S. Myers
jos...@codesourcery.com