Re: Question about wrapv-vect-reduc-dot-s8b.c

2023-08-30 Thread Robin Dapp via Gcc
> the dump-scans.  Can we do sth like
> "vect_recog_dot_prod_pattern: detected\n(!FAILED)*SUCCEEDED", thus
> after the dot-prod pattern dumping allow arbitrary stuff but _not_
> a "failed" and then require a "succeeded"?

It took some fighting with tcl syntax until I arrived at the regex
pattern below but it looks like it is possible.

 "vect_recog_dot_prod_pattern: detected(?:(?!failed).)*succeeded".

This seems to work for the failing cases and I'm going to send a patch
tomorrow if an x86 testsuite run is unchanged.

Regards
 Robin


Re: Question about wrapv-vect-reduc-dot-s8b.c

2023-08-30 Thread Robin Dapp via Gcc


>> I am wondering whether we do have some situations that
>> vec_pack/vec_unpack/vec_widen_xxx/dot_prod pattern can be
>> beneficial for RVV ? I have ever met some situation that vec_unpack
>> can be beneficial when working on SELECT_VL but I don't which
>> case
> 
> With fixed size vectors you'll face the situation that the vectorizer
> chooses the "wrong" vector type so yes, I think implementing
> vec_unpack[s]_{lo,hi} might be useful.  But I wouldn't prioritize this
> until you have a more clear picture of how useful it would be.

Another thing that comes to mind is that we currently don't do
vectorizable calls with mismatched vector sizes.  So even if we detected
e.g. vec_widen_plus early it wouldn't get us much further.
On the other hand, I don't think we perform many optimizations on such
patterns between vect and combine (where we finally generate those).

Regards
 Robin



Re: Question about wrapv-vect-reduc-dot-s8b.c

2023-08-30 Thread Robin Dapp via Gcc
> it's target dependent what we choose first so it's going to be
> a bit difficult to adjust testcases like this (and it looks like
> a testsuite issue).  I think for this specific testcase changing
> scan-tree-dump-times to scan-tree-dump is reasonable.  Note we
> really want to check that for the case we choose finally
> we use the sdot pattern, but I don't see how we can easily constrain
> the dump-scans.  Can we do sth like
> "vect_recog_dot_prod_pattern: detected\n(!FAILED)*SUCCEEDED", thus
> after the dot-prod pattern dumping allow arbitrary stuff but _not_
> a "failed" and then require a "succeeded"?
> 
> The other way would be to somehow add a dump flag that produces
> dumps only for the succeeded part.  Of course we have targets that
> evaluate multiple succeeded parts for costing (but luckily costing
> is disabled for most tests).

I'm going to have a try at fixing the test expectations but not before
tomorrow.  Right now I can see
 "vect_recog_widen_mult_pattern: detected"
even four times in the dump, 2x for each try, so I first need to
understand what's going on.

Regards
 Robin


Re: Question about wrapv-vect-reduc-dot-s8b.c

2023-08-30 Thread Robin Dapp via Gcc
>> To fix it, is it necessary to support 'vec_unpack' ?
> 
> both same units would be sext, not vec_unpacks_{lo,hi} - the vectorizer
> ties its hands by choosing vector types early and based on the number
> of incoming/outgoing vectors it chooses one or the other method.
> 
> More precise dumping would probably help here but somewhere earlier you
> should be able to see the vector type used for _2
We usually try with a "normal" mode like VNx4SI (RVVM1SI or so) and
then switch to VNx4QI (i.e. a mode that only determines the number of
units/elements) and have vectorize_related_mode return modes with the
same number of units.  This will then result in the sext/zext patterns
matching.  The first round where we try the normal mode will not match
those because the related mode has a different number of units.

So it's somewhat expected that the first try fails.

My dump shows that we vectorize, so IMHO no problem.  I can take a look
at this but it doesn't look like a case for pack/unpack.  

Regards
 Robin


Vectorization regression on s390x GCC6 vs GCC5

2017-01-26 Thread Robin Dapp
Hi,

while analyzing a test case with a lot of nested loops (>7) and double
floating point operations I noticed a performance regression of GCC 6/7
vs GCC 5 on s390x. It seems due to GCC 6 vectorizing something GCC 5
couldn't.
 Basically, each loop iterates over three dimensions, we fully unroll
some of the inner loops until we have straight-line code of roughly 2000
insns that are being executed three times in GCC 5. GCC 6 vectorizes two
iterations and adds a scalar epilogue for the third iteration. The
epilogue code is so bad that it slows down the execution by at least
50%, using only two hard registers and lots of spill slots.
Although my analysis is not completed, I believe this is because
register pressure is high in the epilogue and the live ranges span the
vectorized code as well as the epilogue.

Even reduced, the test case is huge, therefore I didn't include it. Some
high-level questions instead:

- Has anybody else observed similar problems and got around them?

- Is there some way around the register pressure/long live ranges?
Perhaps something we could/should fix in the s390 backend? (Probably
hard to tell without source)

- Would it make sense to allow a backend to specify the minimal number
of loop iterations considered for vectorization? Is this
perhaps already possible somehow? I added a check to disable
vectorization for loops with <= 3 iterations that shows no regressions
and improves two SPEC benchmarks noticeably. I'm even considering <=5,
since a vectorization factor of 4 should exhibit the same problematic
pattern.

Regards
 Robin



Re: k-byte memset/memcpy/strlen builtins

2017-01-12 Thread Robin Dapp
> Yes, for memset with larger element we could add an optab plus
> internal function combination and use that when the target wants.  Or
> always use such IFN and fall back to loopy expansion.

So, adding additional patterns in tree-loop-distribute.c (and mapping
them to dedicated optabs) is fine? Or does the yes refer to the
"else"/"or" part of my question (how would the backend recognize the
patterns then)?

> I'd say a multibyte memchr might make sense, but strlen specifically?
> Not sure.

ok, memchr would also work for the snippet I have in mind.

Regards
 Robin



k-byte memset/memcpy/strlen builtins

2017-01-11 Thread Robin Dapp
Hi,

When examining the performance of some test cases on s390 I realized
that we could do better for constructs like 2-byte memcpys or
2-byte/4-byte memsets. Due to some s390-specific architectural
properties, we could be faster by e.g. avoiding excessive unrolling and
using dedicated memory instructions (or similar).

For 1-byte memset/memcpy the builtin functions provide a straightforward
way to achieve this. At first sight it seemed possible to extend
tree-loop-distribution.c to include the additional variants we need.
However, multibyte memsets/memcpys are not covered by the C standard and
I'm therefore unsure if such an approach is preferable or if there are
more idiomatic ways or places where to add the functionality.

The same question goes for 2-byte strlen. I didn't see a recognition
pattern for strlen (apart from optimizations due to known string length
in tree-ssa-strlen.c). Would it make sense to include strlen recognition
and subsequently handling for 2-byte strlen? The situation might of
course more complicated than memset because of encodings etc. My snippet
in question used a fixed-length encoding of 2 bytes, however.

Another simple idea to tackle this would be a peephole optimization but
I'm not sure if this is really feasible for something like memset.
Wouldn't the peephole have to be recursive then?

Regards
 Robin