Re: Question about wrapv-vect-reduc-dot-s8b.c
> the dump-scans. Can we do sth like > "vect_recog_dot_prod_pattern: detected\n(!FAILED)*SUCCEEDED", thus > after the dot-prod pattern dumping allow arbitrary stuff but _not_ > a "failed" and then require a "succeeded"? It took some fighting with tcl syntax until I arrived at the regex pattern below but it looks like it is possible. "vect_recog_dot_prod_pattern: detected(?:(?!failed).)*succeeded". This seems to work for the failing cases and I'm going to send a patch tomorrow if an x86 testsuite run is unchanged. Regards Robin
Re: Question about wrapv-vect-reduc-dot-s8b.c
>> I am wondering whether we do have some situations that >> vec_pack/vec_unpack/vec_widen_xxx/dot_prod pattern can be >> beneficial for RVV ? I have ever met some situation that vec_unpack >> can be beneficial when working on SELECT_VL but I don't which >> case > > With fixed size vectors you'll face the situation that the vectorizer > chooses the "wrong" vector type so yes, I think implementing > vec_unpack[s]_{lo,hi} might be useful. But I wouldn't prioritize this > until you have a more clear picture of how useful it would be. Another thing that comes to mind is that we currently don't do vectorizable calls with mismatched vector sizes. So even if we detected e.g. vec_widen_plus early it wouldn't get us much further. On the other hand, I don't think we perform many optimizations on such patterns between vect and combine (where we finally generate those). Regards Robin
Re: Question about wrapv-vect-reduc-dot-s8b.c
> it's target dependent what we choose first so it's going to be > a bit difficult to adjust testcases like this (and it looks like > a testsuite issue). I think for this specific testcase changing > scan-tree-dump-times to scan-tree-dump is reasonable. Note we > really want to check that for the case we choose finally > we use the sdot pattern, but I don't see how we can easily constrain > the dump-scans. Can we do sth like > "vect_recog_dot_prod_pattern: detected\n(!FAILED)*SUCCEEDED", thus > after the dot-prod pattern dumping allow arbitrary stuff but _not_ > a "failed" and then require a "succeeded"? > > The other way would be to somehow add a dump flag that produces > dumps only for the succeeded part. Of course we have targets that > evaluate multiple succeeded parts for costing (but luckily costing > is disabled for most tests). I'm going to have a try at fixing the test expectations but not before tomorrow. Right now I can see "vect_recog_widen_mult_pattern: detected" even four times in the dump, 2x for each try, so I first need to understand what's going on. Regards Robin
Re: Question about wrapv-vect-reduc-dot-s8b.c
>> To fix it, is it necessary to support 'vec_unpack' ? > > both same units would be sext, not vec_unpacks_{lo,hi} - the vectorizer > ties its hands by choosing vector types early and based on the number > of incoming/outgoing vectors it chooses one or the other method. > > More precise dumping would probably help here but somewhere earlier you > should be able to see the vector type used for _2 We usually try with a "normal" mode like VNx4SI (RVVM1SI or so) and then switch to VNx4QI (i.e. a mode that only determines the number of units/elements) and have vectorize_related_mode return modes with the same number of units. This will then result in the sext/zext patterns matching. The first round where we try the normal mode will not match those because the related mode has a different number of units. So it's somewhat expected that the first try fails. My dump shows that we vectorize, so IMHO no problem. I can take a look at this but it doesn't look like a case for pack/unpack. Regards Robin
Vectorization regression on s390x GCC6 vs GCC5
Hi, while analyzing a test case with a lot of nested loops (>7) and double floating point operations I noticed a performance regression of GCC 6/7 vs GCC 5 on s390x. It seems due to GCC 6 vectorizing something GCC 5 couldn't. Basically, each loop iterates over three dimensions, we fully unroll some of the inner loops until we have straight-line code of roughly 2000 insns that are being executed three times in GCC 5. GCC 6 vectorizes two iterations and adds a scalar epilogue for the third iteration. The epilogue code is so bad that it slows down the execution by at least 50%, using only two hard registers and lots of spill slots. Although my analysis is not completed, I believe this is because register pressure is high in the epilogue and the live ranges span the vectorized code as well as the epilogue. Even reduced, the test case is huge, therefore I didn't include it. Some high-level questions instead: - Has anybody else observed similar problems and got around them? - Is there some way around the register pressure/long live ranges? Perhaps something we could/should fix in the s390 backend? (Probably hard to tell without source) - Would it make sense to allow a backend to specify the minimal number of loop iterations considered for vectorization? Is this perhaps already possible somehow? I added a check to disable vectorization for loops with <= 3 iterations that shows no regressions and improves two SPEC benchmarks noticeably. I'm even considering <=5, since a vectorization factor of 4 should exhibit the same problematic pattern. Regards Robin
Re: k-byte memset/memcpy/strlen builtins
> Yes, for memset with larger element we could add an optab plus > internal function combination and use that when the target wants. Or > always use such IFN and fall back to loopy expansion. So, adding additional patterns in tree-loop-distribute.c (and mapping them to dedicated optabs) is fine? Or does the yes refer to the "else"/"or" part of my question (how would the backend recognize the patterns then)? > I'd say a multibyte memchr might make sense, but strlen specifically? > Not sure. ok, memchr would also work for the snippet I have in mind. Regards Robin
k-byte memset/memcpy/strlen builtins
Hi, When examining the performance of some test cases on s390 I realized that we could do better for constructs like 2-byte memcpys or 2-byte/4-byte memsets. Due to some s390-specific architectural properties, we could be faster by e.g. avoiding excessive unrolling and using dedicated memory instructions (or similar). For 1-byte memset/memcpy the builtin functions provide a straightforward way to achieve this. At first sight it seemed possible to extend tree-loop-distribution.c to include the additional variants we need. However, multibyte memsets/memcpys are not covered by the C standard and I'm therefore unsure if such an approach is preferable or if there are more idiomatic ways or places where to add the functionality. The same question goes for 2-byte strlen. I didn't see a recognition pattern for strlen (apart from optimizations due to known string length in tree-ssa-strlen.c). Would it make sense to include strlen recognition and subsequently handling for 2-byte strlen? The situation might of course more complicated than memset because of encodings etc. My snippet in question used a fixed-length encoding of 2 bytes, however. Another simple idea to tackle this would be a peephole optimization but I'm not sure if this is really feasible for something like memset. Wouldn't the peephole have to be recursive then? Regards Robin