https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88545

--- Comment #12 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
I had a bug in the benchmark, I forgot to set taskset,

These are the correct ones:

+--------+-----------+---------+---------+
| NEEDLE | scalar 1x | vect    | memchr  |
+--------+-----------+---------+---------+
| 1      | -0.14%    | 174.95% | 373.69% |
| 0      | 0.00%     | -90.60% | -95.21% |
| 100    | 0.03%     | -80.28% | -80.39% |
| 1000   | 0.00%     | -89.46% | -94.06% |
| 10000  | 0.00%     | -90.33% | -95.19% |
| -1     | 0.00%     | -90.60% | -95.21% |
+--------+-----------+---------+---------+

So this shows that on modern cores the unrolled scalar has no influence, so we
should just remove it.

It also shows that memchr is universally faster and that for the rest the
vectorizer does a pretty good job.  We'll get some additional speedups there
soon as well but memchr should still win as it's hand tuned.

So I think for 1-byte we should use memchr and the rest remove the unrolled
code and let the vectorizer handle it.

Reply via email to