On Tue, 27 Jul 2021 02:32:12 PDT (-0700), cmuell...@gcc.gnu.org wrote:
Ok, so if I understand correctly Palmer and Andrew prefer
overlap_op_by_pieces to be controlled
by its own field in the riscv_tune_param struct and not by the field
slow_unaligned_access in this struct
(i.e. slow_unaligned_access==false is not enough to imply
overlap_op_by_pieces==true).

I guess, but I'm not really worried about this at that level of detail right now. It's not like the tune structures form any sort of external interface we have to keep stable, we can do whatever we want with those fields so I'd just aim for encoding the desired behavior as simply as possible rather than trying to build something extensible.

There are really two questions we need to answer: is this code actually faster for the C906, and is this what the average users wants under -Os. That first one is pretty easy: just running those simple code sequences under a sweep of page offsets should be sufficient to determine if this is always faster (in which case it's an easy yes), if it's always slower (an easy no), or if there's some slow cases like page/cache line crossing (in which case we'd need to think a bit).

The second one is a bit tricker. In the past we'd said these sort of "actively misalign accesses to generate smaller code" sort of thing isn't suitable for -Os (as most machines still have very slow unaligned accesses) but is suitable for -Oz (don't remember if that ever ended up in GCC, though). That still seems like a reasonable decision, but if it turns out that implementations with fast unaligned accesses become the norm then it'd probably be worth revisiting it. Not sure exactly how to determine that tipping point, but I think we're a long way away from it right now.

IMO it's really just premature to try and design an encoding of the tuning paramaters until we have an idea of what they are, as we'll just end up devolving down the path of trying to encode all possible hardware and that's generally a huge waste of time. Since there's no ABI here we can refactor this however we want as new tunings show up.

I don't have access to pipeline details that give proof that there are cases
where this patch causes a performance penalty.

So, I leave this here as a summary for someone who has enough information and
interest to move this forward:
* the original patch should be sufficient, but does not have tests:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575791.html
* the tests can be taken from this patch:
  https://gcc.gnu.org/pipermail/gcc-patches/2021-July/575864.html
  Note, that there is a duplicated "sw" in builtins-overlap-6.c, which
should be a "sd".

Thanks for the feedback!

Cool. Looks like the C906 is starting to show up in the real world, so we should be able to find someone who has access to one and cares enough to at least run some simple benchamrks of these code sequences. IMO that's a pretty low interest bar, so I don't see any harm in waiting -- when the hardware is common then I'm sure someone will care enough to give this a shot, and until then it's not really impacting anyone either way.

The -Os thing is a bigger discussion, and while I'm happy to have it I don't really think we're even close to these being common enough yet. I saw your memmove patch and think the same rationale might apply there, but I haven't looked closely and won't have time to for a bit as I've got to get around to the other projects.

On Tue, Jul 27, 2021 at 3:48 AM Palmer Dabbelt <pal...@dabbelt.com> wrote:

On Mon, 26 Jul 2021 03:05:21 PDT (-0700), Andrew Waterman wrote:
> On Thu, Jul 22, 2021 at 10:27 AM Palmer Dabbelt <pal...@dabbelt.com> wrote:
>>
>> On Thu, 22 Jul 2021 06:29:46 PDT (-0700), gcc-patches@gcc.gnu.org wrote:
>> > Could you add a testcase? Otherwise LGTM.
>> >
>> > Option: -O2 -mtune=thead-c906 -march=rv64gc -mabi=lp64
>> > void foo(char *dst){
>> >    __builtin_memset(dst, 0, 15);
>> > }
>>
>> I'd like to see:
>>
>> * Test results.  This is only on for one target right now, so relying on
>>   it to just work on others isn't a good idea.
>> * Something to demonstrate this doesn't break -mstrict-align.
>> * Some sort of performance analysis.  Most machines that support
>>   unaligned access do so with some performance degredation,
>
> Also, some machines that gracefully support misaligned accesses under
> most circumstances nevertheless experience a perf degradation when the
> load depends on two stores that overlap partially but not fully.  This
> transformation will obviously trigger such behavior from time to time.

Ya, I thought I wrote a response to this but I guess it's just in a
buffer somewhere.  The code sequences this is generating are really the
worst case for unaligned stores: one of them is always guaranteed to be
misaligned, and it partially overlaps with a store one cycle away.

We're really only saving a handful of instructions at best here, so
there's not much room for error when it comes to these sorts of things.
Even if this difficult case is handled fast we're only talking about
saving 2 cycles, which is pretty borderline as the single-issue in-order
machines I've worked with that do support misaligned accesses tend to
take at least a few cycles of performance hit on misaligned accesses.
Even if misaligned accesses are single cycle, some back of the envelope
calculations says that a pipeline flush when crossing a cache line would
definitely push this negative and one per page crossing would be in the
margins (depending on how we assume the original accesses are aligned).

This is way too subtle of a thing to analyze without at least some
knowledge of how this pipeline works, whether that's from a benchmark or
just a pipeline description.

> Note, I'm not objecting to this patch; I'm only responding to Palmer's
> comment.  It makes sense to enable this kind of optimization for
> -mtune=native etc., just not for standard software distributions.

IMO there are really two cases here: -mtune=c906 and -Os (-mtune=native
is sort of a red herring, it's just syntax).  For -mtune=c906 I'm happy
enabling this as long as it's actually correct and improves performance,
but that'll need at least some hardware-oriented analysis -- whether
it's a benchmark or just a confirmation that these sequences are
actually expected to run fast.

-Os is a different case, though.  Last time this came up we decided that
-Os shouldn't purposefully misalign accesses, even when that saves code
size, as that's too likely to hit pathological performance cases.  I
don't know if the weights have changed here: the C906 is currently by
far the cheapest chip (which likely means it's going to be the most
popular), but it's unclear as to whether it's even RISC-V compliant and
I don't know of many people who've actually gotten one.  IMO it's too
early to flip -Os over to this behavior, we at least need to know that
this chip is going to be sufficiently RISC-V-ey that standard software
will even run on it much less be popular.

>
>
>>   it's unclear
>>   that this conversion will actually manifst a performance improvement.
>>   I don't have a C906 and don't know what workloads people care about
>>   running on one, but I'm certainly not convinced this is a win --
>>   what's listed here looks to be the best case, and that's only saving
>>   two instructions to generate a pretty pathological sequence
>>   (misaligned access that conflicts with a prior store).

Ah, I guess there it was ;)

>>
>> Jojo: do you have any description of the C906 pipeline?  Specifically in
>> this case it'd be good to know how it handles unaligned accesses.
>>
>> >
>> > On Thu, Jul 22, 2021 at 8:53 PM Christoph Muellner via Gcc-patches
>> > <gcc-patches@gcc.gnu.org> wrote:
>> >>
>> >> This patch enables the overlap-by-pieces feature of the by-pieces
>> >> infrastructure for inlining builtins in case the target has set
>> >> riscv_slow_unaligned_access_p to false.
>> >>
>> >> To demonstrate the effect for targets with fast unaligned access,
>> >> the following code sequences are generated for a 15-byte memset-zero.
>> >>
>> >> Without overlap_op_by_pieces we get:
>> >>   8e:   00053023                sd      zero,0(a0)
>> >>   92:   00052423                sw      zero,8(a0)
>> >>   96:   00051623                sh      zero,12(a0)
>> >>   9a:   00050723                sb      zero,14(a0)
>> >>
>> >> With overlap_op_by_pieces we get:
>> >>   7e:   00053023                sd      zero,0(a0)
>> >>   82:   000533a3                sd      zero,7(a0)
>> >>
>> >> gcc/ChangeLog:
>> >>
>> >>         * config/riscv/riscv.c (riscv_overlap_op_by_pieces): New function.
>> >>         (TARGET_OVERLAP_OP_BY_PIECES_P): Connect to
>> >>         riscv_overlap_op_by_pieces.
>> >>
>> >> Signed-off-by: Christoph Muellner <cmuell...@gcc.gnu.org>
>> >> ---
>> >>  gcc/config/riscv/riscv.c | 11 +++++++++++
>> >>  1 file changed, 11 insertions(+)
>> >>
>> >> diff --git a/gcc/config/riscv/riscv.c b/gcc/config/riscv/riscv.c
>> >> index 576960bb37c..98c76ba657a 100644
>> >> --- a/gcc/config/riscv/riscv.c
>> >> +++ b/gcc/config/riscv/riscv.c
>> >> @@ -5201,6 +5201,14 @@ riscv_slow_unaligned_access (machine_mode, 
unsigned int)
>> >>    return riscv_slow_unaligned_access_p;
>> >>  }
>> >>
>> >> +/* Implement TARGET_OVERLAP_OP_BY_PIECES_P.  */
>> >> +
>> >> +static bool
>> >> +riscv_overlap_op_by_pieces (void)
>> >> +{
>> >> +  return !riscv_slow_unaligned_access_p;
>> >> +}
>> >> +
>> >>  /* Implement TARGET_CAN_CHANGE_MODE_CLASS.  */
>> >>
>> >>  static bool
>> >> @@ -5525,6 +5533,9 @@ riscv_asan_shadow_offset (void)
>> >>  #undef TARGET_SLOW_UNALIGNED_ACCESS
>> >>  #define TARGET_SLOW_UNALIGNED_ACCESS riscv_slow_unaligned_access
>> >>
>> >> +#undef TARGET_OVERLAP_OP_BY_PIECES_P
>> >> +#define TARGET_OVERLAP_OP_BY_PIECES_P riscv_overlap_op_by_pieces
>> >> +
>> >>  #undef TARGET_SECONDARY_MEMORY_NEEDED
>> >>  #define TARGET_SECONDARY_MEMORY_NEEDED riscv_secondary_memory_needed
>> >>
>> >> --
>> >> 2.31.1
>> >>

Reply via email to