On Wed, 2024-03-27 at 10:38 +0800, chenglulu wrote:
> 
> 在 2024/3/26 下午5:48, Xi Ruoyao 写道:
> > The latency of LA464 and LA664 division instructions depends on the
> > input.  When I updated the costs in r14-6642, I unintentionally set the
> > division costs to the best-case latency (when the first operand is 0).
> > Per a recent discussion [1] we should use "something sensible" instead
> > of it.
> > 
> > Use the average of the minimum and maximum latency observed instead.
> > This enables multiplication to reciprocal sequence reduction and speeds
> > up the following test case for about 30%:
> > 
> >      int
> >      main (void)
> >      {
> >        unsigned long stat = 0xdeadbeef;
> >        for (int i = 0; i < 100000000; i++)
> >          stat = (stat * stat + stat * 114514 + 1919810) % 1000000007;
> >        asm(""::"r"(stat));
> >      }
> > 
> > [1]: https://gcc.gnu.org/pipermail/gcc-patches/2024-March/648348.html
> 
> The test case div-const-reduction.c is modified to assemble the instruction
> sequence as follows:
>       lu12i.w $r12,999997440>>12                      # 0x3b9ac000
>       ori     $r12,$r12,2567
>       mod.w   $r13,$r13,$r12
> 
> This sequence of instructions takes 5 clock cycles.

Hmm indeed, it seems a waste to do this reduction for int / 1000000007.
I'll try to make a better heuristic as Richard suggests...


-- 
Xi Ruoyao <xry...@xry111.site>
School of Aerospace Science and Technology, Xidian University

Reply via email to