On Thu, 2010-12-30 at 18:56 +0200, Revital1 Eres wrote:
> Hello,
>
> The attached patch is my latest attempt to model doloop for arm.
> I followed Chung-Lin Tang suggestion and used subs+jump similar to your
> patch.
> On crotex-A8 I see gain of 29% on autocor benchmark (telecom suite) with
> SMS using the following flags: -fmodulo-sched-allow-regmoves
> -funsafe-loop-optimizations -fmodulo-sched -fno-auto-inc-dec
> -fdump-rtl-sms -mthumb -mcpu=cortex-a8 -O3. (compare to using only
> -mthumb -mcpu=cortex-a8 -O3)
>
> I have not fully tested the patch and it's not in the proper format of
> submission yet.
>
> Thanks,
> Revital
>
> (See attached file: patch_arm_doloop.txt)
>
>
>
> From: Roman Zhuykov <[email protected]>
> To: [email protected]
> Cc: [email protected]
> Date: 30/12/2010 04:04 PM
> Subject: [ARM] Implementing doloop pattern
> Sent by: [email protected]
>
>
>
> Hello!
>
> The main idea of the work described below was to estimate speedup we can
> gain from SMS on ARM. SMS depends on doloop_end pattern and there is no
> appropriate instruction on ARM. We decided to create a "fake"
> doloop_end pattern on ARM using a pair of "subs" and "bne" assembler
> instructions. In implementation we used ideas from machine description
> files of other architectures, e. g. spu, which expands doloop_end
> pattern only when SMS is enabled. The patch is attached.
>
> This patch allows to use any possible register for the doloop pattern.
> It was tested on trunk snapshot from 30 Aug 2010. It works fine on
> several small examples, but gives an ICE on sqlite-amalgamation-3.6.1
> source:
> sqlite3.c: In function 'sqlite3WhereBegin':
> sqlite3.c:76683:1: internal compiler error: in patch_jump_insn, at
> cfgrtl.c:1020
>
> ICE happens in ira pass, when cleanup_cfg is called at the end or ira.
>
> The "bad" instruction looks like
> (jump_insn 3601 628 4065 76 (parallel [
> (set (pc)
> (if_then_else (ne (mem/c:SI (plus:SI (reg/f:SI 13 sp)
> (const_int 36 [0x24])) [105 %sfp+-916
> S4 A32])
> (const_int 1 [0x1]))
> (label_ref 3600)
> (pc)))
> (set (mem/c:SI (plus:SI (reg/f:SI 13 sp)
> (const_int 36 [0x24])) [105 %sfp+-916 S4 A32])
> (plus:SI (mem/c:SI (plus:SI (reg/f:SI 13 sp)
> (const_int 36 [0x24])) [105 %sfp+-916 S4 A32])
> (const_int -1 [0xffffffffffffffff])))
> ]) sqlite3.c:75235 328 {doloop_end_internal}
> (expr_list:REG_BR_PROB (const_int 9100 [0x238c])
> (nil))
> -> 3600)
>
> So, the problem seems to be with ira. Memory is used instead of a
> register to store doloop counter. We tried to fix this by explicitly
> specifying hard register (r5) for doloop pattern. The fixed version
> seems to work, but this doesn't look like a proper fix. On trunk
> snapshot from 17 Dec 2010 the ICE described above have disappeared, but
> probably it's just a coincidence, and it will shop up anyway on some
> other test case.
>
> The r5-fix shows the following results (compare "-O2 -fno-auto-inc-dec
> -fmodulo-sched" vs "-O2 -fno-auto-inc-dec").
> Aburto benchmarks: heapsort and matmult - 3% speedup. nsieve - 7% slowdown.
> Other aburto tests, sqlite tests and libevas rasterization library
> (expedite testsuite) show around zero results.
>
> A motivating example shows about 23% speedup:
>
> char scal (int n, char *a, char *b)
> {
> int i;
> char s = 0;
> for (i = 0; i < n; i++)
> s += a[i] * b[i];
> return s;
> }
>
> We have analyzed SMS results, and can conclude that if SMS has
> successfully built a schedule for the loop we usually gain a speedup,
> and when SMS fails, we often have some slowdown, which have appeared
> because of do-loop conversion.
>
> The questions are:
> How to properly fix the ICE described?
> Do you think this approach (after the fixes) can make its way into trunk?
>
> Happy holidays!
> --
> Roman Zhuykov
>
> [attachment "sms-doloop-any-reg.diff" deleted by Revital1 Eres/Haifa/IBM]
@@ -162,6 +175,7 @@ doloop_condition_get (rtx doloop_pat)
return 0;
if ((XEXP (condition, 0) == reg)
+ || (REGNO (XEXP (condition, 0)) == CC_REGNUM)
|| (GET_CODE (XEXP (condition, 0)) == PLUS
&& XEXP (XEXP (condition, 0), 0) == reg))
You can't depend on CC_REGNUM in generic code. That's part of the
private machine description for ARM. Other cores have different ways of
representing condition codes.
R.