Hello,
I tried to use doloop_end pattern to reduce loop overhead for our target
processor, which features a dedicated loop instruction. Somehow even a
simple loop just cannot pass the test of doloop_condition_get, which
requires following canonical pattern.
/* The canonical doloop pattern we expect has one of the following
forms:
1) (parallel [(set (pc) (if_then_else (condition)
(label_ref (label))
(pc)))
(set (reg) (plus (reg) (const_int -1)))
(additional clobbers and uses)])
The branch must be the first entry of the parallel (also required
by jump.c), and the second entry of the parallel must be a set of
the loop counter register. Some targets (IA-64) wrap the set of
the loop counter in an if_then_else too.
2) (set (reg) (plus (reg) (const_int -1))
(set (pc) (if_then_else (reg != 0)
(label_ref (label))
(pc))). */
Here is a simple function I used, it should meet all doloop optimization
requirements.
void Unroll( short s, int * restrict b_inout, int *restrict out, int N)
{
int i;
for (i=0; i<64; i++)
{
out[i] = b_inout[i] + s;
}
}
In tree ivcanon pass, it is converted to
;; Function Unroll (Unroll)
Unroll (short int s, int * restrict b_inout, int * restrict out, int N)
{
unsigned int ivtmp.14;
int pretmp.9;
long unsigned int pretmp.8;
int storetmp.6;
int i;
int D.1459;
int D.1458;
int D.1457;
int * D.1456;
int * D.1455;
long unsigned int D.1454;
long unsigned int D.1453;
<bb 2>:
pretmp.9_8 = (int) s_12(D);
<bb 3>:
# ivtmp.14_13 = PHI <ivtmp.14_21(4), 64(2)>
# i_19 = PHI <i_15(4), 0(2)>
D.1453_3 = (long unsigned int) i_19;
D.1454_4 = D.1453_3 * 4;
D.1455_6 = out_5(D) + D.1454_4;
D.1456_10 = b_inout_9(D) + D.1454_4;
D.1457_11 = *D.1456_10;
D.1459_14 = pretmp.9_8 + D.1457_11;
*D.1455_6 = D.1459_14;
i_15 = i_19 + 1;
ivtmp.14_21 = ivtmp.14_13 - 1;
if (ivtmp.14_21 != 0)
goto <bb 4>;
else
goto <bb 5>;
<bb 4>:
goto <bb 3>;
<bb 5>:
return;
}
This should match requirements of doloop_condition_get. But after
ivopts pass, the code is transformed to:
;; Function Unroll (Unroll)
Unroll (short int s, int * restrict b_inout, int * restrict out, int N)
{
long unsigned int ivtmp.21;
unsigned int ivtmp.14;
int pretmp.9;
long unsigned int pretmp.8;
int storetmp.6;
int i;
int D.1459;
int D.1458;
int D.1457;
int * D.1456;
int * D.1455;
long unsigned int D.1454;
long unsigned int D.1453;
<bb 2>:
pretmp.9_8 = (int) s_12(D);
<bb 3>:
# ivtmp.21_7 = PHI <ivtmp.21_16(4), 0(2)>
D.1457_11 = MEM[base: b_inout_9(D), index: ivtmp.21_7];
D.1459_14 = pretmp.9_8 + D.1457_11;
MEM[base: out_5(D), index: ivtmp.21_7] = D.1459_14;
ivtmp.21_16 = ivtmp.21_7 + 4;
if (ivtmp.21_16 != 256)
goto <bb 4>;
else
goto <bb 5>;
<bb 4>:
goto <bb 3>;
<bb 5>:
return;
}
It is not required canonical form anymore. And later RTL level
optimizations cannot convert it back. Since it doesn't pass the
doloop_condition_get test, modulo scheduling pass doesn't work too. Do
I miss something here? Any hint is greatly appreciated.
Cheers,
Bingfeng Mei