2014-08-26 11:49 GMT+04:00 Ilya Enkovich <enkovich....@gmail.com>:
> 2014-08-25 19:08 GMT+04:00 Vladimir Makarov <vmaka...@redhat.com>:
>> On 2014-08-22 8:21 AM, Ilya Enkovich wrote:
>>>
>>> Hi,
>>>
>>> On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in
>>> 32bit PIC mode.  It was decided that the best approach would be to not fix
>>> ebx register, use speudo register for GOT base address and let allocator do
>>> the rest.  This should be similar to how clang and icc work with GOT base
>>> address.  I've been working for some time on such patch and now want to
>>> share my results.
>>>
>>> The idea of the patch was very simple and included few things;
>>>   1.  Set PIC_OFFSET_TABLE_REGNUM to INVALID_REGNUM to specify that we do
>>> not have any hard reg fixed for PIC.
>>>   2.  Initialize pic_offset_table_rtx with a new pseudo register in the
>>> begining of a function expand.
>>>   3.  Change ABI so that there is a possible implicit PIC argument for
>>> calls; pic_offset_table_rtx is used as an arg value if such implicit arg
>>> exist.
>>>
>>> Such approach worked well on small tests but trying to run some benchmarks
>>> we faced a problem with reload of address constants.  The problem is that
>>> when we try to rematerialize address constant or some constant memory
>>> reference, we have to use pic_offset_table_rtx.  It means we insert new
>>> usages of a speudo register and alocator cannot handle it correctly.  Same
>>> problem also applies for float and vector constants.
>>>
>>> Rematerialization is not the only case causing new pic_offset_table_rtx
>>> usage.  Another case is a split of some instructions using constant but not
>>> having proper constraints.  E.g. pushtf pattern allows push of constant but
>>> it has to be replaced with push of memory in reload pass causing additional
>>> usage of pic_offset_table_rtx.
>>>
>>> There are two ways to fix it.  The first one is to support modifications
>>> of pseudo register live range during reload and correctly allocate hard regs
>>> for its new usages (currently we have some hard reg allocated for new usage
>>> of pseudo reg but it may contain value of some other pseudo reg; thus we
>>> reveal the problem at runtime only).
>>>
>>
>> I believe there is already code to deal with this situation.  It is code for
>> risky transformations (please check flag lra_risky_transformation_p).  If
>> this flag is set, next lra assign subpass is running and checking
>> correctness of assignments (e.g. checking situation when two different
>> pseudos have intersected live ranges and the same assigned hard reg.  If
>> such dangerous situation is found, it is fixed).
>
> I tried to remove my restrictions from setup_reg_equiv and initialize
> lra_risky_transformation_p with 'true' in lra_constraints instead.  I
> got only 50% pass rate for SPEC2000 on Ofast with LTO.  Will search
> for fail reason.

I've looked into one of fails.  There is still a problem with
allocation in reload. Here is a piece of code which uses float
constant:

(insn 1199 1198 1200 96 (set (reg:SI 3 bx)
        (reg:SI 1301 [528])) /usr/include/bits/stdlib-float.h:28 90
{*movsi_internal}
     (nil))
(call_insn 1200 1199 1201 96 (set (reg:DF 8 st)
        (call (mem:QI (symbol_ref:SI ("strtod") [flags 0x41]
<function_decl 0x2b29b8ea8900 strtod>) [0 strtod S1 A8])
            (const_int 8 [0x8]))) /usr/include/bits/stdlib-float.h:28
661 {*call_value}
     (expr_list:REG_DEAD (reg:SI 3 bx)
        (expr_list:REG_CALL_DECL (symbol_ref:SI ("strtod") [flags
0x41]  <function_decl 0x2b29b8ea8900 strtod>)
            (expr_list:REG_EH_REGION (const_int 0 [0])
                (nil))))
    (expr_list (use (reg:SI 3 bx))
        (expr_list:SI (use (reg:SI 3 bx))
            (expr_list:SI (use (mem/f:SI (reg/f:SI 7 sp) [0  S4 A32]))
                (expr_list:SI (use (mem/f:SI (plus:SI (reg/f:SI 7 sp)
                                (const_int 4 [0x4])) [0  S4 A32]))
                    (nil))))))
(insn 1201 1200 1202 96 (set (reg:DF 321 [ D.7817 ])
        (reg:DF 8 st)) /usr/include/bits/stdlib-float.h:28 128 {*movdf_internal}
     (expr_list:REG_DEAD (reg:DF 8 st)
        (nil)))
(insn 1202 1201 1203 96 (set (reg:SF 322 [ D.7804 ])
        (float_truncate:SF (reg:DF 321 [ D.7817 ]))) read_arch.c:700
157 {*truncdfsf_fast_sse}
     (expr_list:REG_DEAD (reg:DF 321 [ D.7817 ])
        (nil)))
(insn 1203 1202 1204 96 (set (mem:SF (reg/f:SI 198 [ D.7812 ]) [4
_130->frequency+0 S4 A32])
        (reg:SF 322 [ D.7804 ])) read_arch.c:700 129 {*movsf_internal}
     (nil))
(insn 1204 1203 1205 96 (set (reg:SF 1209)
        (mem/u/c:SF (plus:SI (reg:SI 1301 [528])
                (const:SI (unspec:SI [
                            (symbol_ref/u:SI ("*.LC12") [flags 0x2])
                        ] UNSPEC_GOTOFF))) [4  S4 A32]))
read_arch.c:701 129 {*movsf_internal}
     (expr_list:REG_EQUAL (const_double:SF 0.0 [0x0.0p+0])
        (nil)))
(note 1205 1204 1206 96 NOTE_INSN_DELETED)
(note 1206 1205 1207 96 NOTE_INSN_DELETED)
(insn 1207 1206 1208 96 (set (reg:CCFP 17 flags)
        (compare:CCFP (reg:SF 1209)
            (reg:SF 322 [ D.7804 ]))) read_arch.c:701 53 {*cmpisf_sse}
     (nil))
(jump_insn 1208 1207 3075 96 (set (pc)
        (if_then_else (ge (reg:CCFP 17 flags)
                (const_int 0 [0]))
            (label_ref:SI 3114)
            (pc))) read_arch.c:701 606 {*jcc_1}
     (expr_list:REG_DEAD (reg:CCFP 17 flags)
        (int_list:REG_BR_PROB 2 (nil)))
 -> 3114)
(note 3075 1208 1209 97 [bb 97] NOTE_INSN_BASIC_BLOCK)
(insn 1209 3075 1210 97 (set (reg:SF 1208)
        (mem/u/c:SF (plus:SI (reg:SI 1301 [528])
                (const:SI (unspec:SI [
                            (symbol_ref/u:SI ("*.LC11") [flags 0x2])
                        ] UNSPEC_GOTOFF))) [4  S4 A32]))
read_arch.c:701 129 {*movsf_internal}
     (expr_list:REG_EQUIV (const_double:SF 1.0e+0 [0x0.8p+1])
        (nil)))
(note 1210 1209 1211 97 NOTE_INSN_DELETED)
(note 1211 1210 1212 97 NOTE_INSN_DELETED)
(insn 1212 1211 1213 97 (set (reg:CCFP 17 flags)
        (compare:CCFP (reg:SF 322 [ D.7804 ])
            (reg:SF 1208))) read_arch.c:701 53 {*cmpisf_sse}
     (nil))

We have PIC register r1301 (former r528) used for constant load (insn
1209).  This register was actually loaded to bx (insn 1199) and this
hard reg may be used by insn 1209.  During reload we have insn 1209
removed and a new one created instead:

(insn 3864 1211 1212 104 (set (reg:SI 0 ax [1468])
        (plus:SI (reg:SI 6 bp [528])
            (const:SI (unspec:SI [
                        (symbol_ref/u:SI ("*.LC11") [flags 0x2])
                    ] UNSPEC_GOTOFF)))) read_arch.c:701 213 {*leasi}
     (expr_list:REG_EQUAL (symbol_ref/u:SI ("*.LC11") [flags 0x2])
        (nil)))
(insn 1212 3864 1213 104 (set (reg:CCFP 17 flags)
        (compare:CCFP (reg:SF 21 xmm0 [orig:322 D.7804 ] [322])
            (mem/u/c:SF (reg:SI 0 ax [1468]) [4  S4 A32])))
read_arch.c:701 53 {*cmpisf_sse}
     (nil))

In this new instruction bp is used which is wrong. We actually have
required value in bx. In debugger I also checked that bp doesn't have
required value.  I suppose I enabled flag correctly because found this
in the log: "Spill r1301 after risky transformations".  Is it possible
we are still not allowed to use the original PIC register (r528) and
should use a reg copy created for particular region (in this case
r1301)?

Ilya

>
> Ilya
>
>>
>>
>>> The second way is to avoid all cases when new usages of
>>> pic_offset_table_rtx appear in reload.  That is a way I chose because it
>>> appeared simplier to me and would allow me to get some performance data
>>> faster.  Also having rematerialization of address anf float constants in PIC
>>> mode would mean we have higher register pressure, thus having them on stack
>>> should be even more efficient.  To achieve it I had to cut off reg equivs to
>>> all exprs using symbol references and all constants living in the memory.  I
>>> also had to avoid instructions requiring split in reload causing load of
>>> constant from memory (*push[txd]f).
>>>
>>> Resulting compiler successfully passes make check, compiles EEMBC and
>>> SPEC2000 benchmarks.  There is no confidence I covered all cases and there
>>> still may be some templates causing split in reload with new
>>> pic_offset_table_rtx usages.  I think support of reload with pseudo PIC
>>> would be better and more general solution.  But I don't know how difficult
>>> is to implement it though.  Any ideas on resolving this reload issue?
>>>
>>
>> Please see what I mentioned above.  May be it can fix the degradation.
>> Rematerialization is important for performance and switching it of
>> completely is not wise.
>>
>>
>>
>>> I collected some performance numbers for EEMBC and SPEC2000 benchmarks.
>>> Here are patch results for -Ofast optlevel with LTO collectd on Avoton
>>> server:
>>> AUTOmark +1,9%
>>> TELECOMmark +4,0%
>>> DENmark +10,0%
>>> SPEC2000 -0,5%
>>>
>>> There are few degradations on EEMBC benchmarks but on SPEC2000 situation
>>> is different and we see more performance losses.  Some of them are caused by
>>> disabled rematerialization of address constants.  In some cases relaxed ebx
>>> causes more spills/fills in plaecs where GOT is frequently used.  There are
>>> also some minor fixes required in the patch to allow more efficient function
>>> prolog (avoid unnecessary GOT register initialization and allow its
>>> initialization without ebx usage).  Suppose some performance problems may be
>>> resolved but a good fix for reload should go first.
>>>
>>>
>>
>> Ilya, the optimization you are trying to implement is important in many
>> cases and should be in some way included in gcc.  If the degradations can be
>> solved in a way i mentioned above we could introduce a machine-dependent
>> flag.
>>

Reply via email to