On 2014-08-22 8:21 AM, Ilya Enkovich wrote:
Hi,
On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in
32bit PIC mode. It was decided that the best approach would be to not fix ebx
register, use speudo register for GOT base address and let allocator do the
rest. This should be similar to how clang and icc work with GOT base address.
I've been working for some time on such patch and now want to share my results.
The idea of the patch was very simple and included few things;
1. Set PIC_OFFSET_TABLE_REGNUM to INVALID_REGNUM to specify that we do not
have any hard reg fixed for PIC.
2. Initialize pic_offset_table_rtx with a new pseudo register in the
begining of a function expand.
3. Change ABI so that there is a possible implicit PIC argument for calls;
pic_offset_table_rtx is used as an arg value if such implicit arg exist.
Such approach worked well on small tests but trying to run some benchmarks we
faced a problem with reload of address constants. The problem is that when we
try to rematerialize address constant or some constant memory reference, we
have to use pic_offset_table_rtx. It means we insert new usages of a speudo
register and alocator cannot handle it correctly. Same problem also applies
for float and vector constants.
Rematerialization is not the only case causing new pic_offset_table_rtx usage.
Another case is a split of some instructions using constant but not having
proper constraints. E.g. pushtf pattern allows push of constant but it has to
be replaced with push of memory in reload pass causing additional usage of
pic_offset_table_rtx.
There are two ways to fix it. The first one is to support modifications of
pseudo register live range during reload and correctly allocate hard regs for
its new usages (currently we have some hard reg allocated for new usage of
pseudo reg but it may contain value of some other pseudo reg; thus we reveal
the problem at runtime only).
I believe there is already code to deal with this situation. It is code
for risky transformations (please check flag
lra_risky_transformation_p). If this flag is set, next lra assign
subpass is running and checking correctness of assignments (e.g.
checking situation when two different pseudos have intersected live
ranges and the same assigned hard reg. If such dangerous situation is
found, it is fixed).
The second way is to avoid all cases when new usages of pic_offset_table_rtx
appear in reload. That is a way I chose because it appeared simplier to me and
would allow me to get some performance data faster. Also having
rematerialization of address anf float constants in PIC mode would mean we have
higher register pressure, thus having them on stack should be even more
efficient. To achieve it I had to cut off reg equivs to all exprs using symbol
references and all constants living in the memory. I also had to avoid
instructions requiring split in reload causing load of constant from memory
(*push[txd]f).
Resulting compiler successfully passes make check, compiles EEMBC and SPEC2000
benchmarks. There is no confidence I covered all cases and there still may be
some templates causing split in reload with new pic_offset_table_rtx usages. I
think support of reload with pseudo PIC would be better and more general
solution. But I don't know how difficult is to implement it though. Any ideas
on resolving this reload issue?
Please see what I mentioned above. May be it can fix the degradation.
Rematerialization is important for performance and switching it of
completely is not wise.
I collected some performance numbers for EEMBC and SPEC2000 benchmarks. Here
are patch results for -Ofast optlevel with LTO collectd on Avoton server:
AUTOmark +1,9%
TELECOMmark +4,0%
DENmark +10,0%
SPEC2000 -0,5%
There are few degradations on EEMBC benchmarks but on SPEC2000 situation is
different and we see more performance losses. Some of them are caused by
disabled rematerialization of address constants. In some cases relaxed ebx
causes more spills/fills in plaecs where GOT is frequently used. There are
also some minor fixes required in the patch to allow more efficient function
prolog (avoid unnecessary GOT register initialization and allow its
initialization without ebx usage). Suppose some performance problems may be
resolved but a good fix for reload should go first.
Ilya, the optimization you are trying to implement is important in many
cases and should be in some way included in gcc. If the degradations
can be solved in a way i mentioned above we could introduce a
machine-dependent flag.