On 2014-08-22 8:21 AM, Ilya Enkovich wrote:
Hi,

On Cauldron 2014 we had a couple of talks about relaxation of ebx usage in 
32bit PIC mode.  It was decided that the best approach would be to not fix ebx 
register, use speudo register for GOT base address and let allocator do the 
rest.  This should be similar to how clang and icc work with GOT base address.  
I've been working for some time on such patch and now want to share my results.

The idea of the patch was very simple and included few things;
  1.  Set PIC_OFFSET_TABLE_REGNUM to INVALID_REGNUM to specify that we do not 
have any hard reg fixed for PIC.
  2.  Initialize pic_offset_table_rtx with a new pseudo register in the 
begining of a function expand.
  3.  Change ABI so that there is a possible implicit PIC argument for calls; 
pic_offset_table_rtx is used as an arg value if such implicit arg exist.

Such approach worked well on small tests but trying to run some benchmarks we 
faced a problem with reload of address constants.  The problem is that when we 
try to rematerialize address constant or some constant memory reference, we 
have to use pic_offset_table_rtx.  It means we insert new usages of a speudo 
register and alocator cannot handle it correctly.  Same problem also applies 
for float and vector constants.

Rematerialization is not the only case causing new pic_offset_table_rtx usage.  
Another case is a split of some instructions using constant but not having 
proper constraints.  E.g. pushtf pattern allows push of constant but it has to 
be replaced with push of memory in reload pass causing additional usage of 
pic_offset_table_rtx.

There are two ways to fix it.  The first one is to support modifications of 
pseudo register live range during reload and correctly allocate hard regs for 
its new usages (currently we have some hard reg allocated for new usage of 
pseudo reg but it may contain value of some other pseudo reg; thus we reveal 
the problem at runtime only).


I believe there is already code to deal with this situation. It is code for risky transformations (please check flag lra_risky_transformation_p). If this flag is set, next lra assign subpass is running and checking correctness of assignments (e.g. checking situation when two different pseudos have intersected live ranges and the same assigned hard reg. If such dangerous situation is found, it is fixed).

The second way is to avoid all cases when new usages of pic_offset_table_rtx 
appear in reload.  That is a way I chose because it appeared simplier to me and 
would allow me to get some performance data faster.  Also having 
rematerialization of address anf float constants in PIC mode would mean we have 
higher register pressure, thus having them on stack should be even more 
efficient.  To achieve it I had to cut off reg equivs to all exprs using symbol 
references and all constants living in the memory.  I also had to avoid 
instructions requiring split in reload causing load of constant from memory 
(*push[txd]f).

Resulting compiler successfully passes make check, compiles EEMBC and SPEC2000 
benchmarks.  There is no confidence I covered all cases and there still may be 
some templates causing split in reload with new pic_offset_table_rtx usages.  I 
think support of reload with pseudo PIC would be better and more general 
solution.  But I don't know how difficult is to implement it though.  Any ideas 
on resolving this reload issue?


Please see what I mentioned above. May be it can fix the degradation. Rematerialization is important for performance and switching it of completely is not wise.


I collected some performance numbers for EEMBC and SPEC2000 benchmarks.  Here 
are patch results for -Ofast optlevel with LTO collectd on Avoton server:
AUTOmark +1,9%
TELECOMmark +4,0%
DENmark +10,0%
SPEC2000 -0,5%

There are few degradations on EEMBC benchmarks but on SPEC2000 situation is 
different and we see more performance losses.  Some of them are caused by 
disabled rematerialization of address constants.  In some cases relaxed ebx 
causes more spills/fills in plaecs where GOT is frequently used.  There are 
also some minor fixes required in the patch to allow more efficient function 
prolog (avoid unnecessary GOT register initialization and allow its 
initialization without ebx usage).  Suppose some performance problems may be 
resolved but a good fix for reload should go first.



Ilya, the optimization you are trying to implement is important in many cases and should be in some way included in gcc. If the degradations can be solved in a way i mentioned above we could introduce a machine-dependent flag.

Reply via email to