This series implements some basic machine-independent optimizations. They simplify code and allow liveness analysis do it's work better.
Suppose we have following ARM code: movw r12, #0xb6db movt r12, #0xdb6d In TCG before optimizations we'll have: movi_i32 tmp8,$0xb6db mov_i32 r12,tmp8 mov_i32 tmp8,r12 ext16u_i32 tmp8,tmp8 movi_i32 tmp9,$0xdb6d0000 or_i32 tmp8,tmp8,tmp9 mov_i32 r12,tmp8 And after optimizations we'll have this: movi_i32 r12,$0xdb6db6db Here are performance evaluation results on SPEC CPU2000 integer tests in user-mode emulation on x86_64 host. There were 5 runs of each test on reference data set. The tables below show runtime in seconds for all these runs. ARM guest without optimizations: Test name #1 #2 #3 #4 #5 Median 164.gzip 1408.891 1402.323 1407.623 1404.955 1405.396 1405.396 175.vpr 1245.31 1248.758 1247.936 1248.534 1247.534 1247.936 176.gcc 912.561 809.481 847.057 912.636 912.544 912.544 181.mcf 198.384 197.841 199.127 197.976 197.29 197.976 186.crafty 1545.881 1546.051 1546.002 1545.927 1545.945 1545.945 197.parser 3779.954 3779.878 3779.79 3779.94 3779.88 3779.88 252.eon 2563.168 2776.152 2776.395 2776.577 2776.202 2776.202 253.perlbmk 2591.781 2504.078 2507.07 2591.337 2463.401 2507.07 256.bzip2 1306.197 1304.639 1184.853 1305.141 1305.606 1305.141 300.twolf 2918.984 2918.926 2918.93 2918.97 2918.914 2918.93 ARM guest with optimizations: Test name #1 #2 #3 #4 #5 Median Gain 164.gzip 1401.198 1376.337 1401.117 1401.23 1401.246 1401.198 0.30% 175.vpr 1247.964 1151.468 1247.76 1154.419 1242.017 1242.017 0.47% 176.gcc 896.882 918.546 918.297 851.465 918.39 918.297 -0.63% 181.mcf 198.19 197.399 198.421 198.663 198.312 198.312 -0.17% 186.crafty 1520.425 1520.362 1520.477 1520.445 1520.957 1520.445 1.65% 197.parser 3770.943 3770.927 3770.578 3771.048 3770.904 3770.927 0.24% 252.eon 2752.371 2752.111 2752.005 2752.214 2752.109 2752.111 0.87% 253.perlbmk 2577.462 2578.588 2493.567 2578.571 2578.318 2578.318 -2.84% 256.bzip2 1296.198 1271.128 1296.044 1296.321 1296.147 1296.147 0.69% 300.twolf 2888.984 2889.023 2889.225 2889.039 2889.05 2889.039 1.02% x86_64 guest without optimizations: Test name #1 #2 #3 #4 #5 Median 164.gzip 857.654 857.646 857.678 798.119 857.675 857.654 175.vpr 959.265 959.207 959.185 959.461 959.332 959.265 176.gcc 625.722 637.257 646.638 646.614 646.56 646.56 181.mcf 221.666 220.194 220.079 219.868 221.5 220.194 186.crafty 1129.531 1129.739 1129.573 1129.588 1129.624 1129.588 197.parser 1809.517 1809.516 1809.386 1809.477 1809.427 1809.477 253.perlbmk 1774.944 1776.046 1769.865 1774.052 1775.236 1774.944 254.gap 1061.033 1061.158 1061.064 1061.047 1061.01 1061.047 255.vortex 1871.261 1914.144 1914.057 1914.086 1914.127 1914.086 256.bzip2 918.916 1011.828 1011.819 1012.11 1011.932 1011.828 300.twolf 1332.797 1330.56 1330.687 1330.917 1330.602 1330.687 x86_64 guest with optimizations: Test name #1 #2 #3 #4 #5 Median Gain 164.gzip 806.198 854.159 854.184 854.168 854.187 854.168 0.41% 175.vpr 955.905 950.86 955.876 876.397 955.957 955.876 1.82% 176.gcc 641.663 640.189 641.57 641.552 641.514 641.552 0.03% 181.mcf 217.619 218.627 218.699 217.977 216.955 217.977 1.18% 186.crafty 1123.909 1123.852 1123.917 1123.781 1123.805 1123.852 0.51% 197.parser 1813.94 1814.643 1815.286 1814.445 1813.72 1814.445 -0.27% 253.perlbmk 1791.536 1795.642 1793.0 1797.486 1791.401 1793.0 -1.02% 254.gap 1070.605 1070.216 1070.637 1070.168 1070.491 1070.491 -0.89% 255.vortex 1918.764 1918.573 1917.411 1918.287 1918.735 1918.573 -0.23% 256.bzip2 1017.179 1017.083 1017.283 1016.913 1017.189 1017.179 -0.53% 300.twolf 1321.072 1321.109 1321.019 1321.072 1321.004 1321.072 0.72% ARM guests for 254.gap and 255.vortex and x86_64 guest for 252.eon does not work under QEMU for some unrelated reason. Changes: v1 -> v2 - State and Vals arrays merged to an array of structures. - Added reference counting of temp's copies. This helps to reset temp's state faster in most cases. - Do not make copy propagation through operations with TCG_OPF_CALL_CLOBBER or TCG_OPF_SIDE_EFFECTS flag. - Split some expression simplifications into independent switch. - Let compiler handle signed shifts and sign/zero extends in it's implementation defined way. v2 -> v3 - Elements of equiv class are placed in a double-linked circular list so it's easier to choose a new representative. - CASE_OP_32_64 macro is used to reduce amount of ifdefdsi. Checkpatch is not happy about this change but I do not think spaces would be appropriate here. - Some constraints during copy propagation are relaxed. - Functions tcg_opt_gen_mov and tcg_opt_gen_movi are introduced to reduce code duplication. Kirill Batuzov (6): Add TCG optimizations stub Add copy and constant propagation. Do constant folding for basic arithmetic operations. Do constant folding for boolean operations. Do constant folding for shift operations. Do constant folding for unary operations. Makefile.target | 2 +- tcg/optimize.c | 568 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ tcg/tcg.c | 6 + tcg/tcg.h | 3 + 4 files changed, 578 insertions(+), 1 deletions(-) create mode 100644 tcg/optimize.c -- 1.7.4.1