Re: A case that PRE optimization hurts performance
On Mon, Sep 26, 2011 at 6:42 PM, Jeff Law l...@redhat.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/26/11 05:00, Richard Guenther wrote: On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu jiangning@arm.com wrote: * Without PRE, Path1: movl $2, %eax cmpl $1, %eax je .L3 Path2: movb $3, %al cmpl $1, %eax je .L3 Path3: cmpl $1, %eax jne .L14 * With PRE, Path1: movl $1, %ebx movl $2, %eax testb %bl, %bl je .L3 Path2: movl $1, %ebx movb $3, %al testb %bl, %bl je .L3 Path3: cmpl $1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? It seems to me that with PRE all the testb %bl, %bl should be evaluated at compile-time considering the preceeding movl $1, %ebx. Am I missing something? Yes. Can this be done by PRE or any other optimizations in middle end? Hm, the paths as you quote them are obfuscated by missed jump-threading. On the tree level we have # s_2 = PHI 2(5), 3(4), 2(6), s_25(7) # prephitmp.6_1 = PHI 1(5), 1(4), 1(6), prephitmp.6_3(7) L10: t_14 = t_24 + 1; D.2729_6 = MEM[base: t_14, offset: 0B]; D.2732_7 = D.2729_6 != 0; D.2734_9 = prephitmp.6_1 D.2732_7; if (D.2734_9 != 0) where we could thread the cases with prephitmp.6_1 == 1, ultimately removing the and forwarding the D.2729_6 != 0 test. Which would of course cause some code duplication. Jeff, you recently looked at tree jump-threading, can you see if we can improve things on this particular testcase? There's nothing threading can do here because it doesn't know anything about the value MEM[t14]. It knows something about prephitmp.6_1 and thus could simplify D.2734_9 = prephitmp_6.1 D.2732_7; to D.2734_9 = D.2732_7; But admittedly I have no idea if DOM tries to simplify things other than comparisons within the jump threading machinery ... (the above block basically ends in if (D.2729_6 != 0 prephitmp_6.1), so I'd guess it's worth to simplify the (complex) conditional expression). Richard. Jeff -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJOgKuLAAoJEBRtltQi2kC75aIH/iikuOQXrMrQJFbQw0COXznB OGq8iXdGwTJGH13vxdItTE0upJp7RgUVLzuhdqj1elTLHv/ujYygMsQRNGKcc8tb GMLECmWDhZqQTFXcTJCgJNZiv7MH1PNELXSdIkkSnxY+pwyn9AX5D3+HcTSjGU6B 51AdUNVph/VSaVboAgcrFpu9S0pX9HVTqFy4JI83Lh613zDVSmPo14DDy7vjBvE9 2Srlvlw0srYup97bGmRqN8wT4ZLLlyYSB2rjEFc6jmgXVncxiteQYIUZpy0lcC0M q3j80aXjZ57/iWyAbqDr1jI5tbVKDBkRa9LL1jvn9534adiG4GrnSMPhoog0ibA= =azr5 -END PGP SIGNATURE-
Re: A case that PRE optimization hurts performance
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/27/11 01:30, Richard Guenther wrote: It knows something about prephitmp.6_1 and thus could simplify D.2734_9 = prephitmp_6.1 D.2732_7; to D.2734_9 = D.2732_7; But admittedly I have no idea if DOM tries to simplify things other than comparisons within the jump threading machinery ... (the above block basically ends in if (D.2729_6 != 0 prephitmp_6.1), so I'd guess it's worth to simplify the (complex) conditional expression). Jump threading will enter simplified expressions into the hash tables for every statement in the block in an effort to utilize any information available to determine the result of the comparison. However, those simplifications aren't ever reflected back into the IL. The thing to remember is jump threading is tasked detecting cases where the control statement has a predetermined destination utilizing path sensitive information. Expanding it to do general path sensitive optimizations is possible, but at even greater cost. jeff -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJOgeQGAAoJEBRtltQi2kC7xX8H/1odgAgNVKdbWk4FtsNPx9xl +xHtQdB2zycD0o0TMdQ+PcllXHaIK+ScIZEcxys1lN9HoEEyhyHPmQi7FZbD807y oTswglsX7hnTM3fMCN62BTFwk6P0QNrbeZ4bsVxF5DkBmMLXbArQ7UEYD9mSHnDv ixi06cUc/x5CXCAbcrAdUQI/9uF9d83myLs3rS3T0g2nPm+chGRN94Bm6ZbrB0hA PjyKiZ4j3z6Yc5bs2GF19Rh7vfjD/9NhKF9t9YwuBky4luLv4RPFsT1JQM3+1yuT tW5xk0et3eUnGFgiLUrF1mkHO/TkSv5iTsDI2tbbCkLCdH3w8Runj21/5qMWXxY= =X10i -END PGP SIGNATURE-
Re: A case that PRE optimization hurts performance
On Tue, Sep 27, 2011 at 4:56 PM, Jeff Law l...@redhat.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/27/11 01:30, Richard Guenther wrote: It knows something about prephitmp.6_1 and thus could simplify D.2734_9 = prephitmp_6.1 D.2732_7; to D.2734_9 = D.2732_7; But admittedly I have no idea if DOM tries to simplify things other than comparisons within the jump threading machinery ... (the above block basically ends in if (D.2729_6 != 0 prephitmp_6.1), so I'd guess it's worth to simplify the (complex) conditional expression). Jump threading will enter simplified expressions into the hash tables for every statement in the block in an effort to utilize any information available to determine the result of the comparison. However, those simplifications aren't ever reflected back into the IL. The thing to remember is jump threading is tasked detecting cases where the control statement has a predetermined destination utilizing path sensitive information. Expanding it to do general path sensitive optimizations is possible, but at even greater cost. Yeah, I realize that. It's just this case is the non-cfg form of if (pretmp) if (D.1234) ... where jump-threading would probably handle the CFG variant just fine. Oh well ;) Richard.
Re: A case that PRE optimization hurts performance
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/27/11 09:00, Richard Guenther wrote: The thing to remember is jump threading is tasked detecting cases where the control statement has a predetermined destination utilizing path sensitive information. Expanding it to do general path sensitive optimizations is possible, but at even greater cost. Yeah, I realize that. It's just this case is the non-cfg form of if (pretmp) if (D.1234) ... where jump-threading would probably handle the CFG variant just fine. Possibly. I haven't looked at the RTL version in a long long time, but it's still around cfgcleanup::thread_jump. I've been meaning to do some final experiments and declare that code as effectively dead (yes, it still does stuff, but not much, at least not last time I looked). WRT generalized path sensitive optimizations, the best work I've seen in the space is from Bodik. I've been wanting to play with some of his ideas for path isolation to simplify our current jump threading implementation. In theory, if that experiment works out we could utilize that infrastructure to implement other path sensitive analysis and optimization. Obviously, that's a ways out. jeff -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJOgefZAAoJEBRtltQi2kC7JDEH+gOr0BWdN/Ru9oE7Fng1SJvX adH9HnT0PhVhvbZQuF9Ag3HYYFOZumNgEI5HU2s3gPaFNU/KuUMcCBAD7V/1748P L4soNu/dyHVZnKFtjx8MKEe9Jc+UMIvfdjewlsGHjzyBoXDIDaqzqQGmBLtM9MjK gV/tKzdeYz0TB4QMKb27RIBtrZn+l6Ur6IC/PJBxB6qrimvxbet8qxgj+0RDf5XE QSvCTS77QbfSw8k/5X1ag2lRKSZqX15w+p2b7S60XdvVd4pcIcyYKvyMDUcGy2hl PEay76+XyFsO8bCcLg69i+UirBrfceE2CVzYYkMfsPxh2RgBViluTveeybblkLs= =kkAA -END PGP SIGNATURE-
RE: A case that PRE optimization hurts performance
* Without PRE, Path1: movl $2, %eax cmpl $1, %eax je .L3 Path2: movb $3, %al cmpl $1, %eax je .L3 Path3: cmpl $1, %eax jne .L14 * With PRE, Path1: movl $1, %ebx movl $2, %eax testb %bl, %bl je .L3 Path2: movl $1, %ebx movb $3, %al testb %bl, %bl je .L3 Path3: cmpl $1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? It seems to me that with PRE all the testb %bl, %bl should be evaluated at compile-time considering the preceeding movl $1, %ebx. Am I missing something? Yes. Can this be done by PRE or any other optimizations in middle end? Thanks, -Jiangning Richard.
Re: A case that PRE optimization hurts performance
On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu jiangning@arm.com wrote: * Without PRE, Path1: movl $2, %eax cmpl $1, %eax je .L3 Path2: movb $3, %al cmpl $1, %eax je .L3 Path3: cmpl $1, %eax jne .L14 * With PRE, Path1: movl $1, %ebx movl $2, %eax testb %bl, %bl je .L3 Path2: movl $1, %ebx movb $3, %al testb %bl, %bl je .L3 Path3: cmpl $1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? It seems to me that with PRE all the testb %bl, %bl should be evaluated at compile-time considering the preceeding movl $1, %ebx. Am I missing something? Yes. Can this be done by PRE or any other optimizations in middle end? Hm, the paths as you quote them are obfuscated by missed jump-threading. On the tree level we have # s_2 = PHI 2(5), 3(4), 2(6), s_25(7) # prephitmp.6_1 = PHI 1(5), 1(4), 1(6), prephitmp.6_3(7) L10: t_14 = t_24 + 1; D.2729_6 = MEM[base: t_14, offset: 0B]; D.2732_7 = D.2729_6 != 0; D.2734_9 = prephitmp.6_1 D.2732_7; if (D.2734_9 != 0) where we could thread the cases with prephitmp.6_1 == 1, ultimately removing the and forwarding the D.2729_6 != 0 test. Which would of course cause some code duplication. Jeff, you recently looked at tree jump-threading, can you see if we can improve things on this particular testcase? Thanks, Richard. Thanks, -Jiangning Richard.
Re: A case that PRE optimization hurts performance
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/26/11 05:00, Richard Guenther wrote: On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu jiangning@arm.com wrote: * Without PRE, Path1: movl$2, %eax cmpl$1, %eax je .L3 Path2: movb$3, %al cmpl$1, %eax je .L3 Path3: cmpl$1, %eax jne .L14 * With PRE, Path1: movl$1, %ebx movl$2, %eax testb %bl, %bl je .L3 Path2: movl$1, %ebx movb$3, %al testb %bl, %bl je .L3 Path3: cmpl$1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? It seems to me that with PRE all the testb %bl, %bl should be evaluated at compile-time considering the preceeding movl $1, %ebx. Am I missing something? Yes. Can this be done by PRE or any other optimizations in middle end? Hm, the paths as you quote them are obfuscated by missed jump-threading. On the tree level we have # s_2 = PHI 2(5), 3(4), 2(6), s_25(7) # prephitmp.6_1 = PHI 1(5), 1(4), 1(6), prephitmp.6_3(7) L10: t_14 = t_24 + 1; D.2729_6 = MEM[base: t_14, offset: 0B]; D.2732_7 = D.2729_6 != 0; D.2734_9 = prephitmp.6_1 D.2732_7; if (D.2734_9 != 0) where we could thread the cases with prephitmp.6_1 == 1, ultimately removing the and forwarding the D.2729_6 != 0 test. Which would of course cause some code duplication. Jeff, you recently looked at tree jump-threading, can you see if we can improve things on this particular testcase? There's nothing threading can do here because it doesn't know anything about the value MEM[t14]. Jeff -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJOgKuLAAoJEBRtltQi2kC75aIH/iikuOQXrMrQJFbQw0COXznB OGq8iXdGwTJGH13vxdItTE0upJp7RgUVLzuhdqj1elTLHv/ujYygMsQRNGKcc8tb GMLECmWDhZqQTFXcTJCgJNZiv7MH1PNELXSdIkkSnxY+pwyn9AX5D3+HcTSjGU6B 51AdUNVph/VSaVboAgcrFpu9S0pX9HVTqFy4JI83Lh613zDVSmPo14DDy7vjBvE9 2Srlvlw0srYup97bGmRqN8wT4ZLLlyYSB2rjEFc6jmgXVncxiteQYIUZpy0lcC0M q3j80aXjZ57/iWyAbqDr1jI5tbVKDBkRa9LL1jvn9534adiG4GrnSMPhoog0ibA= =azr5 -END PGP SIGNATURE-
RE: A case that PRE optimization hurts performance
-Original Message- From: Jeff Law [mailto:l...@redhat.com] Sent: Tuesday, September 27, 2011 12:43 AM To: Richard Guenther Cc: Jiangning Liu; gcc@gcc.gnu.org Subject: Re: A case that PRE optimization hurts performance -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/26/11 05:00, Richard Guenther wrote: On Mon, Sep 26, 2011 at 9:39 AM, Jiangning Liu jiangning@arm.com wrote: * Without PRE, Path1: movl$2, %eax cmpl$1, %eax je .L3 Path2: movb$3, %al cmpl$1, %eax je .L3 Path3: cmpl$1, %eax jne .L14 * With PRE, Path1: movl$1, %ebx movl$2, %eax testb %bl, %bl je .L3 Path2: movl$1, %ebx movb$3, %al testb %bl, %bl je .L3 Path3: cmpl$1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? It seems to me that with PRE all the testb %bl, %bl should be evaluated at compile-time considering the preceeding movl $1, %ebx. Am I missing something? Yes. Can this be done by PRE or any other optimizations in middle end? Hm, the paths as you quote them are obfuscated by missed jump-threading. On the tree level we have # s_2 = PHI 2(5), 3(4), 2(6), s_25(7) # prephitmp.6_1 = PHI 1(5), 1(4), 1(6), prephitmp.6_3(7) L10: t_14 = t_24 + 1; D.2729_6 = MEM[base: t_14, offset: 0B]; D.2732_7 = D.2729_6 != 0; D.2734_9 = prephitmp.6_1 D.2732_7; if (D.2734_9 != 0) where we could thread the cases with prephitmp.6_1 == 1, ultimately removing the and forwarding the D.2729_6 != 0 test. Which would of course cause some code duplication. Jeff, you recently looked at tree jump-threading, can you see if we can improve things on this particular testcase? There's nothing threading can do here because it doesn't know anything about the value MEM[t14]. Jeff, Could you please explain more about this? What information does jump threading want to know on MEM[t14]? Do you mean it's hard to duplicate that basic block due to some reasons? Thanks, -Jiangning Jeff -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQEcBAEBAgAGBQJOgKuLAAoJEBRtltQi2kC75aIH/iikuOQXrMrQJFbQw0COXznB OGq8iXdGwTJGH13vxdItTE0upJp7RgUVLzuhdqj1elTLHv/ujYygMsQRNGKcc8tb GMLECmWDhZqQTFXcTJCgJNZiv7MH1PNELXSdIkkSnxY+pwyn9AX5D3+HcTSjGU6B 51AdUNVph/VSaVboAgcrFpu9S0pX9HVTqFy4JI83Lh613zDVSmPo14DDy7vjBvE9 2Srlvlw0srYup97bGmRqN8wT4ZLLlyYSB2rjEFc6jmgXVncxiteQYIUZpy0lcC0M q3j80aXjZ57/iWyAbqDr1jI5tbVKDBkRa9LL1jvn9534adiG4GrnSMPhoog0ibA= =azr5 -END PGP SIGNATURE-
Re: A case that PRE optimization hurts performance
On Fri, Sep 16, 2011 at 4:00 AM, Jiangning Liu jiangning@arm.com wrote: Hi Richard, I slightly changed the case to be like below, int f(char *t) { int s=0; while (*t s != 1) { switch (s) { case 0: /* path 1 */ s = 2; break; case 2: /* path 2 */ s = 3; /* changed */ break; default: /* path 3 */ if (*t == '-') s = 2; break; } t++; } return s; } -O2 is still worse than -O2 -fno-tree-pre. -O2 -fno-tree-pre result is f: pushl %ebp xorl %eax, %eax movl %esp, %ebp movl 8(%ebp), %edx movzbl (%edx), %ecx jmp .L14 .p2align 4,,7 .p2align 3 .L5: movl $2, %eax .L7: addl $1, %edx cmpl $1, %eax movzbl (%edx), %ecx je .L3 .L14: testb %cl, %cl je .L3 testl %eax, %eax je .L5 cmpl $2, %eax .p2align 4,,5 je .L17 cmpb $45, %cl .p2align 4,,5 je .L5 addl $1, %edx cmpl $1, %eax movzbl (%edx), %ecx jne .L14 .p2align 4,,7 .p2align 3 .L3: popl %ebp .p2align 4,,2 ret .p2align 4,,7 .p2align 3 .L17: movb $3, %al .p2align 4,,3 jmp .L7 While -O2 result is f: pushl %ebp xorl %eax, %eax movl %esp, %ebp movl 8(%ebp), %edx pushl %ebx movzbl (%edx), %ecx jmp .L14 .p2align 4,,7 .p2align 3 .L5: movl $1, %ebx movl $2, %eax .L7: addl $1, %edx testb %bl, %bl movzbl (%edx), %ecx je .L3 .L14: testb %cl, %cl je .L3 testl %eax, %eax je .L5 cmpl $2, %eax .p2align 4,,5 je .L16 cmpb $45, %cl .p2align 4,,5 je .L5 cmpl $1, %eax setne %bl addl $1, %edx testb %bl, %bl movzbl (%edx), %ecx jne .L14 .p2align 4,,7 .p2align 3 .L3: popl %ebx popl %ebp ret .p2align 4,,7 .p2align 3 .L16: movl $1, %ebx movb $3, %al jmp .L7 You may notice that register ebx is introduced, and some more instructions around ebx are generated as well. i.e. setne %bl testb %bl, %bl I agree with you that in theory PRE does the right thing to minimize the computation cost on gimple level. However, the problem is the cost of converting comparison result to a bool value is not considered, so it actually makes binary code worse. For this case, as I summarized below, to complete the same functionality With PRE is worse than Without PRE for all three paths, * Without PRE, Path1: movl $2, %eax cmpl $1, %eax je .L3 Path2: movb $3, %al cmpl $1, %eax je .L3 Path3: cmpl $1, %eax jne .L14 * With PRE, Path1: movl $1, %ebx movl $2, %eax testb %bl, %bl je .L3 Path2: movl $1, %ebx movb $3, %al testb %bl, %bl je .L3 Path3: cmpl $1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? It seems to me that with PRE all the testb %bl, %bl should be evaluated at compile-time considering the preceeding movl $1, %ebx. Am I missing something? Richard. Thanks, -Jiangning -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, August 02, 2011 5:23 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org Subject: Re: A case that PRE optimization hurts performance On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu jiangning@arm.com wrote: Hi, For the following simple test case, PRE optimization hoists computation (s!=1) into the default branch of the switch statement, and finally causes very poor code generation. This problem occurs in both X86 and ARM, and I believe it is also a problem for other targets. int f(char *t) { int s=0; while (*t s != 1) { switch (s) { case 0: s = 2; break; case 2: s = 1; break; default: if (*t == '-') s = 1; break; } t++; } return s; } Taking X86 as an example, with option -O2 you may find 52 instructions generated like below, f: 0: 55 push %ebp 1: 31 c0 xor %eax,%eax 3: 89 e5 mov %esp,%ebp
RE: A case that PRE optimization hurts performance
Hi Richard, I slightly changed the case to be like below, int f(char *t) { int s=0; while (*t s != 1) { switch (s) { case 0: /* path 1 */ s = 2; break; case 2: /* path 2 */ s = 3; /* changed */ break; default: /* path 3 */ if (*t == '-') s = 2; break; } t++; } return s; } -O2 is still worse than -O2 -fno-tree-pre. -O2 -fno-tree-pre result is f: pushl %ebp xorl%eax, %eax movl%esp, %ebp movl8(%ebp), %edx movzbl (%edx), %ecx jmp .L14 .p2align 4,,7 .p2align 3 .L5: movl$2, %eax .L7: addl$1, %edx cmpl$1, %eax movzbl (%edx), %ecx je .L3 .L14: testb %cl, %cl je .L3 testl %eax, %eax je .L5 cmpl$2, %eax .p2align 4,,5 je .L17 cmpb$45, %cl .p2align 4,,5 je .L5 addl$1, %edx cmpl$1, %eax movzbl (%edx), %ecx jne .L14 .p2align 4,,7 .p2align 3 .L3: popl%ebp .p2align 4,,2 ret .p2align 4,,7 .p2align 3 .L17: movb$3, %al .p2align 4,,3 jmp .L7 While -O2 result is f: pushl %ebp xorl%eax, %eax movl%esp, %ebp movl8(%ebp), %edx pushl %ebx movzbl (%edx), %ecx jmp .L14 .p2align 4,,7 .p2align 3 .L5: movl$1, %ebx movl$2, %eax .L7: addl$1, %edx testb %bl, %bl movzbl (%edx), %ecx je .L3 .L14: testb %cl, %cl je .L3 testl %eax, %eax je .L5 cmpl$2, %eax .p2align 4,,5 je .L16 cmpb$45, %cl .p2align 4,,5 je .L5 cmpl$1, %eax setne %bl addl$1, %edx testb %bl, %bl movzbl (%edx), %ecx jne .L14 .p2align 4,,7 .p2align 3 .L3: popl%ebx popl%ebp ret .p2align 4,,7 .p2align 3 .L16: movl$1, %ebx movb$3, %al jmp .L7 You may notice that register ebx is introduced, and some more instructions around ebx are generated as well. i.e. setne %bl testb %bl, %bl I agree with you that in theory PRE does the right thing to minimize the computation cost on gimple level. However, the problem is the cost of converting comparison result to a bool value is not considered, so it actually makes binary code worse. For this case, as I summarized below, to complete the same functionality With PRE is worse than Without PRE for all three paths, * Without PRE, Path1: movl$2, %eax cmpl$1, %eax je .L3 Path2: movb$3, %al cmpl$1, %eax je .L3 Path3: cmpl$1, %eax jne .L14 * With PRE, Path1: movl$1, %ebx movl$2, %eax testb %bl, %bl je .L3 Path2: movl$1, %ebx movb$3, %al testb %bl, %bl je .L3 Path3: cmpl$1, %eax setne %bl testb %bl, %bl jne .L14 Do you have any more thoughts? Thanks, -Jiangning -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Tuesday, August 02, 2011 5:23 PM To: Jiangning Liu Cc: gcc@gcc.gnu.org Subject: Re: A case that PRE optimization hurts performance On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu jiangning@arm.com wrote: Hi, For the following simple test case, PRE optimization hoists computation (s!=1) into the default branch of the switch statement, and finally causes very poor code generation. This problem occurs in both X86 and ARM, and I believe it is also a problem for other targets. int f(char *t) { int s=0; while (*t s != 1) { switch (s) { case 0: s = 2; break; case 2: s = 1; break; default: if (*t == '-') s = 1; break; } t++; } return s; } Taking X86 as an example, with option -O2 you may find 52 instructions generated like below, f: 0: 55 push %ebp 1: 31 c0 xor %eax,%eax 3: 89 e5 mov %esp,%ebp 5: 57 push %edi 6: 56 push %esi 7: 53 push %ebx 8: 8b 55 08 mov 0x8(%ebp),%edx b: 0f b6 0a movzbl (%edx),%ecx e: 84 c9
Re: A case that PRE optimization hurts performance
On Tue, 2 Aug 2011 10:37:03 +0800, Jiangning Liu wrote: Hi, For the following simple test case, PRE optimization hoists computation (s!=1) into the default branch of the switch statement, and finally causes very poor code generation. This problem occurs in both X86 and ARM, and I believe it is also a problem for other targets. [...] Do you have any idea about this? Fill a bug report to GCC Bugzilla http://gcc.gnu.org/bugzilla/ -- VZ
Re: A case that PRE optimization hurts performance
On Tue, Aug 2, 2011 at 4:37 AM, Jiangning Liu jiangning@arm.com wrote: Hi, For the following simple test case, PRE optimization hoists computation (s!=1) into the default branch of the switch statement, and finally causes very poor code generation. This problem occurs in both X86 and ARM, and I believe it is also a problem for other targets. int f(char *t) { int s=0; while (*t s != 1) { switch (s) { case 0: s = 2; break; case 2: s = 1; break; default: if (*t == '-') s = 1; break; } t++; } return s; } Taking X86 as an example, with option -O2 you may find 52 instructions generated like below, f: 0: 55 push %ebp 1: 31 c0 xor %eax,%eax 3: 89 e5 mov %esp,%ebp 5: 57 push %edi 6: 56 push %esi 7: 53 push %ebx 8: 8b 55 08 mov 0x8(%ebp),%edx b: 0f b6 0a movzbl (%edx),%ecx e: 84 c9 test %cl,%cl 10: 74 50 je 62 f+0x62 12: 83 c2 01 add $0x1,%edx 15: 85 c0 test %eax,%eax 17: 75 23 jne 3c f+0x3c 19: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi 20: 0f b6 0a movzbl (%edx),%ecx 23: 84 c9 test %cl,%cl 25: 0f 95 c0 setne %al 28: 89 c7 mov %eax,%edi 2a: b8 02 00 00 00 mov $0x2,%eax 2f: 89 fb mov %edi,%ebx 31: 83 c2 01 add $0x1,%edx 34: 84 db test %bl,%bl 36: 74 2a je 62 f+0x62 38: 85 c0 test %eax,%eax 3a: 74 e4 je 20 f+0x20 3c: 83 f8 02 cmp $0x2,%eax 3f: 74 1f je 60 f+0x60 41: 80 f9 2d cmp $0x2d,%cl 44: 74 22 je 68 f+0x68 46: 0f b6 0a movzbl (%edx),%ecx 49: 83 f8 01 cmp $0x1,%eax 4c: 0f 95 c3 setne %bl 4f: 89 df mov %ebx,%edi 51: 84 c9 test %cl,%cl 53: 0f 95 c3 setne %bl 56: 89 de mov %ebx,%esi 58: 21 f7 and %esi,%edi 5a: eb d3 jmp 2f f+0x2f 5c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi 60: b0 01 mov $0x1,%al 62: 5b pop %ebx 63: 5e pop %esi 64: 5f pop %edi 65: 5d pop %ebp 66: c3 ret 67: 90 nop 68: b8 01 00 00 00 mov $0x1,%eax 6d: 5b pop %ebx 6e: 5e pop %esi 6f: 5f pop %edi 70: 5d pop %ebp 71: c3 ret But with command line option -O2 -fno-tree-pre, there are only 12 instructions generated, and the code would be very clean like below, f: 0: 55 push %ebp 1: 31 c0 xor %eax,%eax 3: 89 e5 mov %esp,%ebp 5: 8b 55 08 mov 0x8(%ebp),%edx 8: 80 3a 00 cmpb $0x0,(%edx) b: 74 0e je 1b f+0x1b d: 80 7a 01 00 cmpb $0x0,0x1(%edx) 11: b0 02 mov $0x2,%al 13: ba 01 00 00 00 mov $0x1,%edx 18: 0f 45 c2 cmovne %edx,%eax 1b: 5d pop %ebp 1c: c3 ret Do you have any idea about this? If you look at what PRE does it is clear that it's a profitable transformation as it reduces the number of instructions computed on the paths where s is set to a constant. If you ask for size optimization you won't get this transformation. Now - on the tree level we don't see the if-conversion opportunity (we don't have a conditional move instruction), but still we could improve phiopt to handle switch statements - not sure what we could do with the case in question though which is bb 3: switch (s_3) default: L3, case 0: L11, case 2: L10 L11: goto bb 7 (L10); L3: if (D.2703_6 == 45) goto bb 6; else goto bb 7 (L10); bb 6: # s_2 = PHI 2(4), 1(3), 1(6), s_3(5) L10: as the condition in at L3 complicates things. The code is also really really special (and written in an awful way ...). Richard. Thanks, -Jiangning