Hi Do you have the chance to compare the speed of the two codes ?
> David, > > On 2012-05-16 08:23, David Crayford wrote: > > Robert, > > > > I'm no expert but I have read that newer hardware models (Z10 and > above) are > > essentially RISC processors that run complex instructions in millicode. > In the > > I may be wrong, but I think the z196 is the first OOO machine and > Enterprise PL/I V3R9 pre-dates it > by two years. > > > case of a MVC instruction it would have to do that in a loop which > would require > > branching, the enemy of pipelined exeuction units. It's also possible > to run > > simple instructions > > in parallel. It's plausible an MVC instruction can be executed more > efficiently > > as a sequence of LG/STG instructions. > > Given that moves are the most executed instructions, at least on x86, > (see, among many others > <www.ijpg.org/index.php/IJACSci/article/download/118/29>) and I have > little doubt that the same > holds true for about any other architecture and that there is special x86 > circuitry to optimize MOVS > instructions, it would be highly surprising if IBM did not make MVC as > fast as possible, millicoded > or not. > > > The OOO decode units do this for you with instruction cracking on a > z196, it > > seems that on a z10 the optimizer is doing the same thing. > > Possibly, but that does not explain the 10 superfluous reloads of r1. > > > See this document - page 21 > > > http://www-01.ibm.com/software/htp/tpf/tpfug/tgf11/How_do_you_do_when_youre_a_z196_CPU.pdf > > > > Optimizers create arcane code. It's almost impossible to verify without > > understanding the secret sauce. A lot of the code the optimizers spit > out is > > intractable, > > I don't know much about z/OS assembler, but at least I sort of managed to > understand the code > generated by the OS PL/I compiler. The code generated by Enterprise PL/I > is completely unreadable, > even some (or more than some) on this list might have trouble figuring out > why it does what it does. > > > and it's almost a paradox that a longer code path produces faster code. > > > > If you don't like it you can always compile at a different ARCH() level > and ask > > IBM. > > Going back to ARCH(5) doesn't produce anything that seems much shorter, > still the ridiculous > reloading of the same register, and oodles and oodles instructions which > would run and take time on > a definitely not-OOO CPU: > > 003A58 E300 8238 0014 003119 | LGF r0,LINE_PTR(,r8,568) > 003A5E 4110 E00C 003119 | LA r1,_shadow21(,r14,12) > 003A62 B914 00E0 003119 | LGFR r14,r0 > 003A66 D278 B38E 6D33 003118 | MVC > LINE(121,r11,910),REPT_INIT(r6,3379) > 003A6C E3B0 DC20 0004 003119 | LG r11,#SPILL17(,r13,3104) > 003A72 50B0 D25C 003119 | ST r11,_temp9(,r13,604) > 003A76 DE03 D25C 1000 003119 | ED _temp9(4,r13,604),_shadow21(r1,0) > 003A7C 4110 E003 003119 | LA r1,#AddressShadow(,r14,3) > 003A80 41F0 E00A 003119 | LA r15,#AddressShadow(,r14,10) > 003A84 D202 1001 D25D 003119 | MVC _shadow21(3,r1,1),_temp9(r13,605) > 003A8A 9240 E003 003119 | MVI _shadow21(r14,3),64 > 003A8E 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003A92 50B0 D2E4 003119 | ST r11,_temp8(,r13,740) > 003A96 41B0 E017 003119 | LA r11,#AddressShadow(,r14,23) > 003A9A 4110 100E 003119 | LA r1,_shadow21(,r1,14) > 003A9E DE03 D2E4 1000 003119 | ED _temp8(4,r13,740),_shadow21(r1,0) > 003AA4 D202 F001 D2E5 003119 | MVC > _shadow21(3,r15,1),_temp8(r13,741) > 003AAA 9240 E00A 003119 | MVI _shadow21(r14,10),64 > 003AAE 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003AB2 E3F0 DB98 0004 003119 | LG r15,#SPILL0(,r13,2968) > 003AB8 D202 E011 1010 003119 | MVC > _shadow21(3,r14,17),_shadow21(r1,16) > 003ABE 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003AC2 D206 D2D4 F4A4 003119 | MVC _temp19(7,r13,724),' > ......'(r15,1188) > 003AC8 D203 D26C 1013 003119 | MVC > _temp15(4,r13,620),_shadow18(r1,19) > 003ACE 4110 D26C 003119 | LA r1,_temp15(,r13,620) > 003AD2 D202 D24C 1001 003119 | MVC > _temp11(3,r13,588),_shadow12(r1,1) > 003AD8 4110 D24C 003119 | LA r1,_temp11(,r13,588) > 003ADC DE06 D2D4 1000 003119 | ED _temp19(7,r13,724),_temp11(r1,0) > 003AE2 D205 B000 D2D5 003119 | MVC > _shadow21(6,r11,0),_temp19(r13,725) > 003AE8 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003AEC D206 D2CC F4A4 003119 | MVC _temp21(7,r13,716),' > ......'(r15,1188) > 003AF2 D202 D249 101B 003119 | MVC > _temp18(3,r13,585),_shadow12(r1,27) > 003AF8 D202 D246 D249 003119 | MVC > _temp20(3,r13,582),_temp18(r13,585) > 003AFE 4110 E028 003119 | LA r1,#AddressShadow(,r14,40) > 003B02 E300 D246 0090 003119 | LLGC r0,<a1:d582:l1>(,r13,582) > 003B08 E300 3114 0080 003119 | NG r0,=X'00000000 0000000F' > 003B0E 41B0 D246 003119 | LA r11,_temp20(,r13,582) > 003B12 4200 D246 003119 | STC r0,<a1:d582:l1>(,r13,582) > 003B16 DE06 D2CC B000 003119 | ED _temp21(7,r13,716),_temp20(r11,0) > 003B1C D204 1000 D2CE 003119 | MVC > _shadow21(5,r1,0),_temp21(r13,718) > 003B22 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003B26 E300 1026 0014 003119 | LGF r0,_shadow19(,r1,38) > 003B2C 5000 E030 003119 | ST r0,_shadow19(,r14,48) > 003B30 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003B34 E300 102A 0014 003119 | LGF r0,_shadow19(,r1,42) > 003B3A 5000 E036 003119 | ST r0,_shadow19(,r14,54) > 003B3E 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003B42 E300 102E 0014 003119 | LGF r0,_shadow19(,r1,46) > 003B48 5000 E03D 003119 | ST r0,_shadow19(,r14,61) > 003B4C 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003B50 4300 1036 003119 | IC r0,_shadow21(,r1,54) > 003B54 4200 E04B 003119 | STC r0,_shadow21(,r14,75) > 003B58 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003B5C E300 1043 0014 003119 | LGF r0,_shadow19(,r1,67) > 003B62 5000 E05F 003119 | ST r0,_shadow19(,r14,95) > 003B66 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003B6A 4800 1047 003119 | LH r0,_shadow20(,r1,71) > 003B6E B914 0000 003119 | LGFR r0,r0 > 003B72 4000 E064 003119 | STH r0,_shadow20(,r14,100) > 003B76 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003B7A 4800 1049 003119 | LH r0,_shadow20(,r1,73) > 003B7E B914 0000 003119 | LGFR r0,r0 > 003B82 4000 E067 003119 | STH r0,_shadow20(,r14,103) > 003B6E B914 0000 003119 | LGFR r0,r0 > 003B72 4000 E064 003119 | STH r0,_shadow20(,r14,100) > 003B76 5810 8000 003119 | L r1,REPT_PTR(,r8,0) > 003B7A 4800 1049 003119 | LH r0,_shadow20(,r1,73) > 003B7E B914 0000 003119 | LGFR r0,r0 > 003B82 4000 E067 003119 | STH r0,_shadow20(,r14,103) > > When I started with PL/I in 1985, we were told never to initialize a > structure multiple times with > '', but to do it once, copy the initialized structure to a copy and > re-initialize it with this copy, > as the compiler would just generate a simple MVC. For arrays of structures > the multi-init using '' > was even worse, but by doing a > > array_of_structure(1) = '' > > followed by > > array_of_structure = array_of_structure(1) > > The code was near optimal (and by having an initialized STATIC copy of the > structure at hand and > using that), the code was for all intents and purposes optimal. Not so > with Enterprise PL/I although > I believe, but lacking access to EPLI 4.1 & 4.2, some issues have been > addressed. Here's an example > of OS/ PL/I V2.3.0 versus Enterprise PL/I V3R9M0: > > dcl 1 rept_line(10), > 2 z00001 char (3), > 2 tr pic 'zzz9', > 2 z00002 char (3), > 2 ri pic 'zzz9', > 2 z00003 char (3), > 2 da char (3), > 2 z00004 char (3), > 2 km pic '(3)z9v.9', > 2 z00005 char (3), > 2 hh pic 'z9.', > 2 mm pic '99', > 2 z00006 char (3), > 2 v pic 'zz9v.9', > 2 z00007 char (3), > 2 na char (4), > 2 z00008 char (2), > 2 ty char (4), > 2 z00009 char (3), > 2 co char (4), > 2 z00010 char (2), > 2 wa, > 3 whh pic 'z9.', > 3 wmm pic '99', > 2 z00011 char (3), > 2 sp char (1), > 2 z00012 char (255), > 2 de, > 3 dhh pic 'z9.', > 3 dmm pic '99', > 2 z00013 char (3), > 2 ar, > 3 ahh pic 'z9.', > 3 amm pic '99', > 2 z00014 char (3), > 2 date, > 3 year pic '9999', > 3 z00015 char (1), > 3 month pic '99', > 3 z00016 char (1), > 3 day pic '99'; > > /* Just fill the first element with something total random */ > /* Q&D, too long, not good, but that's safe in PL/I */ > string(rept_line(1)) = '.z,dmbvn;aehj,mzncbkmsdlkjsjsndvfkl\hjsb' || > 'fjhbc.blkwaioyuh.m,jdnsvkjxbvhjbzdfwtytk' || > 'vkjbnsegfirahjgouegkjnzkjgh8eryghkjghxjv' || > 'uye9tkjgkjvuhkjzxng-oipu8ynkjh4268srtjsc' || > 'uhkdlgozdugjnrg;hzdfgi.zdlnhg;zfhjgiozdh' || > 'iorjhdhgjzndg;hzdohgjdrhgjiozd-862jhaso9' || > 'fhhgoishiojsdrhjdiuhz,dmbvn;aehj,mzncbkk' || > 'l59uyjhlkxjbxofyixhjjhbc.blkwaioyuh.m,j0' || > 'ftkjgkjvuhkjzxng-oipkjbnsegfirahjgouegkk' || > 'llgozdugjnrg;hzdfgi.mbx/hjjxfhj(*^^^%$?0'; > rept_line = rept_line(1); > > The code generated by OS PL/I V2.3.0 - OPT(2): > > * STATEMENT NUMBER 15 > 0000E0 41 E0 D 0D0 LA 14,REPT_LINE.Z00001+357 > 0000E4 50 E0 D 0C8 ST 14,200(0,13) > 0000E8 41 70 D 0C8 LA 7,200(0,13) > 0000EC 50 70 3 57C ST 7,1404(0,3) > 0000F0 41 10 3 57C LA 1,1404(0,3) > 0000F4 58 F0 3 020 L 15,A..IBMBAPMA > 0000F8 05 EF BALR 14,15 > > Not very nice, a call to the library, but once in a program? We have to > live with it if we choose > this kind of initialization. > > * STATEMENT NUMBER 16 > 0000FA 41 90 D 0D0 LA 9,REPT_LINE.Z00001+357 > 0000FE 41 80 0 00C LA 8,12(0,0) > 000102 41 70 D 235 LA 7,REPT_LINE.Z00001+714 > 000106 CL.7 EQU * > 000106 D2 FF 7 000 9 000 MVC 0(256,7),0(9) > 00010C 41 70 7 100 LA 7,256(0,7) > 000110 41 90 9 100 LA 9,256(0,9) > 000114 46 80 2 02A BCT 8,CL.7 > 000118 D2 8C 7 000 9 000 MVC 0(141,7),0(9) > > Inner loop: MVC, 2 x LA and BCT > > The code generated by Enterprise PL/I V3R9 OPT(3), ARCH(9) - 108 is > statement 15 above, 119 is 17: > > 00008C 4110 D0CC 108 | LA r1,REPT_LINE(,r13,204) > 000090 A709 0001 119 | LGHI r0,H'1' > > Optimizer seemed to have moved other code here... > vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv > 000094 5020 2010 002 | ST r2,<s43:d16:l4>(,r2,16) > 000098 9231 D0C1 059 | MVI _Sfi(r13,193),49 > 00009C E3F0 3004 0014 059 | LGF > r15,=A(_ON_Begin_60_Blk_2)(,r3,4) > 0000A2 50D0 DF04 059 | ST r13,<a1:d3844:l4>(,r13,3844) > 0000A6 50F0 DF00 059 | ST r15,<a1:d3840:l4>(,r13,3840) > 0000AA E3E0 DF00 0004 059 | LG r14,_temp1(,r13,3840) > 0000B0 E3E0 D0C4 0024 059 | STG r14,_Sfi(,r13,196) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > 0000B6 C0E0 0000 00E9 108 | LARL r14,F'233' > 0000BC D2FF 1000 E1E8 108 | MVC > _...(256,r1,0),'.z,...'(r14,488) > 0000C2 D264 1100 E2E8 108 | MVC > _shad...(101,r1,256),'.z,...'(r14,744) > > 0000C8 119 | @1L2 DS 0H > 0000C8 EBE0 0009 000D 119 | SLLG r14,r0,9 r14 = 512 x r0 > 0000CE EB10 0007 000D 119 | SLLG r1,r0,7 r1 = 128 x r0 > 0000D4 EBF0 0005 000D 119 | SLLG r15,r0,5 r15 = 32 x r0 > 0000DA 1FE1 119 | SLR r14,r1 r14 = 384 x r0 > 0000DC EB40 0002 000D 119 | SLLG r4,r0,2 r4 = 4 x r0 > 0000E2 1FEF 119 | SLR r14,r15 r14 = 352 x r0 > 0000E4 B904 0010 119 | LGR r1,r0 r1 = r0 > 0000E8 1EE4 119 | ALR r14,r4 r14 = 356 x r0 > 0000EA A70A 0001 119 | AHI r0,H'1' > 0000EE 1E1E 119 | ALR r1,r14 r1 = 357 x r0 > 0000F0 E311 DF67 FF71 119 | LAY r1,REPT_LINE(r1,r13,-153) > 0000F6 D2FF 1000 D0CC 119 | MVC > REPT_LINE(256,r1,0),REPT_LINE(r13,204) > 0000FC D264 1100 D1CC 119 | MVC > REPT_LINE(101,r1,256),REPT_LINE(r13,460) > 000102 EC0C FFE3 0A7E 119 | CIJNH r0,H'10',@1L2 > > Inner loop: WTH! For crying out loud... Is this really a "fast" multiply > by 357??? And why waste > three extra registers on it??? Oh yes, because the instructions overlap... > > The equivalent inner loop using ARCH(5), the lowest possible by EPLI V3R9: > > 0000D8 000119 | @1L2 DS 0H > 0000D8 B904 00E0 000119 | LGR r14,r0 > 0000DC A7EC 0165 000119 | MHI r14,H'357' > 0000E0 A70A 0001 000119 | AHI r0,H'1' > 0000E4 A70E 000A 000119 | CHI r0,H'10' > 0000E8 41EE FE9A 000119 | LA r14,REPT_LINE(r14,r15,3738) > 0000EC D2FF E000 1000 000119 | MVC > REPT_LINE(256,r14,0),REPT_LINE(r1,0) > 0000F2 D264 E100 1100 000119 | MVC > REPT_LINE(101,r14,256),REPT_LINE(r1,256) > 0000F8 A7D4 FFF0 000119 | JNH @1L2 > > OK, it contains a multiply, a "slow" instruction, be it that it can be > made pretty fast if you look > at the x86 offerings from AMD & Intel (Sandy Bridge: 64 bit mul in 3 > cycles). However, given that > this is a normal non-interleaved array, why do you need a multiplication > at all. The V2.3.0 compiler > clearly demonstrated that you don't, and did so almost three decades > ago!!! > > Again, I just observe, your boss picks up the bill for the CPU cycles > used... > > If your company is paying thousands of dollars per year to be able to use > Enterprise PL/I, don't you > think you are entitled to a compiler that generates the best possible > code? > > Robert > -- > Robert AH Prins > robert(a)prino(d)org > > ---------------------------------------------------------------------- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to lists...@bama.ua.edu with the message: INFO IBM-MAIN > > ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@bama.ua.edu with the message: INFO IBM-MAIN