David,
On 2012-05-16 08:23, David Crayford wrote:
> Robert,
>
> I'm no expert but I have read that newer hardware models (Z10 and above) are
> essentially RISC processors that run complex instructions in millicode. In the
I may be wrong, but I think the z196 is the first OOO machine and Enterprise PL/I V3R9 pre-dates it
by two years.
> case of a MVC instruction it would have to do that in a loop which would
require
> branching, the enemy of pipelined exeuction units. It's also possible to run
> simple instructions
> in parallel. It's plausible an MVC instruction can be executed more
efficiently
> as a sequence of LG/STG instructions.
Given that moves are the most executed instructions, at least on x86, (see, among many others
<www.ijpg.org/index.php/IJACSci/article/download/118/29>) and I have little doubt that the same
holds true for about any other architecture and that there is special x86 circuitry to optimize MOVS
instructions, it would be highly surprising if IBM did not make MVC as fast as possible, millicoded
or not.
> The OOO decode units do this for you with instruction cracking on a z196, it
> seems that on a z10 the optimizer is doing the same thing.
Possibly, but that does not explain the 10 superfluous reloads of r1.
> See this document - page 21
>
http://www-01.ibm.com/software/htp/tpf/tpfug/tgf11/How_do_you_do_when_youre_a_z196_CPU.pdf
>
> Optimizers create arcane code. It's almost impossible to verify without
> understanding the secret sauce. A lot of the code the optimizers spit out is
> intractable,
I don't know much about z/OS assembler, but at least I sort of managed to understand the code
generated by the OS PL/I compiler. The code generated by Enterprise PL/I is completely unreadable,
even some (or more than some) on this list might have trouble figuring out why it does what it does.
> and it's almost a paradox that a longer code path produces faster code.
>
> If you don't like it you can always compile at a different ARCH() level and
ask
> IBM.
Going back to ARCH(5) doesn't produce anything that seems much shorter, still the ridiculous
reloading of the same register, and oodles and oodles instructions which would run and take time on
a definitely not-OOO CPU:
003A58 E300 8238 0014 003119 | LGF r0,LINE_PTR(,r8,568)
003A5E 4110 E00C 003119 | LA r1,_shadow21(,r14,12)
003A62 B914 00E0 003119 | LGFR r14,r0
003A66 D278 B38E 6D33 003118 | MVC LINE(121,r11,910),REPT_INIT(r6,3379)
003A6C E3B0 DC20 0004 003119 | LG r11,#SPILL17(,r13,3104)
003A72 50B0 D25C 003119 | ST r11,_temp9(,r13,604)
003A76 DE03 D25C 1000 003119 | ED _temp9(4,r13,604),_shadow21(r1,0)
003A7C 4110 E003 003119 | LA r1,#AddressShadow(,r14,3)
003A80 41F0 E00A 003119 | LA r15,#AddressShadow(,r14,10)
003A84 D202 1001 D25D 003119 | MVC _shadow21(3,r1,1),_temp9(r13,605)
003A8A 9240 E003 003119 | MVI _shadow21(r14,3),64
003A8E 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003A92 50B0 D2E4 003119 | ST r11,_temp8(,r13,740)
003A96 41B0 E017 003119 | LA r11,#AddressShadow(,r14,23)
003A9A 4110 100E 003119 | LA r1,_shadow21(,r1,14)
003A9E DE03 D2E4 1000 003119 | ED _temp8(4,r13,740),_shadow21(r1,0)
003AA4 D202 F001 D2E5 003119 | MVC _shadow21(3,r15,1),_temp8(r13,741)
003AAA 9240 E00A 003119 | MVI _shadow21(r14,10),64
003AAE 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003AB2 E3F0 DB98 0004 003119 | LG r15,#SPILL0(,r13,2968)
003AB8 D202 E011 1010 003119 | MVC _shadow21(3,r14,17),_shadow21(r1,16)
003ABE 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003AC2 D206 D2D4 F4A4 003119 | MVC _temp19(7,r13,724),' ......'(r15,1188)
003AC8 D203 D26C 1013 003119 | MVC _temp15(4,r13,620),_shadow18(r1,19)
003ACE 4110 D26C 003119 | LA r1,_temp15(,r13,620)
003AD2 D202 D24C 1001 003119 | MVC _temp11(3,r13,588),_shadow12(r1,1)
003AD8 4110 D24C 003119 | LA r1,_temp11(,r13,588)
003ADC DE06 D2D4 1000 003119 | ED _temp19(7,r13,724),_temp11(r1,0)
003AE2 D205 B000 D2D5 003119 | MVC _shadow21(6,r11,0),_temp19(r13,725)
003AE8 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003AEC D206 D2CC F4A4 003119 | MVC _temp21(7,r13,716),' ......'(r15,1188)
003AF2 D202 D249 101B 003119 | MVC _temp18(3,r13,585),_shadow12(r1,27)
003AF8 D202 D246 D249 003119 | MVC _temp20(3,r13,582),_temp18(r13,585)
003AFE 4110 E028 003119 | LA r1,#AddressShadow(,r14,40)
003B02 E300 D246 0090 003119 | LLGC r0,<a1:d582:l1>(,r13,582)
003B08 E300 3114 0080 003119 | NG r0,=X'00000000 0000000F'
003B0E 41B0 D246 003119 | LA r11,_temp20(,r13,582)
003B12 4200 D246 003119 | STC r0,<a1:d582:l1>(,r13,582)
003B16 DE06 D2CC B000 003119 | ED _temp21(7,r13,716),_temp20(r11,0)
003B1C D204 1000 D2CE 003119 | MVC _shadow21(5,r1,0),_temp21(r13,718)
003B22 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003B26 E300 1026 0014 003119 | LGF r0,_shadow19(,r1,38)
003B2C 5000 E030 003119 | ST r0,_shadow19(,r14,48)
003B30 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003B34 E300 102A 0014 003119 | LGF r0,_shadow19(,r1,42)
003B3A 5000 E036 003119 | ST r0,_shadow19(,r14,54)
003B3E 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003B42 E300 102E 0014 003119 | LGF r0,_shadow19(,r1,46)
003B48 5000 E03D 003119 | ST r0,_shadow19(,r14,61)
003B4C 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003B50 4300 1036 003119 | IC r0,_shadow21(,r1,54)
003B54 4200 E04B 003119 | STC r0,_shadow21(,r14,75)
003B58 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003B5C E300 1043 0014 003119 | LGF r0,_shadow19(,r1,67)
003B62 5000 E05F 003119 | ST r0,_shadow19(,r14,95)
003B66 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003B6A 4800 1047 003119 | LH r0,_shadow20(,r1,71)
003B6E B914 0000 003119 | LGFR r0,r0
003B72 4000 E064 003119 | STH r0,_shadow20(,r14,100)
003B76 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003B7A 4800 1049 003119 | LH r0,_shadow20(,r1,73)
003B7E B914 0000 003119 | LGFR r0,r0
003B82 4000 E067 003119 | STH r0,_shadow20(,r14,103)
003B6E B914 0000 003119 | LGFR r0,r0
003B72 4000 E064 003119 | STH r0,_shadow20(,r14,100)
003B76 5810 8000 003119 | L r1,REPT_PTR(,r8,0)
003B7A 4800 1049 003119 | LH r0,_shadow20(,r1,73)
003B7E B914 0000 003119 | LGFR r0,r0
003B82 4000 E067 003119 | STH r0,_shadow20(,r14,103)
When I started with PL/I in 1985, we were told never to initialize a structure multiple times with
'', but to do it once, copy the initialized structure to a copy and re-initialize it with this copy,
as the compiler would just generate a simple MVC. For arrays of structures the multi-init using ''
was even worse, but by doing a
array_of_structure(1) = ''
followed by
array_of_structure = array_of_structure(1)
The code was near optimal (and by having an initialized STATIC copy of the structure at hand and
using that), the code was for all intents and purposes optimal. Not so with Enterprise PL/I although
I believe, but lacking access to EPLI 4.1 & 4.2, some issues have been addressed. Here's an example
of OS/ PL/I V2.3.0 versus Enterprise PL/I V3R9M0:
dcl 1 rept_line(10),
2 z00001 char (3),
2 tr pic 'zzz9',
2 z00002 char (3),
2 ri pic 'zzz9',
2 z00003 char (3),
2 da char (3),
2 z00004 char (3),
2 km pic '(3)z9v.9',
2 z00005 char (3),
2 hh pic 'z9.',
2 mm pic '99',
2 z00006 char (3),
2 v pic 'zz9v.9',
2 z00007 char (3),
2 na char (4),
2 z00008 char (2),
2 ty char (4),
2 z00009 char (3),
2 co char (4),
2 z00010 char (2),
2 wa,
3 whh pic 'z9.',
3 wmm pic '99',
2 z00011 char (3),
2 sp char (1),
2 z00012 char (255),
2 de,
3 dhh pic 'z9.',
3 dmm pic '99',
2 z00013 char (3),
2 ar,
3 ahh pic 'z9.',
3 amm pic '99',
2 z00014 char (3),
2 date,
3 year pic '9999',
3 z00015 char (1),
3 month pic '99',
3 z00016 char (1),
3 day pic '99';
/* Just fill the first element with something total random */
/* Q&D, too long, not good, but that's safe in PL/I */
string(rept_line(1)) = '.z,dmbvn;aehj,mzncbkmsdlkjsjsndvfkl\hjsb' ||
'fjhbc.blkwaioyuh.m,jdnsvkjxbvhjbzdfwtytk' ||
'vkjbnsegfirahjgouegkjnzkjgh8eryghkjghxjv' ||
'uye9tkjgkjvuhkjzxng-oipu8ynkjh4268srtjsc' ||
'uhkdlgozdugjnrg;hzdfgi.zdlnhg;zfhjgiozdh' ||
'iorjhdhgjzndg;hzdohgjdrhgjiozd-862jhaso9' ||
'fhhgoishiojsdrhjdiuhz,dmbvn;aehj,mzncbkk' ||
'l59uyjhlkxjbxofyixhjjhbc.blkwaioyuh.m,j0' ||
'ftkjgkjvuhkjzxng-oipkjbnsegfirahjgouegkk' ||
'llgozdugjnrg;hzdfgi.mbx/hjjxfhj(*^^^%$?0';
rept_line = rept_line(1);
The code generated by OS PL/I V2.3.0 - OPT(2):
* STATEMENT NUMBER 15
0000E0 41 E0 D 0D0 LA 14,REPT_LINE.Z00001+357
0000E4 50 E0 D 0C8 ST 14,200(0,13)
0000E8 41 70 D 0C8 LA 7,200(0,13)
0000EC 50 70 3 57C ST 7,1404(0,3)
0000F0 41 10 3 57C LA 1,1404(0,3)
0000F4 58 F0 3 020 L 15,A..IBMBAPMA
0000F8 05 EF BALR 14,15
Not very nice, a call to the library, but once in a program? We have to live with it if we choose
this kind of initialization.
* STATEMENT NUMBER 16
0000FA 41 90 D 0D0 LA 9,REPT_LINE.Z00001+357
0000FE 41 80 0 00C LA 8,12(0,0)
000102 41 70 D 235 LA 7,REPT_LINE.Z00001+714
000106 CL.7 EQU *
000106 D2 FF 7 000 9 000 MVC 0(256,7),0(9)
00010C 41 70 7 100 LA 7,256(0,7)
000110 41 90 9 100 LA 9,256(0,9)
000114 46 80 2 02A BCT 8,CL.7
000118 D2 8C 7 000 9 000 MVC 0(141,7),0(9)
Inner loop: MVC, 2 x LA and BCT
The code generated by Enterprise PL/I V3R9 OPT(3), ARCH(9) - 108 is statement
15 above, 119 is 17:
00008C 4110 D0CC 108 | LA r1,REPT_LINE(,r13,204)
000090 A709 0001 119 | LGHI r0,H'1'
Optimizer seemed to have moved other code here...
vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
000094 5020 2010 002 | ST r2,<s43:d16:l4>(,r2,16)
000098 9231 D0C1 059 | MVI _Sfi(r13,193),49
00009C E3F0 3004 0014 059 | LGF r15,=A(_ON_Begin_60_Blk_2)(,r3,4)
0000A2 50D0 DF04 059 | ST r13,<a1:d3844:l4>(,r13,3844)
0000A6 50F0 DF00 059 | ST r15,<a1:d3840:l4>(,r13,3840)
0000AA E3E0 DF00 0004 059 | LG r14,_temp1(,r13,3840)
0000B0 E3E0 D0C4 0024 059 | STG r14,_Sfi(,r13,196)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
0000B6 C0E0 0000 00E9 108 | LARL r14,F'233'
0000BC D2FF 1000 E1E8 108 | MVC _...(256,r1,0),'.z,...'(r14,488)
0000C2 D264 1100 E2E8 108 | MVC
_shad...(101,r1,256),'.z,...'(r14,744)
0000C8 119 | @1L2 DS 0H
0000C8 EBE0 0009 000D 119 | SLLG r14,r0,9 r14 = 512 x r0
0000CE EB10 0007 000D 119 | SLLG r1,r0,7 r1 = 128 x r0
0000D4 EBF0 0005 000D 119 | SLLG r15,r0,5 r15 = 32 x r0
0000DA 1FE1 119 | SLR r14,r1 r14 = 384 x r0
0000DC EB40 0002 000D 119 | SLLG r4,r0,2 r4 = 4 x r0
0000E2 1FEF 119 | SLR r14,r15 r14 = 352 x r0
0000E4 B904 0010 119 | LGR r1,r0 r1 = r0
0000E8 1EE4 119 | ALR r14,r4 r14 = 356 x r0
0000EA A70A 0001 119 | AHI r0,H'1'
0000EE 1E1E 119 | ALR r1,r14 r1 = 357 x r0
0000F0 E311 DF67 FF71 119 | LAY r1,REPT_LINE(r1,r13,-153)
0000F6 D2FF 1000 D0CC 119 | MVC
REPT_LINE(256,r1,0),REPT_LINE(r13,204)
0000FC D264 1100 D1CC 119 | MVC
REPT_LINE(101,r1,256),REPT_LINE(r13,460)
000102 EC0C FFE3 0A7E 119 | CIJNH r0,H'10',@1L2
Inner loop: WTH! For crying out loud... Is this really a "fast" multiply by 357??? And why waste
three extra registers on it??? Oh yes, because the instructions overlap...
The equivalent inner loop using ARCH(5), the lowest possible by EPLI V3R9:
0000D8 000119 | @1L2 DS 0H
0000D8 B904 00E0 000119 | LGR r14,r0
0000DC A7EC 0165 000119 | MHI r14,H'357'
0000E0 A70A 0001 000119 | AHI r0,H'1'
0000E4 A70E 000A 000119 | CHI r0,H'10'
0000E8 41EE FE9A 000119 | LA r14,REPT_LINE(r14,r15,3738)
0000EC D2FF E000 1000 000119 | MVC REPT_LINE(256,r14,0),REPT_LINE(r1,0)
0000F2 D264 E100 1100 000119 | MVC
REPT_LINE(101,r14,256),REPT_LINE(r1,256)
0000F8 A7D4 FFF0 000119 | JNH @1L2
OK, it contains a multiply, a "slow" instruction, be it that it can be made pretty fast if you look
at the x86 offerings from AMD & Intel (Sandy Bridge: 64 bit mul in 3 cycles). However, given that
this is a normal non-interleaved array, why do you need a multiplication at all. The V2.3.0 compiler
clearly demonstrated that you don't, and did so almost three decades ago!!!
Again, I just observe, your boss picks up the bill for the CPU cycles used...
If your company is paying thousands of dollars per year to be able to use Enterprise PL/I, don't you
think you are entitled to a compiler that generates the best possible code?
Robert
--
Robert AH Prins
robert(a)prino(d)org
----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@bama.ua.edu with the message: INFO IBM-MAIN