Re: Millicode Instructions
MVCX sounds a bit like MVCOS with R00=0. The PoOP says MVCOS may be significantly slower than MVC, but I would be interested to see a comparison between it and an executed MVC i.e. for use in short(ish) variable length moves. Robert Ngan CSC Financial Services Group From: Peurifoy, Richard L r-peuri...@neo.tamu.edu To: ASSEMBLER-LIST@LISTSERV.UGA.EDU Date: 2013/04/17 10:49 Subject:Re: Millicode Instructions Sent by:IBM Mainframe Assembler List ASSEMBLER-LIST@LISTSERV.UGA.EDU Some millicode instructions will outperform their PoOp-code counterparts because millicode has access to hardware features not available to ordinary code. For example, MVCL(E) has the ability to move data under certain conditions without loading it into cache. (You can't do that with looping MVC.) Millicode routines also have access to the MVCX instruction which performs a variable-length MVC -- something ordinary programs cannot do without using the EXecute instruction. MVCX sounds like it would be usefull for non-millicode, any idea why it was not externalized? Is there a coresponding CLCX? -- Richard
Automatic reply: Millicode Instructions
I will be out of the office, returning May 28th. I will respond to your email ASAP once I am back. In the meantime if you require an urgent response, please contact CSC on 212-855-1541. BR_ FONT size=2BR DTCC DISCLAIMER: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error, please notify us immediately and delete the email and any attachments from your system. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email./FONT
Re: Millicode Instructions
On 4/17/2013 8:44 AM, Peurifoy, Richard L wrote: Is there a coresponding CLCX? I assume yes, although I know not for sure... -- Edward E Jaffe Phoenix Software International, Inc 831 Parkview Drive North El Segundo, CA 90245 http://www.phoenixsoftware.com/
Re: Good Performing Code (Was: Millicode Instructions)
I would assume that 1 branch (the first option) is always faster than 2 branches (the second option). The branch prediction in the CPU should figure out with execution path is most likely. That is not a correct assumption, even if the branch prediction table of the CPU was long enough to remember every branch ever taken for the life of the power-on. You would need to know if this code is even executed enough to stay in any sort of table (let alone in cache). I'm not saying which is faster, just that the assumption is incorrect. In this land of pipelines and out of order execution (mixed with operating system dispatches and redispatches), hard and fast rules are hard to come by. What is knowable is that the general approach of the machine is to look ahead and to prefer that conditional branches not be taken. And when looking ahead, it knows that an unconditional branch will be taken so it can continue forward. So maybe it will be able to look further forward in the fall-through case. Peter Relson z/OS Core Technology Design
Automatic reply: Good Performing Code (Was: Millicode Instructions)
I will be out of the office on vacation starting from 6:30am CDT on Friday, April 19th. I will not have access to email throughout this time. I will return on Monday, April 29th at 6:30am.
Re: Good Performing Code (Was: Millicode Instructions)
And according to Dr John, BCT/BCTG, BXLE/BXLEG are predicted to always branch. Ciao, -- Raphael Dal-Pos / z/OS Support Generali France Assurances DSIO - DIO - IT Infrastructure Support Saint Denis - Wilo W 03 B1 028 F rdal...@generali.fr +(33)1-58-38-59-67 or mobile +(33)6.24.33.20.87 -- MVS: Guilty, until proven innocent !! RDP 2009 -Message d'origine- De : IBM Mainframe Assembler List [mailto:ASSEMBLER-LIST@LISTSERV.UGA.EDU] De la part de Peter Relson Envoyé : vendredi 19 avril 2013 14:39 À : ASSEMBLER-LIST@LISTSERV.UGA.EDU Objet : Re: Good Performing Code (Was: Millicode Instructions) I would assume that 1 branch (the first option) is always faster than 2 branches (the second option). The branch prediction in the CPU should figure out with execution path is most likely. That is not a correct assumption, even if the branch prediction table of the CPU was long enough to remember every branch ever taken for the life of the power-on. You would need to know if this code is even executed enough to stay in any sort of table (let alone in cache). I'm not saying which is faster, just that the assumption is incorrect. In this land of pipelines and out of order execution (mixed with operating system dispatches and redispatches), hard and fast rules are hard to come by. What is knowable is that the general approach of the machine is to look ahead and to prefer that conditional branches not be taken. And when looking ahead, it knows that an unconditional branch will be taken so it can continue forward. So maybe it will be able to look further forward in the fall-through case. Peter Relson z/OS Core Technology Design
Re: Good Performing Code (Was: Millicode Instructions)
I understand also that unconditional branches are faster than conditional branches. So, which is faster: BNZ LABEL Branch most frequent or: BZ*+8fall through most frequent B LABEL Unconditional It might seem naïve but I would assume that 1 branch (the first option) is always faster than 2 branches (the second option). The branch prediction in the CPU should figure out with execution path is most likely. Fred! - ATTENTION: The information in this electronic mail message is private and confidential, and only intended for the addressee. Should you receive this message by mistake, you are hereby notified that any disclosure, reproduction, distribution or use of this message is strictly prohibited. Please inform the sender by reply transmission and delete the message without copying or opening it. Messages and attachments are scanned for all viruses known. If this message contains password-protected attachments, the files have NOT been scanned for viruses by the ING mail domain. Always scan attachments before opening them. -
Re: Good Performing Code (Was: Millicode Instructions)
John Ehrman wrote: Performance concerns about individual instructions aren't worth much effort. Things like operand alignment, data and instruction cache retention, locality of reference, branch frequency etc. can have really significant effects. For sure. But for the pathologically curious, if you have z/VM source, look at the main module for EXEC 2 (DMSEXE). It's full of lines like: SLR R0,R8 (DO IT WHILE R3 SETTLES) These must go back to what, 303x? 370 itself? Christopher J. Stephenson Sir Chris the EXECutor) wrote this code 30+ years ago, when that stuff DID matter. There were giants in those days...! ...phsiii
Re: Good Performing Code (Was: Millicode Instructions)
The two newest processors (z196 and zEC12) do out-of-order processing. Does that mean that we do not need to 'intermingle' instructions because the processor will do it for us? Fred! Sent from my new iPad On Apr 18, 2013, at 17:05, Phil Smith III li...@akphs.com wrote: John Ehrman wrote: Performance concerns about individual instructions aren't worth much effort. Things like operand alignment, data and instruction cache retention, locality of reference, branch frequency etc. can have really significant effects. For sure. But for the pathologically curious, if you have z/VM source, look at the main module for EXEC 2 (DMSEXE). It's full of lines like: SLR R0,R8 (DO IT WHILE R3 SETTLES) These must go back to what, 303x? 370 itself? Christopher J. Stephenson Sir Chris the EXECutor) wrote this code 30+ years ago, when that stuff DID matter. There were giants in those days...! ...phsiii - ATTENTION: The information in this electronic mail message is private and confidential, and only intended for the addressee. Should you receive this message by mistake, you are hereby notified that any disclosure, reproduction, distribution or use of this message is strictly prohibited. Please inform the sender by reply transmission and delete the message without copying or opening it. Messages and attachments are scanned for all viruses known. If this message contains password-protected attachments, the files have NOT been scanned for viruses by the ING mail domain. Always scan attachments before opening them. -
Re: Millicode Instructions
Well, I for one don't go along with the obsession with saving a nanosecond here and there. In any case, as someone just pointed out, a compiler can do far more optimization than one can manage by hand, and compiler writers spend large amounts of time determining optimal instruction sequences for certain operations and developing algorithms to compile an optimal solution for each piece of code. Modern software systems like Microsoft .Net can compile at run-time for the current hardware architecture. So much for TRT vs TRTE. Few active product developers have time to really learn all the endless new z/Architecture instructions, anyway, I suggest, and continually compare them and determine which would be optimal in this or that situation. (But maybe I'm just too lazy nowadays). What rarely (if ever) gets addressed here is good programming practices (in a wider sense than how to load a base register or whatever), something you continually encounter on HLL forums, but rarely, somehow, in assembler. That for me means writing code that works correctly, is understandable, reflects in structure the logic of the problem, and can be easily modified and expanded in scope. Cryptically clever code is generally best avoided, much as it seems to appeal to certain kinds of programmers. Efficiency of individual small code sections is mostly pretty irrelevant unless at the center of a loop which is executed a vast number of times. Not so long ago it was suggested by one of the more august personalities here that I should not use a system macro for its intended purpose but rather some allegedly quicker set of instructions accessing the same data via control block pointers. However, since the code is executed once at start-up of a permanently active STC the issue ! of speed was not very relevant. Good practices in my view would also exclude enormous code sections requiring numerous base registers (even if replaced by relative branches). Our coding standards never gave rise to a need for more than one code base register, although it's all baseless nowadays and uses 64 bit code and the odd ZS3 instruction, even. In fact we recently implemented a pre-loader with the aim of loading different code versions for modern or older machines, but have seen no pressing need to use it yet for that purpose (it has other functions as well). Structured programming as small logical sections is something that can be practiced in assembler too. I am responsible for several products in use around the world in large IBM mainframe computer centers. They are all written in assembler (for various good reasons from the distant past, starting again today might change things of course). Although we occasionally hear a customer complain that we are using too much CPU, it is generally due to ! poor use of the products' facilities and not to obvious weaknesses in the code. Speed in a program depends often more on the architecture of code than on individual instructions. Running serially through long lists or tables to find stuff is a common cause of CPU hotspots. One solution is to use a hash table. Using methods like bubble-sort rather than say quicksort algorithms to sort data in storage makes the programming easy, but much, much slower. Of course in pure assembler these kind of things have to be programmed. (The nicer part of using HLL - in the wider world of Java, C#, C++ etc. anyway - is having large libraries of functions available to do such things). Misuse of system functions can cause issues too, some of our early I/O code caused problems by issuing unnecessary PGSER RELEASE requests, for example. Such things can be determined by suitable tools. Btw, my last reply to one of your posts got caught by the reply-to issue, but I didn't feel a great need to post it again to the list. It wasn't my intention to reply to you personally. DS -Ursprüngliche Nachricht- Von: IBM Mainframe Assembler List [mailto:ASSEMBLER-LIST@LISTSERV.UGA.EDU] Im Auftrag von Scott Ford Gesendet: Mittwoch, 17. April 2013 00:55 An: ASSEMBLER-LIST@LISTSERV.UGA.EDU Betreff: Re: Millicode Instructions Ed, I want to ask a question, in this day/age and processing power is it really worth being concerned about Assembler instructions speed ? Unless there is some application that is very time sensitive, that I understand Regards, Scott J Ford Software Engineer http://www.identityforge.com/ From: Ed Jaffe edja...@phoenixsoftware.com To: ASSEMBLER-LIST@LISTSERV.UGA.EDU Sent: Tuesday, April 16, 2013 6:13 PM Subject: Re: Millicode Instructions On 4/16/2013 12:43 PM, Gibney, Dave wrote: I don't get to work at this level often, but I am always interested. How can Millicode be faster than the equivalent using the hardware instructions? As I understand Millicode, that is really all it is (using the hardware instructions) plus any overhead in context switching to the Millicode
Re: Good Performing Code (Was: Millicode Instructions)
On 17 April 2013 07:34, Ed Jaffe edja...@phoenixsoftware.com wrote: On 4/16/2013 3:55 PM, Scott Ford wrote: I want to ask a question, in this day/age and processing power is it really worth being concerned about Assembler instructions speed ? I am not unbiased. My answer is exactly what one would expect from the CTO of a software company that has been authoring far-better-performing code since 1978. Am I proud of slides 67-74 in this SHARE presentation? https://share.confex.com/**share/120/webprogram/Handout/** Session13319/%28E%29JES%**20Update_SHARE%20120.pdfhttps://share.confex.com/share/120/webprogram/Handout/Session13319/%28E%29JES%20Update_SHARE%20120.pdf You bet I am! You may! :-) I think that most of the people on the list realize that much of this type of discussions is to hone your skills, understand what challenges your code offers to the machine, and be able to diagnose issues with code fragments where it is relevant. My experience is that those who don't appreciate the low level concepts often don't see the big picture either. Much of what I learn here while lurking provides background information for when I need to address an issue in the code base that I inherited. I recently found the code spending a lot of time in sequentially searching a linked list, comparing each key with an EX of a CLC instruction (which I understand is not a good idea anymore). Since only the search argument was variable length, I could copy it to a fixed length field and do plain CLC instead. While the big improvements are in the algorithms, some understanding of the machine architecture is helpful when thinking about those issues as well. I don't know which part helped most to write the next release that added a lot of new function, reduced the size of code by 30% and reduced CPU usage by factor of 5-10. While it is true that CPUs have gotten faster, the volume of data we operate on has often increased as well. And even when algorithms are O(N) the volume of data can still surprise you. My favorite quote from an application developer is: Rob, we know this is not effecient. But it works fine for 100,000 records. Why would it not work for 107 million? (hint, 100K records took less than 2 minutes to run, the nightly batch took 27 hrs). Rob
Re: Good Performing Code (Was: Millicode Instructions)
Long ago G. H. Hardy, one of the great figures of 20th-century mathematics, set out what he took to be the three most important characteristics of successful contributors to any technical field. They are 1) intellectual curiosity, the itch to know how things work, 2) craftsmanship, a commitment to doing the best job one knows how to do, and 3) a desire for recognition, fame, money, the esteem of one's colleagues and the like. Conspicuously absent from this list are preoccupation with rules of thumb and standard practices. Algorithms are indeed important: Linear search is polynomial-time; binary search is logarithmic time. Details are important too, not least because the cumulative effect of getting them wrong can swamp the advantages that the choice of good algorithms confers. Scale is important. Some problems are still inaccessible, others will remain so when computers that operate at the frequencies of hard cosmic rays become available. Taste and experience are important. Anyone who knows a little physics can sit down and make a long list of the things that may affect the path of, say, an artillery shell from muzzle to impact. Some of them indeed need to be considered; but it turns out that the Newtonian model of the parabolic path of a mass point in a gravitational field is usually sovereign. The capacity to clear away intellectual clutter is thus one of the chief marks of high talent, and the role for programmers of low talent is diminishing rapidly. (Yes, this is a species of Programming, like other engineering activities, is characterized, all but defined, by the need to make tradeoffs among conflicting, finally irreconcilable objectives; and few programmers are at all good at this, mostly because 1) no institutional premium has been placed on doing it well and 2) they have been poorly educated to do it. Over time vendor groups and ISVs like EJ's will, I think, very largely replace in-house programming staffs. We shall have a situation much like that which prevails in the legal profession today. For scut work, reviewing an employment contract or lease, say, the in-house lawyer has his or her uses. For real trouble and important advice, an outside firm must be turned to. John Gilmore, Ashland, MA 01721 - USA
Re: Good Performing Code (Was: Millicode Instructions)
Your elapsed times in g oing from 100,000 records to 107 million looks like linear scaling. That's the best one can hope for. Working fine means running in less than two minutes. It did work for 107 million records. But it didn't work fine because it took longer than two minutes. I suppose this developer also expects it to take less than two minutes to process 100 billion records. The application developer needs to go to remedial multiplication class. I learned how to multiply in the third grade. Sixty years later I still remember how to multiply. It's also important to know when, why, and what to multiply. Bill Fairchild Franklin, TN - Original Message - From: Rob van der Heij rvdh...@gmail.com To: ASSEMBLER-LIST@LISTSERV.UGA.EDU Sent: Wednesday, April 17, 2013 3:12:09 AM Subject: Re: Good Performing Code (Was: Millicode Instructions) On 17 April 2013 07:34, Ed Jaffe edja...@phoenixsoftware.com wrote: On 4/16/2013 3:55 PM, Scott Ford wrote: I want to ask a question, in this day/age and processing power is it really worth being concerned about Assembler instructions speed ? I am not unbiased. My answer is exactly what one would expect from the CTO of a software company that has been authoring far-better-performing code since 1978. Am I proud of slides 67-74 in this SHARE presentation? https://share.confex.com/**share/120/webprogram/Handout/** Session13319/%28E%29JES%**20Update_SHARE%20120.pdfhttps://share.confex.com/share/120/webprogram/Handout/Session13319/%28E%29JES%20Update_SHARE%20120.pdf You bet I am! You may! :-) I think that most of the people on the list realize that much of this type of discussions is to hone your skills, understand what challenges your code offers to the machine, and be able to diagnose issues with code fragments where it is relevant. My experience is that those who don't appreciate the low level concepts often don't see the big picture either. Much of what I learn here while lurking provides background information for when I need to address an issue in the code base that I inherited. I recently found the code spending a lot of time in sequentially searching a linked list, comparing each key with an EX of a CLC instruction (which I understand is not a good idea anymore). Since only the search argument was variable length, I could copy it to a fixed length field and do plain CLC instead. While the big improvements are in the algorithms, some understanding of the machine architecture is helpful when thinking about those issues as well. I don't know which part helped most to write the next release that added a lot of new function, reduced the size of code by 30% and reduced CPU usage by factor of 5-10. While it is true that CPUs have gotten faster, the volume of data we operate on has often increased as well. And even when algorithms are O(N) the volume of data can still surprise you. My favorite quote from an application developer is: Rob, we know this is not effecient. But it works fine for 100,000 records. Why would it not work for 107 million? (hint, 100K records took less than 2 minutes to run, the nightly batch took 27 hrs). Rob
Re: Good Performing Code (Was: Millicode Instructions)
On 17 April 2013 15:14, DASDBILL2 dasdbi...@comcast.net wrote: Your elapsed times in g oing from 100,000 records to 107 million looks like linear scaling. That's the best one can hope for. Working fine means running in less than two minutes. It did work for 107 million records. But it didn't work fine because it took longer than two minutes. I suppose this developer also expects it to take less than two minutes to process 100 billion records. The application developer needs to go to remedial multiplication class. I learned how to multiply in the third grade. Sixty years later I still remember how to multiply. It's also important to know when, why, and what to multiply. Right. I like the case because it illustrates some of the issues. In this particular case the application went from a intimate interaction between z/OS and DB2 to a remote database on z/Linux. It's not even bad if you make the round trip from the application through TCP/IP to the database, now and then hit the disk, dispatch the virtual machine, and back to the application, and all on average under 1 ms. That's less than 2 minutes for 100K, but 27 hrs for 100M... I think it's not uncommon for people having trouble to absorb several orders of magnitude. Many mainframe folks have learned to do that multiplication despite intuition. And some of us know how long it takes to copy a 3390-3 and can do the math during the meeting already. I've been involved in migration projects where people claimed it was pretty fast but in reality would not even do 5% of the total migration in 48 hrs. It's the same experience that makes me ask what about backup and D/R and project managers blame the messenger for being negative... Rob
Re: Good Performing Code (Was: Millicode Instructions)
On 4/17/2013 9:14 AM, DASDBILL2 wrote: I learned how to multiply in the third grade. Sixty years later I still remember how to multiply. It's also important to know when, why, and what to multiply. Simple - you write the two numbers with the larger on the left. In the next row, double the number on the left, and halve the number on the right, discarding any fraction. Upon reaching 1 on the right, cross out any row where the right number is even. Add the remaining rows on the left. Gerhard Postpischil Bradford, Vermont
Re: Good Performing Code (Was: Millicode Instructions)
I tried your algorithm with 13 multiplied by 81 and produced the correct answer. This algorithm is undoubtedly how the microcode for the M (multiply fullword) instruction does its math. There are many paths to the end of one's journey, Grasshopper. Bill Fairchild Franklin, TN “Political language is designed to make lies sound truthful and murder acceptable, and to give the appearance of solidity to pure wind.” [George Orwell] - Original Message - From: Gerhard Postpischil gerh...@valley.net To: ASSEMBLER-LIST@LISTSERV.UGA.EDU Sent: Wednesday, April 17, 2013 10:19:22 AM Subject: Re: Good Performing Code (Was: Millicode Instructions) On 4/17/2013 9:14 AM, DASDBILL2 wrote: I learned how to multiply in the third grade. Sixty years later I still remember how to multiply. It's also important to know when, why, and what to multiply. Simple - you write the two numbers with the larger on the left. In the next row, double the number on the left, and halve the number on the right, discarding any fraction. Upon reaching 1 on the right, cross out any row where the right number is even. Add the remaining rows on the left. Gerhard Postpischil Bradford, Vermont
Automatic reply: Good Performing Code (Was: Millicode Instructions)
I am currently out of the office and unreachable until Thursday. If you have a P1, Production Down issue with PowerExchange or UDR products please make a voice call to Informatica Support and open an SR. Thanks, Joey
Re: Millicode Instructions
Some millicode instructions will outperform their PoOp-code counterparts because millicode has access to hardware features not available to ordinary code. For example, MVCL(E) has the ability to move data under certain conditions without loading it into cache. (You can't do that with looping MVC.) Millicode routines also have access to the MVCX instruction which performs a variable-length MVC -- something ordinary programs cannot do without using the EXecute instruction. MVCX sounds like it would be usefull for non-millicode, any idea why it was not externalized? Is there a coresponding CLCX? -- Richard
Re: Good Performing Code (Was: Millicode Instructions)
On 2013-04-17, at 09:31, DASDBILL2 wrote: I tried your algorithm with 13 multiplied by 81 and produced the correct answer. This algorithm is undoubtedly how the microcode for the M (multiply fullword) instruction does its math. It has a lot to do with where the 1-bits are in the binary representation of the multiplier, yes. GIYF. Wallace tree PDP-6 et al. inspected two bits of the multiplier at each iteration and mixed adds and subtracts to get a 2s complement product without a restoring step. - Original Message - From: Gerhard Postpischil Sent: Wednesday, April 17, 2013 10:19:22 AM Simple - you write the two numbers with the larger on the left. In the next row, double the number on the left, and halve the number on the right, discarding any fraction. Upon reaching 1 on the right, cross out any row where the right number is even. Add the remaining rows on the left. -- gil
Re: Good Performing Code (Was: Millicode Instructions)
Hey Gil, I assume that type of math with bits is super fast ...I has a friend show my similar techniques using SRL or SLL, but my old age ...I forgot . Will have to revisit Scott ford www.identityforge.com from my IPAD 'Infinite wisdom through infinite means' On Apr 17, 2013, at 1:03 PM, Paul Gilmartin paulgboul...@aim.com wrote: On 2013-04-17, at 09:31, DASDBILL2 wrote: I tried your algorithm with 13 multiplied by 81 and produced the correct answer. This algorithm is undoubtedly how the microcode for the M (multiply fullword) instruction does its math. It has a lot to do with where the 1-bits are in the binary representation of the multiplier, yes. GIYF. Wallace tree PDP-6 et al. inspected two bits of the multiplier at each iteration and mixed adds and subtracts to get a 2s complement product without a restoring step. - Original Message - From: Gerhard Postpischil Sent: Wednesday, April 17, 2013 10:19:22 AM Simple - you write the two numbers with the larger on the left. In the next row, double the number on the left, and halve the number on the right, discarding any fraction. Upon reaching 1 on the right, cross out any row where the right number is even. Add the remaining rows on the left. -- gil
Re: Good Performing Code (Was: Millicode Instructions)
On 2013-04-17, at 11:26, Scott Ford wrote: I assume that type of math with bits is super fast ...I has a friend show my similar techniques using SRL or SLL, but my old age ...I forgot . Will have to revisit I expect it's hardwired in the Multiply instruction. Or, you could do it with a MACRO. But we're already having that discussion. Similar techniques are applicable to SQRT. -- gil
Re: Good Performing Code (Was: Millicode Instructions)
Performance concerns about individual instructions aren't worth much effort. Things like operand alignment, data and instruction cache retention, locality of reference, branch frequency etc. can have really significant effects. Remember that CPU speeds have increased much faster than memory speeds -- getting an operand from cache can take a cycle or two, but from memory can take hundreds or thousands (try causing a page fault!).
Re: Good Performing Code (Was: Millicode Instructions)
John, For example, what are Assembler no nos in performance ...I am trying to put you 'on the spot' , I am curious and responsible person, so I would like to know Best Regards, Scott ford www.identityforge.com from my IPAD 'Infinite wisdom through infinite means' On Apr 17, 2013, at 2:08 PM, John Ehrman ehr...@us.ibm.com wrote: Performance concerns about individual instructions aren't worth much effort. Things like operand alignment, data and instruction cache retention, locality of reference, branch frequency etc. can have really significant effects. Remember that CPU speeds have increased much faster than memory speeds -- getting an operand from cache can take a cycle or two, but from memory can take hundreds or thousands (try causing a page fault!).
Re: Good Performing Code (Was: Millicode Instructions)
Trying not to put you on the spot sorry... Scott ford www.identityforge.com from my IPAD 'Infinite wisdom through infinite means' On Apr 17, 2013, at 3:35 PM, Scott Ford scott_j_f...@yahoo.com wrote: John, For example, what are Assembler no nos in performance ...I am trying to put you 'on the spot' , I am curious and responsible person, so I would like to know Best Regards, Scott ford www.identityforge.com from my IPAD 'Infinite wisdom through infinite means' On Apr 17, 2013, at 2:08 PM, John Ehrman ehr...@us.ibm.com wrote: Performance concerns about individual instructions aren't worth much effort. Things like operand alignment, data and instruction cache retention, locality of reference, branch frequency etc. can have really significant effects. Remember that CPU speeds have increased much faster than memory speeds -- getting an operand from cache can take a cycle or two, but from memory can take hundreds or thousands (try causing a page fault!).
Re: Good Performing Code (Was: Millicode Instructions)
On 4/17/2013 12:35 PM, Scott Ford wrote: For example, what are Assembler no nos in performance ...I am trying to put you 'on the spot' , I am curious and responsible person, so I would like to know One example would be having a data area with various fields that are frequently updated by multiple, simultaneous units of work. The cache thrashing will eat your lunch. Instead, spread out the data so that each unit of work has its own cache line to play with. -- Edward E Jaffe Phoenix Software International, Inc 831 Parkview Drive North El Segundo, CA 90245 http://www.phoenixsoftware.com/
Re: Good Performing Code (Was: Millicode Instructions)
Scott Ford asked: what are Assembler no nos in performance ... Here are some examples (from my session 12522 talk at SHARE in San Francisco): 1. Memory speed is very slow compared to CPU speed -- for example, use immediate operands wherever possible 2. Operand alignment can be very important (doubleword alignment if possible!) 3. Don't mix instructions and data -- keep them far apart 4. Modifying instructions on the fly is performance poison 5. Minimize Address Generation Interlock (you can put other unrelated instructions between these two at little or no cost because the CPU has to wait until the first Load completes before it can execute the second) L1,Pointer L2,0(,1) 6. Arrange branches so the fall through path is most frequent 7. Keep data references close in memory and time 8. Keep instruction references close in memory and time That's a start, anyway. John Ehrman
Re: Good Performing Code (Was: Millicode Instructions)
On 2013-04-17, at 14:59, John Ehrman wrote: 6. Arrange branches so the fall through path is most frequent I understand also that unconditional branches are faster than conditional branches. So, which is faster: BNZ LABEL Branch most frequent or: BZ*+8fall through most frequent B LABEL Unconditional ? -- gil
Re: Good Performing Code (Was: Millicode Instructions)
Haven't seen a timings table since the early 90's, and rather than show some gusto and code up a test, my official guess is: neither. -Original Message- From: IBM Mainframe Assembler List [mailto:ASSEMBLER-LIST@LISTSERV.UGA.EDU] On Behalf Of Paul Gilmartin Sent: Wednesday, April 17, 2013 6:19 PM To: ASSEMBLER-LIST@LISTSERV.UGA.EDU Subject: Re: Good Performing Code (Was: Millicode Instructions) On 2013-04-17, at 14:59, John Ehrman wrote: 6. Arrange branches so the fall through path is most frequent I understand also that unconditional branches are faster than conditional branches. So, which is faster: BNZ LABEL Branch most frequent or: BZ*+8fall through most frequent B LABEL Unconditional ? -- gil
Millicode Instructions
-Original Message- From: IBM Mainframe Assembler List [mailto:ASSEMBLER- l...@listserv.uga.edu] On Behalf Of John Gilmore Sent: Tuesday, April 16, 2013 12:29 PM To: ASSEMBLER-LIST@LISTSERV.UGA.EDU Subject: Re: TRTE and new instructions Peter Farley's points are interesting ones. My numbers tell a very different tale, and I suspect that these differences turn on when such measurements are taken. The first appearances of new instructions, millicoded ones anyway, do often exhibit 'bad' performance; but this performance sometimes, even usually, improves rapidly. Working with millicoded instructions has taught me two important lessons: Their performance is a moving target, and early measurements of it are usually misleading. It often improves significantly in the interval that would be required to replace them with alternative sequences. I don't get to work at this level often, but I am always interested. How can Millicode be faster than the equivalent using the hardware instructions? As I understand Millicode, that is really all it is (using the hardware instructions) plus any overhead in context switching to the Millicode environment. For the MVC/MVCL option, I can imagine a macro which generates an MVC loop, or unroll the loop into a sequence of MVC, or generate the MVCL depending on several criteria. I currently don't have the knowledge to determine the criteria and I would expect the criteria to change over time. John Gilmore, Ashland, MA 01721 - USA
Re: Millicode Instructions
Dave Gibney wrote: begin extract How can Millicode be faster than the equivalent using the hardware instructions? As I understand Millicode, that is really all it is (using the hardware instructions) plus any overhead in context switching to the Millicode environment. /end extract This is a common misunderstanding that has unfortunately been repeated many times. It is a radically misleading caricature. Millicode makes available many facilities not available in the HLASM. It does not make additional machine instructions available, but it does make its own powerful facilities for specifying the path pf control among them available. I have always felt some impatience with this view. If it were at all accurate it would make millicode, which goes back to the System/390, unimportant, even dispensable; and, while IBM is not infallible, it is deeply serious about its hardware investments. GIYF. To begin see (watch wrap) http://ecc.marist.edu/conf2011/materials/SlegelSystemZ_APeekUnderTheHood_Slegel_MaristECC.pdf. John Gilmore, Ashland, MA 01721 - USA
Re: Millicode Instructions
On 4/16/2013 12:43 PM, Gibney, Dave wrote: I don't get to work at this level often, but I am always interested. How can Millicode be faster than the equivalent using the hardware instructions? As I understand Millicode, that is really all it is (using the hardware instructions) plus any overhead in context switching to the Millicode environment. For the MVC/MVCL option, I can imagine a macro which generates an MVC loop, or unroll the loop into a sequence of MVC, or generate the MVCL depending on several criteria. I currently don't have the knowledge to determine the criteria and I would expect the criteria to change over time Some millicode instructions will outperform their PoOp-code counterparts because millicode has access to hardware features not available to ordinary code. For example, MVCL(E) has the ability to move data under certain conditions without loading it into cache. (You can't do that with looping MVC.) Millicode routines also have access to the MVCX instruction which performs a variable-length MVC -- something ordinary programs cannot do without using the EXecute instruction. Furthermore, a millicode instruction is perceived by the architecture as a single instruction. This allows millicode to do things that cannot be simulated in ordinary code. For example, it would be impossible to write a simulation of the PLO instruction. -- Edward E Jaffe Phoenix Software International, Inc 831 Parkview Drive North El Segundo, CA 90245 http://www.phoenixsoftware.com/
Re: Millicode Instructions
Ed, I want to ask a question, in this day/age and processing power is it really worth being concerned about Assembler instructions speed ? Unless there is some application that is very time sensitive, that I understand Regards, Scott J Ford Software Engineer http://www.identityforge.com/ From: Ed Jaffe edja...@phoenixsoftware.com To: ASSEMBLER-LIST@LISTSERV.UGA.EDU Sent: Tuesday, April 16, 2013 6:13 PM Subject: Re: Millicode Instructions On 4/16/2013 12:43 PM, Gibney, Dave wrote: I don't get to work at this level often, but I am always interested. How can Millicode be faster than the equivalent using the hardware instructions? As I understand Millicode, that is really all it is (using the hardware instructions) plus any overhead in context switching to the Millicode environment. For the MVC/MVCL option, I can imagine a macro which generates an MVC loop, or unroll the loop into a sequence of MVC, or generate the MVCL depending on several criteria. I currently don't have the knowledge to determine the criteria and I would expect the criteria to change over time Some millicode instructions will outperform their PoOp-code counterparts because millicode has access to hardware features not available to ordinary code. For example, MVCL(E) has the ability to move data under certain conditions without loading it into cache. (You can't do that with looping MVC.) Millicode routines also have access to the MVCX instruction which performs a variable-length MVC -- something ordinary programs cannot do without using the EXecute instruction. Furthermore, a millicode instruction is perceived by the architecture as a single instruction. This allows millicode to do things that cannot be simulated in ordinary code. For example, it would be impossible to write a simulation of the PLO instruction. -- Edward E Jaffe Phoenix Software International, Inc 831 Parkview Drive North El Segundo, CA 90245 http://www.phoenixsoftware.com/
Re: Millicode Instructions
For us, yes. We pay most of our software based on MSU usage. My boss says that one MSU reduction will save us $13,000/yr. Is this huge? To us, yes. We must constantly fight the management belief that Windows is better! Cheaper! faster! If some company could do a conversion with a 1 year ROI, they would go full blast without any other consideration being looked at. On Apr 16, 2013 5:56 PM, Scott Ford scott_j_f...@yahoo.com wrote: Ed, I want to ask a question, in this day/age and processing power is it really worth being concerned about Assembler instructions speed ? Unless there is some application that is very time sensitive, that I understand Regards, Scott J Ford Software Engineer http://www.identityforge.com/ From: Ed Jaffe edja...@phoenixsoftware.com To: ASSEMBLER-LIST@LISTSERV.UGA.EDU Sent: Tuesday, April 16, 2013 6:13 PM Subject: Re: Millicode Instructions On 4/16/2013 12:43 PM, Gibney, Dave wrote: I don't get to work at this level often, but I am always interested. How can Millicode be faster than the equivalent using the hardware instructions? As I understand Millicode, that is really all it is (using the hardware instructions) plus any overhead in context switching to the Millicode environment. For the MVC/MVCL option, I can imagine a macro which generates an MVC loop, or unroll the loop into a sequence of MVC, or generate the MVCL depending on several criteria. I currently don't have the knowledge to determine the criteria and I would expect the criteria to change over time Some millicode instructions will outperform their PoOp-code counterparts because millicode has access to hardware features not available to ordinary code. For example, MVCL(E) has the ability to move data under certain conditions without loading it into cache. (You can't do that with looping MVC.) Millicode routines also have access to the MVCX instruction which performs a variable-length MVC -- something ordinary programs cannot do without using the EXecute instruction. Furthermore, a millicode instruction is perceived by the architecture as a single instruction. This allows millicode to do things that cannot be simulated in ordinary code. For example, it would be impossible to write a simulation of the PLO instruction. -- Edward E Jaffe Phoenix Software International, Inc 831 Parkview Drive North El Segundo, CA 90245 http://www.phoenixsoftware.com/
Good Performing Code (Was: Millicode Instructions)
On 4/16/2013 3:55 PM, Scott Ford wrote: I want to ask a question, in this day/age and processing power is it really worth being concerned about Assembler instructions speed ? I am not unbiased. My answer is exactly what one would expect from the CTO of a software company that has been authoring far-better-performing code since 1978. Am I proud of slides 67-74 in this SHARE presentation? https://share.confex.com/share/120/webprogram/Handout/Session13319/%28E%29JES%20Update_SHARE%20120.pdf You bet I am! -- Edward E Jaffe Phoenix Software International, Inc 831 Parkview Drive North El Segundo, CA 90245 http://www.phoenixsoftware.com/