Re: Why sort (was Microprocessor Optimization Primer)
David, I haven't seen benchmarks yet, just a bunch of wall time comparisons. We have terabytes of memory on our Z13 that hasn't been deployed yet, I'm cynical but our guys have to start to run interference. On Thu, Apr 7, 2016 at 9:46 PM, David Crayfordwrote: > On 7/04/2016 6:59 PM, Wayne Bickerdike wrote: > >> I'm slightly gobsmacked that this discussion is needed. I guess the forest >> is lost in the trees. >> >> I can recommend "Principles of Program Design" by Michael Jackson c. 1975. >> >> Of greater concern is the implication that Oracle on AIX outperforms DB2 >> on >> z/OS at our shop. Surely not :( >> > > Do you have real workload benchmarks that prove it Wayne? > > > On Thu, Apr 7, 2016 at 2:59 PM, Joel C. Ewing wrote: >> >> On 04/06/2016 07:01 AM, Andrew Rowley wrote: >>> On 05/04/2016 01:20 AM, Tom Marchant wrote: > On Mon, 4 Apr 2016 16:45:37 +1000, Andrew Rowley wrote: > > A Hashmap potentially allows you to read sequentially and match records >> between files, without caring about the order. >> > Can you please explain what you mean by this? Are you talking about > using > the hashmap to determine which record to read next, and so to read the > records in an order that is logically sequential, but physically > random? If so, > that is not at all like reading the records sequentially. > > If one file fits in memory, you can read it sequentially into a Hashmap with the using the data you want to match as the key. Then read the second one, also sequentially, retrieving matching records from the Hashmap by key. You can also remove them from the Hashmap as they are found if you need to know if any are unmatched. But this is a solution for a made up case - I don't know whether it is a common situation. I was interested in hearing real reasons why sort is so common on z/OS i.e. Why sort? On Hashmaps etc. in general - they are the memory equivalent to indexed datasets (VSAM etc) versus sequential datasets. Their availability opens up many new ways to process data - and algorithm changes are often where the big savings can be made. I believe others have already alluded to the potential time advantage of >>> processing a large number of updates in key order rather than randomly >>> when external data is indexed but actually physically ordered by some >>> key. The reason why this has historically been the case is that >>> external disk storage devices which allow random access have >>> rotational-latency delay and access-head-positioning delay which is >>> minimized when doing full-track or even multi-track I/O and when >>> accessing adjacent cylinders. The way to update the data in minimal >>> real time is to do the I/O in minimal disk rotations, accessing all data >>> needed on the same track in one rotation and all data in one cylinder >>> before moving to an adjacent cylinder. Crucial to this concept is >>> understanding that z/OS includes support within I/O access methods which >>> allows applications to successfully exploit the ability of DASD hardware >>> to transfer one, several, or all data blocks on a track as a single >>> operation within a single disk revolution. >>> >>> With emulated DASD and hardware DASD caching, the effects of physical >>> track and cylinder boundaries may be unknownl, but it is still likely >>> that minimizing repeated visitations to an emulated track or an >>> emulated cylinder will achieve similar locality of reference on physical >>> DASD, reduce latency delays and improve the effectiveness of hardware >>> caching. Processing transaction records in the same order as the >>> database records are physically stored on an external file gives the >>> best odds of grouping transactions needing the same track and cylinder >>> together and for minimizing I/O delays and minimizing demands on DASD >>> cache storage and processor storage for file buffers. Processing >>> transactions in a different order increases the likelihood that the >>> needed file data to process the transaction is no longer in processor >>> memory or disk cache and that at a minimum the time equivalent of >>> another disk revolution will be required to obtain it. >>> >>> It was not uncommon with VSAM files for transaction sorting to improve >>> real-time processing speed sufficiently that the break-even point even >>> with sorting overhead could be as low as updating only 5% of the >>> database. These techniques were common in MVS and its z/OS successor >>> applications because it was common for those systems to deal with very >>> large files and databases where tricks like this were necessary in order >>> to meet constrained nightly batch processing windows.. Since it is >>> common in z/OS to be dealing with very large files and databases, there >>> are always files in those environments that are too large to consider
Re: Why sort (was Microprocessor Optimization Primer)
An excellent synopsis of mainframe history. It follows that most mature shops use SORT extensively because until recently, the platform pretty much required it for reasonable performance as measured by wall clock. One could argue--maybe even prove--that today's DASD allows more random updating than in the days of yore, but a mature shop that has orchestrated batch around sorting would find it a hard sell to convince business units (i.e. paying customers) to reengineer massive production processes just it's possible. We explored TVS (Transactional VSAM) in ESP some years ago. As wonderful as it sounded--and probably was--the target applications folks balked at having to redesign their update programs because the processing logic is totally different. Unfortunately, I think they moved off of mainframe instead. ;-(( . . . J.O.Skip Robinson Southern California Edison Company Electric Dragon Team Paddler SHARE MVS Program Co-Manager 323-715-0595 Mobile 626-302-7535 Office robin...@sce.com -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Joel C. Ewing Sent: Wednesday, April 06, 2016 9:59 PM To: IBM-MAIN@LISTSERV.UA.EDU Subject: (External):Re: Why sort (was Microprocessor Optimization Primer) On 04/06/2016 07:01 AM, Andrew Rowley wrote: > On 05/04/2016 01:20 AM, Tom Marchant wrote: >> On Mon, 4 Apr 2016 16:45:37 +1000, Andrew Rowley wrote: >> >>> A Hashmap potentially allows you to read sequentially and match >>> records between files, without caring about the order. >> Can you please explain what you mean by this? Are you talking about >> using the hashmap to determine which record to read next, and so to >> read the records in an order that is logically sequential, but >> physically random? If so, that is not at all like reading the records >> sequentially. >> > > If one file fits in memory, you can read it sequentially into a > Hashmap with the using the data you want to match as the key. > Then read the second one, also sequentially, retrieving matching > records from the Hashmap by key. You can also remove them from the > Hashmap as they are found if you need to know if any are unmatched. > > But this is a solution for a made up case - I don't know whether it is > a common situation. I was interested in hearing real reasons why sort > is so common on z/OS i.e. Why sort? > > On Hashmaps etc. in general - they are the memory equivalent to > indexed datasets (VSAM etc) versus sequential datasets. Their > availability opens up many new ways to process data - and algorithm > changes are often where the big savings can be made. > I believe others have already alluded to the potential time advantage of processing a large number of updates in key order rather than randomly when external data is indexed but actually physically ordered by some key. The reason why this has historically been the case is that external disk storage devices which allow random access have rotational-latency delay and access-head-positioning delay which is minimized when doing full-track or even multi-track I/O and when accessing adjacent cylinders. The way to update the data in minimal real time is to do the I/O in minimal disk rotations, accessing all data needed on the same track in one rotation and all data in one cylinder before moving to an adjacent cylinder. Crucial to this concept is understanding that z/OS includes support within I/O access methods which allows applications to successfully exploit the ability of DASD hardware to transfer one, several, or all data blocks on a track as a single operation within a single disk revolution. With emulated DASD and hardware DASD caching, the effects of physical track and cylinder boundaries may be unknownl, but it is still likely that minimizing repeated visitations to an emulated track or an emulated cylinder will achieve similar locality of reference on physical DASD, reduce latency delays and improve the effectiveness of hardware caching. Processing transaction records in the same order as the database records are physically stored on an external file gives the best odds of grouping transactions needing the same track and cylinder together and for minimizing I/O delays and minimizing demands on DASD cache storage and processor storage for file buffers. Processing transactions in a different order increases the likelihood that the needed file data to process the transaction is no longer in processor memory or disk cache and that at a minimum the time equivalent of another disk revolution will be required to obtain it. It was not uncommon with VSAM files for transaction sorting to improve real-time processing speed sufficiently that the break-even point even with sorting overhead could be as low as updating only 5% of the database. These techniq
Re: Why sort (was Microprocessor Optimization Primer)
...hey Wayne. Mitch Mccluhan mitc...@aol.com On Thursday, April 7, 2016 Wayne Bickerdikewrote: I'm slightly gobsmacked that this discussion is needed. I guess the forest is lost in the trees. I can recommend "Principles of Program Design" by Michael Jackson c. 1975. Of greater concern is the implication that Oracle on AIX outperforms DB2 on z/OS at our shop. Surely not :( On Thu, Apr 7, 2016 at 2:59 PM, Joel C. Ewing wrote: > On 04/06/2016 07:01 AM, Andrew Rowley wrote: > > On 05/04/2016 01:20 AM, Tom Marchant wrote: > >> On Mon, 4 Apr 2016 16:45:37 +1000, Andrew Rowley wrote: > >> > >>> A Hashmap potentially allows you to read sequentially and match records > >>> between files, without caring about the order. > >> Can you please explain what you mean by this? Are you talking about > >> using > >> the hashmap to determine which record to read next, and so to read the > >> records in an order that is logically sequential, but physically > >> random? If so, > >> that is not at all like reading the records sequentially. > >> > > > > If one file fits in memory, you can read it sequentially into a > > Hashmap with the using the data you want to match as the key. > > Then read the second one, also sequentially, retrieving matching > > records from the Hashmap by key. You can also remove them from the > > Hashmap as they are found if you need to know if any are unmatched. > > > > But this is a solution for a made up case - I don't know whether it is > > a common situation. I was interested in hearing real reasons why sort > > is so common on z/OS i.e. Why sort? > > > > On Hashmaps etc. in general - they are the memory equivalent to > > indexed datasets (VSAM etc) versus sequential datasets. Their > > availability opens up many new ways to process data - and algorithm > > changes are often where the big savings can be made. > > > I believe others have already alluded to the potential time advantage of > processing a large number of updates in key order rather than randomly > when external data is indexed but actually physically ordered by some > key. The reason why this has historically been the case is that > external disk storage devices which allow random access have > rotational-latency delay and access-head-positioning delay which is > minimized when doing full-track or even multi-track I/O and when > accessing adjacent cylinders. The way to update the data in minimal > real time is to do the I/O in minimal disk rotations, accessing all data > needed on the same track in one rotation and all data in one cylinder > before moving to an adjacent cylinder. Crucial to this concept is > understanding that z/OS includes support within I/O access methods which > allows applications to successfully exploit the ability of DASD hardware > to transfer one, several, or all data blocks on a track as a single > operation within a single disk revolution. > > With emulated DASD and hardware DASD caching, the effects of physical > track and cylinder boundaries may be unknownl, but it is still likely > that minimizing repeated visitations to an emulated track or an > emulated cylinder will achieve similar locality of reference on physical > DASD, reduce latency delays and improve the effectiveness of hardware > caching. Processing transaction records in the same order as the > database records are physically stored on an external file gives the > best odds of grouping transactions needing the same track and cylinder > together and for minimizing I/O delays and minimizing demands on DASD > cache storage and processor storage for file buffers. Processing > transactions in a different order increases the likelihood that the > needed file data to process the transaction is no longer in processor > memory or disk cache and that at a minimum the time equivalent of > another disk revolution will be required to obtain it. > > It was not uncommon with VSAM files for transaction sorting to improve > real-time processing speed sufficiently that the break-even point even > with sorting overhead could be as low as updating only 5% of the > database. These techniques were common in MVS and its z/OS successor > applications because it was common for those systems to deal with very > large files and databases where tricks like this were necessary in order > to meet constrained nightly batch processing windows.. Since it is > common in z/OS to be dealing with very large files and databases, there > are always files in those environments that are too large to consider > placing the entire file in memory, no matter how large processor memory > becomes. > > Hash maps are not really equivalent to VSAM data sets because a VSAM > file is not just indexed, but indexed-sequential, which means once you > have successfully stored records in the file, reading the records in key > order from a VSAM file is just a trivial sequential read. A hash map > makes it trivial to find a record with a given key, but if
Re: Why sort (was Microprocessor Optimization Primer)
On 7/04/2016 7:56 PM, John McKown wrote: On Thu, Apr 7, 2016 at 5:59 AM, Wayne Bickerdikewrote: I'm slightly gobsmacked that this discussion is needed. I guess the forest is lost in the trees. I can recommend "Principles of Program Design" by Michael Jackson c. 1975. Of greater concern is the implication that Oracle on AIX outperforms DB2 on z/OS at our shop. Surely not :( Without knowing the hardware & software setups, I can believe this. Why? Because our distributed systems people "proved" that a Sun running Solaris was faster than z/Linux on a z. Of course, they were comparing a _dedicated_ Sun server (don't know the exact model) which 10 CPs and 10Gb of memory to z/Linux running on a z890 with 2 Gb memory and a single IFL. Kind of like proving that a Chevy is better than a Mazda by comparing a Corvette's performance on a race track to my Mazda 3 on the same race track. So that Sun setup is basically what you get with a $5000 x86 blade! -- Wayne V. Bickerdike -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On Thu, Apr 7, 2016 at 5:59 AM, Wayne Bickerdikewrote: > I'm slightly gobsmacked that this discussion is needed. I guess the forest > is lost in the trees. > > I can recommend "Principles of Program Design" by Michael Jackson c. 1975. > > Of greater concern is the implication that Oracle on AIX outperforms DB2 on > z/OS at our shop. Surely not :( > Without knowing the hardware & software setups, I can believe this. Why? Because our distributed systems people "proved" that a Sun running Solaris was faster than z/Linux on a z. Of course, they were comparing a _dedicated_ Sun server (don't know the exact model) which 10 CPs and 10Gb of memory to z/Linux running on a z890 with 2 Gb memory and a single IFL. Kind of like proving that a Chevy is better than a Mazda by comparing a Corvette's performance on a race track to my Mazda 3 on the same race track. > > > -- > Wayne V. Bickerdike > > -- How many surrealists does it take to screw in a lightbulb? One to hold the giraffe and one to fill the bathtub with brightly colored power tools. Maranatha! <>< John McKown -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On 7/04/2016 6:59 PM, Wayne Bickerdike wrote: I'm slightly gobsmacked that this discussion is needed. I guess the forest is lost in the trees. I can recommend "Principles of Program Design" by Michael Jackson c. 1975. Of greater concern is the implication that Oracle on AIX outperforms DB2 on z/OS at our shop. Surely not :( Do you have real workload benchmarks that prove it Wayne? On Thu, Apr 7, 2016 at 2:59 PM, Joel C. Ewingwrote: On 04/06/2016 07:01 AM, Andrew Rowley wrote: On 05/04/2016 01:20 AM, Tom Marchant wrote: On Mon, 4 Apr 2016 16:45:37 +1000, Andrew Rowley wrote: A Hashmap potentially allows you to read sequentially and match records between files, without caring about the order. Can you please explain what you mean by this? Are you talking about using the hashmap to determine which record to read next, and so to read the records in an order that is logically sequential, but physically random? If so, that is not at all like reading the records sequentially. If one file fits in memory, you can read it sequentially into a Hashmap with the using the data you want to match as the key. Then read the second one, also sequentially, retrieving matching records from the Hashmap by key. You can also remove them from the Hashmap as they are found if you need to know if any are unmatched. But this is a solution for a made up case - I don't know whether it is a common situation. I was interested in hearing real reasons why sort is so common on z/OS i.e. Why sort? On Hashmaps etc. in general - they are the memory equivalent to indexed datasets (VSAM etc) versus sequential datasets. Their availability opens up many new ways to process data - and algorithm changes are often where the big savings can be made. I believe others have already alluded to the potential time advantage of processing a large number of updates in key order rather than randomly when external data is indexed but actually physically ordered by some key. The reason why this has historically been the case is that external disk storage devices which allow random access have rotational-latency delay and access-head-positioning delay which is minimized when doing full-track or even multi-track I/O and when accessing adjacent cylinders. The way to update the data in minimal real time is to do the I/O in minimal disk rotations, accessing all data needed on the same track in one rotation and all data in one cylinder before moving to an adjacent cylinder. Crucial to this concept is understanding that z/OS includes support within I/O access methods which allows applications to successfully exploit the ability of DASD hardware to transfer one, several, or all data blocks on a track as a single operation within a single disk revolution. With emulated DASD and hardware DASD caching, the effects of physical track and cylinder boundaries may be unknownl, but it is still likely that minimizing repeated visitations to an emulated track or an emulated cylinder will achieve similar locality of reference on physical DASD, reduce latency delays and improve the effectiveness of hardware caching. Processing transaction records in the same order as the database records are physically stored on an external file gives the best odds of grouping transactions needing the same track and cylinder together and for minimizing I/O delays and minimizing demands on DASD cache storage and processor storage for file buffers. Processing transactions in a different order increases the likelihood that the needed file data to process the transaction is no longer in processor memory or disk cache and that at a minimum the time equivalent of another disk revolution will be required to obtain it. It was not uncommon with VSAM files for transaction sorting to improve real-time processing speed sufficiently that the break-even point even with sorting overhead could be as low as updating only 5% of the database. These techniques were common in MVS and its z/OS successor applications because it was common for those systems to deal with very large files and databases where tricks like this were necessary in order to meet constrained nightly batch processing windows.. Since it is common in z/OS to be dealing with very large files and databases, there are always files in those environments that are too large to consider placing the entire file in memory, no matter how large processor memory becomes. Hash maps are not really equivalent to VSAM data sets because a VSAM file is not just indexed, but indexed-sequential, which means once you have successfully stored records in the file, reading the records in key order from a VSAM file is just a trivial sequential read. A hash map makes it trivial to find a record with a given key, but if you also need to access the records in key order, a sort of the keys is still required. I have applications that have used hash tables in exactly that way, doing a tag-sort of the keys after the fact to allow
Re: Why sort (was Microprocessor Optimization Primer)
I'm slightly gobsmacked that this discussion is needed. I guess the forest is lost in the trees. I can recommend "Principles of Program Design" by Michael Jackson c. 1975. Of greater concern is the implication that Oracle on AIX outperforms DB2 on z/OS at our shop. Surely not :( On Thu, Apr 7, 2016 at 2:59 PM, Joel C. Ewingwrote: > On 04/06/2016 07:01 AM, Andrew Rowley wrote: > > On 05/04/2016 01:20 AM, Tom Marchant wrote: > >> On Mon, 4 Apr 2016 16:45:37 +1000, Andrew Rowley wrote: > >> > >>> A Hashmap potentially allows you to read sequentially and match records > >>> between files, without caring about the order. > >> Can you please explain what you mean by this? Are you talking about > >> using > >> the hashmap to determine which record to read next, and so to read the > >> records in an order that is logically sequential, but physically > >> random? If so, > >> that is not at all like reading the records sequentially. > >> > > > > If one file fits in memory, you can read it sequentially into a > > Hashmap with the using the data you want to match as the key. > > Then read the second one, also sequentially, retrieving matching > > records from the Hashmap by key. You can also remove them from the > > Hashmap as they are found if you need to know if any are unmatched. > > > > But this is a solution for a made up case - I don't know whether it is > > a common situation. I was interested in hearing real reasons why sort > > is so common on z/OS i.e. Why sort? > > > > On Hashmaps etc. in general - they are the memory equivalent to > > indexed datasets (VSAM etc) versus sequential datasets. Their > > availability opens up many new ways to process data - and algorithm > > changes are often where the big savings can be made. > > > I believe others have already alluded to the potential time advantage of > processing a large number of updates in key order rather than randomly > when external data is indexed but actually physically ordered by some > key. The reason why this has historically been the case is that > external disk storage devices which allow random access have > rotational-latency delay and access-head-positioning delay which is > minimized when doing full-track or even multi-track I/O and when > accessing adjacent cylinders. The way to update the data in minimal > real time is to do the I/O in minimal disk rotations, accessing all data > needed on the same track in one rotation and all data in one cylinder > before moving to an adjacent cylinder. Crucial to this concept is > understanding that z/OS includes support within I/O access methods which > allows applications to successfully exploit the ability of DASD hardware > to transfer one, several, or all data blocks on a track as a single > operation within a single disk revolution. > > With emulated DASD and hardware DASD caching, the effects of physical > track and cylinder boundaries may be unknownl, but it is still likely > that minimizing repeated visitations to an emulated track or an > emulated cylinder will achieve similar locality of reference on physical > DASD, reduce latency delays and improve the effectiveness of hardware > caching. Processing transaction records in the same order as the > database records are physically stored on an external file gives the > best odds of grouping transactions needing the same track and cylinder > together and for minimizing I/O delays and minimizing demands on DASD > cache storage and processor storage for file buffers. Processing > transactions in a different order increases the likelihood that the > needed file data to process the transaction is no longer in processor > memory or disk cache and that at a minimum the time equivalent of > another disk revolution will be required to obtain it. > > It was not uncommon with VSAM files for transaction sorting to improve > real-time processing speed sufficiently that the break-even point even > with sorting overhead could be as low as updating only 5% of the > database. These techniques were common in MVS and its z/OS successor > applications because it was common for those systems to deal with very > large files and databases where tricks like this were necessary in order > to meet constrained nightly batch processing windows.. Since it is > common in z/OS to be dealing with very large files and databases, there > are always files in those environments that are too large to consider > placing the entire file in memory, no matter how large processor memory > becomes. > > Hash maps are not really equivalent to VSAM data sets because a VSAM > file is not just indexed, but indexed-sequential, which means once you > have successfully stored records in the file, reading the records in key > order from a VSAM file is just a trivial sequential read. A hash map > makes it trivial to find a record with a given key, but if you also need > to access the records in key order, a sort of the keys is still > required. I have
Re: Why sort (was Microprocessor Optimization Primer)
On 04/06/2016 07:01 AM, Andrew Rowley wrote: > On 05/04/2016 01:20 AM, Tom Marchant wrote: >> On Mon, 4 Apr 2016 16:45:37 +1000, Andrew Rowley wrote: >> >>> A Hashmap potentially allows you to read sequentially and match records >>> between files, without caring about the order. >> Can you please explain what you mean by this? Are you talking about >> using >> the hashmap to determine which record to read next, and so to read the >> records in an order that is logically sequential, but physically >> random? If so, >> that is not at all like reading the records sequentially. >> > > If one file fits in memory, you can read it sequentially into a > Hashmap with the using the data you want to match as the key. > Then read the second one, also sequentially, retrieving matching > records from the Hashmap by key. You can also remove them from the > Hashmap as they are found if you need to know if any are unmatched. > > But this is a solution for a made up case - I don't know whether it is > a common situation. I was interested in hearing real reasons why sort > is so common on z/OS i.e. Why sort? > > On Hashmaps etc. in general - they are the memory equivalent to > indexed datasets (VSAM etc) versus sequential datasets. Their > availability opens up many new ways to process data - and algorithm > changes are often where the big savings can be made. > I believe others have already alluded to the potential time advantage of processing a large number of updates in key order rather than randomly when external data is indexed but actually physically ordered by some key. The reason why this has historically been the case is that external disk storage devices which allow random access have rotational-latency delay and access-head-positioning delay which is minimized when doing full-track or even multi-track I/O and when accessing adjacent cylinders. The way to update the data in minimal real time is to do the I/O in minimal disk rotations, accessing all data needed on the same track in one rotation and all data in one cylinder before moving to an adjacent cylinder. Crucial to this concept is understanding that z/OS includes support within I/O access methods which allows applications to successfully exploit the ability of DASD hardware to transfer one, several, or all data blocks on a track as a single operation within a single disk revolution. With emulated DASD and hardware DASD caching, the effects of physical track and cylinder boundaries may be unknownl, but it is still likely that minimizing repeated visitations to an emulated track or an emulated cylinder will achieve similar locality of reference on physical DASD, reduce latency delays and improve the effectiveness of hardware caching. Processing transaction records in the same order as the database records are physically stored on an external file gives the best odds of grouping transactions needing the same track and cylinder together and for minimizing I/O delays and minimizing demands on DASD cache storage and processor storage for file buffers. Processing transactions in a different order increases the likelihood that the needed file data to process the transaction is no longer in processor memory or disk cache and that at a minimum the time equivalent of another disk revolution will be required to obtain it. It was not uncommon with VSAM files for transaction sorting to improve real-time processing speed sufficiently that the break-even point even with sorting overhead could be as low as updating only 5% of the database. These techniques were common in MVS and its z/OS successor applications because it was common for those systems to deal with very large files and databases where tricks like this were necessary in order to meet constrained nightly batch processing windows.. Since it is common in z/OS to be dealing with very large files and databases, there are always files in those environments that are too large to consider placing the entire file in memory, no matter how large processor memory becomes. Hash maps are not really equivalent to VSAM data sets because a VSAM file is not just indexed, but indexed-sequential, which means once you have successfully stored records in the file, reading the records in key order from a VSAM file is just a trivial sequential read. A hash map makes it trivial to find a record with a given key, but if you also need to access the records in key order, a sort of the keys is still required. I have applications that have used hash tables in exactly that way, doing a tag-sort of the keys after the fact to allow ordered access, but that is not a feature inherent in hash mapped records like it is with a VSAM data set. While as you point out it is possible to process a transaction file against a database file without either being sorted by reading records from one file (presumably the smaller one) into a hash map memory table and then processing the other file and searching the hash table for records with
Re: Why sort (was Microprocessor Optimization Primer)
On Wed, Apr 6, 2016 at 7:01 AM, Andrew Rowleywrote: > On 05/04/2016 01:20 AM, Tom Marchant wrote: > >> On Mon, 4 Apr 2016 16:45:37 +1000, Andrew Rowley wrote: >> >> A Hashmap potentially allows you to read sequentially and match records >>> between files, without caring about the order. >>> >> Can you please explain what you mean by this? Are you talking about using >> the hashmap to determine which record to read next, and so to read the >> records in an order that is logically sequential, but physically random? >> If so, >> that is not at all like reading the records sequentially. >> >> > If one file fits in memory, you can read it sequentially into a Hashmap > with the using the data you want to match as the key. > Then read the second one, also sequentially, retrieving matching records > from the Hashmap by key. You can also remove them from the Hashmap as they > are found if you need to know if any are unmatched. > > But this is a solution for a made up case - I don't know whether it is a > common situation. I was interested in hearing real reasons why sort is so > common on z/OS i.e. Why sort? > Not meaning to sound silly, but I fear the main reason may be the good old: "We've always done it that way". And, since most of the in-house software written in z/OS is in some version of COBOL, there is no other real choice because COBOL does not have anything like a content addressable "array" built into the language. IMO, a major deficiency in IBM's COBOL, and maybe other vendors' COBOLs, is that it does not come with a great library of functionality. It is simple to do things in Java, Perl, PHP, python, and Go because of the huge amount of support in the libraries. COBOL basically has the barest of native data types. And basically only has integer indexed arrays and structures as ways to "group" things together. Also, COBOL has pretty much the barest of run time routines. And the only invocation of anything in a library is via the CALL verb. I guess that it's sad that the object oriented portion of the latest COBOL compilers seem to be ignored. So, why not migrate away from COBOL to a more advanced language? Many places are doing so for new work or development (or going to a non-z platform). Also, do you really need to buffer up everything in a Hashmap if your data resides in a relational database? It is generally much better to let the RDBMS do most of the work. And it will buffer up the active data, not only from your program but every program which is accessing the data. In this case, do a SORT could possibly be unnecessary. Or you may need to do a SORT if you are writing a report sorted by a value created in the program itself. Do you really want to use a Hashmap to store the unsorted electricity bills for Los Angeles, and then, at the end, read & write said bills by reading the Hashmap by key? This sort of thing goes on a _lot_ on z/OS. Just my take on it. I'm not against using something other than SORT if I think it will work well. But SORT (DFSORT & Syncsort) are extremely fast and efficient. So if I need something done which they can do, then I think it is best to use them rather than code something up myself, in any language. > > On Hashmaps etc. in general - they are the memory equivalent to indexed > datasets (VSAM etc) versus sequential datasets. Their availability opens up > many new ways to process data - and algorithm changes are often where the > big savings can be made. > > -- How many surrealists does it take to screw in a lightbulb? One to hold the giraffe and one to fill the bathtub with brightly colored power tools. Maranatha! <>< John McKown -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On 05/04/2016 01:20 AM, Tom Marchant wrote: On Mon, 4 Apr 2016 16:45:37 +1000, Andrew Rowley wrote: A Hashmap potentially allows you to read sequentially and match records between files, without caring about the order. Can you please explain what you mean by this? Are you talking about using the hashmap to determine which record to read next, and so to read the records in an order that is logically sequential, but physically random? If so, that is not at all like reading the records sequentially. If one file fits in memory, you can read it sequentially into a Hashmap with the using the data you want to match as the key. Then read the second one, also sequentially, retrieving matching records from the Hashmap by key. You can also remove them from the Hashmap as they are found if you need to know if any are unmatched. But this is a solution for a made up case - I don't know whether it is a common situation. I was interested in hearing real reasons why sort is so common on z/OS i.e. Why sort? On Hashmaps etc. in general - they are the memory equivalent to indexed datasets (VSAM etc) versus sequential datasets. Their availability opens up many new ways to process data - and algorithm changes are often where the big savings can be made. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On Mon, 4 Apr 2016 16:45:37 +1000, Andrew Rowley wrote: >A Hashmap potentially allows you to read sequentially and match records >between files, without caring about the order. Can you please explain what you mean by this? Are you talking about using the hashmap to determine which record to read next, and so to read the records in an order that is logically sequential, but physically random? If so, that is not at all like reading the records sequentially. -- Tom Marchant -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On 4/04/2016 11:25, David Betten wrote: First the idea of loading all the data into a large hashmap to do the sort tends to eliminate one very important thing and that's overlap. Essentially, you read the entire input, conduct your massive hashsort, and then write the output with no overlap of those three phases. The approach I prefer is an iterative process of sorting smaller amounts and writing them to work files (either on disk or in memory) and then at end of input, you almost immediately begin the output process of merging those sorted strings. This technique is very efficient and I can tell you many z/OS customers are sorting tens to hundreds of gigabytes of data this way. I wasn't actually suggesting sorting using a Hashmap, or that Java sort was more efficient than DFSORT (although the overhead of transferring data between Java<->DFSORT might make Java sort preferable when the data is already in Java). I was more wondering whether collection classes like Hashmap could avoid the need to sort the data altogether, at which point the efficiency becomes moot. One common example given for sorting of data is to do grouping and totals, which can easily be implemented using a Hashmap with unordered data. Second point I'd like to make also is related to overlap. Sorting the files allows downstream process to read them sequentially rather than random gets from say VSAM or a data base. When you read or write sequentially, you have opportunities for I/O overlap along with blocking and chaining. So you can be reading the next set of data while your program is processing the previous set of data. This results in considerable elapsed time savings and reduction in I/O overhead since more data is transferred with each I/O. This is more what I had in mind - other reasons for sorting data before processing. I can see that VSAM would benefit from reading in order. I'm not so sure that a database like DB2 stores data in order - DB2 might be fastest if you don't specify a sort order and just take it as it comes from the database. There's also the question of whether you save enough CPU and I/O to make up for the cost of the sort. A Hashmap potentially allows you to read sequentially and match records between files, without caring about the order. This doesn't really relate to the work I am doing. It was just speculation about whether Java etc. on z/OS provided opportunity to reduce CPU by implementing better algorithms, prompted by the comment about the amount of batch DFSORT people run. -- Andrew Rowley Black Hill Software +61 413 302 386 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On 2016-04-03, at 19:25, David Betten wrote: > First of all, full disclaimer that I was in DFSORT development for about 8 > years so I might be biased. But I just want to share a few thoughts. > > First the idea of loading all the data into a large hashmap to do the sort > tends to eliminate one very important thing and that's overlap. > Essentially, you read the entire input, conduct your massive hashsort, and > then write the output with no overlap of those three phases. ... > Strawman. Or red herring. Or some metaphor. You seem to have deliberately made an adverse choice so you can refute it. Rather than hash, use a B-tree so sorting fully overlaps input. One might argue that given sufficient page data space any sort could be performed in virtual storage. I suspect performance would be suboptimal. I suspect that for a large enough data set Cooley-Tookey FFT brutally defies LoR. But some of the operations in C-T are hauntingly similar to a balanced merge. Might sorting techniques with workfiles implement a C-T that outperforms a virtual storage implementation? -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On 4/3/2016 6:21 PM, Ed Jaffe wrote: DFSORT, Syncsort, etc. use the CPC/UPT hardware instructions to implement the fastest sort on the platform. Typo. Of course, I meant to write CFC/UPT... :-[ -- Edward E Jaffe Phoenix Software International, Inc 831 Parkview Drive North El Segundo, CA 90245 http://www.phoenixsoftware.com/ -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
[image: Mic Drop] On Apr 3, 2016 20:53, "David Crayford"wrote: > > On 4/04/2016 7:41 AM, John McKown wrote: >> >> >> I'm not an application programmer. But I can just imagine the looks of >> astonishment and the "talk", if I were to write a COBOL program which does >> a SORT verb with INPUT PROCEDURE IS and OUTPUT PROCEDURE IS which only did >> a SORT FIELDS=COPY operation. Even more astonishment if I coded the INCLUDE >> or EXCLUDE to subset my data in addition to, or instead of, using COBOL >> code. I don't know if such coding would pass the majority of the "peer >> review" type processes. I'd love to try. Especially if I were smart enough >> to do so initially and keep the output listing. Then allow code review to >> force me to use normal COBOL methods. And then show the differences, >> assuming the SORT method actually is superior. Of course, I'd better know >> my management. I was at one shop (sysprog) where my boss (sysprog + >> manager) did that with a major application that would max the 3083 (long >> ago). Basically he proved it was due a flawed design. Unfortunately, that >> cost him him his job because the design was actually done by the head of >> the company (software development company). >> > > I'm sure the application folks would thinks you're a crazy, performance obsessed systems programmer and should go back to your cave! And they'd be right! And they do, sometimes. But, my management would adore it __IF IT COULD BE DONE RELIABLY BY THE REGULAR PROGRAMMERS__. Why? Because more than __anything__ else at present, they want to decrease the cost of I.T. (They consider it a "money pit" and seem to emotionally consider it to be an "unnecessary" expense which is not really related to the core business) . So if a technique, if consistently applied, would allow them to reduce the MSU cap, thus reducing our software bill, they want it to be done. I was typing more, but really got way too sarcastic. > FileManager was developed at the IBM APC labs in Perth. I worked with one of the lead developers on that product and they try to utilize DFSORT as much as possible. > There must be significant man years of work optimizing the I/O in DFSORT. It's sensible to try and leverage that. In the case of Andrews I/O bound product he could possibly > significantly accelerate the throughput if he could somehow hook into sort. Is it a big deal that DFSORT doesn't run on a zIIP when most of the workload is I/O bound? > > http://www.ibm.com/support/knowledgecenter/SSXJAV_13.1.0/com.ibm.filemanager.doc_13.1/base/funtips.htm > > LOL! IBM had to write a FASTREXX subset because standard REXX was a dog! > > -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On 4/04/2016 7:41 AM, John McKown wrote: On Sun, Apr 3, 2016 at 6:00 PM, Andrew Rowleywrote: On 3/04/2016 22:43, David Crayford wrote: Good question! Sort can be utilised for other purposes than sorting, it can be used as an I/O engine. DFSORT (or Syncsort) implements bespoke highly optinized I/O using low-level programming interfaces such as chained EXCPs which are significantly faster than using standard access methods like QSAM or BSAM, including overlapping BSAM I/O. DFSORT has exit routines (callbacks) which get called for each record. Basically it's supercharged I/O. One of our products does just that as do many others. IIRC, IBM FileManager uses sort for I/O. The trouble with using this technique with Java is the JNI/callback overhead. I'm aware of the efficient I/O, but I'm more interested in the use to put data into a particular order. My own programs I never sort input data, frequently sort small subsets of data during processing (likely always too small quantities for something like DFSORT) and almost always sort for presentation. Presentation is hopefully also too small quantities for DFSORT. It is an interesting idea though to use it to read data via the exits without actually giving it back to DFSORT to process. I'm not an application programmer. But I can just imagine the looks of astonishment and the "talk", if I were to write a COBOL program which does a SORT verb with INPUT PROCEDURE IS and OUTPUT PROCEDURE IS which only did a SORT FIELDS=COPY operation. Even more astonishment if I coded the INCLUDE or EXCLUDE to subset my data in addition to, or instead of, using COBOL code. I don't know if such coding would pass the majority of the "peer review" type processes. I'd love to try. Especially if I were smart enough to do so initially and keep the output listing. Then allow code review to force me to use normal COBOL methods. And then show the differences, assuming the SORT method actually is superior. Of course, I'd better know my management. I was at one shop (sysprog) where my boss (sysprog + manager) did that with a major application that would max the 3083 (long ago). Basically he proved it was due a flawed design. Unfortunately, that cost him him his job because the design was actually done by the head of the company (software development company). I'm sure the application folks would thinks you're a crazy, performance obsessed systems programmer and should go back to your cave! FileManager was developed at the IBM APC labs in Perth. I worked with one of the lead developers on that product and they try to utilize DFSORT as much as possible. There must be significant man years of work optimizing the I/O in DFSORT. It's sensible to try and leverage that. In the case of Andrews I/O bound product he could possibly significantly accelerate the throughput if he could somehow hook into sort. Is it a big deal that DFSORT doesn't run on a zIIP when most of the workload is I/O bound? http://www.ibm.com/support/knowledgecenter/SSXJAV_13.1.0/com.ibm.filemanager.doc_13.1/base/funtips.htm LOL! IBM had to write a FASTREXX subset because standard REXX was a dog! -- Andrew Rowley Black Hill Software +61 413 302 386 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
First of all, full disclaimer that I was in DFSORT development for about 8 years so I might be biased. But I just want to share a few thoughts. First the idea of loading all the data into a large hashmap to do the sort tends to eliminate one very important thing and that's overlap. Essentially, you read the entire input, conduct your massive hashsort, and then write the output with no overlap of those three phases. The approach I prefer is an iterative process of sorting smaller amounts and writing them to work files (either on disk or in memory) and then at end of input, you almost immediately begin the output process of merging those sorted strings. This technique is very efficient and I can tell you many z/OS customers are sorting tens to hundreds of gigabytes of data this way. Second point I'd like to make also is related to overlap. Sorting the files allows downstream process to read them sequentially rather than random gets from say VSAM or a data base. When you read or write sequentially, you have opportunities for I/O overlap along with blocking and chaining. So you can be reading the next set of data while your program is processing the previous set of data. This results in considerable elapsed time savings and reduction in I/O overhead since more data is transferred with each I/O. And that's just my 2 cents! Have a nice day, Dave Betten z/OS Performance Specialist Cloud and Systems Performance IBM Corporation email: bet...@us.ibm.com IBM Mainframe Discussion List <IBM-MAIN@LISTSERV.UA.EDU> wrote on 04/03/2016 07:28:39 PM: > From: Andrew Rowley <and...@blackhillsoftware.com> > To: IBM-MAIN@LISTSERV.UA.EDU > Date: 04/03/2016 07:32 PM > Subject: Re: Why sort (was Microprocessor Optimization Primer) > Sent by: IBM Mainframe Discussion List <IBM-MAIN@LISTSERV.UA.EDU> > > The reason I like Java on Z so much is I got used to using Hashtable in > C#, then tried to use Rexx stems to do the same thing. (It was semi > successful but I always felt like it was very fragile due to the > potential for unexpected values etc. for the stems.) Then I found Java > had real hash tables. They make so many different problems so much easier. > > A million 1500 byte entries should be about 1.5 GB I think, and I would > expect a hashmap to handle it without difficulty as long as the real > storage was available. But typically a hashtable would hold an object > with the specific items you're interested in rather than the whole 1500 > byte item. > > As for sorting a List of a million 1500 byte items - again I would > expect Java to do this without difficulty as long as real storage is > available. Java is actually pretty efficient at this because you're > actually sorting a list of pointers - you go all over memory to do the > compares, but should be only shuffling 8MB of data in storage if you > have a million 64 bit pointers. I regularly test EasySMF (written in C#) > displaying lists of 1,000,000+ items on the PC. It has column click > sorting, and it copes just fine with 1,000,000+ lists. Sorting a column > takes a few seconds at most on a not particularly fast PC. > > DFSORT seems to be most useful where you need to sort more data than can > be processed in storage - but I'm wondering how often that really needs > to be done. I'm not so interested in utilities and databases calling it > under the covers - more in applications that require records in a > particular order. Nor am I saying that's wrong - I'm really just asking > whether languages like Java provide opportunities to eliminate some sorting. > > On 3/04/2016 22:36, John McKown wrote: > > Sure, but how often do you have a Java HashMap which contains, say, a > > million entries? Oh, and the entries are not something like an "int", but > > more like a C struct where the size of each struct is around 1500 bytes. > > That would require about 1.5 Terabytes of memory. Not many systems have > > that much to give you for a single "object". And yes, we _do_ sort such > > monsters. Not often, granted, but we're doing a conversion right now and > > the programmer is doing work on claims which go back 10 years! That's a > > _lot_ of data! And , we don't have _any_ data bases, just VSAM and > > sequential data sets. I've actually used VSAM to do "sorting", by inserting > > records randomly, then reading them back in keyed order. The performance > > was horrible. DB2, or other database system, could be used in such as > > manner to avoid sorting. But I'd bet it would also be horrible. Of course, > > if you're reading an already existing VSAM keyed file, or a database, then > > you're golden. I'd bet most of the data in the non-z/OS world is kept in > > such a manner, as opposed to a regular "file". >
Re: Why sort (was Microprocessor Optimization Primer)
On 4/3/2016 4:28 PM, Andrew Rowley wrote: DFSORT seems to be most useful where you need to sort more data than can be processed in storage - but I'm wondering how often that really needs to be done. I'm not so interested in utilities and databases calling it under the covers - more in applications that require records in a particular order. Nor am I saying that's wrong - I'm really just asking whether languages like Java provide opportunities to eliminate some sorting. DFSORT, Syncsort, etc. use the CPC/UPT hardware instructions to implement the fastest sort on the platform. Are there java methods that also do this? Or do they use relatively inefficient software-based algorithms? -- Edward E Jaffe Phoenix Software International, Inc 831 Parkview Drive North El Segundo, CA 90245 http://www.phoenixsoftware.com/ -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On Sun, Apr 3, 2016 at 6:28 PM, Andrew Rowleywrote: > The reason I like Java on Z so much is I got used to using Hashtable in > C#, then tried to use Rexx stems to do the same thing. (It was semi > successful but I always felt like it was very fragile due to the potential > for unexpected values etc. for the stems.) Then I found Java had real hash > tables. They make so many different problems so much easier. > > A million 1500 byte entries should be about 1.5 GB I think, and I would > expect a hashmap to handle it without difficulty as long as the real > storage was available. But typically a hashtable would hold an object with > the specific items you're interested in rather than the whole 1500 byte > item. > Yeah, my arithmetic is really bad. > As for sorting a List of a million 1500 byte items - again I would expect > Java to do this without difficulty as long as real storage is available. > Java is actually pretty efficient at this because you're actually sorting a > list of pointers - you go all over memory to do the compares Hum, just a concern or mine (it may be obsolete), would be the working set in memory of doing that. z/OS, even on our small shop, is probably running 8 other batch jobs, 5 TSO users (we're small), and 7 CICS regions. I'd worry about sizing the real memory on the LPAR if all 8 jobs were "going all over memory". But, again, I have a very small z9BC system, so I worry about things that the big boy would sneer at. > , but should be only shuffling 8MB of data in storage if you have a > million 64 bit pointers. I regularly test EasySMF (written in C#) > displaying lists of 1,000,000+ items on the PC. It has column click > sorting, and it copes just fine with 1,000,000+ lists. Sorting a column > takes a few seconds at most on a not particularly fast PC. > > DFSORT seems to be most useful where you need to sort more data than can > be processed in storage - but I'm wondering how often that really needs to > be done. I'm not so interested in utilities and databases calling it under > the covers - more in applications that require records in a particular > order. Nor am I saying that's wrong - I'm really just asking whether > languages like Java provide opportunities to eliminate some sorting. > You have a good point about using SORT directly. Let the thing in the infrastructure use sort, like SQL "ORDER BY" or other things. Of course, it would be easier in our shop to do this if the COBOL language had a hashing facility built into it. Most of our code is COBOL and a CA product called EasyTrieve. We don't have any "fancy" or "up to date" languages like Java, Python, Ruby, Go, ... insert others ... . Took me a while to write and I had to rewrite a number of times when my current bitterness about things at work got to be too much. I'm going to go watch some ALF and "Get Smart" episodes to cheer up. > > > -- > Andrew Rowley > Black Hill Software > +61 413 302 386 > -- How many surrealists does it take to screw in a lightbulb? One to hold the giraffe and one to fill the bathtub with brightly colored power tools. Maranatha! <>< John McKown -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On Sun, Apr 3, 2016 at 6:00 PM, Andrew Rowleywrote: > On 3/04/2016 22:43, David Crayford wrote: > >> Good question! Sort can be utilised for other purposes than sorting, it >> can be used as an I/O engine. DFSORT (or Syncsort) implements bespoke >> highly optinized I/O using low-level programming interfaces such as chained >> EXCPs which are significantly faster than using standard access methods >> like QSAM or BSAM, including overlapping BSAM I/O. DFSORT has exit routines >> (callbacks) which get called for each record. Basically it's supercharged >> I/O. One of our products does just that as do many others. IIRC, IBM >> FileManager uses sort for I/O. The trouble with using this technique with >> Java is the JNI/callback overhead. >> > > I'm aware of the efficient I/O, but I'm more interested in the use to put > data into a particular order. My own programs I never sort input data, > frequently sort small subsets of data during processing (likely always too > small quantities for something like DFSORT) and almost always sort for > presentation. Presentation is hopefully also too small quantities for > DFSORT. > > It is an interesting idea though to use it to read data via the exits > without actually giving it back to DFSORT to process. I'm not an application programmer. But I can just imagine the looks of astonishment and the "talk", if I were to write a COBOL program which does a SORT verb with INPUT PROCEDURE IS and OUTPUT PROCEDURE IS which only did a SORT FIELDS=COPY operation. Even more astonishment if I coded the INCLUDE or EXCLUDE to subset my data in addition to, or instead of, using COBOL code. I don't know if such coding would pass the majority of the "peer review" type processes. I'd love to try. Especially if I were smart enough to do so initially and keep the output listing. Then allow code review to force me to use normal COBOL methods. And then show the differences, assuming the SORT method actually is superior. Of course, I'd better know my management. I was at one shop (sysprog) where my boss (sysprog + manager) did that with a major application that would max the 3083 (long ago). Basically he proved it was due a flawed design. Unfortunately, that cost him him his job because the design was actually done by the head of the company (software development company). > > > -- > Andrew Rowley > Black Hill Software > +61 413 302 386 > > -- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN > -- How many surrealists does it take to screw in a lightbulb? One to hold the giraffe and one to fill the bathtub with brightly colored power tools. Maranatha! <>< John McKown -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
The reason I like Java on Z so much is I got used to using Hashtable in C#, then tried to use Rexx stems to do the same thing. (It was semi successful but I always felt like it was very fragile due to the potential for unexpected values etc. for the stems.) Then I found Java had real hash tables. They make so many different problems so much easier. A million 1500 byte entries should be about 1.5 GB I think, and I would expect a hashmap to handle it without difficulty as long as the real storage was available. But typically a hashtable would hold an object with the specific items you're interested in rather than the whole 1500 byte item. As for sorting a List of a million 1500 byte items - again I would expect Java to do this without difficulty as long as real storage is available. Java is actually pretty efficient at this because you're actually sorting a list of pointers - you go all over memory to do the compares, but should be only shuffling 8MB of data in storage if you have a million 64 bit pointers. I regularly test EasySMF (written in C#) displaying lists of 1,000,000+ items on the PC. It has column click sorting, and it copes just fine with 1,000,000+ lists. Sorting a column takes a few seconds at most on a not particularly fast PC. DFSORT seems to be most useful where you need to sort more data than can be processed in storage - but I'm wondering how often that really needs to be done. I'm not so interested in utilities and databases calling it under the covers - more in applications that require records in a particular order. Nor am I saying that's wrong - I'm really just asking whether languages like Java provide opportunities to eliminate some sorting. On 3/04/2016 22:36, John McKown wrote: Sure, but how often do you have a Java HashMap which contains, say, a million entries? Oh, and the entries are not something like an "int", but more like a C struct where the size of each struct is around 1500 bytes. That would require about 1.5 Terabytes of memory. Not many systems have that much to give you for a single "object". And yes, we _do_ sort such monsters. Not often, granted, but we're doing a conversion right now and the programmer is doing work on claims which go back 10 years! That's a _lot_ of data! And , we don't have _any_ data bases, just VSAM and sequential data sets. I've actually used VSAM to do "sorting", by inserting records randomly, then reading them back in keyed order. The performance was horrible. DB2, or other database system, could be used in such as manner to avoid sorting. But I'd bet it would also be horrible. Of course, if you're reading an already existing VSAM keyed file, or a database, then you're golden. I'd bet most of the data in the non-z/OS world is kept in such a manner, as opposed to a regular "file". On z/OS, REXX has "stem" variables which are "content addressable", much like a HashMap (keep type HaspMap, ). The COBOL language doesn't have anything like this built in. Neither does PL/I. Of course, IBM's Java for z/OS does. As do other languages in the UNIX environment such as Perl. But there just aren't as many of them in z/OS due to the effort to make them work in an EBCDIC environment instead of an ASCII (or Unicode) environment. For Perl, Larry Wall just said "forget it, we're not doing it any more". I know that there is a port of LUA ( http://lua4z.com/ ), but I don't know how popular it is. Unfortunately, z/OS people (programmers, sysprogs, and management) don't really seem to be very interested in doing UNIX type work on z/OS. Possibly because "it's too expensive!" or "it's not how we have done things in the past and it's too difficult to bother learning." Or, maybe, just plain NIH syndrome (Not Invented Here). I mean, have you read the screams here about the latest COBOL requiring PDSEs for their executable output? You'd think that they'd been told to convert their COBOL to FORTRAN. -- Andrew Rowley Black Hill Software +61 413 302 386 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On 3/04/2016 22:43, David Crayford wrote: Good question! Sort can be utilised for other purposes than sorting, it can be used as an I/O engine. DFSORT (or Syncsort) implements bespoke highly optinized I/O using low-level programming interfaces such as chained EXCPs which are significantly faster than using standard access methods like QSAM or BSAM, including overlapping BSAM I/O. DFSORT has exit routines (callbacks) which get called for each record. Basically it's supercharged I/O. One of our products does just that as do many others. IIRC, IBM FileManager uses sort for I/O. The trouble with using this technique with Java is the JNI/callback overhead. I'm aware of the efficient I/O, but I'm more interested in the use to put data into a particular order. My own programs I never sort input data, frequently sort small subsets of data during processing (likely always too small quantities for something like DFSORT) and almost always sort for presentation. Presentation is hopefully also too small quantities for DFSORT. It is an interesting idea though to use it to read data via the exits without actually giving it back to DFSORT to process. -- Andrew Rowley Black Hill Software +61 413 302 386 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On 3 April 2016 at 02:50, Andrew Rowleywrote: > One question that puzzles me (maybe it's my lack of an application > programming background): Why is sort used so much on z/OS? As others have pointed out, sort on z/OS (whether IBM's or other vendors') can be used for all sorts (heh) of general I/O with high performance. But "sort" also covers the notion of merge, and more generally of collation. Many languages have constructs that implicitly sort, and all relational (and probably other) databases will sort implicitly as required, whether they implement their own sorts, or call the system one. The database product I worked on 20 years ago had three levels of sort: for a few rows it did its own in-storage sort, for thousands of rows it did it's own with work files, and for bigger stuff it called the system sort. Today those thresholds would be x10 at least because of much cheaper and bigger main storage, but the concept holds. A historical reason for use of sort on z/OS may be that "way back in the days of steam powered computing", main storage was very expensive, disk was expensive, and tape was cheap. It was not unusual for sort to use tapes for work files; how else would you sort tens of millions of records on a machine with, say, 512 kB of storage and a few hundred megabytes of disk? UNIX and indeed most other systems didn't start out doing commercial data processing, and to this day don't do batch processing very well. Tony H. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
I have always argued that a company can buy more CPU, but no one can buy more wall clock time. Yes, sometimes you need CPU to be king, which is why MFX, Syncsort to you old timers, has offered multiple optimization options for years. Chris Blaicher Technical Architect Ironstream Development Syncsort Incorporated 50 Tice Boulevard, Woodcliff Lake, NJ 07677 P: 201-930-8234 | M: 512-627-3803 E: cblaic...@syncsort.com -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Jesse 1 Robinson Sent: Sunday, April 03, 2016 4:21 PM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: Why sort (was Microprocessor Optimization Primer) I used to work for the late great Security Pacific, at the time the largest bank based in Los Angeles. When DFSORT was a pimply-faced teenager, some of us sysprogs were invited to Santa Teresa to meet with product developers to share some real-world feedback. They were a young and earnest bunch. They wanted us to help them decide between two frequently conflicting goals: Minimize CPU time Minimize I/O count Enhancing one often came at the expense of the other. We couldn't wait to lay it on them. Every business day at 2 AM, a messenger would arrive at our data center to collect a bag containing all the checks processed that day along with reports tied to them. The bag was to be delivered to 'the feds downtown'. If the bag was ready for pickup, all was sweetness and light. If the bag was late, there would be h*ll to pay. That's all that mattered: wall clock time. It was a revelation to the developers. Every serious business has to sort data for a myriad reasons, all of which boil down to this: somewhere along the line--surely more than once--data must be processed in some kind of order. Maybe by account number. Maybe by account name. Maybe by account value. Each of these needs requires ordering unsorted data or data previously sorted for another purpose. Sort is a huge lynchpin in the foundation of any large business. Argue about CPU or I/O stats all you want. You either meet your 'messenger' deadline or you don't. . . . J.O.Skip Robinson Southern California Edison Company Electric Dragon Team Paddler SHARE MVS Program Co-Manager 323-715-0595 Mobile 626-302-7535 Office robin...@sce.com -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Blaicher, Christopher Y. Sent: Sunday, April 03, 2016 6:52 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: (External):Re: Why sort (was Microprocessor Optimization Primer) Along with the other reasons outlined by others, it significantly improves bulk processing, I shy away from the term batch because that has come to have a bad connotation. When dealing with individual transactions, such as an ATM transaction or a web transaction, sorted data is not needed. But, when company goes to process all the payments received that day, or checks that cleared, etc., processing is much improved when the data coming in is in the same sequence as the existing data structure. It improves because of locality of reference. Using a relational data base, or any other random access method, doesn't mean you have to access it randomly. Chris Blaicher Technical Architect Software Development Syncsort Incorporated 50 Tice Boulevard, Woodcliff Lake, NJ 07677 P: 201-930-8234 | M: 512-627-3803 E: cblaic...@syncsort.com -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Andrew Rowley Sent: Sunday, April 03, 2016 2:51 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Why sort (was Microprocessor Optimization Primer) On 02/04/2016 10:09 PM, David Crayford wrote: > IBM switched the magic bit to offload the JZOS JNI C/C++ workload to a > zIIP so they could do the same for DFSORT. A well engineered library > could handle the callbacks so the client just reads records like a > normal API. That would certainly push Java batch up a notch. One question that puzzles me (maybe it's my lack of an application programming background): Why is sort used so much on z/OS? I know you can then e.g. do grouping based on key changes, but is that really necessary in current programs? Is that the reason it is commonly used? I generally use e.g. Java HashMap, C# Hashtable for grouping so the data doesn't need to be sorted. Do other common languages on z/OS provide similar functions? (C++ I know.) Are there opportunities to use programming language features to avoid sorts altogether? Andrew Rowley -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN ATTENTION: - The information contained in this message (including any files transmitted with this message) may contain proprietary, trade sec
Re: Why sort (was Microprocessor Optimization Primer)
I used to work for the late great Security Pacific, at the time the largest bank based in Los Angeles. When DFSORT was a pimply-faced teenager, some of us sysprogs were invited to Santa Teresa to meet with product developers to share some real-world feedback. They were a young and earnest bunch. They wanted us to help them decide between two frequently conflicting goals: Minimize CPU time Minimize I/O count Enhancing one often came at the expense of the other. We couldn't wait to lay it on them. Every business day at 2 AM, a messenger would arrive at our data center to collect a bag containing all the checks processed that day along with reports tied to them. The bag was to be delivered to 'the feds downtown'. If the bag was ready for pickup, all was sweetness and light. If the bag was late, there would be h*ll to pay. That's all that mattered: wall clock time. It was a revelation to the developers. Every serious business has to sort data for a myriad reasons, all of which boil down to this: somewhere along the line--surely more than once--data must be processed in some kind of order. Maybe by account number. Maybe by account name. Maybe by account value. Each of these needs requires ordering unsorted data or data previously sorted for another purpose. Sort is a huge lynchpin in the foundation of any large business. Argue about CPU or I/O stats all you want. You either meet your 'messenger' deadline or you don't. . . . J.O.Skip Robinson Southern California Edison Company Electric Dragon Team Paddler SHARE MVS Program Co-Manager 323-715-0595 Mobile 626-302-7535 Office robin...@sce.com -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Blaicher, Christopher Y. Sent: Sunday, April 03, 2016 6:52 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: (External):Re: Why sort (was Microprocessor Optimization Primer) Along with the other reasons outlined by others, it significantly improves bulk processing, I shy away from the term batch because that has come to have a bad connotation. When dealing with individual transactions, such as an ATM transaction or a web transaction, sorted data is not needed. But, when company goes to process all the payments received that day, or checks that cleared, etc., processing is much improved when the data coming in is in the same sequence as the existing data structure. It improves because of locality of reference. Using a relational data base, or any other random access method, doesn't mean you have to access it randomly. Chris Blaicher Technical Architect Software Development Syncsort Incorporated 50 Tice Boulevard, Woodcliff Lake, NJ 07677 P: 201-930-8234 | M: 512-627-3803 E: cblaic...@syncsort.com -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Andrew Rowley Sent: Sunday, April 03, 2016 2:51 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Why sort (was Microprocessor Optimization Primer) On 02/04/2016 10:09 PM, David Crayford wrote: > IBM switched the magic bit to offload the JZOS JNI C/C++ workload to a > zIIP so they could do the same for DFSORT. A well engineered library > could handle the callbacks so the client just reads records like a > normal API. That would certainly push Java batch up a notch. One question that puzzles me (maybe it's my lack of an application programming background): Why is sort used so much on z/OS? I know you can then e.g. do grouping based on key changes, but is that really necessary in current programs? Is that the reason it is commonly used? I generally use e.g. Java HashMap, C# Hashtable for grouping so the data doesn't need to be sorted. Do other common languages on z/OS provide similar functions? (C++ I know.) Are there opportunities to use programming language features to avoid sorts altogether? Andrew Rowley -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
Along with the other reasons outlined by others, it significantly improves bulk processing, I shy away from the term batch because that has come to have a bad connotation. When dealing with individual transactions, such as an ATM transaction or a web transaction, sorted data is not needed. But, when company goes to process all the payments received that day, or checks that cleared, etc., processing is much improved when the data coming in is in the same sequence as the existing data structure. It improves because of locality of reference. Using a relational data base, or any other random access method, doesn't mean you have to access it randomly. Chris Blaicher Technical Architect Software Development Syncsort Incorporated 50 Tice Boulevard, Woodcliff Lake, NJ 07677 P: 201-930-8234 | M: 512-627-3803 E: cblaic...@syncsort.com -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of Andrew Rowley Sent: Sunday, April 03, 2016 2:51 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Why sort (was Microprocessor Optimization Primer) On 02/04/2016 10:09 PM, David Crayford wrote: > IBM switched the magic bit to offload the JZOS JNI C/C++ workload to a > zIIP so they could do the same for DFSORT. A well engineered library > could handle the callbacks so the client just reads records like a > normal API. That would certainly push Java batch up a notch. One question that puzzles me (maybe it's my lack of an application programming background): Why is sort used so much on z/OS? I know you can then e.g. do grouping based on key changes, but is that really necessary in current programs? Is that the reason it is commonly used? I generally use e.g. Java HashMap, C# Hashtable for grouping so the data doesn't need to be sorted. Do other common languages on z/OS provide similar functions? (C++ I know.) Are there opportunities to use programming language features to avoid sorts altogether? Andrew Rowley -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN ATTENTION: - The information contained in this message (including any files transmitted with this message) may contain proprietary, trade secret or other confidential and/or legally privileged information. Any pricing information contained in this message or in any files transmitted with this message is always confidential and cannot be shared with any third parties without prior written approval from Syncsort. This message is intended to be read only by the individual or entity to whom it is addressed or by their designee. If the reader of this message is not the intended recipient, you are on notice that any use, disclosure, copying or distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or Syncsort and destroy all copies of this message in your possession, custody or control. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On 3/04/2016 2:50 PM, Andrew Rowley wrote: On 02/04/2016 10:09 PM, David Crayford wrote: IBM switched the magic bit to offload the JZOS JNI C/C++ workload to a zIIP so they could do the same for DFSORT. A well engineered library could handle the callbacks so the client just reads records like a normal API. That would certainly push Java batch up a notch. One question that puzzles me (maybe it's my lack of an application programming background): Why is sort used so much on z/OS? Good question! Sort can be utilised for other purposes than sorting, it can be used as an I/O engine. DFSORT (or Syncsort) implements bespoke highly optinized I/O using low-level programming interfaces such as chained EXCPs which are significantly faster than using standard access methods like QSAM or BSAM, including overlapping BSAM I/O. DFSORT has exit routines (callbacks) which get called for each record. Basically it's supercharged I/O. One of our products does just that as do many others. IIRC, IBM FileManager uses sort for I/O. The trouble with using this technique with Java is the JNI/callback overhead. I know you can then e.g. do grouping based on key changes, but is that really necessary in current programs? Is that the reason it is commonly used? I generally use e.g. Java HashMap, C# Hashtable for grouping so the data doesn't need to be sorted. Do other common languages on z/OS provide similar functions? (C++ I know.) Are there opportunities to use programming language features to avoid sorts altogether? Andrew Rowley -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Why sort (was Microprocessor Optimization Primer)
On Sun, Apr 3, 2016 at 1:50 AM, Andrew Rowleywrote: > On 02/04/2016 10:09 PM, David Crayford wrote: > >> IBM switched the magic bit to offload the JZOS JNI C/C++ workload to a >> zIIP so they could do the same for DFSORT. A well engineered library >> could handle the callbacks so the client just reads records like a normal >> API. That would certainly push Java batch up a notch. >> > One question that puzzles me (maybe it's my lack of an application > programming background): Why is sort used so much on z/OS? > > I know you can then e.g. do grouping based on key changes, but is that > really necessary in current programs? Is that the reason it is commonly > used? > In my shop, it is used mainly so that the output, such as reports sent to a web site for perusal or email, are in some order such as account number. Also, DB2 uses it a lot when you do a CREATE INDEX, I think. IDCAMS uses it when you build an alternate INDEX. This done when a VSAM file (yes, there are a lot of them still around) is "reorganized" for performance or space reasons. > > I generally use e.g. Java HashMap, C# Hashtable for grouping so the data > doesn't need to be sorted. Do other common languages on z/OS provide > similar functions? (C++ I know.) Are there opportunities to use programming > language features to avoid sorts altogether? > Sure, but how often do you have a Java HashMap which contains, say, a million entries? Oh, and the entries are not something like an "int", but more like a C struct where the size of each struct is around 1500 bytes. That would require about 1.5 Terabytes of memory. Not many systems have that much to give you for a single "object". And yes, we _do_ sort such monsters. Not often, granted, but we're doing a conversion right now and the programmer is doing work on claims which go back 10 years! That's a _lot_ of data! And , we don't have _any_ data bases, just VSAM and sequential data sets. I've actually used VSAM to do "sorting", by inserting records randomly, then reading them back in keyed order. The performance was horrible. DB2, or other database system, could be used in such as manner to avoid sorting. But I'd bet it would also be horrible. Of course, if you're reading an already existing VSAM keyed file, or a database, then you're golden. I'd bet most of the data in the non-z/OS world is kept in such a manner, as opposed to a regular "file". On z/OS, REXX has "stem" variables which are "content addressable", much like a HashMap (keep type HaspMap, ). The COBOL language doesn't have anything like this built in. Neither does PL/I. Of course, IBM's Java for z/OS does. As do other languages in the UNIX environment such as Perl. But there just aren't as many of them in z/OS due to the effort to make them work in an EBCDIC environment instead of an ASCII (or Unicode) environment. For Perl, Larry Wall just said "forget it, we're not doing it any more". I know that there is a port of LUA ( http://lua4z.com/ ), but I don't know how popular it is. Unfortunately, z/OS people (programmers, sysprogs, and management) don't really seem to be very interested in doing UNIX type work on z/OS. Possibly because "it's too expensive!" or "it's not how we have done things in the past and it's too difficult to bother learning." Or, maybe, just plain NIH syndrome (Not Invented Here). I mean, have you read the screams here about the latest COBOL requiring PDSEs for their executable output? You'd think that they'd been told to convert their COBOL to FORTRAN. > > Andrew Rowley > > -- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN > -- How many surrealists does it take to screw in a lightbulb? One to hold the giraffe and one to fill the bathtub with brightly colored power tools. Maranatha! <>< John McKown -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Why sort (was Microprocessor Optimization Primer)
On 02/04/2016 10:09 PM, David Crayford wrote: IBM switched the magic bit to offload the JZOS JNI C/C++ workload to a zIIP so they could do the same for DFSORT. A well engineered library could handle the callbacks so the client just reads records like a normal API. That would certainly push Java batch up a notch. One question that puzzles me (maybe it's my lack of an application programming background): Why is sort used so much on z/OS? I know you can then e.g. do grouping based on key changes, but is that really necessary in current programs? Is that the reason it is commonly used? I generally use e.g. Java HashMap, C# Hashtable for grouping so the data doesn't need to be sorted. Do other common languages on z/OS provide similar functions? (C++ I know.) Are there opportunities to use programming language features to avoid sorts altogether? Andrew Rowley -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN