First of all, full disclaimer that I was in DFSORT development for about 8 years so I might be biased. But I just want to share a few thoughts.
First the idea of loading all the data into a large hashmap to do the sort tends to eliminate one very important thing and that's overlap. Essentially, you read the entire input, conduct your massive hashsort, and then write the output with no overlap of those three phases. The approach I prefer is an iterative process of sorting smaller amounts and writing them to work files (either on disk or in memory) and then at end of input, you almost immediately begin the output process of merging those sorted strings. This technique is very efficient and I can tell you many z/OS customers are sorting tens to hundreds of gigabytes of data this way. Second point I'd like to make also is related to overlap. Sorting the files allows downstream process to read them sequentially rather than random gets from say VSAM or a data base. When you read or write sequentially, you have opportunities for I/O overlap along with blocking and chaining. So you can be reading the next set of data while your program is processing the previous set of data. This results in considerable elapsed time savings and reduction in I/O overhead since more data is transferred with each I/O. And that's just my 2 cents! Have a nice day, Dave Betten z/OS Performance Specialist Cloud and Systems Performance IBM Corporation email: bet...@us.ibm.com IBM Mainframe Discussion List <IBM-MAIN@LISTSERV.UA.EDU> wrote on 04/03/2016 07:28:39 PM: > From: Andrew Rowley <and...@blackhillsoftware.com> > To: IBM-MAIN@LISTSERV.UA.EDU > Date: 04/03/2016 07:32 PM > Subject: Re: Why sort (was Microprocessor Optimization Primer) > Sent by: IBM Mainframe Discussion List <IBM-MAIN@LISTSERV.UA.EDU> > > The reason I like Java on Z so much is I got used to using Hashtable in > C#, then tried to use Rexx stems to do the same thing. (It was semi > successful but I always felt like it was very fragile due to the > potential for unexpected values etc. for the stems.) Then I found Java > had real hash tables. They make so many different problems so much easier. > > A million 1500 byte entries should be about 1.5 GB I think, and I would > expect a hashmap to handle it without difficulty as long as the real > storage was available. But typically a hashtable would hold an object > with the specific items you're interested in rather than the whole 1500 > byte item. > > As for sorting a List of a million 1500 byte items - again I would > expect Java to do this without difficulty as long as real storage is > available. Java is actually pretty efficient at this because you're > actually sorting a list of pointers - you go all over memory to do the > compares, but should be only shuffling 8MB of data in storage if you > have a million 64 bit pointers. I regularly test EasySMF (written in C#) > displaying lists of 1,000,000+ items on the PC. It has column click > sorting, and it copes just fine with 1,000,000+ lists. Sorting a column > takes a few seconds at most on a not particularly fast PC. > > DFSORT seems to be most useful where you need to sort more data than can > be processed in storage - but I'm wondering how often that really needs > to be done. I'm not so interested in utilities and databases calling it > under the covers - more in applications that require records in a > particular order. Nor am I saying that's wrong - I'm really just asking > whether languages like Java provide opportunities to eliminate some sorting. > > On 3/04/2016 22:36, John McKown wrote: > > Sure, but how often do you have a Java HashMap which contains, say, a > > million entries? Oh, and the entries are not something like an "int", but > > more like a C struct where the size of each struct is around 1500 bytes. > > That would require about 1.5 Terabytes of memory. Not many systems have > > that much to give you for a single "object". And yes, we _do_ sort such > > monsters. Not often, granted, but we're doing a conversion right now and > > the programmer is doing work on claims which go back 10 years! That's a > > _lot_ of data! And <sob>, we don't have _any_ data bases, just VSAM and > > sequential data sets. I've actually used VSAM to do "sorting", by inserting > > records randomly, then reading them back in keyed order. The performance > > was horrible. DB2, or other database system, could be used in such as > > manner to avoid sorting. But I'd bet it would also be horrible. Of course, > > if you're reading an already existing VSAM keyed file, or a database, then > > you're golden. I'd bet most of the data in the non-z/OS world is kept in > > such a manner, as opposed to a regular "file". > > > > On z/OS, REXX has "stem" variables which are "content addressable", much > > like a HashMap (keep type HaspMap, <grin>). The COBOL language doesn't have > > anything like this built in. Neither does PL/I. Of course, IBM's Java for > > z/OS does. As do other languages in the UNIX environment such as Perl. But > > there just aren't as many of them in z/OS due to the effort to make them > > work in an EBCDIC environment instead of an ASCII (or Unicode) environment. > > For Perl, Larry Wall just said "forget it, we're not doing it any more". I > > know that there is a port of LUA ( http://lua4z.com/ ), but I don't know > > how popular it is. Unfortunately, z/OS people (programmers, sysprogs, and > > management) don't really seem to be very interested in doing UNIX type work > > on z/OS. Possibly because "it's too expensive!" or "it's not how we have > > done things in the past and it's too difficult to bother learning." Or, > > maybe, just plain NIH syndrome (Not Invented Here). I mean, have you read > > the screams here about the latest COBOL requiring PDSEs for their > > executable output? You'd think that they'd been told to convert their COBOL > > to FORTRAN. > > > > > > > -- > Andrew Rowley > Black Hill Software > +61 413 302 386 > > ---------------------------------------------------------------------- > For IBM-MAIN subscribe / signoff / archive access instructions, > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN > ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN