Re: Why sort (was Microprocessor Optimization Primer)

David Betten Sun, 03 Apr 2016 18:26:11 -0700

First of all, full disclaimer that I was in DFSORT development for about 8
years so I might be biased.  But I just want to share a few thoughts.


First the idea of loading all the data into a large hashmap to do the sort
tends to eliminate one very important thing and that's overlap.
Essentially, you read the entire input, conduct your massive hashsort, and
then write the output with no overlap of those three phases.  The approach
I prefer is an iterative process of sorting smaller amounts and writing
them to work files (either on disk or in memory) and then at end of input,
you almost immediately begin the output process of merging those sorted
strings.  This technique is very efficient and I can tell you many z/OS
customers are sorting tens to hundreds of gigabytes of data this way.

Second point I'd like to make also is related to overlap.  Sorting the
files allows downstream process to read them sequentially rather than
random gets from say VSAM or a data base.  When you read or write
sequentially, you have opportunities for I/O overlap along with blocking
and chaining.  So you can be reading the next set of data while your
program is processing the previous set of data.  This results in
considerable elapsed time savings and reduction in I/O overhead since more
data is transferred with each I/O.

And that's just my 2 cents!


Have a nice day,
Dave Betten
z/OS Performance Specialist
Cloud and Systems Performance
IBM Corporation
email:  bet...@us.ibm.com

IBM Mainframe Discussion List <IBM-MAIN@LISTSERV.UA.EDU> wrote on
04/03/2016 07:28:39 PM:

> From: Andrew Rowley <and...@blackhillsoftware.com>
> To: IBM-MAIN@LISTSERV.UA.EDU
> Date: 04/03/2016 07:32 PM
> Subject: Re: Why sort (was Microprocessor Optimization Primer)
> Sent by: IBM Mainframe Discussion List <IBM-MAIN@LISTSERV.UA.EDU>
>
> The reason I like Java on Z so much is I got used to using Hashtable in
> C#, then tried to use Rexx stems to do the same thing. (It was semi
> successful but I always felt like it was very fragile due to the
> potential for unexpected values etc. for the stems.) Then I found Java
> had real hash tables. They make so many different problems so much
easier.
>
> A million 1500 byte entries should be about 1.5 GB I think, and I would
> expect a hashmap to handle it without difficulty as long as the real
> storage was available. But typically a hashtable would hold an object
> with the specific items you're interested in rather than the whole 1500
> byte item.
>
> As for sorting a List of a million 1500 byte items - again I would
> expect Java to do this without difficulty as long as real storage is
> available. Java is actually pretty efficient at this because you're
> actually sorting a list of pointers - you go all over memory to do the
> compares, but should be only shuffling 8MB of data in storage if you
> have a million 64 bit pointers. I regularly test EasySMF (written in C#)
> displaying lists of 1,000,000+ items on the PC. It has column click
> sorting, and it copes just fine with 1,000,000+ lists. Sorting a column
> takes a few seconds at most on a not particularly fast PC.
>
> DFSORT seems to be most useful where you need to sort more data than can
> be processed in storage - but I'm wondering how often that really needs
> to be done. I'm not so interested in utilities and databases calling it
> under the covers - more in applications that require records in a
> particular order. Nor am I saying that's wrong - I'm really just asking
> whether languages like Java provide opportunities to eliminate some
sorting.
>
> On 3/04/2016 22:36, John McKown wrote:
> > Sure, but how often do you have a Java HashMap which contains, say, a
> > million entries? Oh, and the entries are not something like an "int",
but
> > more like a C struct where the size of each struct is around 1500
bytes.
> > That would require about 1.5 Terabytes of memory. Not many systems have
> > that much to give you for a single "object". And yes, we _do_ sort such
> > monsters. Not often, granted, but we're doing a conversion right now
and
> > the programmer is doing work on claims which go back 10 years! That's a
> > _lot_ of data! And <sob>, we don't have _any_ data bases, just VSAM and
> > sequential data sets. I've actually used VSAM to do "sorting", by
inserting
> > records randomly, then reading them back in keyed order. The
performance
> > was horrible. DB2, or other database system, could be used in such as
> > manner to avoid sorting. But I'd bet it would also be horrible. Of
course,
> > if you're reading an already existing VSAM keyed file, or a database,
then
> > you're golden. I'd bet most of the data in the non-z/OS world is kept
in
> > such a manner, as opposed to a regular "file".
> >
> > On z/OS, REXX has "stem" variables which are "content addressable",
much
> > like a HashMap (keep type HaspMap, <grin>). The COBOL language doesn't
have
> > anything like this built in. Neither does PL/I. Of course, IBM's Java
for
> > z/OS does. As do other languages in the UNIX environment such as Perl.
But
> > there just aren't as many of them in z/OS due to the effort to make
them
> > work in an EBCDIC environment instead of an ASCII (or Unicode)
environment.
> > For Perl, Larry Wall just said "forget it, we're not doing it any
more". I
> > know that there is a port of LUA ( http://lua4z.com/ ), but I don't
know
> > how popular it is. Unfortunately, z/OS people (programmers, sysprogs,
and
> > management) don't really seem to be very interested in doing UNIX type
work
> > on z/OS. Possibly because "it's too expensive!" or "it's not how we
have
> > done things in the past and it's too difficult to bother learning." Or,
> > maybe, just plain NIH syndrome (Not Invented Here). I mean, have you
read
> > the screams here about the latest COBOL requiring PDSEs for their
> > executable output? You'd think that they'd been told to convert their
COBOL
> > to FORTRAN.
> >
> >
>
>
> --
> Andrew Rowley
> Black Hill Software
> +61 413 302 386
>
> ----------------------------------------------------------------------
> For IBM-MAIN subscribe / signoff / archive access instructions,
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
>

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Re: Why sort (was Microprocessor Optimization Primer)

Reply via email to