Sort performance improvement

Paul Rogers Sun, 01 Jan 2017 17:13:13 -0800

Hi All,

Happy New Year!


Recent experiments show that we can more than double in-memory sort performance 
by generating tighter code in the “inner loops” of the sort and merge 
implementations. The improved sort performance causes a 50% reduction in 
overall runtime (doubling of performance) for the sample query: 10 million rows 
of randomly-generated integers. YMMV.

The details are in [1]. In short, generated code makes expensive use of layer 
upon layer of function calls to fetch each value. The speed-up comes from 
eliminating 13 function calls per compare operation. In the sort we do 
something like O(n log n) compares: so savings add up quickly. For this test, 
the reduction is something like ~1.5 billion function calls.

Existing:

compare()
. SelectionVector2.getIndex( ) // x 2
. . DrillBuf.getChar( )
. . . DrillBuf.getShort()
. . . . PlatformDependent.getShort()
. doEval( ) // Generated, this is for required int
. . IntVector.getAccessor() // x 2
. . . Accessor.get()
. . . . DrillBuf.getInt()
. . . . . PlatformDependent.getInt()

Revised:

    public int compare(int leftIndex, int rightIndex) {
      int sv1 = PlatformDependent.getShort(addrSv + (leftIndex << 1)) & 0xFFFF;
      int sv2 = PlatformDependent.getShort(addrSv + (rightIndex << 1)) & 0xFFFF;
      int left = PlatformDependent.getInt(addr0 + (sv1<<2));
      int right = PlatformDependent.getInt(addr4 + (sv2<<2));

A similar change was made in the “MSorter”: the code that merges sorted batches 
in memory.

Interestingly, no additional performance gain was seen in replacing the 
PlatformDependent calls with direct calls to Unsafe. Presumably that path is 
simple enough to be optimized away by the JVM. 

This experiment hand-coded the replacement “SingleBatchSorter” and “MSorter". A 
next step would be to revise the code generator to generate the required code. 
And, of course, nullable, variable-length and array vectors will be more 
complex than the simple required int vector used in the experiment.

This technique can be used anywhere we have compute-intensive inner loops in 
generated code.

- Paul

[1] 
https://github.com/paul-rogers/drill/wiki/Optimization-of-External-Sort-Performance

Sort performance improvement

Reply via email to