Hi All,
Happy New Year!
Recent experiments show that we can more than double in-memory sort performance
by generating tighter code in the “inner loops” of the sort and merge
implementations. The improved sort performance causes a 50% reduction in
overall runtime (doubling of performance) for the sample query: 10 million rows
of randomly-generated integers. YMMV.
The details are in [1]. In short, generated code makes expensive use of layer
upon layer of function calls to fetch each value. The speed-up comes from
eliminating 13 function calls per compare operation. In the sort we do
something like O(n log n) compares: so savings add up quickly. For this test,
the reduction is something like ~1.5 billion function calls.
Existing:
compare()
. SelectionVector2.getIndex( ) // x 2
. . DrillBuf.getChar( )
. . . DrillBuf.getShort()
. . . . PlatformDependent.getShort()
. doEval( ) // Generated, this is for required int
. . IntVector.getAccessor() // x 2
. . . Accessor.get()
. . . . DrillBuf.getInt()
. . . . . PlatformDependent.getInt()
Revised:
public int compare(int leftIndex, int rightIndex) {
int sv1 = PlatformDependent.getShort(addrSv + (leftIndex << 1)) & 0xFFFF;
int sv2 = PlatformDependent.getShort(addrSv + (rightIndex << 1)) & 0xFFFF;
int left = PlatformDependent.getInt(addr0 + (sv1<<2));
int right = PlatformDependent.getInt(addr4 + (sv2<<2));
A similar change was made in the “MSorter”: the code that merges sorted batches
in memory.
Interestingly, no additional performance gain was seen in replacing the
PlatformDependent calls with direct calls to Unsafe. Presumably that path is
simple enough to be optimized away by the JVM.
This experiment hand-coded the replacement “SingleBatchSorter” and “MSorter". A
next step would be to revise the code generator to generate the required code.
And, of course, nullable, variable-length and array vectors will be more
complex than the simple required int vector used in the experiment.
This technique can be used anywhere we have compute-intensive inner loops in
generated code.
- Paul
[1]
https://github.com/paul-rogers/drill/wiki/Optimization-of-External-Sort-Performance