Oracle's Smart Arrays paper

Paul Rogers Fri, 10 Aug 2018 19:25:23 -0700

Hi All,

Recall that one of the claimed advantages of value vectors is that we could, in 
theory, write operators in C/C++ to use SIMD instructions. Recall that 
developers have often attempted to make vectors ever larger in order to benefit 
from CPU cache lines.

Since Drill is written in Java, and typically uses nullable types, the
vectorized SIMD instruction idea is more of an aspiration than an operational
reality. And, since cache lines are usually on the order of 256K, we don't
actually gain much from making vectors of sizes 4, 16 or 64 MB in size.
Further, Drill's operators all process batches row-wise, causing cache
thrashing as we iterate over the vectors for each row.

All that said, the core idea is correct, the key question is how to create the
proper implementation.

In this light, there is an interesting paper out of Oracle, "Analytics with
smart arrays: adaptive and efficient language-independent data", summarized at
The Morning Paper [1].

This paper outlines a structure called a "smart array" which is vaguely like a
Drill value vector. The paper identifies a number experiments regarding how to
place array operations on NUMA cores to optimize compute or cache performance.
The paper talks a bout a simple encoding of integers which compresses memory at
the cost of increased compute. (If memory bandwidth is the bottleneck, than CPU
operations that operate on data already in the cache are essentially free as
the CPU would otherwise wait for a memory fetch.)

The gist of the paper is that getting good CPU and cache performance is more
complex than the classic Drill ideas of large vectors and SIMD instructions.
But, there can be a significant benefit from good array design.

The paper also points out that there is no "one size fits all" solution; the
paper shows the a variety of solutions are needed. The paper discusses software
developed to automatically pick the right approach for the available hardware.
(This is the "smart" part of the title.)

The Oracle solution is written in C/C++, with an API for Java. It uses the
latest Graal JVM in Java 11. Some very interesting ideas here.

Overall, it is worth a quick read of the summary to compare and contrast the
Oracle and Drill approaches.

Thanks,
- Paul

[1]
https://blog.acolyer.org/2018/06/14/analytics-with-smart-arrays-adaptive-and-efficient-language-independent-data/

Oracle's Smart Arrays paper

Reply via email to