Hi All, Recall that one of the claimed advantages of value vectors is that we could, in theory, write operators in C/C++ to use SIMD instructions. Recall that developers have often attempted to make vectors ever larger in order to benefit from CPU cache lines.
Since Drill is written in Java, and typically uses nullable types, the vectorized SIMD instruction idea is more of an aspiration than an operational reality. And, since cache lines are usually on the order of 256K, we don't actually gain much from making vectors of sizes 4, 16 or 64 MB in size. Further, Drill's operators all process batches row-wise, causing cache thrashing as we iterate over the vectors for each row. All that said, the core idea is correct, the key question is how to create the proper implementation. In this light, there is an interesting paper out of Oracle, "Analytics with smart arrays: adaptive and efficient language-independent data", summarized at The Morning Paper [1]. This paper outlines a structure called a "smart array" which is vaguely like a Drill value vector. The paper identifies a number experiments regarding how to place array operations on NUMA cores to optimize compute or cache performance. The paper talks a bout a simple encoding of integers which compresses memory at the cost of increased compute. (If memory bandwidth is the bottleneck, than CPU operations that operate on data already in the cache are essentially free as the CPU would otherwise wait for a memory fetch.) The gist of the paper is that getting good CPU and cache performance is more complex than the classic Drill ideas of large vectors and SIMD instructions. But, there can be a significant benefit from good array design. The paper also points out that there is no "one size fits all" solution; the paper shows the a variety of solutions are needed. The paper discusses software developed to automatically pick the right approach for the available hardware. (This is the "smart" part of the title.) The Oracle solution is written in C/C++, with an API for Java. It uses the latest Graal JVM in Java 11. Some very interesting ideas here. Overall, it is worth a quick read of the summary to compare and contrast the Oracle and Drill approaches. Thanks, - Paul [1] https://blog.acolyer.org/2018/06/14/analytics-with-smart-arrays-adaptive-and-efficient-language-independent-data/