[jira] [Commented] (DRILL-8088) Improve expression evaluation performance

ASF GitHub Bot (Jira) Mon, 03 Jan 2022 16:09:04 -0800


    [ 
https://issues.apache.org/jira/browse/DRILL-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17468274#comment-17468274
 ]


ASF GitHub Bot commented on DRILL-8088:
---------------------------------------

paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004438817


   All: so I've kicked the hornet's nest with the mention of value vectors and 
Arrow. I'm going to put on my flame-proof suit and debunk some myths.
   
   The columnar format is great for storage, for all the usual reasons. This is 
why Parquet uses it, Druid uses it for segment files, and various DBs use it 
for storage. The question we want to ask is, do those benefits apply to the 
format within the Drill execution engine? I'm here to suggest that columnar has 
no advantage, and many disadvantages, when used as the *internal* format of an 
execution engine. "Thems is fighting words", so let's bring it on.
   
   I've had the pleasure of working with several query engines: Drill 
(columnar) and Impala (row-based) are two well-known examples. This has given 
me a unique opportunity to see if all the marketing claims for columnar (which 
still appear in the videos on Drill's website) actually hold up in practice. 
Spoiler: they don't.
   
   This is a PR about optimization. A good rule in optimization is to start 
with the biggest issues, then work toward the details. So, rather than tinker 
with the details of vector execution, let's look at the fundamental issues.
   
   **Myth: Vectorized execution**: The biggest myth is around vectorized 
execution. Of course, a minor myth is that Drill uses such execution (it 
doesn't.) The bigger myth is that, if we invested enough, it could.
   
   Vectorized execution is great when we have a simple operation we apply to a 
large amount of data. Think the dot-product operation for neural networks, or 
data compression, or image transforms, or graphics. In all cases, we apply a 
simple operation (rescale, say) to a large amount of homogeneous data (the 
pixels in an image.)
   
   So, the question is, does typical, real-world SQL fit this pattern? I've now 
seen enough crufty, complex, messy real-world queries to suggest that, no, SQL 
is not a good candidate for vectorization. `SELECT` and `WHERE` clauses embed 
business logic, and that logic is based on messy human rules, not crisp, clean 
mathematics. The resulting SQL tends to have conditionals (`WHEN` or `IF()`, 
etc.), lots of function calls (all those cool UDFs which @cgivre has written), 
and so on. Plus, as noted above, SQL deals with NULL values, which must 
short-circuit entire execution paths.
   
   Hence, even if we could vectorize simple operations, we'd find that, in most 
queries, we could not actually use that code.
   
   **Myth: Vectors are CPU Cache Friendly**: The second big myth is that 
vectors are somehow more "friendly" to the CPU L1 cache than a row format. The 
idea is that one can load a vector into the L1 cache, then zip through many 
values in one go. This myth is related to the above one.
   
   First, SQL expressions are not based on columns, they are based on rows. 
Each calculation tends to involve multiple columns: `net_receipts = sales + 
taxes - returns`. Here each calculation touches four vectors, so we need all 
four to be in the CPU cache to benefit.
   
   Second, SQL is row based: that above calculation is just one of perhaps many 
that occur on each row. In the ideal case, the calculations for independent 
groups: `SELECT a + b AS x, c - d + e AS y, f / g AS z, ...`. In this case, we 
could load vectors ``a, `b`, `x` into the L1 cache, do the calcs, then load 
`c`, `d`, `e` and y in the cache and so on. Of course, Drill doesn't work this 
way (it does all the calculations for a single row before moving to the next), 
but it could, and it would have to to benefit from vectorization.
   
   A more typical case is that the same column is used in multiple expressions: 
`SELECT a + b AS x, a / c AS y, (a - d) * e AS z, ...` In this case, we must 
load the `a` vector into the L1 cache multiple times. (Or, more properly, its 
values would continually be bumped out of the cache, then reloaded.)
   
   **Myth: Bigger Vectors are Better**: Drill went though a phase when everyone 
bought into the "L1 cache" myth. To get better performance everyone wanted ever 
larger vectors. In the code, you'll see that we started with 1K-row batches, 
then it grew to 4K, then other code would create 64K row batches. It got so bad 
we'd allocate vectors larger than 16MB, which caused memory fragmentation and 
OOM errors. (This is the original reason for what evolved to be "EVF": to 
control vector sizes to prevent memory fragmentation - very basic DB stuff.)
   
   Remember, the CPU L1 cache is only about 256K in size. A 4MB vector is 
already 16x the L1 cache size. Combine that with real-world expressions and we 
end up with a "working set" of 10s of MB in size: 20x or more the L1 cache 
size. The result is lots of cache misses. (This stuff is really hard to 
measure, would be great for someone to do the experiments to show this 
happening in practice.)
   
   **Myth: Vectors are Efficient**: A related, minor myth is that writing to 
vectors is more efficient than writing to rows. This is not true in general. In 
Drill, it is especially false. Originally, vectors were allocated at some small 
initial size (256K? Something like that.) Vectors grow as we add data. (That's 
one of the things that make working with them difficult.) Fill the 256K vector? 
We double it to 512K and copy across the data. We do again at 1MB, 2MB, 4MB, 
... In the end, to grow to 4MB, copy about 4MB of data. That is 4MB of reads, 
4MB of writes, in addition to the 4MB of writes needed to create the data.
   
   Later, a bunch of ad-hoc "batch sizing" code was added to try to guess a 
good initial size for vectors. Not too hard or fixed-width vectors (`INT`, 
say), but rather tricky for variable-width vectors (`VARCHAR`).
   
   Remember that each vector is sized and grows independently. So, to create a 
batch, we have to track the size of every vector, grow those that need growing, 
but not over-allocate all the vectors because the space is not fungible: vector 
`a` can't "borrow" unused space from vector `b`.
   
   The point is, Drill needs a large amount of complex, add-on code just to 
work around the fact that every vector will grow, and copy data, if not sized 
correctly, and, in general, we don't know ahead of time what the size should 
be. The result is inefficiency.
   
   **Single Data Format for In-Memory and Over-the-Wire**: One of Drills' 
claims to fame is that value vectors (and, later Arrow vectors) use the same 
layout in memory as over the wire, leading to efficient exchanges via RPC. This 
myth is true, as far as it goes. But, if you look at the code, the truth is 
much more complex. On the sender side, vectors are independent buffers (as 
explained above.) On the receiver side, the whole message comes in as one big 
buffer. Special code slices up that buffer to recover vectors. A vast amount of 
complex code is needed to handle the accounting. (Thanks to Chris, a one-time 
Drill contributor who wrote all that code well enough that it "just works". It 
would be hell to fix otherwise.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Improve expression evaluation performance
> -----------------------------------------
>
>                 Key: DRILL-8088
>                 URL: https://issues.apache.org/jira/browse/DRILL-8088
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Codegen
>            Reporter: wtf
>            Assignee: wtf
>            Priority: Minor
>
> Found unnecessary map copy when doing expression evaluation, it will slow 
> down the codegen when the query include many "case when" or avg/stddev(the 
> reduced expressions include "case when"). In our case, the query include 314 
> avg, it takes 3+ seconds to generate the projector expressions(Intel(R) 
> Xeon(R) CPU E5-2682 v4 @ 2.50GHz 32cores).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (DRILL-8088) Improve expression evaluation performance

Reply via email to