[
https://issues.apache.org/jira/browse/DRILL-8088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17468274#comment-17468274
]
ASF GitHub Bot commented on DRILL-8088:
---------------------------------------
paul-rogers commented on pull request #2412:
URL: https://github.com/apache/drill/pull/2412#issuecomment-1004438817
All: so I've kicked the hornet's nest with the mention of value vectors and
Arrow. I'm going to put on my flame-proof suit and debunk some myths.
The columnar format is great for storage, for all the usual reasons. This is
why Parquet uses it, Druid uses it for segment files, and various DBs use it
for storage. The question we want to ask is, do those benefits apply to the
format within the Drill execution engine? I'm here to suggest that columnar has
no advantage, and many disadvantages, when used as the *internal* format of an
execution engine. "Thems is fighting words", so let's bring it on.
I've had the pleasure of working with several query engines: Drill
(columnar) and Impala (row-based) are two well-known examples. This has given
me a unique opportunity to see if all the marketing claims for columnar (which
still appear in the videos on Drill's website) actually hold up in practice.
Spoiler: they don't.
This is a PR about optimization. A good rule in optimization is to start
with the biggest issues, then work toward the details. So, rather than tinker
with the details of vector execution, let's look at the fundamental issues.
**Myth: Vectorized execution**: The biggest myth is around vectorized
execution. Of course, a minor myth is that Drill uses such execution (it
doesn't.) The bigger myth is that, if we invested enough, it could.
Vectorized execution is great when we have a simple operation we apply to a
large amount of data. Think the dot-product operation for neural networks, or
data compression, or image transforms, or graphics. In all cases, we apply a
simple operation (rescale, say) to a large amount of homogeneous data (the
pixels in an image.)
So, the question is, does typical, real-world SQL fit this pattern? I've now
seen enough crufty, complex, messy real-world queries to suggest that, no, SQL
is not a good candidate for vectorization. `SELECT` and `WHERE` clauses embed
business logic, and that logic is based on messy human rules, not crisp, clean
mathematics. The resulting SQL tends to have conditionals (`WHEN` or `IF()`,
etc.), lots of function calls (all those cool UDFs which @cgivre has written),
and so on. Plus, as noted above, SQL deals with NULL values, which must
short-circuit entire execution paths.
Hence, even if we could vectorize simple operations, we'd find that, in most
queries, we could not actually use that code.
**Myth: Vectors are CPU Cache Friendly**: The second big myth is that
vectors are somehow more "friendly" to the CPU L1 cache than a row format. The
idea is that one can load a vector into the L1 cache, then zip through many
values in one go. This myth is related to the above one.
First, SQL expressions are not based on columns, they are based on rows.
Each calculation tends to involve multiple columns: `net_receipts = sales +
taxes - returns`. Here each calculation touches four vectors, so we need all
four to be in the CPU cache to benefit.
Second, SQL is row based: that above calculation is just one of perhaps many
that occur on each row. In the ideal case, the calculations for independent
groups: `SELECT a + b AS x, c - d + e AS y, f / g AS z, ...`. In this case, we
could load vectors ``a, `b`, `x` into the L1 cache, do the calcs, then load
`c`, `d`, `e` and y in the cache and so on. Of course, Drill doesn't work this
way (it does all the calculations for a single row before moving to the next),
but it could, and it would have to to benefit from vectorization.
A more typical case is that the same column is used in multiple expressions:
`SELECT a + b AS x, a / c AS y, (a - d) * e AS z, ...` In this case, we must
load the `a` vector into the L1 cache multiple times. (Or, more properly, its
values would continually be bumped out of the cache, then reloaded.)
**Myth: Bigger Vectors are Better**: Drill went though a phase when everyone
bought into the "L1 cache" myth. To get better performance everyone wanted ever
larger vectors. In the code, you'll see that we started with 1K-row batches,
then it grew to 4K, then other code would create 64K row batches. It got so bad
we'd allocate vectors larger than 16MB, which caused memory fragmentation and
OOM errors. (This is the original reason for what evolved to be "EVF": to
control vector sizes to prevent memory fragmentation - very basic DB stuff.)
Remember, the CPU L1 cache is only about 256K in size. A 4MB vector is
already 16x the L1 cache size. Combine that with real-world expressions and we
end up with a "working set" of 10s of MB in size: 20x or more the L1 cache
size. The result is lots of cache misses. (This stuff is really hard to
measure, would be great for someone to do the experiments to show this
happening in practice.)
**Myth: Vectors are Efficient**: A related, minor myth is that writing to
vectors is more efficient than writing to rows. This is not true in general. In
Drill, it is especially false. Originally, vectors were allocated at some small
initial size (256K? Something like that.) Vectors grow as we add data. (That's
one of the things that make working with them difficult.) Fill the 256K vector?
We double it to 512K and copy across the data. We do again at 1MB, 2MB, 4MB,
... In the end, to grow to 4MB, copy about 4MB of data. That is 4MB of reads,
4MB of writes, in addition to the 4MB of writes needed to create the data.
Later, a bunch of ad-hoc "batch sizing" code was added to try to guess a
good initial size for vectors. Not too hard or fixed-width vectors (`INT`,
say), but rather tricky for variable-width vectors (`VARCHAR`).
Remember that each vector is sized and grows independently. So, to create a
batch, we have to track the size of every vector, grow those that need growing,
but not over-allocate all the vectors because the space is not fungible: vector
`a` can't "borrow" unused space from vector `b`.
The point is, Drill needs a large amount of complex, add-on code just to
work around the fact that every vector will grow, and copy data, if not sized
correctly, and, in general, we don't know ahead of time what the size should
be. The result is inefficiency.
**Single Data Format for In-Memory and Over-the-Wire**: One of Drills'
claims to fame is that value vectors (and, later Arrow vectors) use the same
layout in memory as over the wire, leading to efficient exchanges via RPC. This
myth is true, as far as it goes. But, if you look at the code, the truth is
much more complex. On the sender side, vectors are independent buffers (as
explained above.) On the receiver side, the whole message comes in as one big
buffer. Special code slices up that buffer to recover vectors. A vast amount of
complex code is needed to handle the accounting. (Thanks to Chris, a one-time
Drill contributor who wrote all that code well enough that it "just works". It
would be hell to fix otherwise.)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
> Improve expression evaluation performance
> -----------------------------------------
>
> Key: DRILL-8088
> URL: https://issues.apache.org/jira/browse/DRILL-8088
> Project: Apache Drill
> Issue Type: Improvement
> Components: Execution - Codegen
> Reporter: wtf
> Assignee: wtf
> Priority: Minor
>
> Found unnecessary map copy when doing expression evaluation, it will slow
> down the codegen when the query include many "case when" or avg/stddev(the
> reduced expressions include "case when"). In our case, the query include 314
> avg, it takes 3+ seconds to generate the projector expressions(Intel(R)
> Xeon(R) CPU E5-2682 v4 @ 2.50GHz 32cores).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)