Hi Igor,
Thanks much for volunteering to create some POCs for our various options! It is
not entirely obvious what we want to test, so let's think about it a bit.
We want to identify those areas that are either the biggest risk or benefit to
performance. We want to do that without the cost of actually implementing the
target solutions. Instead, we want to find clever ways to test specific
components or scenarios.
Performance is notoriously hard to predict, but we have to start somewhere. One
place to start is with our knowledge of where vectors have advantages (network
transfers) and where they might have costs (partition exchanges, code gen.)
Would be great if others can suggest areas based on their experience.
First, we don't need to test the parts that would be the same between two
options. So, if both Drill vectors and Arrow are direct memory formats directly
consumable by Netty, we probably don't need to compare network performance
between the two options.
On the other hand, a plain Java solution would require
serialization/deserialization of objects, so we'd want to look into the many
serialization options available and compare them with, say Value Vectors. We
could, say, create a simple client and server that transfers batches of records
of various widths so we can see the cost of serialization. It could be that
Netty isn't even the right choice for plain Java, perhaps we want to compare
Value Vectors/Netty with plain java/GRPC (or whatever is the most
state-of-the-art choice.) We might look at what Presto and Spark do in order to
get some hints.
Next, we'd want to know how much faster Arrow is compared to value vectors for
common operations. Maybe simulate writing, then reading, batches of records of
various sizes. For apples-to-apples comparisons, both can be done at the vector
API level. Because Drill is an SQL engine, maybe write record-by-record. If
Arrow has optimizations for certain operations (filters maybe? WHERE x = 10,
say), try the Drill approach (row-by-row, read vector x, compare with 10 and
set an offset vector. Compare that with whatever Arrow does (LLVM code gen?) We
could then do the same operations with plain Java; expanding on the little
prototype we just did.
Third, we should consider code gen. We can assume that Java code gen for Arrow
will be substantially similar to that for value vectors (though the detailed
holders, readers and the rest will differ.) If we choose to code gen to the
column accessor API, then code gen would ideally be nearly identical for either
implementation, and we would not need to run tests.
However, if we used plain java, code gen would be far simpler (just directly
access class fields). So, we could maybe take a sizable chunk of generated code
(maybe for a TPC-H query Project operation, that has a number of calculations),
and we could hand-write the plain Java equivalent. (This is not as hard as it
sounds, most of our generated code is boiler-plate not needed for plain java.)
Then we could compare compile performance (compiler + byte code fix-up for
value vectors, compile only for Java), along with runtime performance (run the
code against, say, 100 canned, well-known batches.)
If Arrow can do low-level code gen, we can perhaps compare the performance of
an Arrow implementation (with its code gen added to the Drill code gen) against
the value vector version.
Fourth, we want to understand how the memory allocators work under load. For
example, we might simulate a loaded Drillbit that is doing, say, buffered
exchanges, hash joins or sorts. A good simulation might be to create, buffer,
read and release many batches concurrently in many threads. This will tell us
how well the memory allocators behave when memory is heavily utilized (see how
close to 100% usage we can get) and concurrency is high (at what thread count
does throughput start dropping due to memory manager contention?)
More broadly, we might simulate complex operations such as sort or join. As
above, we don't need the full operation: we just identify the most expensive
part (copying rows, say, or simulating a selection vector remover.)
Fifth, and probably more difficult, would be to understand the impact on
network exchanges. Simulate exchanges on a large cluster (the issue I mentioned
perviously.) How much memory & cost is needed for the vector-based slice &
buffer approach we use today? How might that change if we use plain Java and
can just shuffle & send individual Java rows? We don't actually need a large
cluster: we could use a smaller cluster and tell Drill to create 3x or 5x
fragments compared to CPUs. (Or, even better, a whole bunch of this kind of
testing was done by the previous Drill team at MapR, would be great if we could
dig up those results.)
Another possible approach, if someone has access to a decent test cluster, is
to run the TPC-H queries on both Drill and Presto. Presto appears to mostly use
a heap-memory approach, so we could get a crude comparison there. If anyone
knows of an Arrow-based, open source tool that can do TPC-H, we could compare
that to Dill's value-vector approach to get another crude estimate. Such tests
are not idea; since they will also measure things like Parquet read efficiency,
DAG pipelining and the rest. But, it would provide useful data points.
With these basics available, we can try variations. For example, I can try to
find the branch for that fixed-size-block prototype we did a few years back and
we compare that to Arrow and value vectors. Plus whatever else we think up.
That's a quick initial outline of what we could do. I'm sure folks will suggest
ways to simplify and sharpen the set of tests. In any event, a very good first
step is to get familiar with Arrow to suggest tests that would best demonstrate
it's advantages.
The ideal result out of all this would be to learn that 1) the difference in
performance between the options is too small to justify the cost of making a
switch, or 2) one of the options is so obviously better that it is clear what
we should do. Or course, the real answer is likely to be somewhere in the
middle.
Suggestions?
Thanks,
- Paul
On Monday, January 13, 2020, 10:52:14 AM PST, Igor Guzenko
<[email protected]> wrote:
Hi Paul and Volodymyr,
Thank you very much Volodymyr and Paul for defining the good migration
strategy. It really should work for a smooth migration.
What also I really like in the discussion is that excellent questions
appeared:
- Aren't we just suffering from premature optimizations?
- Were the whole vectors and buffers' complexity actually required to
achieve performance improvements?
- What will be without the complex codegen? Is it possible that JIT helps
to avoid it?
I think we finally defined the 3 options to compare:
- plain Java with few different GCs optimized for a big heap size
- Arrow vectors
- fixed-size Drill vectors.
So as the first step, I want to develop POC and compare these options.
Since you're much more experienced with Drill than I, could you please help
me to define the minimal set of operators to compare?
Thanks,
Igor