paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1008510147


   @jnturton, I think you're starting down a slippery slope, one that will end 
up with you convinced to simply move to Arrow (or enhance Drill's value 
vectors.) You are assuming that you can get a big win from vectorization in 
selected compute or hash operations -- at no cost elsewhere. The gist of my 
argument where, if that were true, Drill should already be in good shape: we'd 
only need to add some SIMD code and we'd rock. You're also assuming that there 
are SIMD hash functions: I'm not sure those exist: a search came up with 
somewhat random results. (SHA exists, however.)
   
   The hard truth is that vectors are good in very limited cases: those you 
outlined. As said above, Drill does far more than that. For all those other 
things, vectors are hugely complex and expensive.
   
   So, if we ignore the parts where vectors are inefficient, and ignore the 
operations that don't benefit from vectors, and ignore complexity of vectors in 
client code and their horrible effect on exchanges... and instead focus on some 
ideal cases where vectors might be faster... then, yes, we end up back where we 
are today. This is, the classic marketing position: vectors are great, if we 
ignore the bad stuff. The problem is, in making all those assumptions, we 
ignore the actual reality of what we've learned over many years.
   
   Until I see some numbers, I'm not convinced that there is ever a case, other 
than in ideal lab conditions, where the gain from vectors in some operations 
makes up for the cost everywhere else. Still, if we wanted to find out, we 
could tinker a bit:
   
   * Limit vector batches to a decent number of records, say 1K. Avoid the 4K, 
8K or 64K behemoths. That will be friendlier on memory usage, deliver results 
to downstream fragments faster. And, since we don't actually have SIMD support, 
our SIMD features remain unaffected.
   * In exchanges, limit outgoing batches to a small row count. Add a timeout: 
if the batch is not filled in, say, x seconds, ship what we have to avoid 
near-deadlock at scale.
   * Since the Gandiva code from Arrow has to, at core, work on a block of 
direct memory, use it for a few select operations. That is, if we see that a 
projection is only one of your ideal cases, generate Gandiva code instead of 
Java code. If the operations are real-world messy, then stick with the Java 
code we have.
   * Even if we moved toward a row-based approach, keep the ability for a row 
to be comprised of multiple "column groups", each as individual columns or as 
groups of columns. For example, when reading from a file, one "column group" 
contains the file data, another has the "implicit" fields. If we do this, we 
can split out single columns if we want to run the hash (or other) function 
efficiently.
   
   There is no free lunch. If we focus only on the idea cases, then any gain we 
get in those cases is more than given back by the overall complexity and 
slowness of other cases.
   
   At some point, someone's got to actually do some benchmarks to gather actual 
facts...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to