paul-rogers commented on issue #2421:
URL: https://github.com/apache/drill/issues/2421#issuecomment-1005316421


   @jnturton, thanks for converting the discussion to a ticket. Thanks 
especially for dealing with the formatting translation!
    
   @luocooong, thanks for the very impressive analysis, and the wonderful 
images!
   
   As noted earlier, columnar is huge win for storage. It is less so for tools 
that work a row-at-a-time such as Drill or most data pipelines. It is easy to 
see why. A Parquet file may have 100 or more columns. Your query uses 10 of 
them, so that's all we read. A huge win. But, inside Drill, we'll mess with all 
10 of your columns. You are probably doing filtering, calculations, 
aggregation, grouping, etc. There is a _reason_ you selected those columns: you 
want to work with them. And, Drill will do so a row at a time.
   
   Hence, the use pattern _within Drill_ seems to favor a row-based layout as 
we get few of the advantages of a columnar format, but we get all of the cost 
and complexity.
   
   Most SQL operators work a row at a time. So, even if the SQL is messy, and 
we have touch the same column multiple times, the row itself can fit into the 
CPU cache and those redundant accesses are trivially fast. It would be very 
hard to do a similar optimization for columns.
   
   Your OLAP analysis is correct, but mostly applies to old-school 
Mondrian-style cubes on top of relational tables where the aggregation is 
clearly defined. Most of the queries I've seen for Drill (or Impala or other 
tools) are pretty ugly, with lots of business logic. If the user writes the 
SQL, they won't limit themselves to a small set of standard aggregations, 
they'll do whatever they need to do to get the job done. That, in fact, is why 
SQL endures decade after decade: as awful as it is, you can get your task done 
if you try hard enough.
   
   As a result, although we could vectorize simple cube-rollup calculations, in 
practice, we don't get many of those. So, we need to optimize for what people 
actually do, not the ideal cases.
   
   Let's also remember this: if all you need are simple roll-up calculations, 
and you run those queries again and again (so they are worth optimizing), you 
have a better choice: have your data pipeline write pre-aggregated "cubes". 
Want to see sales-by-product-by-day over the last year on billions of 
transactions? Just build the roll-up cube. Its what Druid does and it works 
well. With a Spark pipeline, you can create a detail table and a roll-up table, 
both in Parquet, and queried by Drill. So, if we look at the broader context, 
we see that, the one place we'd benefit from vectorized operations, might not 
even happen within Drill; it would be done in Spark in the data pipeline.
   
   That said, we should also remember operators that might benefit from 
columnar: computing a hash key or sort. In this case, we could have a hybrid 
approach: a "row group" with just the required fields (the keys), and a 
separate "row group" with the "payload" of other fields. Some DBs use a similar 
approach (for storage.) With a bit of pipelining, the preceding operator could 
produce its output rows in the format optimal for the next operator.
   
   It would be great if we could mock up some simple test cases and try out the 
ideas. Doing so is not simple at present, but would be possible once we did the 
intermediate raw-vector-to-EVF conversion. Our experience tells us what we 
might expect to find, but we need to do the experiments to tell us what's 
actually true.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to