clintropolis commented on issue #8822: optimize numeric column null value 
checking for low filter selectivity (more rows)
URL: https://github.com/apache/incubator-druid/pull/8822#issuecomment-549717755
 
 
   >The heatmaps look super cool! (although I don't think I fully understand 
them yet :| ) What did you use to build them?
   
   Hah, thanks, I used R with ggplot2 to make them. I'll try to clean up the 
code and attach it, if I have a chance, in case anyone else wants to do some 
benchmarking tinker with the results. As for what they mean, I'll try my best 
to explain it as succinctly as possible 😅.
   
   The benchmark I added in this PR, 
`NullHandlingBitmapGetVsIteratorBenchmark`, is simulating approximately what 
happens during query processing on a historical for numerical null columns when 
used with something like a `NullableAggregator`, which is a wrapper around 
another `Aggregator` to ignore `null` values or delegate aggregation to the 
wrapped aggregator for rows that have actual values.
   
   When SQL compatible null handling is enabled, numeric columns are stored 
with 2 parts if nulls are present: the column itself, and a bitmap that has a 
set bit for each null value. At query time, filters are evaluated to compute 
something called an `Offset`, which is basically just the set of rows that are 
taking part in the query, and are used to create a column value/vector selector 
for those rows from the underlying column. Selectors have a `isNull` method 
which can be used to determine if a particular row is a `null`, and for numeric 
columns this is checking if that row is set on the bitmap. So mechanically, 
`NullableAggregator` will check each row from the selector to see if it is null 
(through the underlying bitmap), if it is, ignore it, but if not, delegate to 
the underlying `Aggregator` to do whatever it does to compute the result.
   
   The benchmark simplifies this concept into using a `BitSet` to simulate the 
`Offset`, an `ImmutableBitmap` for the null value bitmap, and a for loop that 
iterates over the "rows" selected by the `BitSet` to emulate the behavior of 
the aggregator on the selector, checking for set bits in the `ImmutableBitmap` 
for each index like `isNull` would be doing.
   
   Translating this into heatmap, the y axis is showing the effects of 
differences in density of the null bitmap (bottom is a few null values, top is 
nearly all rows are null), the x axis is the differences in the number of rows 
that our selector will select (left side selects very few rows, right scans 
nearly all rows), and the z axis is the difference in benchmark operation time 
between using bitmap.get` and using an iterator (or peekable iterator) from the 
null bitmap to move along with the iterator on the selectivity bitset. Further, 
some of the heatmaps have translated the raw benchmark times into the _time per 
row_ by scaling the time by how many rows are selected, to standardize 
measurement across the x axis, making it easier to compare the 2 strategies.
   
   Sorry, that didn't end up being so short... I .. hope this didn't make it 
more confusing 😜 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to