[GitHub] [incubator-druid] clintropolis commented on a change in pull request #6794: Query vectorization.

GitBox Sun, 07 Jul 2019 19:03:25 -0700

clintropolis commented on a change in pull request #6794: Query vectorization.
URL: https://github.com/apache/incubator-druid/pull/6794#discussion_r300890249


 ##########
 File path: docs/content/querying/query-context.md
 ##########
 @@ -60,3 +60,31 @@ In addition, some query types offer context parameters 
specific to that query ty
 ### GroupBy queries
 
 See [GroupBy query context](groupbyquery.html#query-context).
+
+### Vectorizable queries
+
+The GroupBy and Timeseries query types can run in _vectorized_ mode, which 
speeds up query execution by processing
+batches of rows at a time. Not all queries can be vectorized. In particular, 
vectorization currently has the following
+requirements:
+
+- All query-level filters must either be able to run on bitmap indexes or must 
offer vectorized row-matchers. These
+include "selector", "bound", "in", "like", "regex", "search", "and", "or", and 
"not".
+- All filters in filtered aggregators must offer vectorized row-matchers.
+- All aggregators must offer vectorized implementations. These include 
"count", "doubleSum", "floatSum", "longSum",
+"hyperUnique", and "filtered".
+- No virtual columns.
+- For GroupBy: All dimension specs must be "default" (no extraction functions 
or filtered dimension specs).
+- For GroupBy: No multi-value dimensions.
+- For Timeseries: No "descending" order.
+- Only immutable segments (not real-time).
+
+Other query types (like TopN, Scan, Select, and Search) ignore the "vectorize" 
parameter, and will execute without
+vectorization. These query types will ignore the "vectorize" parameter even if 
it is set to `"force"`.
+
+Vectorization is an alpha-quality feature as of Druid #{DRUIDVERSION}. We 
heartily welcome any feedback and testing
+from the community as we work to battle-test it.
+
+|property|default| description|
+|--------|-------|------------|
+|vectorize|`false`|Enables or disables vectorized query execution. Possible 
values are `false` (disabled), `true` (enabled if possible, disabled otherwise, 
on a per-segment basis), and `force` (enabled, and groupBy or timeseries 
queries that cannot be vectorized will fail). The `"force"` setting is meant to 
aid in testing, and is not generally useful in production (since real-time 
segments can never be processed with vectorized execution, any queries on 
real-time data will fail).|
 
 Review comment:
   What is the reason for `force` exploding for any queries on real-time data 
instead of just ignoring it and doing non-vectorized? Is it a hassle to ignore 
it for incremental indexes only?
   
   I think I understand the utility in the context of historical servers, as a 
mechanism to ensure that the query that is being tested fully supports 
vectorization as we add additional vectorized code paths, but if we don't 
anticipate vectorization to be worth it for incremental index then the purpose 
of exploding here seems lower. Or do you imagine that someday maybe realtime 
will also be vectorized and want to leave this in as an option? 
   
   This is totally a nit btw and not saying it should be changed. Since this 
option is just for testing we can just like... control the interval being 
queried to not hit any incremental indexes and not have any issues.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] clintropolis commented on a change in pull request #6794: Query vectorization.

Reply via email to