clintropolis commented on a change in pull request #6794: Query vectorization.
URL: https://github.com/apache/incubator-druid/pull/6794#discussion_r300890249
##########
File path: docs/content/querying/query-context.md
##########
@@ -60,3 +60,31 @@ In addition, some query types offer context parameters
specific to that query ty
### GroupBy queries
See [GroupBy query context](groupbyquery.html#query-context).
+
+### Vectorizable queries
+
+The GroupBy and Timeseries query types can run in _vectorized_ mode, which
speeds up query execution by processing
+batches of rows at a time. Not all queries can be vectorized. In particular,
vectorization currently has the following
+requirements:
+
+- All query-level filters must either be able to run on bitmap indexes or must
offer vectorized row-matchers. These
+include "selector", "bound", "in", "like", "regex", "search", "and", "or", and
"not".
+- All filters in filtered aggregators must offer vectorized row-matchers.
+- All aggregators must offer vectorized implementations. These include
"count", "doubleSum", "floatSum", "longSum",
+"hyperUnique", and "filtered".
+- No virtual columns.
+- For GroupBy: All dimension specs must be "default" (no extraction functions
or filtered dimension specs).
+- For GroupBy: No multi-value dimensions.
+- For Timeseries: No "descending" order.
+- Only immutable segments (not real-time).
+
+Other query types (like TopN, Scan, Select, and Search) ignore the "vectorize"
parameter, and will execute without
+vectorization. These query types will ignore the "vectorize" parameter even if
it is set to `"force"`.
+
+Vectorization is an alpha-quality feature as of Druid #{DRUIDVERSION}. We
heartily welcome any feedback and testing
+from the community as we work to battle-test it.
+
+|property|default| description|
+|--------|-------|------------|
+|vectorize|`false`|Enables or disables vectorized query execution. Possible
values are `false` (disabled), `true` (enabled if possible, disabled otherwise,
on a per-segment basis), and `force` (enabled, and groupBy or timeseries
queries that cannot be vectorized will fail). The `"force"` setting is meant to
aid in testing, and is not generally useful in production (since real-time
segments can never be processed with vectorized execution, any queries on
real-time data will fail).|
Review comment:
What is the reason for `force` exploding for any queries on real-time data
instead of just ignoring it and doing non-vectorized? Is it a hassle to ignore
it for incremental indexes only?
I think I understand the utility in the context of historical servers, as a
mechanism to ensure that the query that is being tested fully supports
vectorization as we add additional vectorized code paths, but if we don't
anticipate vectorization to be worth it for incremental index then the purpose
of exploding here seems lower. Or do you imagine that someday maybe realtime
will also be vectorized and want to leave this in as an option?
This is totally a nit btw and not saying it should be changed. Since this
option is just for testing we can just like... control the interval being
queried to not hit any incremental indexes and not have any issues.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]