gianm commented on code in PR #18252: URL: https://github.com/apache/druid/pull/18252#discussion_r2244441789
########## docs/querying/dart.md: ########## @@ -0,0 +1,140 @@ +--- +id: dart +title: "SQL queries using the Dart query profile" +sidebar_label: "Dart query profile" +description: The Dart query profile for the MSQ engine is an alternative to the native query engine that offers better parallelism and better performance for certain types of queries. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ License); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +:::info[Experimental] + +Dart is experimental. Use it in situations where it fits your use case better than the native query engine. But be aware that Dart has not received as much testing as the other query engines. + +::: + + +Dart is a profile of the MSQ engine that runs SELECT queries on Brokers and Historicals instead of on tasks. The Brokers act as controllers and the Historicals act as workers. + +Use Dart as an alternative to the native query engine since it offers better parallelism, excelling at queries that involve: + +- large joins, which Dart performs using parallel sort-merges +- high-cardinality exact groupBys +- high-cardinality exact count distinct + +When processing these kinds of queries, Dart can parallelize through the entire query, leading to better performance. + +By default, Dart queries include results form published segments and realtime tasks. Review Comment: from (spelling) ########## docs/querying/dart.md: ########## @@ -0,0 +1,140 @@ +--- +id: dart +title: "SQL queries using the Dart query profile" +sidebar_label: "Dart query profile" +description: The Dart query profile for the MSQ engine is an alternative to the native query engine that offers better parallelism and better performance for certain types of queries. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ License); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +:::info[Experimental] + +Dart is experimental. Use it in situations where it fits your use case better than the native query engine. But be aware that Dart has not received as much testing as the other query engines. + +::: + + +Dart is a profile of the MSQ engine that runs SELECT queries on Brokers and Historicals instead of on tasks. The Brokers act as controllers and the Historicals act as workers. + +Use Dart as an alternative to the native query engine since it offers better parallelism, excelling at queries that involve: + +- large joins, which Dart performs using parallel sort-merges +- high-cardinality exact groupBys +- high-cardinality exact count distinct + +When processing these kinds of queries, Dart can parallelize through the entire query, leading to better performance. + +By default, Dart queries include results form published segments and realtime tasks. + +## Enable Dart + +To enable Dart, add the following line to your `_common/common.runtime.properties` files: + +``` +druid.msq.dart.enabled = true +``` + +### Configure resource consumption + +You can configure the Broker and the Historical to tune Dart's resource consumption. Since Brokers only act as controllers, they don't require substantial resources. Historicals, on the other hand, are processing the queries. More resources for Historicals can result in faster query processing. + +For Brokers, you can set the following configs: + +| Property name | Description | Default | +|---|---|---| +| `druid.msq.dart.controller.concurrentQueries` | Maximum number of query controllers that can run concurrently on that Broker. Additional controllers are queued. Queries can get stuck waiting for each other if the total value on Brokers exceeds the setting on a single Historical (`druid.msq.dart.worker.concurrentQueries` ).| 1 | +| `druid.msq.dart.query.context.targetPartitionsPerWorker` |Number of available threads on workers (`druid.processing.numThreads`) | 1 (Multithreading is turned off on Historicals) | + + +For Historicals, you can set the following configs: + +| Property name | Description | Default Value | +|---|---|---| +| `druid.msq.dart.worker.concurrentQueries` | Maximum number of query workers that can run concurrently on that Historical. Set this to a value equal to or larger than `druid.msq.dart.controller.concurrentQueries` on your Brokers. If you don't, queries can get stuck waiting for each other. | Equal to the number of merge buffers | +| `druid.msq.dart.worker.heapFraction` | Maximum amount of heap available for use across all Dart queries as a decimal. | 0.35 (35% of heap) | + + +## Run a Dart query + +Once enabled, you can use Dart in the Druid console or the SQL query API to issue queries. + +### Druid console + +In the **Query** view, select **Engine: SQL (Dart)** from the engine selector menu. + +### API + +Dart uses the SQL endpoint `/druid/v2/sql`. To use Dart, include the query context parameter `engine` and set it to `msq-dart`: + +<Tabs> +<TabItem value="SET" label="SET" default> + +As part of your query using `SET engine = 'msq-dart'`: + +```json +"query":"SET \"engine\"='msq-dart';\nSELECT\n user,\n commentLength,\n COUNT(*) AS \"COUNT\"\nFROM \"wikipedia\"\nGROUP BY 1, 2\nORDER BY 2 DESC" Review Comment: This example is not valid JSON ########## docs/querying/dart.md: ########## @@ -0,0 +1,140 @@ +--- +id: dart +title: "SQL queries using the Dart query profile" +sidebar_label: "Dart query profile" +description: The Dart query profile for the MSQ engine is an alternative to the native query engine that offers better parallelism and better performance for certain types of queries. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ License); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +:::info[Experimental] + +Dart is experimental. Use it in situations where it fits your use case better than the native query engine. But be aware that Dart has not received as much testing as the other query engines. + +::: + + +Dart is a profile of the MSQ engine that runs SELECT queries on Brokers and Historicals instead of on tasks. The Brokers act as controllers and the Historicals act as workers. + +Use Dart as an alternative to the native query engine since it offers better parallelism, excelling at queries that involve: + +- large joins, which Dart performs using parallel sort-merges +- high-cardinality exact groupBys +- high-cardinality exact count distinct + +When processing these kinds of queries, Dart can parallelize through the entire query, leading to better performance. + +By default, Dart queries include results form published segments and realtime tasks. + +## Enable Dart + +To enable Dart, add the following line to your `_common/common.runtime.properties` files: + +``` +druid.msq.dart.enabled = true +``` + +### Configure resource consumption + +You can configure the Broker and the Historical to tune Dart's resource consumption. Since Brokers only act as controllers, they don't require substantial resources. Historicals, on the other hand, are processing the queries. More resources for Historicals can result in faster query processing. + +For Brokers, you can set the following configs: + +| Property name | Description | Default | +|---|---|---| +| `druid.msq.dart.controller.concurrentQueries` | Maximum number of query controllers that can run concurrently on that Broker. Additional controllers are queued. Queries can get stuck waiting for each other if the total value on Brokers exceeds the setting on a single Historical (`druid.msq.dart.worker.concurrentQueries` ).| 1 | +| `druid.msq.dart.query.context.targetPartitionsPerWorker` |Number of available threads on workers (`druid.processing.numThreads`) | 1 (Multithreading is turned off on Historicals) | + + +For Historicals, you can set the following configs: + +| Property name | Description | Default Value | +|---|---|---| +| `druid.msq.dart.worker.concurrentQueries` | Maximum number of query workers that can run concurrently on that Historical. Set this to a value equal to or larger than `druid.msq.dart.controller.concurrentQueries` on your Brokers. If you don't, queries can get stuck waiting for each other. | Equal to the number of merge buffers | Review Comment: Important to note that this cannot be set higher than the number of merge buffers. The default in most cases should be left alone. ########## docs/querying/dart.md: ########## @@ -0,0 +1,140 @@ +--- +id: dart +title: "SQL queries using the Dart query profile" +sidebar_label: "Dart query profile" +description: The Dart query profile for the MSQ engine is an alternative to the native query engine that offers better parallelism and better performance for certain types of queries. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ License); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +:::info[Experimental] + +Dart is experimental. Use it in situations where it fits your use case better than the native query engine. But be aware that Dart has not received as much testing as the other query engines. + +::: + + +Dart is a profile of the MSQ engine that runs SELECT queries on Brokers and Historicals instead of on tasks. The Brokers act as controllers and the Historicals act as workers. + +Use Dart as an alternative to the native query engine since it offers better parallelism, excelling at queries that involve: + +- large joins, which Dart performs using parallel sort-merges +- high-cardinality exact groupBys +- high-cardinality exact count distinct + +When processing these kinds of queries, Dart can parallelize through the entire query, leading to better performance. + +By default, Dart queries include results form published segments and realtime tasks. + +## Enable Dart + +To enable Dart, add the following line to your `_common/common.runtime.properties` files: + +``` +druid.msq.dart.enabled = true +``` + +### Configure resource consumption + +You can configure the Broker and the Historical to tune Dart's resource consumption. Since Brokers only act as controllers, they don't require substantial resources. Historicals, on the other hand, are processing the queries. More resources for Historicals can result in faster query processing. + +For Brokers, you can set the following configs: + +| Property name | Description | Default | +|---|---|---| +| `druid.msq.dart.controller.concurrentQueries` | Maximum number of query controllers that can run concurrently on that Broker. Additional controllers are queued. Queries can get stuck waiting for each other if the total value on Brokers exceeds the setting on a single Historical (`druid.msq.dart.worker.concurrentQueries` ).| 1 | +| `druid.msq.dart.query.context.targetPartitionsPerWorker` |Number of available threads on workers (`druid.processing.numThreads`) | 1 (Multithreading is turned off on Historicals) | Review Comment: Some alternative wording I think makes this clearer: > To parallelize queries as much as possible on each Historical, set this to the value of `druid.processing.numThreads` on the Historicals. ########## docs/querying/dart.md: ########## @@ -0,0 +1,140 @@ +--- +id: dart +title: "SQL queries using the Dart query profile" +sidebar_label: "Dart query profile" +description: The Dart query profile for the MSQ engine is an alternative to the native query engine that offers better parallelism and better performance for certain types of queries. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ License); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +:::info[Experimental] + +Dart is experimental. Use it in situations where it fits your use case better than the native query engine. But be aware that Dart has not received as much testing as the other query engines. + +::: + + +Dart is a profile of the MSQ engine that runs SELECT queries on Brokers and Historicals instead of on tasks. The Brokers act as controllers and the Historicals act as workers. + +Use Dart as an alternative to the native query engine since it offers better parallelism, excelling at queries that involve: + +- large joins, which Dart performs using parallel sort-merges +- high-cardinality exact groupBys +- high-cardinality exact count distinct + +When processing these kinds of queries, Dart can parallelize through the entire query, leading to better performance. + +By default, Dart queries include results form published segments and realtime tasks. + +## Enable Dart + +To enable Dart, add the following line to your `_common/common.runtime.properties` files: + +``` +druid.msq.dart.enabled = true +``` + +### Configure resource consumption + +You can configure the Broker and the Historical to tune Dart's resource consumption. Since Brokers only act as controllers, they don't require substantial resources. Historicals, on the other hand, are processing the queries. More resources for Historicals can result in faster query processing. + +For Brokers, you can set the following configs: + +| Property name | Description | Default | +|---|---|---| +| `druid.msq.dart.controller.concurrentQueries` | Maximum number of query controllers that can run concurrently on that Broker. Additional controllers are queued. Queries can get stuck waiting for each other if the total value on Brokers exceeds the setting on a single Historical (`druid.msq.dart.worker.concurrentQueries` ).| 1 | +| `druid.msq.dart.query.context.targetPartitionsPerWorker` |Number of available threads on workers (`druid.processing.numThreads`) | 1 (Multithreading is turned off on Historicals) | + + +For Historicals, you can set the following configs: + +| Property name | Description | Default Value | +|---|---|---| +| `druid.msq.dart.worker.concurrentQueries` | Maximum number of query workers that can run concurrently on that Historical. Set this to a value equal to or larger than `druid.msq.dart.controller.concurrentQueries` on your Brokers. If you don't, queries can get stuck waiting for each other. | Equal to the number of merge buffers | +| `druid.msq.dart.worker.heapFraction` | Maximum amount of heap available for use across all Dart queries as a decimal. | 0.35 (35% of heap) | + + +## Run a Dart query + +Once enabled, you can use Dart in the Druid console or the SQL query API to issue queries. + +### Druid console + +In the **Query** view, select **Engine: SQL (Dart)** from the engine selector menu. + +### API + +Dart uses the SQL endpoint `/druid/v2/sql`. To use Dart, include the query context parameter `engine` and set it to `msq-dart`: + +<Tabs> +<TabItem value="SET" label="SET" default> + +As part of your query using `SET engine = 'msq-dart'`: + +```json +"query":"SET \"engine\"='msq-dart';\nSELECT\n user,\n commentLength,\n COUNT(*) AS \"COUNT\"\nFROM \"wikipedia\"\nGROUP BY 1, 2\nORDER BY 2 DESC" +``` + +</TabItem> + +<TabItem value="context_block" label="Context block"> + +As part of a `context` block: + +```json +{ + "query": "SELECT\n user,\n commentLength,\n COUNT(*) AS \"COUNT\"\nFROM \"wikipedia\"\nGROUP BY 1, 2\nORDER BY 2 DESC", + "context": { + "engine": "msq-dart" + } +} +``` + + + </TabItem> + </Tabs> + +## Query context parameters + +You can use query context parameters to control Dart's behavior. The following table lists the supported query context parameters: Review Comment: Please mention that Dart can also use any other SQL context parameters, except as otherwise noted. ########## docs/querying/dart.md: ########## @@ -0,0 +1,140 @@ +--- +id: dart +title: "SQL queries using the Dart query profile" +sidebar_label: "Dart query profile" +description: The Dart query profile for the MSQ engine is an alternative to the native query engine that offers better parallelism and better performance for certain types of queries. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ License); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +:::info[Experimental] + +Dart is experimental. Use it in situations where it fits your use case better than the native query engine. But be aware that Dart has not received as much testing as the other query engines. + +::: + + +Dart is a profile of the MSQ engine that runs SELECT queries on Brokers and Historicals instead of on tasks. The Brokers act as controllers and the Historicals act as workers. + +Use Dart as an alternative to the native query engine since it offers better parallelism, excelling at queries that involve: + +- large joins, which Dart performs using parallel sort-merges +- high-cardinality exact groupBys +- high-cardinality exact count distinct + +When processing these kinds of queries, Dart can parallelize through the entire query, leading to better performance. + +By default, Dart queries include results form published segments and realtime tasks. + +## Enable Dart + +To enable Dart, add the following line to your `_common/common.runtime.properties` files: + +``` +druid.msq.dart.enabled = true +``` + +### Configure resource consumption + +You can configure the Broker and the Historical to tune Dart's resource consumption. Since Brokers only act as controllers, they don't require substantial resources. Historicals, on the other hand, are processing the queries. More resources for Historicals can result in faster query processing. + +For Brokers, you can set the following configs: + +| Property name | Description | Default | +|---|---|---| +| `druid.msq.dart.controller.concurrentQueries` | Maximum number of query controllers that can run concurrently on that Broker. Additional controllers are queued. Queries can get stuck waiting for each other if the total value on Brokers exceeds the setting on a single Historical (`druid.msq.dart.worker.concurrentQueries` ).| 1 | +| `druid.msq.dart.query.context.targetPartitionsPerWorker` |Number of available threads on workers (`druid.processing.numThreads`) | 1 (Multithreading is turned off on Historicals) | + + +For Historicals, you can set the following configs: + +| Property name | Description | Default Value | +|---|---|---| +| `druid.msq.dart.worker.concurrentQueries` | Maximum number of query workers that can run concurrently on that Historical. Set this to a value equal to or larger than `druid.msq.dart.controller.concurrentQueries` on your Brokers. If you don't, queries can get stuck waiting for each other. | Equal to the number of merge buffers | +| `druid.msq.dart.worker.heapFraction` | Maximum amount of heap available for use across all Dart queries as a decimal. | 0.35 (35% of heap) | + + +## Run a Dart query + +Once enabled, you can use Dart in the Druid console or the SQL query API to issue queries. + +### Druid console + +In the **Query** view, select **Engine: SQL (Dart)** from the engine selector menu. + +### API + +Dart uses the SQL endpoint `/druid/v2/sql`. To use Dart, include the query context parameter `engine` and set it to `msq-dart`: + +<Tabs> +<TabItem value="SET" label="SET" default> + +As part of your query using `SET engine = 'msq-dart'`: + +```json +"query":"SET \"engine\"='msq-dart';\nSELECT\n user,\n commentLength,\n COUNT(*) AS \"COUNT\"\nFROM \"wikipedia\"\nGROUP BY 1, 2\nORDER BY 2 DESC" +``` + +</TabItem> + +<TabItem value="context_block" label="Context block"> + +As part of a `context` block: + +```json +{ + "query": "SELECT\n user,\n commentLength,\n COUNT(*) AS \"COUNT\"\nFROM \"wikipedia\"\nGROUP BY 1, 2\nORDER BY 2 DESC", + "context": { + "engine": "msq-dart" + } +} +``` + + + </TabItem> + </Tabs> + +## Query context parameters + +You can use query context parameters to control Dart's behavior. The following table lists the supported query context parameters: + +| Parameter | Description | Default value | +|---|---|---| +| `finalizeAggregations` | Determines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see [SQL aggregation functions](../querying/sql-aggregations.md). | `true` | +| `includeSegmentSource` | Controls the sources that are queried for results in addition to the segments present on deep storage. Can be `NONE` or `REALTIME`. If this value is `NONE`, only non-realtime (published and used) segments will be downloaded from deep storage. If this value is `REALTIME`, results will also be included from realtime tasks.| `REALTIME` | +| `removeNullBytes` |The MSQ engine cannot process null bytes in strings and throws `InvalidNullByteFault` if it encounters them in the source data. If the parameter is set to true, The MSQ engine will remove the null bytes in string fields when reading the data. | `false` | +|`maxConcurrentStages`|Number of stages that can run concurrently for a query. A higher number can potentially improve pipelining but results in less memory available for each stage.|2| +|`maxNonLeafWorkers`|Number of workers to use for stages beyond the leaf stage| 1 (Scatter-gather style)| +| `sqlJoinAlgorithm` | Algorithm to use for JOIN. Use `broadcast` (the default) for broadcast hash join or `sortMerge` for sort-merge join. Affects all JOIN operations in the query. This is a hint to the MSQ engine and the actual joins in the query may proceed in a different way than specified. See [Joins](#joins) for more details. | `broadcast` | +|`targetPartitionsPerWorker`|Number of partitions Druid generates for each worker. This number controls how much parallelism can be maintained throughout a query.|1| + + + ## Known issues and limitations + +- Dart doesn't do the following: + - verify that `druid.msq.dart.controller.concurrentQueries` is set properly. If set too high, queries can get stuck on each other. + - use the query cache. + - perform query prioritization or laning Review Comment: inconsistent punctuation; some lines end with `.` and some don't ########## docs/querying/dart.md: ########## @@ -0,0 +1,140 @@ +--- +id: dart +title: "SQL queries using the Dart query profile" +sidebar_label: "Dart query profile" +description: The Dart query profile for the MSQ engine is an alternative to the native query engine that offers better parallelism and better performance for certain types of queries. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ License); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +:::info[Experimental] + +Dart is experimental. Use it in situations where it fits your use case better than the native query engine. But be aware that Dart has not received as much testing as the other query engines. + +::: + + +Dart is a profile of the MSQ engine that runs SELECT queries on Brokers and Historicals instead of on tasks. The Brokers act as controllers and the Historicals act as workers. + +Use Dart as an alternative to the native query engine since it offers better parallelism, excelling at queries that involve: + +- large joins, which Dart performs using parallel sort-merges +- high-cardinality exact groupBys +- high-cardinality exact count distinct + +When processing these kinds of queries, Dart can parallelize through the entire query, leading to better performance. + +By default, Dart queries include results form published segments and realtime tasks. + +## Enable Dart + +To enable Dart, add the following line to your `_common/common.runtime.properties` files: + +``` +druid.msq.dart.enabled = true +``` + +### Configure resource consumption + +You can configure the Broker and the Historical to tune Dart's resource consumption. Since Brokers only act as controllers, they don't require substantial resources. Historicals, on the other hand, are processing the queries. More resources for Historicals can result in faster query processing. + +For Brokers, you can set the following configs: + +| Property name | Description | Default | +|---|---|---| +| `druid.msq.dart.controller.concurrentQueries` | Maximum number of query controllers that can run concurrently on that Broker. Additional controllers are queued. Queries can get stuck waiting for each other if the total value on Brokers exceeds the setting on a single Historical (`druid.msq.dart.worker.concurrentQueries` ).| 1 | +| `druid.msq.dart.query.context.targetPartitionsPerWorker` |Number of available threads on workers (`druid.processing.numThreads`) | 1 (Multithreading is turned off on Historicals) | + + +For Historicals, you can set the following configs: + +| Property name | Description | Default Value | +|---|---|---| +| `druid.msq.dart.worker.concurrentQueries` | Maximum number of query workers that can run concurrently on that Historical. Set this to a value equal to or larger than `druid.msq.dart.controller.concurrentQueries` on your Brokers. If you don't, queries can get stuck waiting for each other. | Equal to the number of merge buffers | +| `druid.msq.dart.worker.heapFraction` | Maximum amount of heap available for use across all Dart queries as a decimal. | 0.35 (35% of heap) | + + +## Run a Dart query + +Once enabled, you can use Dart in the Druid console or the SQL query API to issue queries. + +### Druid console + +In the **Query** view, select **Engine: SQL (Dart)** from the engine selector menu. + +### API + +Dart uses the SQL endpoint `/druid/v2/sql`. To use Dart, include the query context parameter `engine` and set it to `msq-dart`: + +<Tabs> +<TabItem value="SET" label="SET" default> + +As part of your query using `SET engine = 'msq-dart'`: + +```json +"query":"SET \"engine\"='msq-dart';\nSELECT\n user,\n commentLength,\n COUNT(*) AS \"COUNT\"\nFROM \"wikipedia\"\nGROUP BY 1, 2\nORDER BY 2 DESC" +``` + +</TabItem> + +<TabItem value="context_block" label="Context block"> + +As part of a `context` block: + +```json +{ + "query": "SELECT\n user,\n commentLength,\n COUNT(*) AS \"COUNT\"\nFROM \"wikipedia\"\nGROUP BY 1, 2\nORDER BY 2 DESC", + "context": { + "engine": "msq-dart" + } +} +``` + + + </TabItem> + </Tabs> + +## Query context parameters + +You can use query context parameters to control Dart's behavior. The following table lists the supported query context parameters: + +| Parameter | Description | Default value | +|---|---|---| +| `finalizeAggregations` | Determines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see [SQL aggregation functions](../querying/sql-aggregations.md). | `true` | +| `includeSegmentSource` | Controls the sources that are queried for results in addition to the segments present on deep storage. Can be `NONE` or `REALTIME`. If this value is `NONE`, only non-realtime (published and used) segments will be downloaded from deep storage. If this value is `REALTIME`, results will also be included from realtime tasks.| `REALTIME` | +| `removeNullBytes` |The MSQ engine cannot process null bytes in strings and throws `InvalidNullByteFault` if it encounters them in the source data. If the parameter is set to true, The MSQ engine will remove the null bytes in string fields when reading the data. | `false` | +|`maxConcurrentStages`|Number of stages that can run concurrently for a query. A higher number can potentially improve pipelining but results in less memory available for each stage.|2| +|`maxNonLeafWorkers`|Number of workers to use for stages beyond the leaf stage| 1 (Scatter-gather style)| +| `sqlJoinAlgorithm` | Algorithm to use for JOIN. Use `broadcast` (the default) for broadcast hash join or `sortMerge` for sort-merge join. Affects all JOIN operations in the query. This is a hint to the MSQ engine and the actual joins in the query may proceed in a different way than specified. See [Joins](#joins) for more details. | `broadcast` | Review Comment: Please move this to the top, since it's the most likely one that people will actually need to set ########## docs/querying/dart.md: ########## @@ -0,0 +1,140 @@ +--- +id: dart +title: "SQL queries using the Dart query profile" +sidebar_label: "Dart query profile" +description: The Dart query profile for the MSQ engine is an alternative to the native query engine that offers better parallelism and better performance for certain types of queries. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ License); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +:::info[Experimental] + +Dart is experimental. Use it in situations where it fits your use case better than the native query engine. But be aware that Dart has not received as much testing as the other query engines. + +::: + + +Dart is a profile of the MSQ engine that runs SELECT queries on Brokers and Historicals instead of on tasks. The Brokers act as controllers and the Historicals act as workers. + +Use Dart as an alternative to the native query engine since it offers better parallelism, excelling at queries that involve: + +- large joins, which Dart performs using parallel sort-merges +- high-cardinality exact groupBys +- high-cardinality exact count distinct + +When processing these kinds of queries, Dart can parallelize through the entire query, leading to better performance. + +By default, Dart queries include results form published segments and realtime tasks. + +## Enable Dart + +To enable Dart, add the following line to your `_common/common.runtime.properties` files: + +``` +druid.msq.dart.enabled = true +``` + +### Configure resource consumption + +You can configure the Broker and the Historical to tune Dart's resource consumption. Since Brokers only act as controllers, they don't require substantial resources. Historicals, on the other hand, are processing the queries. More resources for Historicals can result in faster query processing. + +For Brokers, you can set the following configs: + +| Property name | Description | Default | +|---|---|---| +| `druid.msq.dart.controller.concurrentQueries` | Maximum number of query controllers that can run concurrently on that Broker. Additional controllers are queued. Queries can get stuck waiting for each other if the total value on Brokers exceeds the setting on a single Historical (`druid.msq.dart.worker.concurrentQueries` ).| 1 | +| `druid.msq.dart.query.context.targetPartitionsPerWorker` |Number of available threads on workers (`druid.processing.numThreads`) | 1 (Multithreading is turned off on Historicals) | + + +For Historicals, you can set the following configs: + +| Property name | Description | Default Value | +|---|---|---| +| `druid.msq.dart.worker.concurrentQueries` | Maximum number of query workers that can run concurrently on that Historical. Set this to a value equal to or larger than `druid.msq.dart.controller.concurrentQueries` on your Brokers. If you don't, queries can get stuck waiting for each other. | Equal to the number of merge buffers | +| `druid.msq.dart.worker.heapFraction` | Maximum amount of heap available for use across all Dart queries as a decimal. | 0.35 (35% of heap) | + + +## Run a Dart query + +Once enabled, you can use Dart in the Druid console or the SQL query API to issue queries. + +### Druid console + +In the **Query** view, select **Engine: SQL (Dart)** from the engine selector menu. + +### API + +Dart uses the SQL endpoint `/druid/v2/sql`. To use Dart, include the query context parameter `engine` and set it to `msq-dart`: + +<Tabs> +<TabItem value="SET" label="SET" default> + +As part of your query using `SET engine = 'msq-dart'`: + +```json +"query":"SET \"engine\"='msq-dart';\nSELECT\n user,\n commentLength,\n COUNT(*) AS \"COUNT\"\nFROM \"wikipedia\"\nGROUP BY 1, 2\nORDER BY 2 DESC" +``` + +</TabItem> + +<TabItem value="context_block" label="Context block"> + +As part of a `context` block: + +```json +{ + "query": "SELECT\n user,\n commentLength,\n COUNT(*) AS \"COUNT\"\nFROM \"wikipedia\"\nGROUP BY 1, 2\nORDER BY 2 DESC", + "context": { + "engine": "msq-dart" + } +} +``` + + + </TabItem> + </Tabs> + +## Query context parameters + +You can use query context parameters to control Dart's behavior. The following table lists the supported query context parameters: + +| Parameter | Description | Default value | +|---|---|---| +| `finalizeAggregations` | Determines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see [SQL aggregation functions](../querying/sql-aggregations.md). | `true` | +| `includeSegmentSource` | Controls the sources that are queried for results in addition to the segments present on deep storage. Can be `NONE` or `REALTIME`. If this value is `NONE`, only non-realtime (published and used) segments will be downloaded from deep storage. If this value is `REALTIME`, results will also be included from realtime tasks.| `REALTIME` | +| `removeNullBytes` |The MSQ engine cannot process null bytes in strings and throws `InvalidNullByteFault` if it encounters them in the source data. If the parameter is set to true, The MSQ engine will remove the null bytes in string fields when reading the data. | `false` | +|`maxConcurrentStages`|Number of stages that can run concurrently for a query. A higher number can potentially improve pipelining but results in less memory available for each stage.|2| +|`maxNonLeafWorkers`|Number of workers to use for stages beyond the leaf stage| 1 (Scatter-gather style)| +| `sqlJoinAlgorithm` | Algorithm to use for JOIN. Use `broadcast` (the default) for broadcast hash join or `sortMerge` for sort-merge join. Affects all JOIN operations in the query. This is a hint to the MSQ engine and the actual joins in the query may proceed in a different way than specified. See [Joins](#joins) for more details. | `broadcast` | +|`targetPartitionsPerWorker`|Number of partitions Druid generates for each worker. This number controls how much parallelism can be maintained throughout a query.|1| + + + ## Known issues and limitations + +- Dart doesn't do the following: Review Comment: please also mention that `useApproximateTopN` is not implemented; Dart always does exact topNs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
