vtlim commented on code in PR #14609: URL: https://github.com/apache/druid/pull/14609#discussion_r1272808644
########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. Review Comment: The line above uses a forward slash preceding the endpoint but this line and others don't include it ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [submit a query to the `sql` endpoint](#submit-a-query). Review Comment: ```suggestion Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [Submit a query to the `sql` endpoint](#submit-a-query). ``` ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. The currently supported mode is `ASYNC`. + - `selectDestination` set to `DURABLE_STORAGE` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). Review Comment: Include a general term of this like the context parameter above? ########## docs/design/architecture.md: ########## @@ -70,12 +70,20 @@ Druid uses deep storage to store any data that has been ingested into the system storage accessible by every Druid server. In a clustered deployment, this is typically a distributed object store like S3 or HDFS, or a network mounted filesystem. In a single-server deployment, this is typically local disk. -Druid uses deep storage only as a backup of your data and as a way to transfer data in the background between -Druid processes. Druid stores data in files called _segments_. Historical processes cache data segments on -local disk and serve queries from that cache as well as from an in-memory cache. -This means that Druid never needs to access deep storage -during a query, helping it offer the best query latencies possible. It also means that you must have enough disk space -both in deep storage and across your Historical servers for the data you plan to load. +Druid uses deep storage for the following purposes: + +- As a backup of your data, including those that get loaded onto Historical processes. +- As a way to transfer data in the background between +Druid processes. Druid stores data in files called _segments_. Review Comment: ```suggestion - As a way to transfer data in the background between Druid processes. Druid stores data in files called _segments_. ``` ########## docs/operations/rule-configuration.md: ########## @@ -167,7 +167,7 @@ Set the following properties: - the segment interval starts any time after the rule interval starts. You can use this property to load segments with future start and end dates, where "future" is relative to the time when the Coordinator evaluates data against the rule. Defaults to `true`. -- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier. +- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier. If you set the replicants for a period to 0 on all tiers, you can still [query the data from deep storage](../querying/query-from-deep-storage.md) Review Comment: ```suggestion - `tieredReplicants`: a map of tier names to the number of segment replicas for that tier. If you set the replicants for a period to 0 on all tiers, you can still [query the data from deep storage](../querying/query-from-deep-storage.md). ``` ########## docs/operations/rule-configuration.md: ########## @@ -167,7 +167,7 @@ Set the following properties: - the segment interval starts any time after the rule interval starts. You can use this property to load segments with future start and end dates, where "future" is relative to the time when the Coordinator evaluates data against the rule. Defaults to `true`. -- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier. +- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier. If you set the replicants for a period to 0 on all tiers, you can still [query the data from deep storage](../querying/query-from-deep-storage.md) Review Comment: What does this mean? >If you set the replicants for a period to 0 on all tiers, ########## docs/querying/query-from-deep-storage.md: ########## @@ -0,0 +1,187 @@ +--- +id: query-deep-storage +title: "Query from deep storage" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +> Query from deep storage is an experimental feature. + +## Segments in deep storage + +Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. To take advantage of the space savings that querying from deep storage provides though, you need to make sure not all your segments get loaded onto Historical processes. + +To do this, configure [load rules](../operations/rule-configuration.md#load-rules) to load only the segments you do want on Historical processes. Review Comment: > only the segments you do want What criteria determines this? The segments corresponding to data to query with low latency? ########## docs/operations/durable-storage.md: ########## @@ -0,0 +1,66 @@ +--- +id: durable-storage +title: "Durable storage for the multi-stage query engine" +sidebar_label: "Durable storage" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +You can use durable storage to improve querying from deep storage and SQL-based ingestion. + +> Note that only S3 is supported as a durable storage location. + +Durable storage for queries from deep storage provides a location where you can write the results of deep storage queries to. Durable storage for SQL-based ingestion is used to temporarily house intermediate files, which can improve reliability. + +## Enable durable storage + +To enable durable storage, you need to set the following common service properties: + +``` +druid.msq.intermediate.storage.enable=true +druid.msq.intermediate.storage.type=s3 +druid.msq.intermediate.storage.bucket=YOUR_BUCKET +druid.msq.intermediate.storage.prefix=YOUR_PREFIX +druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir +``` + +For detailed information about the settings related to durable storage, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). + + +## Use durable storage for SQL-based ingestion queries + +When you run a query, include the context parameter `durableShuffleStorage` and set it to `true`. + +For queries where you want to use fault tolerance for workers, set `faultTolerance` to `true`, which automatically sets `durableShuffleStorage` to `true`. + +## Use durable storage for queries from deep storage + +When you run a query, include the context parameter `selectDestination` and set it to `DURABLE_STORAGE`. This context parameter configures queries from deep storage to write their results to durable storage. + +## Durable storage clean up + +To prevent durable storage from getting filled up with temporary files in case the tasks fail to clean them up, a periodic +cleaner can be scheduled to clean the directories corresponding to which there isn't a controller task running. It utilizes +the storage connector to work upon the durable storage. The durable storage location should only be utilized to store the output +for cluster's MSQ tasks. If the location contains other files or directories, then they will get cleaned up as well. + +Enabling durable storage also enables the use of local disk to store temporary files, such as the intermediate files produced +by the super sorter. Tasks will use whatever has been configured for their temporary usage as described in [Configuring task storage sizes](../ingestion/tasks.md#configuring-task-storage-sizes) Review Comment: ```suggestion by the super sorter. Tasks will use whatever has been configured for their temporary usage as described in [Configuring task storage sizes](../ingestion/tasks.md#configuring-task-storage-sizes). ``` ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. Review Comment: comments here also apply to query-from-deep-storage.md ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. The currently supported mode is `ASYNC`. Review Comment: ```suggestion - `executionMode` determines how query results are fetched. Druid currently only supports `ASYNC`. ``` ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. The currently supported mode is `ASYNC`. + - `selectDestination` set to `DURABLE_STORAGE` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). + +- The only supported results format is JSON. Review Comment: ```suggestion - The only supported value for `resultFormat` is JSON. ``` ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. The currently supported mode is `ASYNC`. + - `selectDestination` set to `DURABLE_STORAGE` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). + +- The only supported results format is JSON. +- Only the user who submits a query can see the results. + + +### Get query status + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID} +``` + +Returns the same response as the post API if the query is accepted or running. The response for a completed query includes the same information as an in-progress query with several additions: + +- A `result` object that summarizes information about your results, such as the total number of rows and a sample record +- A `pages` object that includes the following information for each page of results: + - `numRows`: the number of rows in that page of results + - `sizeInBytes`: the size of the page + - `id`: the page number that you can use to reference a specific page when you get query results + + +### Get query results + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID}/results?page=PAGENUMBER +``` + +Results are separated into pages, so you can use the optional `page` parameter to refine the results you get. When you retrieve the status of a completed query, Druid returns information about the composition of each page and its page number (`id`). + +When getting query results, keep the following in mind: + +- JSON is the only supported result format Review Comment: ```suggestion - JSON is the only supported result format. ``` ########## docs/design/architecture.md: ########## @@ -70,12 +70,20 @@ Druid uses deep storage to store any data that has been ingested into the system storage accessible by every Druid server. In a clustered deployment, this is typically a distributed object store like S3 or HDFS, or a network mounted filesystem. In a single-server deployment, this is typically local disk. -Druid uses deep storage only as a backup of your data and as a way to transfer data in the background between -Druid processes. Druid stores data in files called _segments_. Historical processes cache data segments on -local disk and serve queries from that cache as well as from an in-memory cache. -This means that Druid never needs to access deep storage -during a query, helping it offer the best query latencies possible. It also means that you must have enough disk space -both in deep storage and across your Historical servers for the data you plan to load. +Druid uses deep storage for the following purposes: + +- As a backup of your data, including those that get loaded onto Historical processes. +- As a way to transfer data in the background between +Druid processes. Druid stores data in files called _segments_. +- As the source data for queries that run against segments stored only in deep storage and not in Historical processes as determined by your load rules. + +Historical processes cache data segments on +local disk and serve queries from that cache as well as from an in-memory cache. Segments on disk for Historical processes provide the low latency querying performance Druid is known for. You can query directly from deep storage though, which allows you to query segments that exist only in deep storage. This trades some performance to provide you with the ability to query more of your data without necessarily having to scale your Historical processes. + +When determining sizing for your storage, keep the following in mind: + +- Deep storage needs to be able to hold all the data that you ingest into Druid Review Comment: ```suggestion - Deep storage needs to be able to hold all the data that you ingest into Druid. ``` ########## docs/design/deep-storage.md: ########## @@ -25,7 +25,13 @@ title: "Deep storage" Deep storage is where segments are stored. It is a storage mechanism that Apache Druid does not provide. This deep storage infrastructure defines the level of durability of your data, as long as Druid processes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented. Review Comment: ```suggestion Deep storage is where segments are stored. It is a storage mechanism that Apache Druid does not provide. This deep storage infrastructure defines the level of durability of your data. As long as Druid processes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented. ``` ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. The currently supported mode is `ASYNC`. + - `selectDestination` set to `DURABLE_STORAGE` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). + +- The only supported results format is JSON. +- Only the user who submits a query can see the results. + + +### Get query status + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID} +``` + +Returns the same response as the post API if the query is accepted or running. The response for a completed query includes the same information as an in-progress query with several additions: Review Comment: ```suggestion Returns information about the query associated with the given query ID. The response matches the response from the POST API if the query is accepted or running. The response for a completed query includes the same information as an in-progress query with several additions: ``` ########## docs/design/deep-storage.md: ########## @@ -55,22 +61,28 @@ druid.storage.storageDirectory=/tmp/druid/localStorage The `druid.storage.storageDirectory` must be set to a different path than `druid.segmentCache.locations` or `druid.segmentCache.infoDir`. -## Amazon S3 or S3-compatible +### Amazon S3 or S3-compatible See [`druid-s3-extensions`](../development/extensions-core/s3.md). -## Google Cloud Storage +### Google Cloud Storage See [`druid-google-extensions`](../development/extensions-core/google.md). -## Azure Blob Storage +### Azure Blob Storage See [`druid-azure-extensions`](../development/extensions-core/azure.md). -## HDFS +### HDFS See [druid-hdfs-storage extension documentation](../development/extensions-core/hdfs.md). -## Additional options +### Additional options For additional deep storage options, please see our [extensions list](../configuration/extensions.md). + +## Querying from deep storage + +Although not as performant as querying segments stored on disk for Historicals processes, you can query from deep storage to access segments that you may not need frequently or with the extreme low latency Druid queries traditionally provide. You trade some performance for a total lower storage cost because you can access more of your data without the need to increase the number or capacity of your Historical processes. Review Comment: ```suggestion Although not as performant as querying segments stored on disk for Historical processes, you can query from deep storage to access segments that you may not need frequently or with the extreme low latency Druid queries traditionally provide. You trade some performance for a total lower storage cost because you can access more of your data without the need to increase the number or capacity of your Historical processes. ``` ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. The currently supported mode is `ASYNC`. + - `selectDestination` set to `DURABLE_STORAGE` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). + +- The only supported results format is JSON. +- Only the user who submits a query can see the results. + + +### Get query status + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID} +``` + +Returns the same response as the post API if the query is accepted or running. The response for a completed query includes the same information as an in-progress query with several additions: + +- A `result` object that summarizes information about your results, such as the total number of rows and a sample record +- A `pages` object that includes the following information for each page of results: + - `numRows`: the number of rows in that page of results + - `sizeInBytes`: the size of the page + - `id`: the page number that you can use to reference a specific page when you get query results + + +### Get query results + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID}/results?page=PAGENUMBER +``` + +Results are separated into pages, so you can use the optional `page` parameter to refine the results you get. When you retrieve the status of a completed query, Druid returns information about the composition of each page and its page number (`id`). + +When getting query results, keep the following in mind: + +- JSON is the only supported result format +- If you attempt to get the results for an in-progress query, Druid returns an error. + +### Cancel a query + +``` +DELETE https://ROUTER:8888/druid/v2/sql/statements/{queryID} Review Comment: Do the DELETE and GET requests only work for queries that were POSTed using `/sql/statements`? ########## docs/design/architecture.md: ########## @@ -70,12 +70,20 @@ Druid uses deep storage to store any data that has been ingested into the system storage accessible by every Druid server. In a clustered deployment, this is typically a distributed object store like S3 or HDFS, or a network mounted filesystem. In a single-server deployment, this is typically local disk. -Druid uses deep storage only as a backup of your data and as a way to transfer data in the background between -Druid processes. Druid stores data in files called _segments_. Historical processes cache data segments on -local disk and serve queries from that cache as well as from an in-memory cache. -This means that Druid never needs to access deep storage -during a query, helping it offer the best query latencies possible. It also means that you must have enough disk space -both in deep storage and across your Historical servers for the data you plan to load. +Druid uses deep storage for the following purposes: + +- As a backup of your data, including those that get loaded onto Historical processes. +- As a way to transfer data in the background between +Druid processes. Druid stores data in files called _segments_. +- As the source data for queries that run against segments stored only in deep storage and not in Historical processes as determined by your load rules. + +Historical processes cache data segments on +local disk and serve queries from that cache as well as from an in-memory cache. Segments on disk for Historical processes provide the low latency querying performance Druid is known for. You can query directly from deep storage though, which allows you to query segments that exist only in deep storage. This trades some performance to provide you with the ability to query more of your data without necessarily having to scale your Historical processes. Review Comment: ```suggestion Historical processes cache data segments on local disk and serve queries from that cache as well as from an in-memory cache. Segments on disk for Historical processes provide the low latency querying performance Druid is known for. You can also query directly from deep storage. When you query segments that exist only in deep storage, you trade some performance in exchange for the ability to query more of your data without necessarily having to scale your Historical processes. ``` ########## docs/design/architecture.md: ########## @@ -70,12 +70,20 @@ Druid uses deep storage to store any data that has been ingested into the system storage accessible by every Druid server. In a clustered deployment, this is typically a distributed object store like S3 or HDFS, or a network mounted filesystem. In a single-server deployment, this is typically local disk. -Druid uses deep storage only as a backup of your data and as a way to transfer data in the background between -Druid processes. Druid stores data in files called _segments_. Historical processes cache data segments on -local disk and serve queries from that cache as well as from an in-memory cache. -This means that Druid never needs to access deep storage -during a query, helping it offer the best query latencies possible. It also means that you must have enough disk space -both in deep storage and across your Historical servers for the data you plan to load. +Druid uses deep storage for the following purposes: + +- As a backup of your data, including those that get loaded onto Historical processes. +- As a way to transfer data in the background between +Druid processes. Druid stores data in files called _segments_. +- As the source data for queries that run against segments stored only in deep storage and not in Historical processes as determined by your load rules. + +Historical processes cache data segments on +local disk and serve queries from that cache as well as from an in-memory cache. Segments on disk for Historical processes provide the low latency querying performance Druid is known for. You can query directly from deep storage though, which allows you to query segments that exist only in deep storage. This trades some performance to provide you with the ability to query more of your data without necessarily having to scale your Historical processes. + +When determining sizing for your storage, keep the following in mind: + +- Deep storage needs to be able to hold all the data that you ingest into Druid +- On disk storage for Historical processes need to be able to accommodate the data you want to load onto them to run queries on data you access frequently and need low latency for Review Comment: Missing rest of sentence? ########## docs/design/deep-storage.md: ########## @@ -25,7 +25,13 @@ title: "Deep storage" Deep storage is where segments are stored. It is a storage mechanism that Apache Druid does not provide. This deep storage infrastructure defines the level of durability of your data, as long as Druid processes can see this storage infrastructure and get at the segments stored on it, you will not lose data no matter how many Druid nodes you lose. If segments disappear from this storage layer, then you will lose whatever data those segments represented. -## Local +In addition to being the backing store for segments, you can use [query from deep storage](#querying-from-deep-storage) and run queries against segments stored primarily in deep storage. Whether segments exist primarily in deep storage or in deep storage and on Historical processes, is determined by the [load rules](../operations/rule-configuration.md#load-rules) you configure. Review Comment: ```suggestion In addition to being the backing store for segments, you can use [query from deep storage](#querying-from-deep-storage) and run queries against segments stored primarily in deep storage. The [load rules](../operations/rule-configuration.md#load-rules) you configure determine whether segments exist primarily in deep storage or in a combination of deep storage and Historical processes. ``` ########## docs/operations/durable-storage.md: ########## @@ -0,0 +1,66 @@ +--- +id: durable-storage +title: "Durable storage for the multi-stage query engine" +sidebar_label: "Durable storage" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +You can use durable storage to improve querying from deep storage and SQL-based ingestion. + +> Note that only S3 is supported as a durable storage location. + +Durable storage for queries from deep storage provides a location where you can write the results of deep storage queries to. Durable storage for SQL-based ingestion is used to temporarily house intermediate files, which can improve reliability. + +## Enable durable storage + +To enable durable storage, you need to set the following common service properties: + +``` +druid.msq.intermediate.storage.enable=true +druid.msq.intermediate.storage.type=s3 +druid.msq.intermediate.storage.bucket=YOUR_BUCKET +druid.msq.intermediate.storage.prefix=YOUR_PREFIX +druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir +``` + +For detailed information about the settings related to durable storage, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). + + +## Use durable storage for SQL-based ingestion queries + +When you run a query, include the context parameter `durableShuffleStorage` and set it to `true`. + +For queries where you want to use fault tolerance for workers, set `faultTolerance` to `true`, which automatically sets `durableShuffleStorage` to `true`. + +## Use durable storage for queries from deep storage + +When you run a query, include the context parameter `selectDestination` and set it to `DURABLE_STORAGE`. This context parameter configures queries from deep storage to write their results to durable storage. + +## Durable storage clean up + +To prevent durable storage from getting filled up with temporary files in case the tasks fail to clean them up, a periodic +cleaner can be scheduled to clean the directories corresponding to which there isn't a controller task running. It utilizes Review Comment: nit: passive voice. Also, how does one go about scheduling the periodic cleaner? ########## docs/querying/query-from-deep-storage.md: ########## @@ -0,0 +1,187 @@ +--- +id: query-deep-storage +title: "Query from deep storage" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +> Query from deep storage is an experimental feature. + +## Segments in deep storage + +Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. To take advantage of the space savings that querying from deep storage provides though, you need to make sure not all your segments get loaded onto Historical processes. Review Comment: ```suggestion Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. However, to take advantage of the space savings that querying from deep storage provides, make sure not all your segments get loaded onto Historical processes. ``` ########## docs/design/deep-storage.md: ########## @@ -55,22 +61,28 @@ druid.storage.storageDirectory=/tmp/druid/localStorage The `druid.storage.storageDirectory` must be set to a different path than `druid.segmentCache.locations` or `druid.segmentCache.infoDir`. -## Amazon S3 or S3-compatible +### Amazon S3 or S3-compatible See [`druid-s3-extensions`](../development/extensions-core/s3.md). -## Google Cloud Storage +### Google Cloud Storage See [`druid-google-extensions`](../development/extensions-core/google.md). -## Azure Blob Storage +### Azure Blob Storage See [`druid-azure-extensions`](../development/extensions-core/azure.md). -## HDFS +### HDFS See [druid-hdfs-storage extension documentation](../development/extensions-core/hdfs.md). -## Additional options +### Additional options For additional deep storage options, please see our [extensions list](../configuration/extensions.md). + +## Querying from deep storage + +Although not as performant as querying segments stored on disk for Historicals processes, you can query from deep storage to access segments that you may not need frequently or with the extreme low latency Druid queries traditionally provide. You trade some performance for a total lower storage cost because you can access more of your data without the need to increase the number or capacity of your Historical processes. + +For information about how to run queries, see [Query from deep storage](../querying/query-from-deep-storage.md) Review Comment: ```suggestion For information about how to run queries, see [Query from deep storage](../querying/query-from-deep-storage.md). ``` ########## docs/operations/durable-storage.md: ########## @@ -0,0 +1,66 @@ +--- +id: durable-storage +title: "Durable storage for the multi-stage query engine" +sidebar_label: "Durable storage" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +You can use durable storage to improve querying from deep storage and SQL-based ingestion. + +> Note that only S3 is supported as a durable storage location. + +Durable storage for queries from deep storage provides a location where you can write the results of deep storage queries to. Durable storage for SQL-based ingestion is used to temporarily house intermediate files, which can improve reliability. + +## Enable durable storage + +To enable durable storage, you need to set the following common service properties: + +``` +druid.msq.intermediate.storage.enable=true +druid.msq.intermediate.storage.type=s3 +druid.msq.intermediate.storage.bucket=YOUR_BUCKET +druid.msq.intermediate.storage.prefix=YOUR_PREFIX +druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir +``` + +For detailed information about the settings related to durable storage, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). + + +## Use durable storage for SQL-based ingestion queries + +When you run a query, include the context parameter `durableShuffleStorage` and set it to `true`. + +For queries where you want to use fault tolerance for workers, set `faultTolerance` to `true`, which automatically sets `durableShuffleStorage` to `true`. + +## Use durable storage for queries from deep storage + +When you run a query, include the context parameter `selectDestination` and set it to `DURABLE_STORAGE`. This context parameter configures queries from deep storage to write their results to durable storage. + +## Durable storage clean up + +To prevent durable storage from getting filled up with temporary files in case the tasks fail to clean them up, a periodic +cleaner can be scheduled to clean the directories corresponding to which there isn't a controller task running. It utilizes +the storage connector to work upon the durable storage. The durable storage location should only be utilized to store the output +for cluster's MSQ tasks. If the location contains other files or directories, then they will get cleaned up as well. + +Enabling durable storage also enables the use of local disk to store temporary files, such as the intermediate files produced +by the super sorter. Tasks will use whatever has been configured for their temporary usage as described in [Configuring task storage sizes](../ingestion/tasks.md#configuring-task-storage-sizes) +If the configured limit is too low, `NotEnoughTemporaryStorageFault` may be thrown. Review Comment: ```suggestion If the configured limit is too low, Druid may throw the error, `NotEnoughTemporaryStorageFault`. ``` ########## docs/querying/query-from-deep-storage.md: ########## @@ -0,0 +1,187 @@ +--- +id: query-deep-storage +title: "Query from deep storage" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +> Query from deep storage is an experimental feature. + +## Segments in deep storage + +Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. To take advantage of the space savings that querying from deep storage provides though, you need to make sure not all your segments get loaded onto Historical processes. + +To do this, configure [load rules](../operations/rule-configuration.md#load-rules) to load only the segments you do want on Historical processes. + +For example, use the `loadByInterval` load rule and set `tieredReplicants.YOUR_TIER` (such as `tieredReplicants._default_tier`) to 0 for a specific interval. If the default tier is the only tier in your cluster, this results in that interval only being available from deep storage. + +For example, the following interval load rule assigns 0 replicants for the specified interval to the tier `_default_tier`: + +``` + { + "interval": "2017-01-19T00:00:00.000Z/2017-09-20T00:00:00.000Z", + "tieredReplicants": { + "_default_tier": 0 + }, + "useDefaultTierForNull": true, + "type": "loadByInterval" + } +``` + +This means that any segments within that interval don't get loaded onto `_default_tier` . Then, create a corresponding drop rule so that Druid drops the segments from Historical tiers if they were previously loaded. Review Comment: Include an example of the corresponding drop rule? ########## docs/querying/query-from-deep-storage.md: ########## @@ -0,0 +1,187 @@ +--- +id: query-deep-storage +title: "Query from deep storage" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +> Query from deep storage is an experimental feature. + +## Segments in deep storage + +Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. To take advantage of the space savings that querying from deep storage provides though, you need to make sure not all your segments get loaded onto Historical processes. + +To do this, configure [load rules](../operations/rule-configuration.md#load-rules) to load only the segments you do want on Historical processes. + +For example, use the `loadByInterval` load rule and set `tieredReplicants.YOUR_TIER` (such as `tieredReplicants._default_tier`) to 0 for a specific interval. If the default tier is the only tier in your cluster, this results in that interval only being available from deep storage. + +For example, the following interval load rule assigns 0 replicants for the specified interval to the tier `_default_tier`: + +``` + { + "interval": "2017-01-19T00:00:00.000Z/2017-09-20T00:00:00.000Z", + "tieredReplicants": { + "_default_tier": 0 + }, + "useDefaultTierForNull": true, + "type": "loadByInterval" + } +``` + +This means that any segments within that interval don't get loaded onto `_default_tier` . Then, create a corresponding drop rule so that Druid drops the segments from Historical tiers if they were previously loaded. + +You can verify that a segment is not loaded on any Historical tiers by querying the Druid metadata table: + +```sql +SELECT "segment_id", "replication_factor" FROM sys."segments" WHERE "replication_factor" = 0 AND "datasource" = YOUR_DATASOURCE +``` + +Segments with a `replication_factor` of `0` are not assigned to any Historical tiers. Queries you run against these segments are run directly against the segment in deep storage. Review Comment: ```suggestion Segments with a `replication_factor` of `0` are not assigned to any Historical tiers. Queries against these segments are run directly against the segment in deep storage. ``` ########## docs/querying/query-from-deep-storage.md: ########## @@ -0,0 +1,187 @@ +--- +id: query-deep-storage +title: "Query from deep storage" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +> Query from deep storage is an experimental feature. + +## Segments in deep storage + +Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. To take advantage of the space savings that querying from deep storage provides though, you need to make sure not all your segments get loaded onto Historical processes. + +To do this, configure [load rules](../operations/rule-configuration.md#load-rules) to load only the segments you do want on Historical processes. + +For example, use the `loadByInterval` load rule and set `tieredReplicants.YOUR_TIER` (such as `tieredReplicants._default_tier`) to 0 for a specific interval. If the default tier is the only tier in your cluster, this results in that interval only being available from deep storage. + +For example, the following interval load rule assigns 0 replicants for the specified interval to the tier `_default_tier`: + +``` + { + "interval": "2017-01-19T00:00:00.000Z/2017-09-20T00:00:00.000Z", + "tieredReplicants": { + "_default_tier": 0 + }, + "useDefaultTierForNull": true, + "type": "loadByInterval" + } +``` + +This means that any segments within that interval don't get loaded onto `_default_tier` . Then, create a corresponding drop rule so that Druid drops the segments from Historical tiers if they were previously loaded. Review Comment: ```suggestion This means that any segments within that interval don't get loaded onto `_default_tier`. Then, create a corresponding drop rule so that Druid drops the segments from Historical tiers if they were previously loaded. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
