cryptoe commented on code in PR #14609: URL: https://github.com/apache/druid/pull/14609#discussion_r1275697116
########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [Submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. Druid currently only supports `ASYNC`. + - `selectDestination` set to `DURABLE_STORAGE` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). + +- The only supported value for `resultFormat` is JSON. +- Only the user who submits a query can see the results. Review Comment: The response why execution mode is async is this pojo: https://github.com/apache/druid/blob/master/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/sql/entity/SqlStatementResult.java We might want to mention that in the response payload. ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [Submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. Druid currently only supports `ASYNC`. + - `selectDestination` set to `DURABLE_STORAGE` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). + +- The only supported value for `resultFormat` is JSON. +- Only the user who submits a query can see the results. + + +### Get query status + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID} +``` + +Returns information about the query associated with the given query ID. The response matches the response from the POST API if the query is accepted or running. The response for a completed query includes the same information as an in-progress query with several additions: Review Comment: Get query status response and the `postReq` response is the same when executionMode=async ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. Review Comment: ```suggestion Note that at least one segment of a datasource must be available on a Historical process so that the broker can plan your query. A quick way to check this is that data source should be visible on the druid console. ``` ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [Submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. Druid currently only supports `ASYNC`. + - `selectDestination` set to `DURABLE_STORAGE` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). + +- The only supported value for `resultFormat` is JSON. +- Only the user who submits a query can see the results. + + +### Get query status + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID} +``` + +Returns information about the query associated with the given query ID. The response matches the response from the POST API if the query is accepted or running. The response for a completed query includes the same information as an in-progress query with several additions: + +- A `result` object that summarizes information about your results, such as the total number of rows and a sample record +- A `pages` object that includes the following information for each page of results: + - `numRows`: the number of rows in that page of results + - `sizeInBytes`: the size of the page + - `id`: the page number that you can use to reference a specific page when you get query results + + +### Get query results + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID}/results?page=PAGENUMBER +``` + +Results are separated into pages, so you can use the optional `page` parameter to refine the results you get. When you retrieve the status of a completed query, Druid returns information about the composition of each page and its page number (`id`). Review Comment: Should we link the get query status api here ? ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [Submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. Druid currently only supports `ASYNC`. + - `selectDestination` set to `DURABLE_STORAGE` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). + +- The only supported value for `resultFormat` is JSON. +- Only the user who submits a query can see the results. + + +### Get query status + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID} +``` + +Returns information about the query associated with the given query ID. The response matches the response from the POST API if the query is accepted or running. The response for a completed query includes the same information as an in-progress query with several additions: + +- A `result` object that summarizes information about your results, such as the total number of rows and a sample record +- A `pages` object that includes the following information for each page of results: + - `numRows`: the number of rows in that page of results + - `sizeInBytes`: the size of the page + - `id`: the page number that you can use to reference a specific page when you get query results + + +### Get query results + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID}/results?page=PAGENUMBER +``` + +Results are separated into pages, so you can use the optional `page` parameter to refine the results you get. When you retrieve the status of a completed query, Druid returns information about the composition of each page and its page number (`id`). + +When getting query results, keep the following in mind: + +- JSON is the only supported result format. +- If you attempt to get the results for an in-progress query, Druid returns an error. + +### Cancel a query + +``` +DELETE https://ROUTER:8888/druid/v2/sql/statements/{queryID} +``` + +Cancels a running or accepted query. + +Druid returns an HTTP 202 response for successful cancellation requests. If the query is already complete or can't be found, Druid returns an HTTP 500 error with an error message describing the issue. Review Comment: if the query is already completed then we return a 200. If the query cannot be found we return a 404. ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [Submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. Druid currently only supports `ASYNC`. + - `selectDestination` set to `DURABLE_STORAGE` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). + +- The only supported value for `resultFormat` is JSON. +- Only the user who submits a query can see the results. + + +### Get query status + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID} +``` + +Returns information about the query associated with the given query ID. The response matches the response from the POST API if the query is accepted or running. The response for a completed query includes the same information as an in-progress query with several additions: + +- A `result` object that summarizes information about your results, such as the total number of rows and a sample record +- A `pages` object that includes the following information for each page of results: + - `numRows`: the number of rows in that page of results + - `sizeInBytes`: the size of the page + - `id`: the page number that you can use to reference a specific page when you get query results + + +### Get query results + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID}/results?page=PAGENUMBER +``` + +Results are separated into pages, so you can use the optional `page` parameter to refine the results you get. When you retrieve the status of a completed query, Druid returns information about the composition of each page and its page number (`id`). + +When getting query results, keep the following in mind: + +- JSON is the only supported result format. +- If you attempt to get the results for an in-progress query, Druid returns an error. + Review Comment: If you attempt to get the results of a failed query, druid return's a 404. If you attempt to get the results of a ingestion/replace query, druid returns an empty response. ########## docs/api-reference/sql-api.md: ########## @@ -186,4 +186,79 @@ Druid returns an HTTP 404 response in the following cases: - `sqlQueryId` is incorrect. - The query completes before your cancellation request is processed. -Druid returns an HTTP 403 response for authorization failure. \ No newline at end of file +Druid returns an HTTP 403 response for authorization failure. + +## Query from deep storage + +> The `/sql/statements` endpoint used to query from deep storage is currently experimental. + +You can use the `sql/statements` endpoint to query segments that exist only in deep storage and are not loaded onto your Historical processes as determined by your load rules. + +Note that at least part of a datasource must be available on a Historical process so that Druid can plan your query. + +For more information, see [Query from deep storage](../querying/query-from-deep-storage.md). + +### Submit a query + +To run a query from deep storage, send your query to the Router using the POST method: + +``` +POST https://ROUTER:8888/druid/v2/sql/statements +``` + +Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the "query" field in the JSON object within the request payload contains your query. For example: + +```json +{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"} +``` + +Generally, the `sql` and `sql/statements` endpoints support the same response body fields with minor differences. For general information about the available fields, see [Submit a query to the `sql` endpoint](#submit-a-query). + +Keep the following in mind when submitting queries to the `sql/statements` endpoint: + +- There are additional context parameters for `sql/statements`: + + - `executionMode` determines how query results are fetched. Druid currently only supports `ASYNC`. + - `selectDestination` set to `DURABLE_STORAGE` instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have [durable storage for MSQ enabled](../operations/durable-storage.md). + +- The only supported value for `resultFormat` is JSON. +- Only the user who submits a query can see the results. + + +### Get query status + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID} +``` + +Returns information about the query associated with the given query ID. The response matches the response from the POST API if the query is accepted or running. The response for a completed query includes the same information as an in-progress query with several additions: + +- A `result` object that summarizes information about your results, such as the total number of rows and a sample record +- A `pages` object that includes the following information for each page of results: + - `numRows`: the number of rows in that page of results + - `sizeInBytes`: the size of the page + - `id`: the page number that you can use to reference a specific page when you get query results + + +### Get query results + +``` +GET https://ROUTER:8888/druid/v2/sql/statements/{queryID}/results?page=PAGENUMBER +``` + +Results are separated into pages, so you can use the optional `page` parameter to refine the results you get. When you retrieve the status of a completed query, Druid returns information about the composition of each page and its page number (`id`). + Review Comment: If page number is not passed, all data for that query is returned on the order of the pages sequentially in the same response. Note that if you have large result sets, your request can timeout due to `druid.router.http.readTimeout` ########## docs/design/architecture.md: ########## @@ -70,12 +70,20 @@ Druid uses deep storage to store any data that has been ingested into the system storage accessible by every Druid server. In a clustered deployment, this is typically a distributed object store like S3 or HDFS, or a network mounted filesystem. In a single-server deployment, this is typically local disk. -Druid uses deep storage only as a backup of your data and as a way to transfer data in the background between -Druid processes. Druid stores data in files called _segments_. Historical processes cache data segments on -local disk and serve queries from that cache as well as from an in-memory cache. -This means that Druid never needs to access deep storage -during a query, helping it offer the best query latencies possible. It also means that you must have enough disk space -both in deep storage and across your Historical servers for the data you plan to load. +Druid uses deep storage for the following purposes: + +- As a backup of your data, including those that get loaded onto Historical processes. +- As a way to transfer data in the background between Druid processes. Druid stores data in files called _segments_. +- As the source data for queries that run against segments stored only in deep storage and not in Historical processes as determined by your load rules. Review Comment: this line is a little confusing. ########## docs/operations/durable-storage.md: ########## @@ -0,0 +1,66 @@ +--- +id: durable-storage +title: "Durable storage for the multi-stage query engine" +sidebar_label: "Durable storage" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +You can use durable storage to improve querying from deep storage and SQL-based ingestion. + +> Note that only S3 is supported as a durable storage location. + +Durable storage for queries from deep storage provides a location where you can write the results of deep storage queries to. Durable storage for SQL-based ingestion is used to temporarily house intermediate files, which can improve reliability. + +## Enable durable storage + +To enable durable storage, you need to set the following common service properties: + +``` +druid.msq.intermediate.storage.enable=true +druid.msq.intermediate.storage.type=s3 +druid.msq.intermediate.storage.bucket=YOUR_BUCKET +druid.msq.intermediate.storage.prefix=YOUR_PREFIX +druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir +``` + +For detailed information about the settings related to durable storage, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). + + +## Use durable storage for SQL-based ingestion queries + +When you run a query, include the context parameter `durableShuffleStorage` and set it to `true`. + +For queries where you want to use fault tolerance for workers, set `faultTolerance` to `true`, which automatically sets `durableShuffleStorage` to `true`. + +## Use durable storage for queries from deep storage + +When you run a query, include the context parameter `selectDestination` and set it to `DURABLE_STORAGE`. This context parameter configures queries from deep storage to write their results to durable storage. + +## Durable storage clean up + +To prevent durable storage from getting filled up with temporary files in case the tasks fail to clean them up, a periodic +cleaner can be scheduled to clean the directories corresponding to which there isn't a controller task running. It utilizes +the storage connector to work upon the durable storage. The durable storage location should only be utilized to store the output +for cluster's MSQ tasks. If the location contains other files or directories, then they will get cleaned up as well. + Review Comment: If we select the destination as `durableStorage` for query results, the results are cleaned up when the task is removed from the metadata store. ########## docs/operations/rule-configuration.md: ########## @@ -167,7 +167,7 @@ Set the following properties: - the segment interval starts any time after the rule interval starts. You can use this property to load segments with future start and end dates, where "future" is relative to the time when the Coordinator evaluates data against the rule. Defaults to `true`. -- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier. +- `tieredReplicants`: a map of tier names to the number of segment replicas for that tier. If you set the replicants for a period to 0 on all tiers, you can still [query the data from deep storage](../querying/query-from-deep-storage.md). Review Comment: Another way to do query the data from deep storage is to set `tieredReplicants` empty and set `useDefaultTierForNull` to false. I think we should push users to this way in the docs. cc @adarshsanjeev ########## docs/operations/durable-storage.md: ########## @@ -0,0 +1,66 @@ +--- +id: durable-storage +title: "Durable storage for the multi-stage query engine" +sidebar_label: "Durable storage" +--- + +<!-- + ~ Licensed to the Apache Software Foundation (ASF) under one + ~ or more contributor license agreements. See the NOTICE file + ~ distributed with this work for additional information + ~ regarding copyright ownership. The ASF licenses this file + ~ to you under the Apache License, Version 2.0 (the + ~ "License"); you may not use this file except in compliance + ~ with the License. You may obtain a copy of the License at + ~ + ~ http://www.apache.org/licenses/LICENSE-2.0 + ~ + ~ Unless required by applicable law or agreed to in writing, + ~ software distributed under the License is distributed on an + ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + ~ KIND, either express or implied. See the License for the + ~ specific language governing permissions and limitations + ~ under the License. + --> + +You can use durable storage to improve querying from deep storage and SQL-based ingestion. + +> Note that only S3 is supported as a durable storage location. + +Durable storage for queries from deep storage provides a location where you can write the results of deep storage queries to. Durable storage for SQL-based ingestion is used to temporarily house intermediate files, which can improve reliability. + +## Enable durable storage + +To enable durable storage, you need to set the following common service properties: + +``` +druid.msq.intermediate.storage.enable=true +druid.msq.intermediate.storage.type=s3 +druid.msq.intermediate.storage.bucket=YOUR_BUCKET +druid.msq.intermediate.storage.prefix=YOUR_PREFIX +druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir +``` + +For detailed information about the settings related to durable storage, see [Durable storage configurations](../multi-stage-query/reference.md#durable-storage-configurations). + + +## Use durable storage for SQL-based ingestion queries + +When you run a query, include the context parameter `durableShuffleStorage` and set it to `true`. + +For queries where you want to use fault tolerance for workers, set `faultTolerance` to `true`, which automatically sets `durableShuffleStorage` to `true`. + +## Use durable storage for queries from deep storage Review Comment: I think we also need to mention this : https://github.com/apache/druid/pull/14629/files#diff-bb668e1497f66d4430a7e3650bbdc18accaddc4bbcbf111c38298870fd9e7c06R380 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
