[druid] branch master updated: msq: add durable storage info (#14035)

karan Fri, 14 Apr 2023 00:58:35 -0700

This is an automated email from the ASF dual-hosted git repository.

karan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new 6c9b7b6efd msq: add durable storage info (#14035)
6c9b7b6efd is described below

commit 6c9b7b6efdcdc26210276951598f732cc6cd7f0e
Author: 317brian <53799971+317br...@users.noreply.github.com>
AuthorDate: Fri Apr 14 00:58:23 2023 -0700

    msq: add durable storage info (#14035)
    
    * msq: add durable storage info
    
    * fix duplicate row
    
    * Apply suggestions from code review
    
    Co-authored-by: Karan Kumar <karankumar1...@gmail.com>
    
    ---------
    
    Co-authored-by: Karan Kumar <karankumar1...@gmail.com>
---
 docs/multi-stage-query/reference.md | 46 +++++++++++++++++++++++++++++++++----
 docs/multi-stage-query/security.md  | 15 ++++++++++++
 2 files changed, 56 insertions(+), 5 deletions(-)

diff --git a/docs/multi-stage-query/reference.md 
b/docs/multi-stage-query/reference.md
index 2781ca3a3e..de75eef071 100644
--- a/docs/multi-stage-query/reference.md
+++ b/docs/multi-stage-query/reference.md
@@ -330,19 +330,55 @@ CLUSTERED BY user
 
 ## Durable Storage
 
-This section enumerates the advantages and performance implications of 
enabling durable storage while executing MSQ tasks.
+Using durable storage with your SQL-based ingestions can improve their 
reliability by writing intermediate files to a storage location temporarily. 
 
 To prevent durable storage from getting filled up with temporary files in case 
the tasks fail to clean them up, a periodic
 cleaner can be scheduled to clean the directories corresponding to which there 
isn't a controller task running. It utilizes
 the storage connector to work upon the durable storage. The durable storage 
location should only be utilized to store the output
 for cluster's MSQ tasks. If the location contains other files or directories, 
then they will get cleaned up as well.
-Following table lists the properties that can be set to control the behavior 
of the durable storage of the cluster.
+
+### Enable durable storage
+
+To enable durable storage, you need to set the following common service 
properties:
+
+```
+druid.msq.intermediate.storage.enable=true
+druid.msq.intermediate.storage.type=s3
+druid.msq.intermediate.storage.bucket=YOUR_BUCKET
+druid.msq.intermediate.storage.prefix=YOUR_PREFIX
+druid.msq.intermediate.storage.tempDir=/path/to/your/temp/dir
+```
+
+For detailed information about the settings related to durable storage, see 
[Durable storage configurations](#durable-storage-configurations).
+
+
+### Use durable storage for queries
+
+When you run a query,  include the context parameter `durableShuffleStorage` 
and set it to `true`. 
+
+For queries where you want to use fault tolerance for workers,  set 
`faultTolerance` to `true`, which automatically sets `durableShuffleStorage` to 
`true`.
+
+## Durable storage configurations
+
+The following common service properties control how durable storage behaves:
+
+|Parameter          |Default                                 | Description     
     |
+|-------------------|----------------------------------------|----------------------|
+|`druid.msq.intermediate.storage.bucket` | n/a | The bucket in S3 where you 
want to store intermediate files.  |
+| `druid.msq.intermediate.storage.chunkSize` | n/a | Optional. Defines the 
size of each chunk to temporarily store in 
`druid.msq.intermediate.storage.tempDir`. The chunk size must be between 5 MiB 
and 5 GiB. Druid computes the chunk size automatically if no value is 
provided.| 
+|`druid.msq.intermediate.storage.enable` | true | Required. Whether to enable 
durable storage for the cluster.|
+| `druid.msq.intermediate.storage.maxTriesOnTransientErrors` | 10 | Optional. 
Defines the max number times to attempt S3 API calls to avoid failures due to 
transient errors. | 
+|`druid.msq.intermediate.storage.type` | `s3` if your deep storage is S3 | 
Required. The type of storage to use. You can either set this to `local` or 
`s3`.  |
+|`druid.msq.intermediate.storage.prefix` | n/a | S3 prefix to store 
intermediate stage results. Provide a unique value for the prefix. Don't share 
the same prefix between clusters. If the location  includes other files or 
directories, then they will get cleaned up as well.  |
+| `druid.msq.intermediate.storage.tempDir`|  | Required. Directory path on the 
local disk to temporarily store intermediate stage results.  |
+
+In addition to the common service properties, there are certain properties 
that you configure on the Overlord specifically to clean up intermediate files:
 
 |Parameter          |Default                                 | Description     
     |
 
|-------------------|----------------------------------------|----------------------|
-|`druid.msq.intermediate.storage.enable` | true | Whether to enable durable 
storage for the cluster |
-|`druid.msq.intermediate.storage.cleaner.enabled`| false | Whether durable 
storage cleaner should be enabled for the cluster. This should be set on the 
overlord|
-|`druid.msq.intermediate.storage.cleaner.delaySeconds`| 86400 | The delay (in 
seconds) after the last run post which the durable storage cleaner would clean 
the outputs. This should be set on the overlord |
+|`druid.msq.intermediate.storage.cleaner.enabled`| false | Optional. Whether 
durable storage cleaner should be enabled for the cluster.  |
+|`druid.msq.intermediate.storage.cleaner.delaySeconds`| 86400 | Optional. The 
delay (in seconds) after the last run post which the durable storage cleaner 
would clean the outputs.  |
+
 
 ## Limits
 
diff --git a/docs/multi-stage-query/security.md 
b/docs/multi-stage-query/security.md
index 1d18080ca9..6422542890 100644
--- a/docs/multi-stage-query/security.md
+++ b/docs/multi-stage-query/security.md
@@ -50,3 +50,18 @@ To interact with a query through the Overlord API, users 
need the following perm
 
 - `INSERT` or `REPLACE` queries: Users must have READ DATASOURCE permission on 
the output datasource.
 - `SELECT` queries: Users must have read permissions on the `__query_select` 
datasource, which is a stub datasource that gets created.
+
+## S3
+
+The MSQ task engine can use S3 to store intermediate files when running 
queries. This can increase its reliability but requires certain permissions in 
S3.
+These permissions are required if you configure durable storage. 
+
+Permissions for pushing and fetching intermediate stage results to and from S3:
+
+- `s3:GetObject`
+- `s3:PutObject`
+- `s3:AbortMultipartUpload`
+
+Permissions for removing intermediate stage results:
+
+- `s3:DeleteObject`
\ No newline at end of file


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[druid] branch master updated: msq: add durable storage info (#14035)

Reply via email to