[druid] branch master updated: Add a note about setting the value of maxNumConcurrentSubTasks (#12772)

victoria Tue, 19 Jul 2022 15:34:36 -0700

This is an automated email from the ASF dual-hosted git repository.

victoria pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new 809bf161ce Add a note about setting the value of 
maxNumConcurrentSubTasks  (#12772)
809bf161ce is described below

commit 809bf161ce6913c930362239ca079c800e0c317e
Author: Katya Macedo <[email protected]>
AuthorDate: Tue Jul 19 17:34:21 2022 -0500

    Add a note about setting the value of maxNumConcurrentSubTasks  (#12772)
    
    * Add clarification for combining input source
    
    * Update inputFormat note
    
    * Update maxNumConcurrentSubTasks note
    
    * Fix broken link
    
    * Update docs/ingestion/native-batch-input-source.md
    
    Co-authored-by: Charles Smith <[email protected]>
    
    Co-authored-by: Charles Smith <[email protected]>
---
 docs/ingestion/native-batch-input-source.md | 94 ++++++++++++++++-------------
 docs/ingestion/native-batch.md              |  2 +-
 2 files changed, 52 insertions(+), 44 deletions(-)

diff --git a/docs/ingestion/native-batch-input-source.md 
b/docs/ingestion/native-batch-input-source.md
index f4b92bdfe7..62ae3a8a07 100644
--- a/docs/ingestion/native-batch-input-source.md
+++ b/docs/ingestion/native-batch-input-source.md
@@ -176,10 +176,9 @@ Sample specs:
 ...
 ```
 
-
-|property|description|default|required?|
+|Property|Description|Default|Required|
 |--------|-----------|-------|---------|
-|type|This should be `s3`.|None|yes|
+|type|Set the value to `s3`.|None|yes|
 |uris|JSON array of URIs where S3 objects to be ingested are 
located.|None|`uris` or `prefixes` or `objects` must be set|
 |prefixes|JSON array of URI prefixes for the locations of S3 objects to be 
ingested. Empty objects starting with one of the given prefixes will be 
skipped.|None|`uris` or `prefixes` or `objects` must be set|
 |objects|JSON array of S3 Objects to be ingested.|None|`uris` or `prefixes` or 
`objects` must be set|
@@ -193,23 +192,23 @@ Note that the S3 input source will skip all empty objects 
only when `prefixes` i
 
 S3 Object:
 
-|property|description|default|required?|
+|Property|Description|Default|Required|
 |--------|-----------|-------|---------|
 |bucket|Name of the S3 bucket|None|yes|
 |path|The path where data is located.|None|yes|
 
 Properties Object:
 
-|property|description|default|required?|
+|Property|Description|Default|Required|
 |--------|-----------|-------|---------|
-|accessKeyId|The [Password Provider](../operations/password-provider.md) or 
plain text string of this S3 InputSource's access key|None|yes if 
secretAccessKey is given|
-|secretAccessKey|The [Password Provider](../operations/password-provider.md) 
or plain text string of this S3 InputSource's secret key|None|yes if 
accessKeyId is given|
+|accessKeyId|The [Password Provider](../operations/password-provider.md) or 
plain text string of this S3 input source access key|None|yes if 
secretAccessKey is given|
+|secretAccessKey|The [Password Provider](../operations/password-provider.md) 
or plain text string of this S3 input source secret key|None|yes if accessKeyId 
is given|
 |assumeRoleArn|AWS ARN of the role to assume 
[see](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html).
 **assumeRoleArn** can be used either with the ingestion spec AWS credentials 
or with the default S3 credentials|None|no|
 |assumeRoleExternalId|A unique identifier that might be required when you 
assume a role in another account 
[see](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)|None|no|
 
 > **Note:** If `accessKeyId` and `secretAccessKey` are not given, the default 
 > [S3 credentials provider 
 > chain](../development/extensions-core/s3.md#s3-authentication-methods) is 
 > used.
 
-## Google Cloud Storage Input Source
+## Google Cloud Storage input source
 
 > You need to include the 
 > [`druid-google-extensions`](../development/extensions-core/google.md) as an 
 > extension to use the Google Cloud Storage input source.
 
@@ -276,9 +275,9 @@ Sample specs:
 ...
 ```
 
-|property|description|default|required?|
+|Property|Description|Default|Required|
 |--------|-----------|-------|---------|
-|type|This should be `google`.|None|yes|
+|type|Set the value to `google`.|None|yes|
 |uris|JSON array of URIs where Google Cloud Storage objects to be ingested are 
located.|None|`uris` or `prefixes` or `objects` must be set|
 |prefixes|JSON array of URI prefixes for the locations of Google Cloud Storage 
objects to be ingested. Empty objects starting with one of the given prefixes 
will be skipped.|None|`uris` or `prefixes` or `objects` must be set|
 |objects|JSON array of Google Cloud Storage objects to be 
ingested.|None|`uris` or `prefixes` or `objects` must be set|
@@ -288,7 +287,7 @@ Note that the Google Cloud Storage input source will skip 
all empty objects only
 
 Google Cloud Storage object:
 
-|property|description|default|required?|
+|Property|Description|Default|Required|
 |--------|-----------|-------|---------|
 |bucket|Name of the Google Cloud Storage bucket|None|yes|
 |path|The path where data is located.|None|yes|
@@ -357,11 +356,11 @@ Sample specs:
 ...
 ```
 
-|property|description|default|required?|
+|Property|Description|Default|Required|
 |--------|-----------|-------|---------|
-|type|This should be `azure`.|None|yes|
+|type|Set the value to `azure`.|None|yes|
 |uris|JSON array of URIs where the Azure objects to be ingested are located, 
in the form "azure://\<container>/\<path-to-file\>"|None|`uris` or `prefixes` 
or `objects` must be set|
-|prefixes|JSON array of URI prefixes for the locations of Azure objects to 
ingest, in the form "azure://\<container>/\<prefix\>". Empty objects starting 
with one of the given prefixes are skipped.|None|`uris` or `prefixes` or 
`objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to 
ingest, in the form `azure://\<container>/\<prefix\>`. Empty objects starting 
with one of the given prefixes are skipped.|None|`uris` or `prefixes` or 
`objects` must be set|
 |objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or 
`objects` must be set|
 |filter|A wildcard filter for files. See 
[here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter)
 for more information. Files matching the filter criteria are considered for 
ingestion. Files not matching the filter criteria are ignored.|None|no|
 
@@ -369,12 +368,12 @@ Note that the Azure input source skips all empty objects 
only when `prefixes` is
 
 The `objects` property is:
 
-|property|description|default|required?|
+|Property|Description|Default|Required|
 |--------|-----------|-------|---------|
 |bucket|Name of the Azure Blob Storage or Azure Data Lake container|None|yes|
 |path|The path where data is located.|None|yes|
 
-## HDFS Input Source
+## HDFS input source
 
 > You need to include the 
 > [`druid-hdfs-storage`](../development/extensions-core/hdfs.md) as an 
 > extension to use the HDFS input source.
 
@@ -449,9 +448,9 @@ Sample specs:
 ...
 ```
 
-|property|description|default|required?|
+|Property|Description|Default|Required|
 |--------|-----------|-------|---------|
-|type|This should be `hdfs`.|None|yes|
+|type|Set the value to `hdfs`.|None|yes|
 |paths|HDFS paths. Can be either a JSON array or comma-separated string of 
paths. Wildcards like `*` are supported in these paths. Empty files located 
under one of the given paths will be skipped.|None|yes|
 
 You can also ingest from other storage using the HDFS input source if the HDFS 
client supports that storage.
@@ -459,7 +458,7 @@ However, if you want to ingest from cloud storage, consider 
using the service-sp
 If you want to use a non-hdfs protocol with the HDFS input source, include the 
protocol
 in `druid.ingestion.hdfs.allowedProtocols`. See [HDFS input source security 
configuration](../configuration/index.md#hdfs-input-source) for more details.
 
-## HTTP Input Source
+## HTTP input source
 
 The HTTP input source is to support reading files directly from remote sites 
via HTTP.
 
@@ -534,9 +533,9 @@ You can also use the other existing Druid 
PasswordProviders. Here is an example
 }
 ```
 
-|property|description|default|required?|
+|Property|Description|Default|Required|
 |--------|-----------|-------|---------|
-|type|This should be `http`|None|yes|
+|type|Set the value to `http`.|None|yes|
 |uris|URIs of the input files. See below for the protocols allowed for 
URIs.|None|yes|
 |httpAuthenticationUsername|Username to use for authentication with specified 
URIs. Can be optionally used if the URIs specified in the spec require a Basic 
Authentication Header.|None|no|
 |httpAuthenticationPassword|PasswordProvider to use with specified URIs. Can 
be optionally used if the URIs specified in the spec require a Basic 
Authentication Header.|None|no|
@@ -544,7 +543,7 @@ You can also use the other existing Druid 
PasswordProviders. Here is an example
 You can only use protocols listed in the 
`druid.ingestion.http.allowedProtocols` property as HTTP input sources.
 The `http` and `https` protocols are allowed by default. See [HTTP input 
source security configuration](../configuration/index.md#http-input-source) for 
more details.
 
-## Inline Input Source
+## Inline input source
 
 The Inline input source can be used to read the data inlined in its own spec.
 It can be used for demos or for quickly testing out parsing and schema.
@@ -567,12 +566,12 @@ Sample spec:
 ...
 ```
 
-|property|description|required?|
+|Property|Description|Required|
 |--------|-----------|---------|
-|type|This should be "inline".|yes|
+|type|Set the value to `inline`.|yes|
 |data|Inlined data to ingest.|yes|
 
-## Local Input Source
+## Local input source
 
 The Local input source is to support reading files directly from local storage,
 and is mainly intended for proof-of-concept testing.
@@ -599,14 +598,14 @@ Sample spec:
 ...
 ```
 
-|property|description|required?|
+|Property|Description|Required|
 |--------|-----------|---------|
-|type|This should be "local".|yes|
+|type|Set the value to `local`.|yes|
 |filter|A wildcard filter for files. See 
[here](http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/filefilter/WildcardFileFilter)
 for more information. Files matching the filter criteria are considered for 
ingestion. Files not matching the filter criteria are ignored.|yes if `baseDir` 
is specified|
 |baseDir|Directory to search recursively for files to be ingested. Empty files 
under the `baseDir` will be skipped.|At least one of `baseDir` or `files` 
should be specified|
 |files|File paths to ingest. Some files can be ignored to avoid ingesting 
duplicate files if they are located under the specified `baseDir`. Empty files 
will be skipped.|At least one of `baseDir` or `files` should be specified|
 
-## Druid Input Source
+## Druid input source
 
 The Druid input source is to support reading data directly from existing Druid 
segments,
 potentially using a new schema and changing the name, dimensions, metrics, 
rollup, etc. of the segment.
@@ -614,9 +613,9 @@ The Druid input source is _splittable_ and can be used by 
the [Parallel task](./
 This input source has a fixed input format for reading from Druid segments;
 no `inputFormat` field needs to be specified in the ingestion spec when using 
this input source.
 
-|property|description|required?|
+|Property|Description|Required|
 |--------|-----------|---------|
-|type|This should be "druid".|yes|
+|type|Set the value to `druid`.|yes|
 |dataSource|A String defining the Druid datasource to fetch rows from|yes|
 |interval|A String representing an ISO-8601 interval, which defines the time 
range to fetch the data over.|yes|
 |filter| See [Filters](../querying/filters.md). Only rows that match the 
filter, if specified, will be returned.|no|
@@ -696,7 +695,7 @@ rolled-up datasource `wikipedia_rollup` by grouping on 
hour, "countryName", and
 > [`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`](../configuration/index.md#indexer-general-configuration)
 > to `true` to enable a compatibility mode where the timestampSpec is ignored.
 
-## SQL Input Source
+## SQL input source
 
 The SQL input source is used to read data directly from RDBMS.
 The SQL input source is _splittable_ and can be used by the [Parallel 
task](./native-batch.md), where each worker task will read from one SQL query 
from the list of queries.
@@ -704,14 +703,14 @@ This input source does not support Split Hint Spec.
 Since this input source has a fixed input format for reading events, no 
`inputFormat` field needs to be specified in the ingestion spec when using this 
input source.
 Please refer to the Recommended practices section below before using this 
input source.
 
-|property|description|required?|
+|Property|Description|Required|
 |--------|-----------|---------|
-|type|This should be "sql".|Yes|
+|type|Set the value to `sql`.|Yes|
 |database|Specifies the database connection details. The database type 
corresponds to the extension that supplies the `connectorConfig` support. The 
specified extension must be loaded into 
Druid:<br/><br/><ul><li>[mysql-metadata-storage](../development/extensions-core/mysql.md)
 for `mysql`</li><li> 
[postgresql-metadata-storage](../development/extensions-core/postgresql.md) 
extension for `postgresql`.</li></ul><br/><br/>You can selectively allow JDBC 
properties in `connectURI`. See [JDBC  [...]
 |foldCase|Toggle case folding of database column names. This may be enabled in 
cases where the database returns case insensitive column names in query 
results.|No|
 |sqls|List of SQL queries where each SQL query would retrieve the data to be 
indexed.|Yes|
 
-An example SqlInputSource spec is shown below:
+The following is an example of an SQL input source spec:
 
 ```json
 ...
@@ -738,7 +737,7 @@ Each of the SQL queries will be run in its own sub-task and 
thus for the above e
 
 **Recommended practices**
 
-Compared to the other native batch InputSources, SQL InputSource behaves 
differently in terms of reading the input data and so it would be helpful to 
consider the following points before using this InputSource in a production 
environment:
+Compared to the other native batch input sources, SQL input source behaves 
differently in terms of reading the input data. Therefore, consider the 
following points before using this input source in a production environment:
 
 * During indexing, each sub-task would execute one of the SQL queries and the 
results are stored locally on disk. The sub-tasks then proceed to read the data 
from these local input files and generate segments. Presently, there isn’t any 
restriction on the size of the generated files and this would require the 
MiddleManagers or Indexers to have sufficient disk capacity based on the volume 
of data being indexed.
 
@@ -749,18 +748,21 @@ Compared to the other native batch InputSources, SQL 
InputSource behaves differe
 * Similar to file-based input formats, any updates to existing data will 
replace the data in segments specific to the intervals specified in the 
`granularitySpec`.
 
 
-## Combining input sources
+## Combining input source
 
-The Combining input source is used to read data from multiple InputSources. 
This input source should be only used if all the delegate input sources are
- _splittable_ and can be used by the [Parallel task](./native-batch.md). This 
input source will identify the splits from its delegates and each split will be 
processed by a worker task. Similar to other input sources, this input source 
supports a single `inputFormat`. Therefore, please note that delegate input 
sources requiring an `inputFormat` must have the same format for input data.
+The Combining input source lets you read data from multiple input sources.
+It identifies the splits from delegate input sources and uses a worker task to 
process each split.
+Use the Combining input source only if all the delegates are splittable and 
can be used by the [Parallel task](./native-batch.md). 
 
-|property|description|required?|
-|--------|-----------|---------|
-|type|This should be "combining".|Yes|
-|delegates|List of _splittable_ InputSources to read data from.|Yes|
+Similar to other input sources, the Combining input source supports a single 
`inputFormat`.
+Delegate input sources that require an `inputFormat` must have the same format 
for input data.
 
-Sample spec:
+|Property|Description|Required|
+|--------|-----------|---------|
+|type|Set the value to `combining`.|Yes|
+|delegates|List of splittable input sources to read data from.|Yes|
 
+The following is an example of a Combining input source spec:
 
 ```json
 ...
@@ -790,3 +792,9 @@ Sample spec:
 ...
 ```
 
+The [secondary partitioning method](native-batch.md#partitionsspec) determines 
the requisite number of concurrent worker tasks that run in parallel to 
complete ingestion with the Combining input source.
+Set this value in `maxNumConcurrentSubTasks` in `tuningConfig` based on the 
secondary partitioning method:
+- `range` or `single_dim` partitioning: greater than or equal to 1
+- `hashed` or `dynamic` partitioning: greater than or equal to 2
+
+For more information on the `maxNumConcurrentSubTasks` field, see 
[Implementation considerations](native-batch.md#implementation-considerations).
\ No newline at end of file
diff --git a/docs/ingestion/native-batch.md b/docs/ingestion/native-batch.md
index c441c39aeb..1ecba43741 100644
--- a/docs/ingestion/native-batch.md
+++ b/docs/ingestion/native-batch.md
@@ -717,6 +717,6 @@ For details on available input sources see:
 - [Druid input Source](./native-batch-input-source.md#druid-input-source) 
(`druid`) reads data from a Druid datasource.
 - [SQL input Source](./native-batch-input-source.md#sql-input-source) (`sql`) 
reads data from a RDBMS source.
 
-For information on how to combine input sources, see [Combining input 
sources](./native-batch-input-source.md#combining-input-sources).
+For information on how to combine input sources, see [Combining input 
source](./native-batch-input-source.md#combining-input-source).
 
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[druid] branch master updated: Add a note about setting the value of maxNumConcurrentSubTasks (#12772)

Reply via email to