[druid] branch master updated: Document config for ingesting null columns (#12389)

jihoonson Tue, 05 Apr 2022 09:16:00 -0700

This is an automated email from the ASF dual-hosted git repository.

jihoonson pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new d326c681c1 Document config for ingesting null columns (#12389)
d326c681c1 is described below

commit d326c681c1c43724266f84d3ede31ac8299cafe8
Author: Victoria Lim <[email protected]>
AuthorDate: Tue Apr 5 09:15:42 2022 -0700

    Document config for ingesting null columns (#12389)
    
    * config for ingesting null columns
    
    * add link
    
    * edit .spelling
    
    * what happens if storeEmptyColumns is disabled
---
 docs/configuration/index.md                |  2 ++
 docs/ingestion/native-batch-simple-task.md | 12 +++++++-----
 docs/ingestion/native-batch.md             | 17 ++++++++++-------
 docs/ingestion/tasks.md                    | 21 +++++++++++----------
 website/.spelling                          |  2 +-
 5 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/docs/configuration/index.md b/docs/configuration/index.md
index 6845161176..2999e9c10b 100644
--- a/docs/configuration/index.md
+++ b/docs/configuration/index.md
@@ -1432,6 +1432,7 @@ Additional peon configs include:
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop 
tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will 
attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
 |`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks 
using the [Druid input source](../ingestion/native-batch-input-source.md) will 
ignore the provided timestampSpec, and will use the `__time` column of the 
input datasource. This option is provided for compatibility with ingestion 
specs written before Druid 0.22.0.|false|
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to 
store empty columns during ingestion. When set to true, Druid stores every 
column specified in the 
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use 
schemaless ingestion and don't specify any dimensions to ingest, you must also 
set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for 
Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, 
 [...]
 |`druid.indexer.server.maxChatRequests`|Maximum number of concurrent requests 
served by a task's chat handler. Set to 0 to disable limiting.|0|
 
 If the peon is running in remote mode, there must be an Overlord up and 
running. Peons in remote mode can set the following configurations:
@@ -1497,6 +1498,7 @@ then the value from the configuration below is used:
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop 
tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, the Indexer will attempt 
to stop tasks gracefully on shutdown and restore them on restart.|false|
 |`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks 
using the [Druid input source](../ingestion/native-batch-input-source.md) will 
ignore the provided timestampSpec, and will use the `__time` column of the 
input datasource. This option is provided for compatibility with ingestion 
specs written before Druid 0.22.0.|false|
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to 
store empty columns during ingestion. When set to true, Druid stores every 
column specified in the 
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use 
schemaless ingestion and don't specify any dimensions to ingest, you must also 
set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for 
Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, 
 [...]
 |`druid.peon.taskActionClient.retry.minWait`|The minimum retry time to 
communicate with Overlord.|PT5S|
 |`druid.peon.taskActionClient.retry.maxWait`|The maximum retry time to 
communicate with Overlord.|PT1M|
 |`druid.peon.taskActionClient.retry.maxRetryCount`|The maximum number of 
retries to communicate with Overlord.|60|
diff --git a/docs/ingestion/native-batch-simple-task.md 
b/docs/ingestion/native-batch-simple-task.md
index 009f626c88..bb931a50ed 100644
--- a/docs/ingestion/native-batch-simple-task.md
+++ b/docs/ingestion/native-batch-simple-task.md
@@ -25,7 +25,7 @@ sidebar_label: "Simple task indexing"
 
 The simple task (type `index`) is designed to ingest small data sets into 
Apache Druid. The task executes within the indexing service. For general 
information on native batch indexing and parallel task indexing, see [Native 
batch ingestion](./native-batch.md).
 
-## Task syntax
+## Simple task example
 
 A sample task is shown below:
 
@@ -94,12 +94,14 @@ A sample task is shown below:
 }
 ```
 
+## Simple task configuration
+
 |property|description|required?|
 |--------|-----------|---------|
 |type|The task type, this should always be `index`.|yes|
 |id|The task ID. If this is not explicitly specified, Druid generates the task 
ID using task type, data source name, interval, and date-time stamp. |no|
-|spec|The ingestion spec including the data schema, IOConfig, and 
TuningConfig. See below for more details. |yes|
-|context|Context containing various task configuration parameters. See below 
for more details.|no|
+|spec|The ingestion spec including the [data schema](#dataschema), [IO 
config](#ioconfig), and [tuning config](#tuningconfig).|yes|
+|context|Context to specify various task configuration parameters. See [Task 
context parameters](tasks.md#context-parameters) for more details.|no|
 
 ### `dataSchema`
 
@@ -175,11 +177,11 @@ For best-effort rollup, you should use `dynamic`.
 |-----|----|-----------|--------|
 |type|String|See [Additional Peon Configuration: 
SegmentWriteOutMediumFactory](../configuration/index.md#segmentwriteoutmediumfactory)
 for explanation and available options.|yes|
 
-### Segment pushing modes
+## Segment pushing modes
 
 While ingesting data using the simple task indexing, Druid creates segments 
from the input data and pushes them. For segment pushing,
 the simple task index supports the following segment pushing modes based upon 
your type of [rollup](./rollup.md):
 
 - Bulk pushing mode: Used for perfect rollup. Druid pushes every segment at 
the very end of the index task. Until then, Druid stores created segments in 
memory and local storage of the service running the index task. This mode can 
cause problems if you have limited storage capacity, and is not recommended to 
use in production.
 To enable bulk pushing mode, set `forceGuaranteedRollup` in your TuningConfig. 
You can not use bulk pushing with `appendToExisting` in your IOConfig.
-- Incremental pushing mode: Used for best-effort rollup. Druid pushes segments 
are incrementally during the course of the indexing task. The index task 
collects data and stores created segments in the memory and disks of the 
services running the task until the total number of collected rows exceeds 
`maxTotalRows`. At that point the index task immediately pushes all segments 
created up until that moment, cleans up pushed segments, and continues to 
ingest the remaining data.
\ No newline at end of file
+- Incremental pushing mode: Used for best-effort rollup. Druid pushes segments 
are incrementally during the course of the indexing task. The index task 
collects data and stores created segments in the memory and disks of the 
services running the task until the total number of collected rows exceeds 
`maxTotalRows`. At that point the index task immediately pushes all segments 
created up until that moment, cleans up pushed segments, and continues to 
ingest the remaining data.
diff --git a/docs/ingestion/native-batch.md b/docs/ingestion/native-batch.md
index ce61af27c9..141a54c3c4 100644
--- a/docs/ingestion/native-batch.md
+++ b/docs/ingestion/native-batch.md
@@ -183,13 +183,16 @@ The following example illustrates the configuration for a 
parallel indexing task
   }
 }
 ```
+
+## Parallel indexing configuration
+
 The following table defines the primary sections of the input spec:
 |property|description|required?|
 |--------|-----------|---------|
-|type|The task type. For parallel task, set the value to `index_parallel`.|yes|
-|id|The task ID. If omitted, Druid generates the task ID using task type, data 
source name, interval, and date-time stamp. |no|
-|spec|The ingestion spec that defines: the data schema, IO config, and tuning 
config. See [`ioConfig`](#ioconfig) for more details. |yes|
-|context|Context to specify various task configuration parameters.|no|
+|type|The task type. For parallel task indexing, set the value to 
`index_parallel`.|yes|
+|id|The task ID. If omitted, Druid generates the task ID using the task type, 
data source name, interval, and date-time stamp. |no|
+|spec|The ingestion spec that defines the [data schema](#dataschema), [IO 
config](#ioconfig), and [tuning config](#tuningconfig).|yes|
+|context|Context to specify various task configuration parameters. See [Task 
context parameters](tasks.md#context-parameters) for more details.|no|
 
 ### `dataSchema`
 
@@ -409,7 +412,7 @@ Use "countryName" or both "countryName" and "cityName" in 
the `WHERE` clause of
 |maxRowsPerSegment|Soft max for the number of rows to include in a 
partition.|none|either this or `targetRowsPerSegment`|
 |assumeGrouped|Assume that input data has already been grouped on time and 
dimensions. Ingestion will run faster, but may choose sub-optimal partitions if 
this assumption is violated.|false|no|
 
-### HTTP status endpoints
+## HTTP status endpoints
 
 The supervisor task provides some HTTP endpoints to get running status.
 
@@ -637,7 +640,7 @@ An example of the result is
 
 Returns the task attempt history of the worker task spec of the given id, or 
HTTP 404 Not Found error if the supervisor task is running in the sequential 
mode.
 
-### Segment pushing modes
+## Segment pushing modes
 While ingesting data using the parallel task indexing, Druid creates segments 
from the input data and pushes them. For segment pushing,
 the parallel task index supports the following segment pushing modes based 
upon your type of [rollup](./rollup.md):
 
@@ -645,7 +648,7 @@ the parallel task index supports the following segment 
pushing modes based upon
 To enable bulk pushing mode, set `forceGuaranteedRollup` in your TuningConfig. 
You cannot use bulk pushing with `appendToExisting` in your IOConfig.
 - Incremental pushing mode: Used for best-effort rollup. Druid pushes segments 
are incrementally during the course of the indexing task. The index task 
collects data and stores created segments in the memory and disks of the 
services running the task until the total number of collected rows exceeds 
`maxTotalRows`. At that point the index task immediately pushes all segments 
created up until that moment, cleans up pushed segments, and continues to 
ingest the remaining data.
 
-### Capacity planning
+## Capacity planning
 
 The supervisor task can create up to `maxNumConcurrentSubTasks` worker tasks 
no matter how many task slots are currently available.
 As a result, total number of tasks which can be run at the same time is 
`(maxNumConcurrentSubTasks + 1)` (including the supervisor task).
diff --git a/docs/ingestion/tasks.md b/docs/ingestion/tasks.md
index 91ee8d0261..54b7661b01 100644
--- a/docs/ingestion/tasks.md
+++ b/docs/ingestion/tasks.md
@@ -254,7 +254,7 @@ and Kinesis indexing services.
 This section explains the task locking system in Druid. Druid's locking system
 and versioning system are tightly coupled with each other to guarantee the 
correctness of ingested data.
 
-## "Overshadowing" between segments
+### "Overshadowing" between segments
 
 You can run a task to overwrite existing data. The segments created by an 
overwriting task _overshadows_ existing segments.
 Note that the overshadow relation holds only for the same time chunk and the 
same data source.
@@ -277,9 +277,9 @@ Here are some examples.
 - A segment of the major version of `2019-01-01T00:00:00.000Z` and the minor 
version of `1` overshadows
  another of the major version of `2019-01-01T00:00:00.000Z` and the minor 
version of `0`.
 
-## Locking
+### Locking
 
-If you are running two or more [druid tasks](./tasks.md) which generate 
segments for the same data source and the same time chunk,
+If you are running two or more [Druid tasks](./tasks.md) which generate 
segments for the same data source and the same time chunk,
 the generated segments could potentially overshadow each other, which could 
lead to incorrect query results.
 
 To avoid this problem, tasks will attempt to get locks prior to creating any 
segment in Druid.
@@ -331,7 +331,7 @@ For example, Kafka indexing tasks of the same supervisor 
have the same groupId a
 
 <a name="priority"></a>
 
-## Lock priority
+### Lock priority
 
 Each task type has a different default lock priority. The below table shows 
the default priorities of different task types. Higher the number, higher the 
priority.
 
@@ -354,19 +354,20 @@ You can override the task priority by setting your 
priority in the task context
 
 ## Context parameters
 
-The task context is used for various individual task configuration. The 
following parameters apply to all task types.
+The task context is used for various individual task configuration.
+Specify task context configurations in the `context` field of the ingestion 
spec.
+The following parameters apply to all task types.
 
 |property|default|description|
 |--------|-------|-----------|
-|`taskLockTimeout`|300000|task lock timeout in millisecond. For more details, 
see [Locking](#locking).|
-|`forceTimeChunkLock`|true|_Setting this to false is still experimental_<br/> 
Force to always use time chunk lock. If not set, each task automatically 
chooses a lock type to use. If this set, it will overwrite the 
`druid.indexer.tasklock.forceTimeChunkLock` [configuration for the 
overlord](../configuration/index.md#overlord-operations). See 
[Locking](#locking) for more details.|
+|`taskLockTimeout`|300000|Task lock timeout in milliseconds. For more details, 
see [Locking](#locking).<br/><br/>When a task acquires a lock, it sends a 
request via HTTP and awaits until it receives a response containing the lock 
acquisition result. As a result, an HTTP timeout error can occur if 
`taskLockTimeout` is greater than `druid.server.http.maxIdleTime` of Overlords.|
+|`forceTimeChunkLock`|true|_Setting this to false is still experimental_<br/> 
Force to always use time chunk lock. If not set, each task automatically 
chooses a lock type to use. If set, this parameter overwrites 
`druid.indexer.tasklock.forceTimeChunkLock` [configuration for the 
overlord](../configuration/index.md#overlord-operations). See 
[Locking](#locking) for more details.|
 |`priority`|Different based on task types. See [Priority](#priority).|Task 
priority|
 |`useLineageBasedSegmentAllocation`|false in 0.21 or earlier, true in 0.22 or 
later|Enable the new lineage-based segment allocation protocol for the native 
Parallel task with dynamic partitioning. This option should be off during the 
replacing rolling upgrade from one of the Druid versions between 0.19 and 0.21 
to Druid 0.22 or higher. Once the upgrade is done, it must be set to true to 
ensure data correctness.|
+|`storeEmptyColumns`|true|Boolean value for whether or not to store empty 
columns during ingestion. When set to true, Druid stores every column specified 
in the [`dimensionsSpec`](ingestion-spec.md#dimensionsspec). If you use 
schemaless ingestion and don't specify any dimensions to ingest, you must also 
set [`includeAllDimensions`](ingestion-spec.md#dimensionsspec) for Druid to 
store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid 
SQL queries referencing empty colu [...]
 
-> When a task acquires a lock, it sends a request via HTTP and awaits until it 
receives a response containing the lock acquisition result.
-> As a result, an HTTP timeout error can occur if `taskLockTimeout` is greater 
than `druid.server.http.maxIdleTime` of Overlords.
 
-## Task Logs
+## Task logs
 
 Logs are created by ingestion tasks as they run.  You can configure Druid to 
push these into a repository for long-term storage after they complete.
 
diff --git a/website/.spelling b/website/.spelling
index ade85e5966..e83d1b8598 100644
--- a/website/.spelling
+++ b/website/.spelling
@@ -396,6 +396,7 @@ rollups
 rsync
 runtime
 schemas
+schemaless
 searchable
 secondaryPartitionPruning
 seekable-stream
@@ -1957,7 +1958,6 @@ dimensionExclusions
 expr
 jackson-jq
 missingValue
-schemaless
 skipBytesInMemoryOverheadCheck
 spatialDimensions
 useFieldDiscovery


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[druid] branch master updated: Document config for ingesting null columns (#12389)

Reply via email to