This is an automated email from the ASF dual-hosted git repository.
jihoonson pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/master by this push:
new d326c681c1 Document config for ingesting null columns (#12389)
d326c681c1 is described below
commit d326c681c1c43724266f84d3ede31ac8299cafe8
Author: Victoria Lim <[email protected]>
AuthorDate: Tue Apr 5 09:15:42 2022 -0700
Document config for ingesting null columns (#12389)
* config for ingesting null columns
* add link
* edit .spelling
* what happens if storeEmptyColumns is disabled
---
docs/configuration/index.md | 2 ++
docs/ingestion/native-batch-simple-task.md | 12 +++++++-----
docs/ingestion/native-batch.md | 17 ++++++++++-------
docs/ingestion/tasks.md | 21 +++++++++++----------
website/.spelling | 2 +-
5 files changed, 31 insertions(+), 23 deletions(-)
diff --git a/docs/configuration/index.md b/docs/configuration/index.md
index 6845161176..2999e9c10b 100644
--- a/docs/configuration/index.md
+++ b/docs/configuration/index.md
@@ -1432,6 +1432,7 @@ Additional peon configs include:
|`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop
tasks.|`/tmp/druid-indexing`|
|`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will
attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
|`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks
using the [Druid input source](../ingestion/native-batch-input-source.md) will
ignore the provided timestampSpec, and will use the `__time` column of the
input datasource. This option is provided for compatibility with ingestion
specs written before Druid 0.22.0.|false|
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to
store empty columns during ingestion. When set to true, Druid stores every
column specified in the
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use
schemaless ingestion and don't specify any dimensions to ingest, you must also
set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for
Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false,
[...]
|`druid.indexer.server.maxChatRequests`|Maximum number of concurrent requests
served by a task's chat handler. Set to 0 to disable limiting.|0|
If the peon is running in remote mode, there must be an Overlord up and
running. Peons in remote mode can set the following configurations:
@@ -1497,6 +1498,7 @@ then the value from the configuration below is used:
|`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop
tasks.|`/tmp/druid-indexing`|
|`druid.indexer.task.restoreTasksOnRestart`|If true, the Indexer will attempt
to stop tasks gracefully on shutdown and restore them on restart.|false|
|`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks
using the [Druid input source](../ingestion/native-batch-input-source.md) will
ignore the provided timestampSpec, and will use the `__time` column of the
input datasource. This option is provided for compatibility with ingestion
specs written before Druid 0.22.0.|false|
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to
store empty columns during ingestion. When set to true, Druid stores every
column specified in the
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use
schemaless ingestion and don't specify any dimensions to ingest, you must also
set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for
Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false,
[...]
|`druid.peon.taskActionClient.retry.minWait`|The minimum retry time to
communicate with Overlord.|PT5S|
|`druid.peon.taskActionClient.retry.maxWait`|The maximum retry time to
communicate with Overlord.|PT1M|
|`druid.peon.taskActionClient.retry.maxRetryCount`|The maximum number of
retries to communicate with Overlord.|60|
diff --git a/docs/ingestion/native-batch-simple-task.md
b/docs/ingestion/native-batch-simple-task.md
index 009f626c88..bb931a50ed 100644
--- a/docs/ingestion/native-batch-simple-task.md
+++ b/docs/ingestion/native-batch-simple-task.md
@@ -25,7 +25,7 @@ sidebar_label: "Simple task indexing"
The simple task (type `index`) is designed to ingest small data sets into
Apache Druid. The task executes within the indexing service. For general
information on native batch indexing and parallel task indexing, see [Native
batch ingestion](./native-batch.md).
-## Task syntax
+## Simple task example
A sample task is shown below:
@@ -94,12 +94,14 @@ A sample task is shown below:
}
```
+## Simple task configuration
+
|property|description|required?|
|--------|-----------|---------|
|type|The task type, this should always be `index`.|yes|
|id|The task ID. If this is not explicitly specified, Druid generates the task
ID using task type, data source name, interval, and date-time stamp. |no|
-|spec|The ingestion spec including the data schema, IOConfig, and
TuningConfig. See below for more details. |yes|
-|context|Context containing various task configuration parameters. See below
for more details.|no|
+|spec|The ingestion spec including the [data schema](#dataschema), [IO
config](#ioconfig), and [tuning config](#tuningconfig).|yes|
+|context|Context to specify various task configuration parameters. See [Task
context parameters](tasks.md#context-parameters) for more details.|no|
### `dataSchema`
@@ -175,11 +177,11 @@ For best-effort rollup, you should use `dynamic`.
|-----|----|-----------|--------|
|type|String|See [Additional Peon Configuration:
SegmentWriteOutMediumFactory](../configuration/index.md#segmentwriteoutmediumfactory)
for explanation and available options.|yes|
-### Segment pushing modes
+## Segment pushing modes
While ingesting data using the simple task indexing, Druid creates segments
from the input data and pushes them. For segment pushing,
the simple task index supports the following segment pushing modes based upon
your type of [rollup](./rollup.md):
- Bulk pushing mode: Used for perfect rollup. Druid pushes every segment at
the very end of the index task. Until then, Druid stores created segments in
memory and local storage of the service running the index task. This mode can
cause problems if you have limited storage capacity, and is not recommended to
use in production.
To enable bulk pushing mode, set `forceGuaranteedRollup` in your TuningConfig.
You can not use bulk pushing with `appendToExisting` in your IOConfig.
-- Incremental pushing mode: Used for best-effort rollup. Druid pushes segments
are incrementally during the course of the indexing task. The index task
collects data and stores created segments in the memory and disks of the
services running the task until the total number of collected rows exceeds
`maxTotalRows`. At that point the index task immediately pushes all segments
created up until that moment, cleans up pushed segments, and continues to
ingest the remaining data.
\ No newline at end of file
+- Incremental pushing mode: Used for best-effort rollup. Druid pushes segments
are incrementally during the course of the indexing task. The index task
collects data and stores created segments in the memory and disks of the
services running the task until the total number of collected rows exceeds
`maxTotalRows`. At that point the index task immediately pushes all segments
created up until that moment, cleans up pushed segments, and continues to
ingest the remaining data.
diff --git a/docs/ingestion/native-batch.md b/docs/ingestion/native-batch.md
index ce61af27c9..141a54c3c4 100644
--- a/docs/ingestion/native-batch.md
+++ b/docs/ingestion/native-batch.md
@@ -183,13 +183,16 @@ The following example illustrates the configuration for a
parallel indexing task
}
}
```
+
+## Parallel indexing configuration
+
The following table defines the primary sections of the input spec:
|property|description|required?|
|--------|-----------|---------|
-|type|The task type. For parallel task, set the value to `index_parallel`.|yes|
-|id|The task ID. If omitted, Druid generates the task ID using task type, data
source name, interval, and date-time stamp. |no|
-|spec|The ingestion spec that defines: the data schema, IO config, and tuning
config. See [`ioConfig`](#ioconfig) for more details. |yes|
-|context|Context to specify various task configuration parameters.|no|
+|type|The task type. For parallel task indexing, set the value to
`index_parallel`.|yes|
+|id|The task ID. If omitted, Druid generates the task ID using the task type,
data source name, interval, and date-time stamp. |no|
+|spec|The ingestion spec that defines the [data schema](#dataschema), [IO
config](#ioconfig), and [tuning config](#tuningconfig).|yes|
+|context|Context to specify various task configuration parameters. See [Task
context parameters](tasks.md#context-parameters) for more details.|no|
### `dataSchema`
@@ -409,7 +412,7 @@ Use "countryName" or both "countryName" and "cityName" in
the `WHERE` clause of
|maxRowsPerSegment|Soft max for the number of rows to include in a
partition.|none|either this or `targetRowsPerSegment`|
|assumeGrouped|Assume that input data has already been grouped on time and
dimensions. Ingestion will run faster, but may choose sub-optimal partitions if
this assumption is violated.|false|no|
-### HTTP status endpoints
+## HTTP status endpoints
The supervisor task provides some HTTP endpoints to get running status.
@@ -637,7 +640,7 @@ An example of the result is
Returns the task attempt history of the worker task spec of the given id, or
HTTP 404 Not Found error if the supervisor task is running in the sequential
mode.
-### Segment pushing modes
+## Segment pushing modes
While ingesting data using the parallel task indexing, Druid creates segments
from the input data and pushes them. For segment pushing,
the parallel task index supports the following segment pushing modes based
upon your type of [rollup](./rollup.md):
@@ -645,7 +648,7 @@ the parallel task index supports the following segment
pushing modes based upon
To enable bulk pushing mode, set `forceGuaranteedRollup` in your TuningConfig.
You cannot use bulk pushing with `appendToExisting` in your IOConfig.
- Incremental pushing mode: Used for best-effort rollup. Druid pushes segments
are incrementally during the course of the indexing task. The index task
collects data and stores created segments in the memory and disks of the
services running the task until the total number of collected rows exceeds
`maxTotalRows`. At that point the index task immediately pushes all segments
created up until that moment, cleans up pushed segments, and continues to
ingest the remaining data.
-### Capacity planning
+## Capacity planning
The supervisor task can create up to `maxNumConcurrentSubTasks` worker tasks
no matter how many task slots are currently available.
As a result, total number of tasks which can be run at the same time is
`(maxNumConcurrentSubTasks + 1)` (including the supervisor task).
diff --git a/docs/ingestion/tasks.md b/docs/ingestion/tasks.md
index 91ee8d0261..54b7661b01 100644
--- a/docs/ingestion/tasks.md
+++ b/docs/ingestion/tasks.md
@@ -254,7 +254,7 @@ and Kinesis indexing services.
This section explains the task locking system in Druid. Druid's locking system
and versioning system are tightly coupled with each other to guarantee the
correctness of ingested data.
-## "Overshadowing" between segments
+### "Overshadowing" between segments
You can run a task to overwrite existing data. The segments created by an
overwriting task _overshadows_ existing segments.
Note that the overshadow relation holds only for the same time chunk and the
same data source.
@@ -277,9 +277,9 @@ Here are some examples.
- A segment of the major version of `2019-01-01T00:00:00.000Z` and the minor
version of `1` overshadows
another of the major version of `2019-01-01T00:00:00.000Z` and the minor
version of `0`.
-## Locking
+### Locking
-If you are running two or more [druid tasks](./tasks.md) which generate
segments for the same data source and the same time chunk,
+If you are running two or more [Druid tasks](./tasks.md) which generate
segments for the same data source and the same time chunk,
the generated segments could potentially overshadow each other, which could
lead to incorrect query results.
To avoid this problem, tasks will attempt to get locks prior to creating any
segment in Druid.
@@ -331,7 +331,7 @@ For example, Kafka indexing tasks of the same supervisor
have the same groupId a
<a name="priority"></a>
-## Lock priority
+### Lock priority
Each task type has a different default lock priority. The below table shows
the default priorities of different task types. Higher the number, higher the
priority.
@@ -354,19 +354,20 @@ You can override the task priority by setting your
priority in the task context
## Context parameters
-The task context is used for various individual task configuration. The
following parameters apply to all task types.
+The task context is used for various individual task configuration.
+Specify task context configurations in the `context` field of the ingestion
spec.
+The following parameters apply to all task types.
|property|default|description|
|--------|-------|-----------|
-|`taskLockTimeout`|300000|task lock timeout in millisecond. For more details,
see [Locking](#locking).|
-|`forceTimeChunkLock`|true|_Setting this to false is still experimental_<br/>
Force to always use time chunk lock. If not set, each task automatically
chooses a lock type to use. If this set, it will overwrite the
`druid.indexer.tasklock.forceTimeChunkLock` [configuration for the
overlord](../configuration/index.md#overlord-operations). See
[Locking](#locking) for more details.|
+|`taskLockTimeout`|300000|Task lock timeout in milliseconds. For more details,
see [Locking](#locking).<br/><br/>When a task acquires a lock, it sends a
request via HTTP and awaits until it receives a response containing the lock
acquisition result. As a result, an HTTP timeout error can occur if
`taskLockTimeout` is greater than `druid.server.http.maxIdleTime` of Overlords.|
+|`forceTimeChunkLock`|true|_Setting this to false is still experimental_<br/>
Force to always use time chunk lock. If not set, each task automatically
chooses a lock type to use. If set, this parameter overwrites
`druid.indexer.tasklock.forceTimeChunkLock` [configuration for the
overlord](../configuration/index.md#overlord-operations). See
[Locking](#locking) for more details.|
|`priority`|Different based on task types. See [Priority](#priority).|Task
priority|
|`useLineageBasedSegmentAllocation`|false in 0.21 or earlier, true in 0.22 or
later|Enable the new lineage-based segment allocation protocol for the native
Parallel task with dynamic partitioning. This option should be off during the
replacing rolling upgrade from one of the Druid versions between 0.19 and 0.21
to Druid 0.22 or higher. Once the upgrade is done, it must be set to true to
ensure data correctness.|
+|`storeEmptyColumns`|true|Boolean value for whether or not to store empty
columns during ingestion. When set to true, Druid stores every column specified
in the [`dimensionsSpec`](ingestion-spec.md#dimensionsspec). If you use
schemaless ingestion and don't specify any dimensions to ingest, you must also
set [`includeAllDimensions`](ingestion-spec.md#dimensionsspec) for Druid to
store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid
SQL queries referencing empty colu [...]
-> When a task acquires a lock, it sends a request via HTTP and awaits until it
receives a response containing the lock acquisition result.
-> As a result, an HTTP timeout error can occur if `taskLockTimeout` is greater
than `druid.server.http.maxIdleTime` of Overlords.
-## Task Logs
+## Task logs
Logs are created by ingestion tasks as they run. You can configure Druid to
push these into a repository for long-term storage after they complete.
diff --git a/website/.spelling b/website/.spelling
index ade85e5966..e83d1b8598 100644
--- a/website/.spelling
+++ b/website/.spelling
@@ -396,6 +396,7 @@ rollups
rsync
runtime
schemas
+schemaless
searchable
secondaryPartitionPruning
seekable-stream
@@ -1957,7 +1958,6 @@ dimensionExclusions
expr
jackson-jq
missingValue
-schemaless
skipBytesInMemoryOverheadCheck
spatialDimensions
useFieldDiscovery
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]