This is an automated email from the ASF dual-hosted git repository.
cwylie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/master by this push:
new ceda1e98b9 docs: add docs for schema auto-discovery (#14065)
ceda1e98b9 is described below
commit ceda1e98b91c50f96e112ecb275a7acf3955cfb8
Author: 317brian <[email protected]>
AuthorDate: Wed May 17 01:36:02 2023 -0700
docs: add docs for schema auto-discovery (#14065)
* wip schemaless
* wip
* more cleanup
* update tuningconfig example
* updates based on feedback from clint
* remove errant comma
* update dimension object to include auto
* update to include string schemaless way
* fix spelling errors
* updates for type-aware and string-based changes
* Update docs/ingestion/schema-design.md
* Apply suggestions from code review
Co-authored-by: Katya Macedo <[email protected]>
* Apply suggestions from code review
Co-authored-by: Katya Macedo <[email protected]>
* Apply suggestions from code review
Co-authored-by: Katya Macedo <[email protected]>
* Apply suggestions from code review
Co-authored-by: Katya Macedo <[email protected]>
* update spelling file
* Update docs/ingestion/schema-design.md
Co-authored-by: Clint Wylie <[email protected]>
* copyedits
* fix anchor
---------
Co-authored-by: Katya Macedo <[email protected]>
Co-authored-by: Clint Wylie <[email protected]>
---
docs/configuration/index.md | 4 +--
docs/ingestion/ingestion-spec.md | 30 +++++++++++++++++-----
docs/ingestion/schema-design.md | 54 ++++++++++++++++++++++++++++++++++------
docs/ingestion/tasks.md | 2 +-
website/.spelling | 1 +
5 files changed, 75 insertions(+), 16 deletions(-)
diff --git a/docs/configuration/index.md b/docs/configuration/index.md
index 85437dc12b..42542c35ea 100644
--- a/docs/configuration/index.md
+++ b/docs/configuration/index.md
@@ -1526,7 +1526,7 @@ Additional peon configs include:
|`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop
tasks.|`/tmp/druid-indexing`|
|`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will
attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
|`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks
using the [Druid input source](../ingestion/native-batch-input-source.md) will
ignore the provided timestampSpec, and will use the `__time` column of the
input datasource. This option is provided for compatibility with ingestion
specs written before Druid 0.22.0.|false|
-|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to
store empty columns during ingestion. When set to true, Druid stores every
column specified in the
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use
schemaless ingestion and don't specify any dimensions to ingest, you must also
set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for
Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false,
[...]
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to
store empty columns during ingestion. When set to true, Druid stores every
column specified in the
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use
the string-based schemaless ingestion and don't specify any dimensions to
ingest, you must also set
[`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for
Druid to store empty columns.<br/><br/>If you set `storeEmptyCo [...]
|`druid.indexer.task.tmpStorageBytesPerTask`|Maximum number of bytes per task
to be used to store temporary files on disk. This config is generally intended
for internal usage. Attempts to set it are very likely to be overwritten by
the TaskRunner that executes the task, so be sure of what you expect to happen
before directly adjusting this configuration parameter. The config is
documented here primarily to provide an understanding of what it means if/when
someone sees that it has been [...]
|`druid.indexer.server.maxChatRequests`|Maximum number of concurrent requests
served by a task's chat handler. Set to 0 to disable limiting.|0|
@@ -1595,7 +1595,7 @@ then the value from the configuration below is used:
|`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop
tasks.|`/tmp/druid-indexing`|
|`druid.indexer.task.restoreTasksOnRestart`|If true, the Indexer will attempt
to stop tasks gracefully on shutdown and restore them on restart.|false|
|`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks
using the [Druid input source](../ingestion/native-batch-input-source.md) will
ignore the provided timestampSpec, and will use the `__time` column of the
input datasource. This option is provided for compatibility with ingestion
specs written before Druid 0.22.0.|false|
-|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to
store empty columns during ingestion. When set to true, Druid stores every
column specified in the
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use
schemaless ingestion and don't specify any dimensions to ingest, you must also
set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for
Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false,
[...]
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to
store empty columns during ingestion. When set to true, Druid stores every
column specified in the
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). <br/><br/>If
you set `storeEmptyColumns` to false, Druid SQL queries referencing empty
columns will fail. If you intend to leave `storeEmptyColumns` disabled, you
should either ingest placeholder data for empty columns or else not query on
empty colu [...]
|`druid.peon.taskActionClient.retry.minWait`|The minimum retry time to
communicate with Overlord.|PT5S|
|`druid.peon.taskActionClient.retry.maxWait`|The maximum retry time to
communicate with Overlord.|PT1M|
|`druid.peon.taskActionClient.retry.maxRetryCount`|The maximum number of
retries to communicate with Overlord.|60|
diff --git a/docs/ingestion/ingestion-spec.md b/docs/ingestion/ingestion-spec.md
index 079f8e9f19..126a40ca92 100644
--- a/docs/ingestion/ingestion-spec.md
+++ b/docs/ingestion/ingestion-spec.md
@@ -24,7 +24,7 @@ description: Reference for the configuration options in the
ingestion spec.
~ under the License.
-->
-All ingestion methods use ingestion tasks to load data into Druid. Streaming
ingestion uses ongoing supervisors that run and supervise a set of tasks over
time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md).
All types of ingestion use an _ingestion spec_ to configure ingestion.
+All ingestion methods use ingestion tasks to load data into Druid. Streaming
ingestion uses ongoing supervisors that run and supervise a set of tasks over
time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md).
Other than with SQL-based ingestion, use an _ingestion spec_ to configure your
ingestion.
Ingestion specs consists of three main components:
@@ -186,9 +186,19 @@ Treat `__time` as a millisecond timestamp: the number of
milliseconds since Jan
### `dimensionsSpec`
The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is
responsible for
-configuring [dimensions](./data-model.md#dimensions). An example
`dimensionsSpec` is:
+configuring [dimensions](./data-model.md#dimensions).
-```
+You can either manually specify the dimensions or take advantage of schema
auto-discovery where you allow Druid to infer all or some of the schema for
your data. This means that you don't have to explicitly specify your dimensions
and their type.
+
+To use schema auto-discovery, set `useSchemaDiscovery` to `true`.
+
+Alternatively, you can use the string-based schemaless ingestion where any
discovered dimensions are treated as strings. To do so, leave
`useSchemaDiscovery` set to `false` (default). Then, set the dimensions list to
empty or set the `includeAllDimensions` property to `true`.
+
+The following `dimensionsSpec` example uses schema auto-discovery
(`"useSchemaDiscovery": true`) in conjunction with explicitly defined
dimensions to have Druid infer some of the schema for the data:
+
+
+
+```json
"dimensionsSpec" : {
"dimensions": [
"page",
@@ -196,10 +206,12 @@ configuring [dimensions](./data-model.md#dimensions). An
example `dimensionsSpec
{ "type": "long", "name": "userId" }
],
"dimensionExclusions" : [],
- "spatialDimensions" : []
+ "spatialDimensions" : [],
+ "useSchemaDiscovery": true
}
```
+
> Conceptually, after input data records are read, Druid applies ingestion
> spec components in a particular order:
> first [`flattenSpec`](data-formats.md#flattenspec) (if any), then
> [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
> and finally [`dimensionsSpec`](#dimensionsspec) and
> [`metricsSpec`](#metricsspec). Keep this in mind when writing
@@ -212,7 +224,9 @@ A `dimensionsSpec` can have the following components:
| `dimensions` | A list of [dimension names or
objects](#dimension-objects). You cannot include the same column in both
`dimensions` and `dimensionExclusions`.<br /><br />If `dimensions` and
`spatialDimensions` are both null or empty arrays, Druid treats all columns
other than timestamp or metrics that do not appear in `dimensionExclusions` as
String-typed dimension columns. See [inclusions and
exclusions](#inclusions-and-exclusions) for details.<br /><br />As a best
practice, [...]
| `dimensionExclusions` | The names of dimensions to exclude from ingestion.
Only names are supported here, not objects.<br /><br />This list is only used
if the `dimensions` and `spatialDimensions` lists are both null or empty
arrays; otherwise it is ignored. See [inclusions and
exclusions](#inclusions-and-exclusions) below for details.
[...]
| `spatialDimensions` | An array of [spatial
dimensions](../development/geo.md).
[...]
-| `includeAllDimensions` | You can set `includeAllDimensions` to true to
ingest both explicit dimensions in the `dimensions` field and other dimensions
that the ingestion task discovers from input data. In this case, the explicit
dimensions will appear first in order that you specify them and the dimensions
dynamically discovered will come after. This flag can be useful especially with
auto schema discovery using [`flattenSpec`](./data-formats.md#flattenspec). If
this is not set and the [...]
+| `includeAllDimensions` | Note that this field only applies to string-based
schema discovery where Druid ingests dimensions it discovers as strings. This
is different from schema auto-discovery where Druid infers the type for data.
You can set `includeAllDimensions` to true to ingest both explicit dimensions
in the `dimensions` field and other dimensions that the ingestion task
discovers from input data. In this case, the explicit dimensions will appear
first in the order that you speci [...]
+| `useSchemaDiscovery` | Configure Druid to use schema auto-discovery to
discover some or all of the dimensions and types for your data. For any
dimensions that aren't a uniform type, Druid ingests them as JSON. You can use
this for native batch or streaming ingestion. | false |
+
#### Dimension objects
@@ -223,7 +237,7 @@ Dimension objects can have the following components:
| Field | Description | Default |
|-------|-------------|---------|
-| type | Either `string`, `long`, `float`, `double`, or `json`. | `string` |
+| type | Either `auto`, `string`, `long`, `float`, `double`, or `json`. For
the `auto` type, Druid determines the most appropriate type for the dimension
and assigns one of the following: STRING, ARRAY<STRING>, LONG, ARRAY<LONG>,
DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json> columns, all sharing a common 'nested'
format. When Druid infers the schema with schema auto-discovery, the type is
`auto`. | `string` |
| name | The name of the dimension. This will be used as the field name to
read from input records, as well as the column name stored in generated
segments.<br /><br />Note that you can use a [`transformSpec`](#transformspec)
if you want to rename columns during ingestion time. | none (required) |
| createBitmapIndex | For `string` typed dimensions, whether or not bitmap
indexes should be created for the column in generated segments. Creating a
bitmap index requires more storage, but speeds up certain kinds of filtering
(especially equality and prefix filtering). Only supported for `string` typed
dimensions. | `true` |
| multiValueHandling | For `string` typed dimensions, specifies the type of
handling for [multi-value fields](../querying/multi-value-dimensions.md).
Possible values are `array` (ingest string arrays as-is), `sorted_array` (sort
string arrays during ingestion), and `sorted_set` (sort and de-duplicate string
arrays during ingestion). This parameter is ignored for types other than
`string`. | `sorted_array` |
@@ -234,6 +248,8 @@ Druid will interpret a `dimensionsSpec` in two possible
ways: _normal_ or _schem
Normal interpretation occurs when either `dimensions` or `spatialDimensions`
is non-empty. In this case, the combination of the two lists will be taken as
the set of dimensions to be ingested, and the list of `dimensionExclusions`
will be ignored.
+> The following description of schemaless refers to string-based schemaless
where Druid treats dimensions it discovers as strings. We recommend you use
schema auto-discovery instead where Druid infers the type for the dimension.
For more information, see [`dimensionsSpec`](#dimensionsspec).
+
Schemaless interpretation occurs when both `dimensions` and
`spatialDimensions` are empty or null. In this case, the set of dimensions is
determined in the following way:
1. First, start from the set of all root-level fields from the input record,
as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes
all fields at the top level of a data structure, but does not included fields
nested within maps or lists. To extract these, you must use a
[`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data
formats, such as CSV and delimited text, are considered root-level.
@@ -244,6 +260,8 @@ Schemaless interpretation occurs when both `dimensions` and
`spatialDimensions`
6. Any field with the same name as an aggregator from the
[metricsSpec](#metricsspec) is excluded.
7. All other fields are ingested as `string` typed dimensions with the
[default settings](#dimension-objects).
+Additionally, if you have empty columns that you want to include in the
string-based schemaless ingestion, you'll need to include the context parameter
`storeEmptyColumns` and set it to `true`.
+
> Note: Fields generated by a [`transformSpec`](#transformspec) are not
> currently considered candidates for
> schemaless dimension interpretation.
diff --git a/docs/ingestion/schema-design.md b/docs/ingestion/schema-design.md
index f006e792bc..eaada3651b 100644
--- a/docs/ingestion/schema-design.md
+++ b/docs/ingestion/schema-design.md
@@ -107,7 +107,7 @@ to compute percentiles or quantiles, use Druid's
[approximate aggregators](../qu
row in your Druid datasource. This can be useful if you want to store data at
a different time granularity than it is
naturally emitted. It is also useful if you want to combine timeseries and
non-timeseries data in the same datasource.
* If you don't know ahead of time what columns you'll want to ingest, use an
empty dimensions list to trigger
-[automatic detection of dimension columns](#schema-less-dimensions).
+[automatic detection of dimension
columns](#schema-auto-discovery-for-dimensions).
### Log aggregation model
@@ -120,8 +120,7 @@ you must be more explicit. Druid columns have types
specific upfront.
Tips for modeling log data in Druid:
-* If you don't know ahead of time what columns you'll want to ingest, use an
empty dimensions list to trigger
-[automatic detection of dimension columns](#schema-less-dimensions).
+* If you don't know ahead of time what columns to ingest, you can have Druid
perform [schema auto-discovery](#schema-auto-discovery-for-dimensions).
* If you have nested data, you can ingest it using the [nested
columns](../querying/nested-columns.md) feature or flatten it using a
[`flattenSpec`](./ingestion-spec.md#flattenspec).
* Consider enabling [rollup](./rollup.md) if you have mainly analytical use
cases for your log data. This will
mean you lose the ability to retrieve individual events from Druid, but you
potentially gain substantial compression and
@@ -241,12 +240,53 @@ You should query for the number of ingested rows with:
]
```
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery)
where Druid infers the schema and type for your data. Type-aware schema
discovery is an experimental feature currently available for native batch and
streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all
the discovered columns are typed as either native string or multi-value string
columns.
+
+#### Type-aware schema discovery
+
+> Note that using type-aware schema discovery can impact downstream BI tools
depending on how they handle ARRAY typed columns.
+
+You can have Druid infer the schema and types for your data partially or fully
by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or
no dimensions in the dimensions list.
+
+When performing type-aware schema discovery, Druid can discover all of the
columns of your input data (that aren't in
+the exclusion list). Druid automatically chooses the most appropriate native
Druid type among `STRING`, `LONG`,
+`DOUBLE`, `ARRAY<STRING>`, `ARRAY<LONG>`, `ARRAY<DOUBLE>`, or `COMPLEX<json>`
for nested data. For input formats with
+native boolean types, Druid ingests these values as strings if
`druid.expressions.useStrictBooleans` is set to `false`
+(the default), or longs if set to `true` (for more SQL compatible behavior).
Array typed columns can be queried using
+the [array functions](../querying/sql-array-functions.md) or
[UNNEST](../querying/sql-functions.md#unnest). Nested
+columns can be queried with the [JSON
functions](../querying/sql-json-functions.md).
+
+Mixed type columns are stored in the _least_ restrictive type that can
represent all values in the column. For example:
+
+- Mixed numeric columns are `DOUBLE`
+- If there are any strings present, then the column is a `STRING`
+- If there are arrays, then the column becomes an array with the least
restrictive element type
+- Any nested data or arrays of nested data become `COMPLEX<json>` nested
columns.
+
+If you're already using string-based schema discovery and want to migrate, see
[Migrating to type-aware schema
discovery](#migrating-to-type-aware-schema-discovery).
+
+#### String-based schema discovery
+
+If you do not set `dimensionsSpec.useSchemaDiscovery` to `true`, Druid can
still use the string-based schema discovery for ingestion if any of the
following conditions are met:
+
+- The dimension list is empty
+- You set `includeAllDimensions` to `true`
+
+Druid coerces primitives and arrays of primitive types into the native Druid
string type. Nested data structures and arrays of nested data structures are
ignored and not ingested.
+
+#### Migrating to type-aware schema discovery
-If the `dimensions` field is left empty in your ingestion spec, Druid will
treat every column that is not the timestamp column,
-a dimension that has been excluded, or a metric column as a dimension.
+If you previously used string-based schema discovery and want to migrate to
type-aware schema discovery, do the following:
-Note that when using schema-less ingestion, all dimensions will be ingested as
String-typed dimensions.
+- Update any queries that use multi-value dimensions (MVDs) to use UNNEST in
conjunction with other functions so that no MVD behavior is being relied upon.
Type-aware schema discovery generates ARRAY typed columns instead of MVDs, so
queries that use any MVD features will fail.
+- Be aware of mixed typed inputs and test how type-aware schema discovery
handles them. Druid attempts to cast them as the least restrictive type.
+- If you notice issues with numeric types, you may need to explicitly cast
them. Generally, Druid handles the coercion for you.
+- Update your dimension exclusion list and add any nested columns if you want
to continue to exclude them. String-based schema discovery automatically
ignores nested columns, but type-aware schema discovery will ingest them.
### Including the same column as a dimension and a metric
diff --git a/docs/ingestion/tasks.md b/docs/ingestion/tasks.md
index 979d6d88d7..95e61f88dc 100644
--- a/docs/ingestion/tasks.md
+++ b/docs/ingestion/tasks.md
@@ -392,7 +392,7 @@ The following parameters apply to all task types.
|`forceTimeChunkLock`|true|_Setting this to false is still experimental_<br/>
Force to always use time chunk lock. If not set, each task automatically
chooses a lock type to use. If set, this parameter overwrites
`druid.indexer.tasklock.forceTimeChunkLock` [configuration for the
overlord](../configuration/index.md#overlord-operations). See
[Locking](#locking) for more details.|
|`priority`|Different based on task types. See [Priority](#priority).|Task
priority|
|`useLineageBasedSegmentAllocation`|false in 0.21 or earlier, true in 0.22 or
later|Enable the new lineage-based segment allocation protocol for the native
Parallel task with dynamic partitioning. This option should be off during the
replacing rolling upgrade from one of the Druid versions between 0.19 and 0.21
to Druid 0.22 or higher. Once the upgrade is done, it must be set to true to
ensure data correctness.|
-|`storeEmptyColumns`|true|Boolean value for whether or not to store empty
columns during ingestion. When set to true, Druid stores every column specified
in the [`dimensionsSpec`](ingestion-spec.md#dimensionsspec). If you use
schemaless ingestion and don't specify any dimensions to ingest, you must also
set [`includeAllDimensions`](ingestion-spec.md#dimensionsspec) for Druid to
store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid
SQL queries referencing empty colu [...]
+|`storeEmptyColumns`|true|Boolean value for whether or not to store empty
columns during ingestion. When set to true, Druid stores every column specified
in the [`dimensionsSpec`](ingestion-spec.md#dimensionsspec). <br/><br/>If you
set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns
will fail. If you intend to leave `storeEmptyColumns` disabled, you should
either ingest dummy data for empty columns or else not query on empty
columns.<br/><br/>When set in the tas [...]
## Task logs
diff --git a/website/.spelling b/website/.spelling
index 910912d46c..01b5f27ead 100644
--- a/website/.spelling
+++ b/website/.spelling
@@ -375,6 +375,7 @@ misconfigured
mostAvailableSize
multitenancy
multitenant
+MVDs
mysql
namespace
namespaced
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]