[druid] branch master updated: docs: add docs for schema auto-discovery (#14065)

cwylie Wed, 17 May 2023 01:36:15 -0700

This is an automated email from the ASF dual-hosted git repository.

cwylie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new ceda1e98b9 docs: add docs for schema auto-discovery (#14065)
ceda1e98b9 is described below

commit ceda1e98b91c50f96e112ecb275a7acf3955cfb8
Author: 317brian <[email protected]>
AuthorDate: Wed May 17 01:36:02 2023 -0700

    docs: add docs for schema auto-discovery (#14065)
    
    * wip schemaless
    
    * wip
    
    * more cleanup
    
    * update tuningconfig example
    
    * updates based on feedback from clint
    
    * remove errant comma
    
    * update dimension object to include auto
    
    * update to include string schemaless way
    
    * fix spelling errors
    
    * updates for type-aware and string-based changes
    
    * Update docs/ingestion/schema-design.md
    
    * Apply suggestions from code review
    
    Co-authored-by: Katya Macedo  <[email protected]>
    
    * Apply suggestions from code review
    
    Co-authored-by: Katya Macedo  <[email protected]>
    
    * Apply suggestions from code review
    
    Co-authored-by: Katya Macedo  <[email protected]>
    
    * Apply suggestions from code review
    
    Co-authored-by: Katya Macedo  <[email protected]>
    
    * update spelling file
    
    * Update docs/ingestion/schema-design.md
    
    Co-authored-by: Clint Wylie <[email protected]>
    
    * copyedits
    
    * fix anchor
    
    ---------
    
    Co-authored-by: Katya Macedo  <[email protected]>
    Co-authored-by: Clint Wylie <[email protected]>
---
 docs/configuration/index.md      |  4 +--
 docs/ingestion/ingestion-spec.md | 30 +++++++++++++++++-----
 docs/ingestion/schema-design.md  | 54 ++++++++++++++++++++++++++++++++++------
 docs/ingestion/tasks.md          |  2 +-
 website/.spelling                |  1 +
 5 files changed, 75 insertions(+), 16 deletions(-)

diff --git a/docs/configuration/index.md b/docs/configuration/index.md
index 85437dc12b..42542c35ea 100644
--- a/docs/configuration/index.md
+++ b/docs/configuration/index.md
@@ -1526,7 +1526,7 @@ Additional peon configs include:
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop 
tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, MiddleManagers will 
attempt to stop tasks gracefully on shutdown and restore them on restart.|false|
 |`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks 
using the [Druid input source](../ingestion/native-batch-input-source.md) will 
ignore the provided timestampSpec, and will use the `__time` column of the 
input datasource. This option is provided for compatibility with ingestion 
specs written before Druid 0.22.0.|false|
-|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to 
store empty columns during ingestion. When set to true, Druid stores every 
column specified in the 
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use 
schemaless ingestion and don't specify any dimensions to ingest, you must also 
set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for 
Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, 
 [...]
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to 
store empty columns during ingestion. When set to true, Druid stores every 
column specified in the 
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use 
the string-based schemaless ingestion and don't specify any dimensions to 
ingest, you must also set 
[`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for 
Druid to store empty columns.<br/><br/>If you set `storeEmptyCo [...]
 |`druid.indexer.task.tmpStorageBytesPerTask`|Maximum number of bytes per task 
to be used to store temporary files on disk. This config is generally intended 
for internal usage.  Attempts to set it are very likely to be overwritten by 
the TaskRunner that executes the task, so be sure of what you expect to happen 
before directly adjusting this configuration parameter.  The config is 
documented here primarily to provide an understanding of what it means if/when 
someone sees that it has been [...]
 |`druid.indexer.server.maxChatRequests`|Maximum number of concurrent requests 
served by a task's chat handler. Set to 0 to disable limiting.|0|
 
@@ -1595,7 +1595,7 @@ then the value from the configuration below is used:
 |`druid.indexer.task.hadoopWorkingPath`|Temporary working directory for Hadoop 
tasks.|`/tmp/druid-indexing`|
 |`druid.indexer.task.restoreTasksOnRestart`|If true, the Indexer will attempt 
to stop tasks gracefully on shutdown and restore them on restart.|false|
 |`druid.indexer.task.ignoreTimestampSpecForDruidInputSource`|If true, tasks 
using the [Druid input source](../ingestion/native-batch-input-source.md) will 
ignore the provided timestampSpec, and will use the `__time` column of the 
input datasource. This option is provided for compatibility with ingestion 
specs written before Druid 0.22.0.|false|
-|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to 
store empty columns during ingestion. When set to true, Druid stores every 
column specified in the 
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). If you use 
schemaless ingestion and don't specify any dimensions to ingest, you must also 
set [`includeAllDimensions`](../ingestion/ingestion-spec.md#dimensionsspec) for 
Druid to store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, 
 [...]
+|`druid.indexer.task.storeEmptyColumns`|Boolean value for whether or not to 
store empty columns during ingestion. When set to true, Druid stores every 
column specified in the 
[`dimensionsSpec`](../ingestion/ingestion-spec.md#dimensionsspec). <br/><br/>If 
you set `storeEmptyColumns` to false, Druid SQL queries referencing empty 
columns will fail. If you intend to leave `storeEmptyColumns` disabled, you 
should either ingest placeholder data for empty columns or else not query on 
empty colu [...]
 |`druid.peon.taskActionClient.retry.minWait`|The minimum retry time to 
communicate with Overlord.|PT5S|
 |`druid.peon.taskActionClient.retry.maxWait`|The maximum retry time to 
communicate with Overlord.|PT1M|
 |`druid.peon.taskActionClient.retry.maxRetryCount`|The maximum number of 
retries to communicate with Overlord.|60|
diff --git a/docs/ingestion/ingestion-spec.md b/docs/ingestion/ingestion-spec.md
index 079f8e9f19..126a40ca92 100644
--- a/docs/ingestion/ingestion-spec.md
+++ b/docs/ingestion/ingestion-spec.md
@@ -24,7 +24,7 @@ description: Reference for the configuration options in the 
ingestion spec.
   ~ under the License.
   -->
 
-All ingestion methods use ingestion tasks to load data into Druid. Streaming 
ingestion uses ongoing supervisors that run and supervise a set of tasks over 
time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). 
All types of ingestion use an _ingestion spec_ to configure ingestion.
+All ingestion methods use ingestion tasks to load data into Druid. Streaming 
ingestion uses ongoing supervisors that run and supervise a set of tasks over 
time. Native batch and Hadoop-based ingestion use a one-time [task](tasks.md). 
Other than with SQL-based ingestion, use an _ingestion spec_ to configure your 
ingestion.
 
 Ingestion specs consists of three main components:
 
@@ -186,9 +186,19 @@ Treat `__time` as a millisecond timestamp: the number of 
milliseconds since Jan
 ### `dimensionsSpec`
 
 The `dimensionsSpec` is located in `dataSchema` → `dimensionsSpec` and is 
responsible for
-configuring [dimensions](./data-model.md#dimensions). An example 
`dimensionsSpec` is:
+configuring [dimensions](./data-model.md#dimensions). 
 
-```
+You can either manually specify the dimensions or take advantage of schema 
auto-discovery where you allow Druid to infer all or some of the schema for 
your data. This means that you don't have to explicitly specify your dimensions 
and their type. 
+
+To use schema auto-discovery, set `useSchemaDiscovery` to `true`. 
+
+Alternatively, you can use the string-based schemaless ingestion where any 
discovered dimensions are treated as strings. To do so, leave 
`useSchemaDiscovery` set to `false` (default). Then, set the dimensions list to 
empty or set the  `includeAllDimensions` property to `true`.
+
+The following `dimensionsSpec` example uses schema auto-discovery 
(`"useSchemaDiscovery": true`) in conjunction with explicitly defined 
dimensions to have Druid infer some of the schema for the data:
+
+
+
+```json
 "dimensionsSpec" : {
   "dimensions": [
     "page",
@@ -196,10 +206,12 @@ configuring [dimensions](./data-model.md#dimensions). An 
example `dimensionsSpec
     { "type": "long", "name": "userId" }
   ],
   "dimensionExclusions" : [],
-  "spatialDimensions" : []
+  "spatialDimensions" : [],
+  "useSchemaDiscovery": true
 }
 ```
 
+
 > Conceptually, after input data records are read, Druid applies ingestion 
 > spec components in a particular order:
 > first [`flattenSpec`](data-formats.md#flattenspec) (if any), then 
 > [`timestampSpec`](#timestampspec), then [`transformSpec`](#transformspec),
 > and finally [`dimensionsSpec`](#dimensionsspec) and 
 > [`metricsSpec`](#metricsspec). Keep this in mind when writing
@@ -212,7 +224,9 @@ A `dimensionsSpec` can have the following components:
 | `dimensions`           | A list of [dimension names or 
objects](#dimension-objects). You cannot include the same column in both 
`dimensions` and `dimensionExclusions`.<br /><br />If `dimensions` and 
`spatialDimensions` are both null or empty arrays, Druid treats all columns 
other than timestamp or metrics that do not appear in `dimensionExclusions` as 
String-typed dimension columns. See [inclusions and 
exclusions](#inclusions-and-exclusions) for details.<br /><br />As a best 
practice,  [...]
 | `dimensionExclusions`  | The names of dimensions to exclude from ingestion. 
Only names are supported here, not objects.<br /><br />This list is only used 
if the `dimensions` and `spatialDimensions` lists are both null or empty 
arrays; otherwise it is ignored. See [inclusions and 
exclusions](#inclusions-and-exclusions) below for details.                      
                                                                                
                                                   [...]
 | `spatialDimensions`    | An array of [spatial 
dimensions](../development/geo.md).                                             
                                                                                
                                                                                
                                                                                
                                                                                
                                             [...]
-| `includeAllDimensions` | You can set `includeAllDimensions` to true to 
ingest both explicit dimensions in the `dimensions` field and other dimensions 
that the ingestion task discovers from input data. In this case, the explicit 
dimensions will appear first in order that you specify them and the dimensions 
dynamically discovered will come after. This flag can be useful especially with 
auto schema discovery using [`flattenSpec`](./data-formats.md#flattenspec). If 
this is not set and the  [...]
+| `includeAllDimensions` | Note that this field only applies to string-based 
schema discovery where Druid ingests dimensions it discovers as strings. This 
is different from schema auto-discovery where Druid infers the type for data. 
You can set `includeAllDimensions` to true to ingest both explicit dimensions 
in the `dimensions` field and other dimensions that the ingestion task 
discovers from input data. In this case, the explicit dimensions will appear 
first in the order that you speci [...]
+| `useSchemaDiscovery` | Configure Druid to use schema auto-discovery to 
discover some or all of the dimensions and types for your data. For any 
dimensions that aren't a uniform type, Druid ingests them as JSON. You can use 
this for native batch or streaming ingestion.  | false  | 
+
 
 #### Dimension objects
 
@@ -223,7 +237,7 @@ Dimension objects can have the following components:
 
 | Field | Description | Default |
 |-------|-------------|---------|
-| type | Either `string`, `long`, `float`, `double`, or `json`. | `string` |
+| type | Either `auto`, `string`, `long`, `float`, `double`, or `json`. For 
the `auto` type, Druid determines the most appropriate type for the dimension 
and assigns one of the following: STRING, ARRAY<STRING>, LONG, ARRAY<LONG>, 
DOUBLE, ARRAY<DOUBLE>, or COMPLEX<json> columns, all sharing a common 'nested' 
format. When Druid infers the schema with schema auto-discovery, the type is 
`auto`. | `string` |
 | name | The name of the dimension. This will be used as the field name to 
read from input records, as well as the column name stored in generated 
segments.<br /><br />Note that you can use a [`transformSpec`](#transformspec) 
if you want to rename columns during ingestion time. | none (required) |
 | createBitmapIndex | For `string` typed dimensions, whether or not bitmap 
indexes should be created for the column in generated segments. Creating a 
bitmap index requires more storage, but speeds up certain kinds of filtering 
(especially equality and prefix filtering). Only supported for `string` typed 
dimensions. | `true` |
 | multiValueHandling | For `string` typed dimensions, specifies the type of 
handling for [multi-value fields](../querying/multi-value-dimensions.md). 
Possible values are `array` (ingest string arrays as-is), `sorted_array` (sort 
string arrays during ingestion), and `sorted_set` (sort and de-duplicate string 
arrays during ingestion). This parameter is ignored for types other than 
`string`. | `sorted_array` |
@@ -234,6 +248,8 @@ Druid will interpret a `dimensionsSpec` in two possible 
ways: _normal_ or _schem
 
 Normal interpretation occurs when either `dimensions` or `spatialDimensions` 
is non-empty. In this case, the combination of the two lists will be taken as 
the set of dimensions to be ingested, and the list of `dimensionExclusions` 
will be ignored.
 
+> The following description of schemaless refers to  string-based schemaless  
where Druid treats dimensions it discovers as strings. We recommend you use 
schema auto-discovery instead where Druid infers the type for the dimension. 
For more information, see [`dimensionsSpec`](#dimensionsspec).
+
 Schemaless interpretation occurs when both `dimensions` and 
`spatialDimensions` are empty or null. In this case, the set of dimensions is 
determined in the following way:
 
 1. First, start from the set of all root-level fields from the input record, 
as determined by the [`inputFormat`](./data-formats.md). "Root-level" includes 
all fields at the top level of a data structure, but does not included fields 
nested within maps or lists. To extract these, you must use a 
[`flattenSpec`](./data-formats.md#flattenspec). All fields of non-nested data 
formats, such as CSV and delimited text, are considered root-level.
@@ -244,6 +260,8 @@ Schemaless interpretation occurs when both `dimensions` and 
`spatialDimensions`
 6. Any field with the same name as an aggregator from the 
[metricsSpec](#metricsspec) is excluded.
 7. All other fields are ingested as `string` typed dimensions with the 
[default settings](#dimension-objects).
 
+Additionally, if you have empty columns that you want to include in the 
string-based schemaless ingestion, you'll need to include the context parameter 
`storeEmptyColumns` and set it to `true`.
+
 > Note: Fields generated by a [`transformSpec`](#transformspec) are not 
 > currently considered candidates for
 > schemaless dimension interpretation.
 
diff --git a/docs/ingestion/schema-design.md b/docs/ingestion/schema-design.md
index f006e792bc..eaada3651b 100644
--- a/docs/ingestion/schema-design.md
+++ b/docs/ingestion/schema-design.md
@@ -107,7 +107,7 @@ to compute percentiles or quantiles, use Druid's 
[approximate aggregators](../qu
 row in your Druid datasource. This can be useful if you want to store data at 
a different time granularity than it is
 naturally emitted. It is also useful if you want to combine timeseries and 
non-timeseries data in the same datasource.
 * If you don't know ahead of time what columns you'll want to ingest, use an 
empty dimensions list to trigger
-[automatic detection of dimension columns](#schema-less-dimensions).
+[automatic detection of dimension 
columns](#schema-auto-discovery-for-dimensions).
 
 ### Log aggregation model
 
@@ -120,8 +120,7 @@ you must be more explicit. Druid columns have types 
specific upfront.
 
 Tips for modeling log data in Druid:
 
-* If you don't know ahead of time what columns you'll want to ingest, use an 
empty dimensions list to trigger
-[automatic detection of dimension columns](#schema-less-dimensions).
+* If you don't know ahead of time what columns to ingest, you can have Druid 
perform [schema auto-discovery](#schema-auto-discovery-for-dimensions).
 * If you have nested data, you can ingest it using the [nested 
columns](../querying/nested-columns.md) feature or flatten it using a 
[`flattenSpec`](./ingestion-spec.md#flattenspec).
 * Consider enabling [rollup](./rollup.md) if you have mainly analytical use 
cases for your log data. This will
 mean you lose the ability to retrieve individual events from Druid, but you 
potentially gain substantial compression and
@@ -241,12 +240,53 @@ You should query for the number of ingested rows with:
 ]
 ```
 
-### Schema-less dimensions
+### Schema auto-discovery for dimensions
+
+Druid can infer the schema for your data in one of two ways:
+
+- [Type-aware schema discovery (experimental)](#type-aware-schema-discovery) 
where Druid infers the schema and type for your data. Type-aware schema 
discovery is an experimental feature currently available for native batch and 
streaming ingestion.
+- [String-based schema discovery](#string-based-schema-discovery) where all 
the discovered columns are typed as either native string or multi-value string 
columns.
+
+#### Type-aware schema discovery
+
+> Note that using type-aware schema discovery can impact downstream BI tools 
depending on how they handle ARRAY typed columns.
+
+You can have Druid infer the schema and types for your data partially or fully 
by setting `dimensionsSpec.useSchemaDiscovery` to `true` and defining some or 
no dimensions in the dimensions list. 
+
+When performing type-aware schema discovery, Druid can discover all of the 
columns of your input data (that aren't in
+the exclusion list). Druid automatically chooses the most appropriate native 
Druid type among `STRING`, `LONG`,
+`DOUBLE`, `ARRAY<STRING>`, `ARRAY<LONG>`, `ARRAY<DOUBLE>`, or `COMPLEX<json>` 
for nested data. For input formats with
+native boolean types, Druid ingests these values as strings if 
`druid.expressions.useStrictBooleans` is set to `false`
+(the default), or longs if set to `true` (for more SQL compatible behavior). 
Array typed columns can be queried using
+the [array functions](../querying/sql-array-functions.md) or 
[UNNEST](../querying/sql-functions.md#unnest). Nested
+columns can be queried with the [JSON 
functions](../querying/sql-json-functions.md).
+
+Mixed type columns are stored in the _least_ restrictive type that can 
represent all values in the column. For example:
+
+- Mixed numeric columns are `DOUBLE`
+- If there are any strings present, then the column is a `STRING`
+- If there are arrays, then the column becomes an array with the least 
restrictive element type
+- Any nested data or arrays of nested data become `COMPLEX<json>` nested 
columns.
+
+If you're already using string-based schema discovery and want to migrate, see 
[Migrating to type-aware schema 
discovery](#migrating-to-type-aware-schema-discovery).
+
+#### String-based schema discovery
+
+If you do not set `dimensionsSpec.useSchemaDiscovery` to `true`, Druid can 
still use the string-based schema discovery for ingestion if any of the 
following conditions are met: 
+
+- The dimension list is empty 
+- You set `includeAllDimensions` to `true` 
+
+Druid coerces primitives and arrays of primitive types into the native Druid 
string type. Nested data structures and arrays of nested data structures are 
ignored and not ingested.
+
+#### Migrating to type-aware schema discovery
 
-If the `dimensions` field is left empty in your ingestion spec, Druid will 
treat every column that is not the timestamp column,
-a dimension that has been excluded, or a metric column as a dimension.
+If you previously used string-based schema discovery and want to migrate to 
type-aware schema discovery, do the following:
 
-Note that when using schema-less ingestion, all dimensions will be ingested as 
String-typed dimensions.
+- Update any queries that use multi-value dimensions (MVDs) to use UNNEST in 
conjunction with other functions so that no MVD behavior is being relied upon. 
Type-aware schema discovery generates ARRAY typed columns instead of MVDs, so 
queries that use any MVD features will fail.
+- Be aware of mixed typed inputs and test how type-aware schema discovery 
handles them. Druid attempts to cast them as the least restrictive type.
+- If you notice issues with numeric types, you may need to explicitly cast 
them. Generally, Druid handles the coercion for you.
+- Update your dimension exclusion list and add any nested columns if you want 
to continue to exclude them. String-based schema discovery automatically 
ignores nested columns, but type-aware schema discovery will ingest them.
 
 ### Including the same column as a dimension and a metric
 
diff --git a/docs/ingestion/tasks.md b/docs/ingestion/tasks.md
index 979d6d88d7..95e61f88dc 100644
--- a/docs/ingestion/tasks.md
+++ b/docs/ingestion/tasks.md
@@ -392,7 +392,7 @@ The following parameters apply to all task types.
 |`forceTimeChunkLock`|true|_Setting this to false is still experimental_<br/> 
Force to always use time chunk lock. If not set, each task automatically 
chooses a lock type to use. If set, this parameter overwrites 
`druid.indexer.tasklock.forceTimeChunkLock` [configuration for the 
overlord](../configuration/index.md#overlord-operations). See 
[Locking](#locking) for more details.|
 |`priority`|Different based on task types. See [Priority](#priority).|Task 
priority|
 |`useLineageBasedSegmentAllocation`|false in 0.21 or earlier, true in 0.22 or 
later|Enable the new lineage-based segment allocation protocol for the native 
Parallel task with dynamic partitioning. This option should be off during the 
replacing rolling upgrade from one of the Druid versions between 0.19 and 0.21 
to Druid 0.22 or higher. Once the upgrade is done, it must be set to true to 
ensure data correctness.|
-|`storeEmptyColumns`|true|Boolean value for whether or not to store empty 
columns during ingestion. When set to true, Druid stores every column specified 
in the [`dimensionsSpec`](ingestion-spec.md#dimensionsspec). If you use 
schemaless ingestion and don't specify any dimensions to ingest, you must also 
set [`includeAllDimensions`](ingestion-spec.md#dimensionsspec) for Druid to 
store empty columns.<br/><br/>If you set `storeEmptyColumns` to false, Druid 
SQL queries referencing empty colu [...]
+|`storeEmptyColumns`|true|Boolean value for whether or not to store empty 
columns during ingestion. When set to true, Druid stores every column specified 
in the [`dimensionsSpec`](ingestion-spec.md#dimensionsspec). <br/><br/>If you 
set `storeEmptyColumns` to false, Druid SQL queries referencing empty columns 
will fail. If you intend to leave `storeEmptyColumns` disabled, you should 
either ingest dummy data for empty columns or else not query on empty 
columns.<br/><br/>When set in the tas [...]
 
 ## Task logs
 
diff --git a/website/.spelling b/website/.spelling
index 910912d46c..01b5f27ead 100644
--- a/website/.spelling
+++ b/website/.spelling
@@ -375,6 +375,7 @@ misconfigured
 mostAvailableSize
 multitenancy
 multitenant
+MVDs
 mysql
 namespace
 namespaced


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[druid] branch master updated: docs: add docs for schema auto-discovery (#14065)

Reply via email to