Re: [PR] Update front coding text (druid)

via GitHub Fri, 31 May 2024 07:53:18 -0700


2bethere commented on code in PR #16491:
URL: https://github.com/apache/druid/pull/16491#discussion_r1622541432



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -495,38 +508,58 @@ is:
 }
 ```
 
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Each ingestion method has its own tuning type code. You must specify the 
type code that matches your ingestion method. Common options are `index`, 
`hadoop`, `kafka`, and `kinesis`.||
-|maxRowsInMemory|The maximum number of records to store in memory before 
persisting to disk. Note that this is the number of rows post-rollup, and so it 
may not be equal to the number of input records. Ingested records will be 
persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are 
reached (whichever happens first).|`1000000`|
-|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in 
the JVM heap before persisting. This is based on a rough estimate of memory 
usage. Ingested records will be persisted to disk when either `maxRowsInMemory` 
or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` 
also includes heap usage of artifacts created from intermediary persists. This 
means that after every persist, the amount of `maxBytesInMemory` until the next 
persist will decrease. If the sum of bytes of all intermediary persisted 
artifacts exceeds `maxBytesInMemory` the task fails.<br /><br />Setting 
`maxBytesInMemory` to -1 disables this check, meaning Druid will rely entirely 
on `maxRowsInMemory` to control memory usage. Setting it to zero means the 
default value will be used (one-sixth of JVM heap size).<br /><br />Note that 
the estimate of memory usage is designed to be an overestimate, and can be 
especially high when using complex ingest-time aggregators, including s
 ketches. If this causes your indexing workloads to persist to disk too often, 
you can set `maxBytesInMemory` to -1 and rely on `maxRowsInMemory` 
instead.|One-sixth of max JVM heap size|
-|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into 
account overhead objects created during ingestion and each intermediate 
persist. Setting this to true can exclude the bytes of these overhead objects 
from maxBytesInMemory check.|false|
-|indexSpec|Defines segment storage format options to use at indexing time.|See 
[`indexSpec`](#indexspec) for more information.|
-|indexSpecForIntermediatePersists|Defines segment storage format options to 
use at indexing time for intermediate persisted temporary segments.|See 
[`indexSpec`](#indexspec) for more information.|
-|Other properties|Each ingestion method has its own list of additional tuning 
properties. See the documentation for each method for a full list: [Kafka 
indexing service](../ingestion/kafka-ingestion.md#tuning-configuration), 
[Kinesis indexing 
service](../ingestion/kinesis-ingestion.md#tuning-configuration), [Native 
batch](native-batch.md#tuningconfig), and 
[Hadoop-based](hadoop.md#tuningconfig).||
-
 ### `indexSpec`
 
-The `indexSpec` object can include the following properties:
+The `indexSpec` object can include the following properties.
+For information on defining an `indexSpec` in a query context, see [SQL-based 
ingestion reference](../multi-stage-query/reference.md#context-parameters).
 
 |Field|Description|Default|
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with 
`type` set to `roaring` or `concise`.|`{"type": "roaring"}`|
-|dimensionCompression|Compression format for dimension columns. Options are 
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for STRING value dictionaries used 
by STRING and COMPLEX&lt;json&gt; columns. <br /><br />Example to enable front 
coding: `{"type":"frontCoded", "bucketSize": 4}`<br />`bucketSize` is the 
number of values to place in a bucket to perform delta encoding. Must be a 
power of 2, maximum is 128. Defaults to 4.<br /> `formatVersion` can specify 
older versions for backwards compatibility during rolling upgrades, valid 
options are `0` and `1`. Defaults to `0` for backwards compatibility.<br /><br 
/>See [Front coding](#front-coding) for more information.|`{"type":"utf8"}`|
+|dimensionCompression|Compression format for dimension columns. One of `lz4`, 
`lzf`, `zstd`, or `uncompressed`.|`lz4`|
+|stringDictionaryEncoding|Encoding format for string value dictionaries used 
by STRING and [COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns. To 
enable front coding, set `stringDictionaryEncoding.type` to `frontCoded`. 
Optionally, you can specify the `bucketSize` and `formatVersion` properties. 
See [Front coding](#front-coding) for more information.|`{"type":"utf8"}`|
 |metricCompression|Compression format for primitive type metric columns. 
Options are `lz4`, `lzf`, `zstd`, `uncompressed`, or `none` (which is more 
efficient than `uncompressed`, but not supported by older versions of 
Druid).|`lz4`|
 |longEncoding|Encoding format for long-typed columns. Applies regardless of 
whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` 
encodes the values using offset or lookup table depending on column 
cardinality, and store them with variable size. `longs` stores the value as-is 
with 8 bytes each.|`longs`|
 |jsonCompression|Compression format to use for nested column raw data. Options 
are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
 
-##### Front coding
-
-Front coding is an experimental feature starting in version 25.0. Front coding 
is an incremental encoding strategy that Druid can use to store STRING and 
[COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns. It allows Druid 
to create smaller UTF-8 encoded segments with very little performance cost.
-
-You can enable front coding with all types of ingestion. For information on 
defining an `indexSpec` in a query context, see [SQL-based ingestion 
reference](../multi-stage-query/reference.md#context-parameters).
+#### Front coding
 
 :::info
- Front coding was originally introduced in Druid 25.0, and an improved 
'version 1' was introduced in Druid 26.0, with typically faster read speed and 
smaller storage size. The current recommendation is to enable it in a staging 
environment and fully test your use case before using in production. By 
default, segments created with front coding enabled in Druid 26.0 are backwards 
compatible with Druid 25.0, but those created with Druid 26.0 or 25.0 are not 
compatible with Druid versions older than 25.0. If using front coding in Druid 
25.0 and upgrading to Druid 26.0, the `formatVersion` defaults to `0` to keep 
writing out the older format to enable seamless downgrades to Druid 25.0, and 
then later is recommended to be changed to `1` once determined that rollback is 
not necessary.
+Front coding is an [experimental feature](../development/experimental.md).
 :::
 
-Beyond these properties, each ingestion method has its own specific tuning 
properties. See the documentation for each
-[ingestion method](./index.md#ingestion-methods) for details.
+Druid encodes string columns into dictionaries for better compression.
+Front coding is an incremental encoding strategy that lets you store STRING 
and [COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns in Druid with 
minimal performance impact.
+Front-coded dictionaries reduce storage and improve performance by optimizing 
for strings where the front part looks similar.
+For example, if you are tracking website visits, most URLs start with 
`https://domain.xyz/`, and front coding is able to exploit this pattern for 
more optimal compression when storing such datasets.
+
+With front coding enabled, Druid tracks the length of common prefixes of 
values so that only the suffix is stored.

Review Comment:
   I don't think we need to talk about line 536/537 *(how it works internals)



##########
docs/release-info/migr-front-coded-dict.md:
##########
@@ -0,0 +1,90 @@
+---
+id: migr-front-coded-dict
+title: "Migration guide: front-coded dictionaries"
+sidebar_label: Front-coded dictionaries
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+-->
+
+:::info
+Front coding is an [experimental feature](../development/experimental.md) 
introduced in Druid 25.0.0.
+:::
+
+Apache Druid encodes string columns into dictionaries for better compression.
+Front coding is an incremental encoding strategy that lets you store STRING 
and [COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns in Druid with 
minimal performance impact.
+Front-coded dictionaries reduce storage and improve performance by optimizing 
for strings where the front part looks similar.
+For example, if you are tracking website visits, most URLs start with 
`https://domain.xyz/`, and front coding is able to exploit this pattern for 
more optimal compression when storing such datasets.
+
+With front coding enabled, Druid tracks the length of common prefixes of 
values so that only the suffix is stored.

Review Comment:
   Same here, probably don't need to talk about those.



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -495,38 +508,58 @@ is:
 }
 ```
 
-|Field|Description|Default|
-|-----|-----------|-------|
-|type|Each ingestion method has its own tuning type code. You must specify the 
type code that matches your ingestion method. Common options are `index`, 
`hadoop`, `kafka`, and `kinesis`.||
-|maxRowsInMemory|The maximum number of records to store in memory before 
persisting to disk. Note that this is the number of rows post-rollup, and so it 
may not be equal to the number of input records. Ingested records will be 
persisted to disk when either `maxRowsInMemory` or `maxBytesInMemory` are 
reached (whichever happens first).|`1000000`|
-|maxBytesInMemory|The maximum aggregate size of records, in bytes, to store in 
the JVM heap before persisting. This is based on a rough estimate of memory 
usage. Ingested records will be persisted to disk when either `maxRowsInMemory` 
or `maxBytesInMemory` are reached (whichever happens first). `maxBytesInMemory` 
also includes heap usage of artifacts created from intermediary persists. This 
means that after every persist, the amount of `maxBytesInMemory` until the next 
persist will decrease. If the sum of bytes of all intermediary persisted 
artifacts exceeds `maxBytesInMemory` the task fails.<br /><br />Setting 
`maxBytesInMemory` to -1 disables this check, meaning Druid will rely entirely 
on `maxRowsInMemory` to control memory usage. Setting it to zero means the 
default value will be used (one-sixth of JVM heap size).<br /><br />Note that 
the estimate of memory usage is designed to be an overestimate, and can be 
especially high when using complex ingest-time aggregators, including s
 ketches. If this causes your indexing workloads to persist to disk too often, 
you can set `maxBytesInMemory` to -1 and rely on `maxRowsInMemory` 
instead.|One-sixth of max JVM heap size|
-|skipBytesInMemoryOverheadCheck|The calculation of maxBytesInMemory takes into 
account overhead objects created during ingestion and each intermediate 
persist. Setting this to true can exclude the bytes of these overhead objects 
from maxBytesInMemory check.|false|
-|indexSpec|Defines segment storage format options to use at indexing time.|See 
[`indexSpec`](#indexspec) for more information.|
-|indexSpecForIntermediatePersists|Defines segment storage format options to 
use at indexing time for intermediate persisted temporary segments.|See 
[`indexSpec`](#indexspec) for more information.|
-|Other properties|Each ingestion method has its own list of additional tuning 
properties. See the documentation for each method for a full list: [Kafka 
indexing service](../ingestion/kafka-ingestion.md#tuning-configuration), 
[Kinesis indexing 
service](../ingestion/kinesis-ingestion.md#tuning-configuration), [Native 
batch](native-batch.md#tuningconfig), and 
[Hadoop-based](hadoop.md#tuningconfig).||
-
 ### `indexSpec`
 
-The `indexSpec` object can include the following properties:
+The `indexSpec` object can include the following properties.
+For information on defining an `indexSpec` in a query context, see [SQL-based 
ingestion reference](../multi-stage-query/reference.md#context-parameters).
 
 |Field|Description|Default|
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with 
`type` set to `roaring` or `concise`.|`{"type": "roaring"}`|
-|dimensionCompression|Compression format for dimension columns. Options are 
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for STRING value dictionaries used 
by STRING and COMPLEX&lt;json&gt; columns. <br /><br />Example to enable front 
coding: `{"type":"frontCoded", "bucketSize": 4}`<br />`bucketSize` is the 
number of values to place in a bucket to perform delta encoding. Must be a 
power of 2, maximum is 128. Defaults to 4.<br /> `formatVersion` can specify 
older versions for backwards compatibility during rolling upgrades, valid 
options are `0` and `1`. Defaults to `0` for backwards compatibility.<br /><br 
/>See [Front coding](#front-coding) for more information.|`{"type":"utf8"}`|
+|dimensionCompression|Compression format for dimension columns. One of `lz4`, 
`lzf`, `zstd`, or `uncompressed`.|`lz4`|
+|stringDictionaryEncoding|Encoding format for string value dictionaries used 
by STRING and [COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns. To 
enable front coding, set `stringDictionaryEncoding.type` to `frontCoded`. 
Optionally, you can specify the `bucketSize` and `formatVersion` properties. 
See [Front coding](#front-coding) for more information.|`{"type":"utf8"}`|
 |metricCompression|Compression format for primitive type metric columns. 
Options are `lz4`, `lzf`, `zstd`, `uncompressed`, or `none` (which is more 
efficient than `uncompressed`, but not supported by older versions of 
Druid).|`lz4`|
 |longEncoding|Encoding format for long-typed columns. Applies regardless of 
whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` 
encodes the values using offset or lookup table depending on column 
cardinality, and store them with variable size. `longs` stores the value as-is 
with 8 bytes each.|`longs`|
 |jsonCompression|Compression format to use for nested column raw data. Options 
are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
 
-##### Front coding
-
-Front coding is an experimental feature starting in version 25.0. Front coding 
is an incremental encoding strategy that Druid can use to store STRING and 
[COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns. It allows Druid 
to create smaller UTF-8 encoded segments with very little performance cost.
-
-You can enable front coding with all types of ingestion. For information on 
defining an `indexSpec` in a query context, see [SQL-based ingestion 
reference](../multi-stage-query/reference.md#context-parameters).
+#### Front coding
 
 :::info
- Front coding was originally introduced in Druid 25.0, and an improved 
'version 1' was introduced in Druid 26.0, with typically faster read speed and 
smaller storage size. The current recommendation is to enable it in a staging 
environment and fully test your use case before using in production. By 
default, segments created with front coding enabled in Druid 26.0 are backwards 
compatible with Druid 25.0, but those created with Druid 26.0 or 25.0 are not 
compatible with Druid versions older than 25.0. If using front coding in Druid 
25.0 and upgrading to Druid 26.0, the `formatVersion` defaults to `0` to keep 
writing out the older format to enable seamless downgrades to Druid 25.0, and 
then later is recommended to be changed to `1` once determined that rollback is 
not necessary.
+Front coding is an [experimental feature](../development/experimental.md).
 :::
 
-Beyond these properties, each ingestion method has its own specific tuning 
properties. See the documentation for each
-[ingestion method](./index.md#ingestion-methods) for details.
+Druid encodes string columns into dictionaries for better compression.
+Front coding is an incremental encoding strategy that lets you store STRING 
and [COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns in Druid with 
minimal performance impact.
+Front-coded dictionaries reduce storage and improve performance by optimizing 
for strings where the front part looks similar.
+For example, if you are tracking website visits, most URLs start with 
`https://domain.xyz/`, and front coding is able to exploit this pattern for 
more optimal compression when storing such datasets.

Review Comment:
   ```suggestion
   For example, if you are tracking website visits, most URLs start with 
`https://domain.xyz/`, and front coding is able to exploit this pattern for 
more optimal compression when storing such datasets. Durid does the 
optimization automatically, which means Druid usually doesn't negatively impact 
the performance of string columns when they don't match the front-coded 
pattern. Thus, allowing this feature to be turned on universally without you 
knowing the underlying column's data shapes.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Re: [PR] Update front coding text (druid)

Reply via email to