Re: [PR] Update front coding text (druid)

via GitHub Mon, 27 May 2024 19:25:20 -0700


ektravel commented on code in PR #16491:
URL: https://github.com/apache/druid/pull/16491#discussion_r1616503987



##########
docs/ingestion/ingestion-spec.md:
##########
@@ -507,26 +505,64 @@ is:
 
 ### `indexSpec`
 
-The `indexSpec` object can include the following properties:
+The `indexSpec` object can include the following properties.
+For information on defining an `indexSpec` in a query context, see [SQL-based 
ingestion reference](../multi-stage-query/reference.md#context-parameters).
 
 |Field|Description|Default|
 |-----|-----------|-------|
 |bitmap|Compression format for bitmap indexes. Should be a JSON object with 
`type` set to `roaring` or `concise`.|`{"type": "roaring"}`|
-|dimensionCompression|Compression format for dimension columns. Options are 
`lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
-|stringDictionaryEncoding|Encoding format for STRING value dictionaries used 
by STRING and COMPLEX&lt;json&gt; columns. <br /><br />Example to enable front 
coding: `{"type":"frontCoded", "bucketSize": 4}`<br />`bucketSize` is the 
number of values to place in a bucket to perform delta encoding. Must be a 
power of 2, maximum is 128. Defaults to 4.<br /> `formatVersion` can specify 
older versions for backwards compatibility during rolling upgrades, valid 
options are `0` and `1`. Defaults to `0` for backwards compatibility.<br /><br 
/>See [Front coding](#front-coding) for more information.|`{"type":"utf8"}`|
+|dimensionCompression|Compression format for dimension columns. One of `lz4`, 
`lzf`, `zstd`, or `uncompressed`.|`lz4`|
+|stringDictionaryEncoding|Encoding format for STRING value dictionaries used 
by STRING and [COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns. To 
enable front coding, set `stringDictionaryEncoding.type` to `frontCoded`. 
Optionally, you can specify the `bucketSize` and `formatVersion` properties. 
`bucketSize` is the number of values to place in a bucket to perform delta 
encoding. `bucketSize` defaults to 4. You can set it to any power of 2 less 
than or equal to 128. You use `formatVersion` to specify older versions for 
backwards compatibility during rolling upgrades. Options are `0` and `1`. 
`formatVersion` defaults to `0`. See [Front coding](#front-coding) for more 
information.|`{"type":"utf8"}`|
 |metricCompression|Compression format for primitive type metric columns. 
Options are `lz4`, `lzf`, `zstd`, `uncompressed`, or `none` (which is more 
efficient than `uncompressed`, but not supported by older versions of 
Druid).|`lz4`|
 |longEncoding|Encoding format for long-typed columns. Applies regardless of 
whether they are dimensions or metrics. Options are `auto` or `longs`. `auto` 
encodes the values using offset or lookup table depending on column 
cardinality, and store them with variable size. `longs` stores the value as-is 
with 8 bytes each.|`longs`|
 |jsonCompression|Compression format to use for nested column raw data. Options 
are `lz4`, `lzf`, `zstd`, or `uncompressed`.|`lz4`|
 
 ##### Front coding
 
-Front coding is an experimental feature starting in version 25.0. Front coding 
is an incremental encoding strategy that Druid can use to store STRING and 
[COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns. It allows Druid 
to create smaller UTF-8 encoded segments with very little performance cost.
+:::info
+Front coding is an [experimental feature](../development/experimental.md) 
introduced in Druid 25.0.0.
+:::
 
-You can enable front coding with all types of ingestion. For information on 
defining an `indexSpec` in a query context, see [SQL-based ingestion 
reference](../multi-stage-query/reference.md#context-parameters).
+Front coding is an incremental encoding strategy that lets you store STRING 
and [COMPLEX&lt;json&gt;](../querying/nested-columns.md) columns in Druid.
+With front coding enabled, Druid creates smaller UTF-8 encoded segments with 
very little performance cost.
+It involves tracking the length of common prefixes of values so that only the 
suffix is stored.
+Druid divides values into buckets each with a fixed number of values. Because 
the buckets are fixed, Druid can quickly determine which bucket to store a 
dictionary ID in, without hindering its ability to perform a binary search.
 
-:::info
- Front coding was originally introduced in Druid 25.0, and an improved 
'version 1' was introduced in Druid 26.0, with typically faster read speed and 
smaller storage size. The current recommendation is to enable it in a staging 
environment and fully test your use case before using in production. By 
default, segments created with front coding enabled in Druid 26.0 are backwards 
compatible with Druid 25.0, but those created with Druid 26.0 or 25.0 are not 
compatible with Druid versions older than 25.0. If using front coding in Druid 
25.0 and upgrading to Druid 26.0, the `formatVersion` defaults to `0` to keep 
writing out the older format to enable seamless downgrades to Druid 25.0, and 
then later is recommended to be changed to `1` once determined that rollback is 
not necessary.
+To enable front coding, set `indexSpec.stringDictionaryEncoding.type` to 
`frontCoded`.
+You can specify the following optional properties:
+* `bucketSize`: Number of values to place in a bucket to perform delta 
encoding. Setting this property instructs indexing tasks to write segments 
using compressed dictionaries of the specified bucket size. You can set it to 
any power of 2 less than or equal to 128. `bucketSize` defaults to 4.
+* `formatVersion`: Specifies older versions for backwards compatibility during 
rolling upgrades. Valid options are `0` and `1`. `formatVersion` defaults to 
`0`.
+
+For example:
+
+```
+"indexSpec": {
+  "stringDictionaryEncoding": {
+    “type”:”frontCoded”,
+    “bucketSize”: 4,
+    “formatVersion”:0
+  }
+}
+```
+
+A new `"formatVersion": 1` was introduced in Druid 26.0.0, offering typically 
faster read speeds and smaller storage sizes.
+When you upgrade from Druid 25.0.0 to Druid 26.0.0 while using front coding, 
the `formatVersion` defaults to `0` to keep writing out the older format. This 
default setting enables seamless downgrades to Druid 25.0.0 if needed.
+It is recommended to enable the new format in a staging environment and 
thoroughly test your use case before deploying it to production.
+
+:::caution
+By default, segments created with front coding enabled in Druid 26.0.0 are 
backwards compatible with Druid 25.0.0. Segments created with front coding in 
Druid 26.0.0 or 25.0.0 are not compatible with Druid versions older than 25.0.0.
 :::
 
-Beyond these properties, each ingestion method has its own specific tuning 
properties. See the documentation for each
-[ingestion method](./index.md#ingestion-methods) for details.
+The following example shows how to set `stringDictionaryEncoding` using format 
version 1 with bucket size 16:
+
+```
+"indexSpec": {
+  "stringDictionaryEncoding": {
+    “type”:”frontCoded”,
+    “bucketSize”: 16,
+    “formatVersion”:1

Review Comment:
   Good catch!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Re: [PR] Update front coding text (druid)

Reply via email to