This is an automated email from the ASF dual-hosted git repository.
vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 7c5a8803304 [DOCS] Fix up tech specs based on RC (#12483)
7c5a8803304 is described below
commit 7c5a8803304429cdccbce3f5b22baa69dec1fe0c
Author: vinoth chandar <[email protected]>
AuthorDate: Fri Dec 13 05:25:20 2024 -0800
[DOCS] Fix up tech specs based on RC (#12483)
---
website/src/pages/tech-specs-1point0.md | 362 +++++++++++++-------------------
1 file changed, 149 insertions(+), 213 deletions(-)
diff --git a/website/src/pages/tech-specs-1point0.md
b/website/src/pages/tech-specs-1point0.md
index 48487ba32c4..e6ad40f3a60 100644
--- a/website/src/pages/tech-specs-1point0.md
+++ b/website/src/pages/tech-specs-1point0.md
@@ -1,50 +1,55 @@
# Apache Hudi Technical Specification
| **Syntax** | **Description** |
-| ---| --- |
-| Last Updated | Nov 2023 |
-| [Table
Version](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableVersion.java#L29)
| 7 |
+| ---|-----------------|
+| Last Updated | Dec 2024 |
+| [Table
Version](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableVersion.java#L29)
| 8 |
:::note
-Hudi version 1.0 is under active development. This document is a work in
progress and is subject to change. Please check the specification for versions
prior to 1.0 [here](/tech-specs).
+Please check the specification for versions prior to 1.0 [here](/tech-specs).
:::
-Hudi brings database capabilities (tables, transactions, mutability, indexes,
storage layouts) along with an incremental stream processing model (incremental
merges, change streams, out-of-order data) over very large collections of
files/objects, turning immutable cloud/file storage systems into transactional
data lakes.
+Hudi brings database capabilities (tables, transactions, mutability, indexes,
storage layouts) along with an incremental stream processing model (incremental
merges, change streams, out-of-order data)
+over very large collections of files/objects, turning immutable cloud/file
storage systems into transactional data lakes.
## Overview
-
This document is a specification for the Hudi's storage format along with
protocols for correctly implementing readers and writers to accomplish the
following goals.
-
-
* **Unified Computation Model** - a unified way to combine large batch style
operations and frequent near real time incremental operations over large
datasets.
-* **Self-Optimized Storage** - automatically handle all the table storage
maintenance such as compaction, clustering, vacuuming asynchronously and in
most cases non-blocking to actual data changes.
+* **Self-Optimizing Storage** - automatically handle all the table
maintenance and optimizations such as compaction, clustering, cleaning
asynchronously and in most cases non-blocking to actual data changes.
* **Cloud Native Database** - abstracts Table/Schema from actual storage and
ensures up-to-date metadata and indexes unlocking multi-fold read and write
performance optimizations.
-* **Cross-Engine Compatibility** - designed to be neutral and compatible
with different computation engines. Hudi will manage metadata, and provide
common abstractions and pluggable interfaces to most/all common compute/query
engines.
-
-
+* **Cross-Engine Compatibility** - designed to be neutral and compatible
with different computation engines. Hudi will manage metadata, and provide
common abstractions and
+ pluggable interfaces to most/all common compute/query engines.
This document is intended as reference guide for any compute engines, that aim
to write/read Hudi tables, by interacting with the storage format directly.
## Storage Layout
+Hudi organizes a table as a collection of files (objects in cloud storage)
that can be spread across different paths on **_storage_** which can be local
filesystem, distributed filesystem or object storage.
+These different paths that contain a table's files are called
**_partitions_**. Some common ways of organizing files can be as follows
-Hudi organizes a table as a collection of files (objects in cloud storage)
that can be spread across different paths on **_storage_** which can be local
filesystem, distributed filesystem or object storage. These different paths
that contain a table's files are called **_partitions_**. Some common ways of
organizing files can be as follows
-
-* **Hierarchical folder tree:** files are organized under a folder path
structure like conventional filesystems, for ease of access and navigation.
Hive-style partitioning is a special case of this organization, where the
folder path names indicate the field values that partition the data. However,
note that, unlike Hive style partitioning, partition columns are not removed
from data files and partitioning is a mere organizational construct.
-* **Cloud-optimized with random-prefixes**: files are distributed across
different paths (of even varying depth) across cloud storage, to circumvent
scaling/throttling issues that plague cloud storage systems, at the cost of
losing easy navigation of storage folder tree using standard UI/CLI tools.
-* **Unpartitioned flat structure**: tables can also be totally
unpartitioned, where a single folder contains all the files that constitute a
table.
-
+* **Hierarchical folder tree:** files are organized under a folder path
structure like conventional filesystems, for ease of access and navigation.
Hive-style partitioning is a special case of
+ this organization, where the folder path names indicate the field values
that partition the data. However, note that, unlike Hive style partitioning,
partition columns are not removed from
+ data files and partitioning is a mere organizational construct.
+* **Cloud-optimized with random-prefixes**: files are distributed across
different paths (of even varying depth) across cloud storage, to circumvent
scaling/throttling issues that plague
+ cloud storage systems, at the cost of losing easy navigation of storage
folder tree using standard UI/CLI tools.
+
+* **Unpartitioned flat structure**: tables can also be totally unpartitioned,
where a single folder contains all the files that constitute a table.
-Metadata about the table is stored at a location on storage, referred to as
**_basepath_**, which contains a special reserved _.hoodie_ directory under the
base path is used to store transaction logs, metadata and indexes. A special
file [`hoodie.properties`](http://hoodie.properties/) under basepath persists
table level configurations, shared by writers and readers of the table. These
configurations are explained
[here](https://github.com/apache/hudi/blob/master/hudi-common/src/main/jav [...]
+Metadata about the table is stored at a location on storage, referred to as
**_basepath_**, which contains a special reserved _.hoodie_ directory under the
base path
+is used to store transaction logs, metadata and indexes. A special file
[`hoodie.properties`](http://hoodie.properties/) under basepath persists table
level configurations, shared by writers
+and readers of the table. These configurations are explained
[here](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java),
and any config without a default value needs to be specified during table
creation.
```plain
-/data/hudi_trips/ <-- Base Path
-├── .hoodie/ <-- Meta Path
-| └── hoodie.properties <-- Table Configs
-│ └── metadata/ <-- Metadata
-| └── files/ <-- Files that make up the table
-| └── col_stats/
+/data/hudi_trips/ <-- Base Path
+├── .hoodie/ <-- Meta Path
+| └── hoodie.properties <-- Table Configs
+│ └── metadata/ <-- Metadata
+| └── files/ <-- Files that make up the table
+| └── col_stats/ <-- Statistics on columns for each
file
+| └── record_index/ <-- Unique index on record key
+| └── [index_type]_[index_name]/ <-- secondary indexes of different
types
+| └── .index_defs/index.json <-- index definitions
│ └── timeline/ <-- active timeline
| └── history/ <-- timeline history
├── americas/ <-- Data stored as folder tree
@@ -60,21 +65,29 @@ Metadata about the table is stored at a location on
storage, referred to as **_b
├── [data_files]
```
-
-
-Data files can be either **_base files_** or **_log files_**. Base files
contain records stored in columnar or SST table like file formats depending on
use-cases. Log files store deltas (partial or complete), deletes, change logs
and other record level operations performed on the records in the base file.
Data files are organized into logical concepts called **_file groups_**,
uniquely identified by a **_file id_**. Each record in the table is identified
by an unique key and mapped to a [...]
+Data files can be either **_base files_** or **_log files_**. Base files
contain records stored in columnar or SST table like file formats depending on
use-cases.
+Log files store deltas (partial or complete), deletes, change logs and other
record level operations performed on the records in the base file. Data files
are organized
+into logical concepts called **_file groups_**, uniquely identified by a
**_file id_**. Each record in the table is identified by an unique key and
mapped to a single file group
+at any given time. Within a file group, data files are further split into
**_file slices_**, where each file slice contains an optional base file and a
list of log files, that
+constitute the state of all records in the file group at a given time. These
constructs allows Hudi to efficiently support incremental operations, as will
be evident later.
## Timeline
-Hudi stores all actions performed on a table into a Log Structured Merge
([LSM](https://en.wikipedia.org/wiki/Log-structured_merge-tree)) tree structure
called the **_Timeline_**. Unlike typical LSM implementations, the memory
component and the write-ahead-log are at once replaced by
[avro](https://avro.apache.org/) serialized files containing individual actions
(**_active timeline_**) for high durability and inter-process co-ordination.
Every action on the Hudi table creates a new entry [...]
+Hudi stores all actions performed on a table into a Log Structured Merge
([LSM](https://en.wikipedia.org/wiki/Log-structured_merge-tree)) tree structure
called the **_Timeline_**.
+Unlike typical LSM implementations, the memory component and the
write-ahead-log are at once replaced by [avro](https://avro.apache.org/)
serialized files containing individual
+actions (**_active timeline_**) for high durability and inter-process
co-ordination. Every action on the Hudi table creates a new entry
(**_instant_**) in the active timeline and periodically,
+actions move from the active timeline to the LSM structure. Each new actions
and state changes need to be atomically published to the timeline (see Appendix
for storage specific guidelines to achieve that).
### Time
-Each action on the timeline is stamped with a time at which it began and
completed. The notion of time can be logical or physical timestamps, but it's
required that each timestamp generated by a process is monotonically increasing
with respect to the previous time generated by the same or another process.
This requirement can be satisfied by implementing a
[TrueTime](https://research.google/pubs/pub45855/) generator with an external
time generator or rely on system epoch times with assum [...]
+Each action on the timeline is stamped with a time at which it began and
completed. The notion of time can be logical or physical timestamps, but it's
required that each timestamp
+generated by a process is monotonically increasing with respect to the
previous time generated by the same or another process. This requirement can be
satisfied by implementing
+a [TrueTime](https://research.google/pubs/pub45855/) generator with an
external time generator or rely on system epoch times with assumed bounds on
clock skews. See appendix for more.
### Actions
-Action have a plan (optional) and completion metadata associated with them
that capture how the action alters the table state. The metadata is serialized
as Json/Avro, and the schema for each of these actions is described in avro
[here](https://github.com/apache/hudi/tree/master/hudi-common/src/main/avro).
The following are the actions on the Hudi timeline.
+Actions have a plan (optional) and completion metadata associated with them
that capture how the action alters the table state. The metadata is serialized
as Avro, and the schema for
+each of these actions is described in avro
[here](https://github.com/apache/hudi/tree/master/hudi-common/src/main/avro).
The following are the key actions on the Hudi timeline.
| **Action type** | **Description** | **Action Metadata** |
| ---| ---| --- |
@@ -99,16 +112,15 @@ An action can be in any one of the following states on the
active timeline.
| inflight | A process has attempted to execute a requested action. Note that
this process could be still executing or failed midway. Actions are idempotent,
and could fail many times in this state. Stored in the active timeline as
**_\[begin\_instant\_time\].\[action\_type\].\[state\]_** |
| completed | completion time is generated and the action has completed
successfully by publishing a file with both begin and completion time on the
timeline. |
-All the actions in requested/inflight states are stored in the active timeline
as files named
\[**_begin\_instant\_time\].\[action\_type\].\[requested|inflight\]_**.
Completed actions are stored along with a time that denotes when the action was
completed, in a file named
\[**_begin\_instant\_time\]\_\[completion\_instant\_time\].\[action\_type\]._**
+All the actions in requested/inflight states are stored in the active timeline
as files named
\[**_begin\_instant\_time\].\[action\_type\].\[requested|inflight\]_**.
Completed actions are stored along
+with a time that denotes when the action was completed, in a file named
\[**_begin\_instant\_time\]\_\[completion\_instant\_time\].\[action\_type\]._**
### LSM Timeline
-
-Completed actions, their plans and completion metadata are stored in a more
scalable LSM tree based timeline organized in an **_archived_** storage
location under the .hoodie metadata path. It consists of Apache Parquet files
with action instant data and bookkeeping metadata files, in the following
manner.
-
-
+Completed actions, their plans and completion metadata are stored in a more
scalable LSM tree based timeline organized in an **_timeline/history_** storage
location under the .hoodie metadata path.
+It consists of Apache Parquet files with action instant data and bookkeeping
metadata files, in the following manner.
```bash
-/.hoodie/archived/
+/.hoodie/timeline/history
├── _version_ <-- stores the manifest version
that is current
├── manifest_1 <-- manifests store list of files
in timeline
├── manifest_2 <-- compactions, cleaning, writes
produce new manifest files
@@ -117,8 +129,6 @@ Completed actions, their plans and completion metadata are
stored in a more scal
├── [min_time]_[max_time]_[level].parquet <-- files storing actual action
details
```
-
-
The schema of the individual files are as follows.
| **File type** | **File naming** | **Description** |
@@ -148,8 +158,6 @@ Hudi storage format supports two table types offering
different trade-offs betwe
Readers need to then satisfy different query types on these tables.
-
-
* **Snapshot queries** - query the table for latest committed state of each
record.
* **Time-Travel queries** - snapshot query performed as of a point in
timeline or equivalent on an older snapshot of the table.
* **Incremental queries** - obtain the latest merged state of all records
that have been updated/inserted between two points in the timeline.
@@ -157,7 +165,8 @@ Readers need to then satisfy different query types on these
tables.
## Data Files
-Data Files have naming conventions that allows to easily track history of
files with respect to the timeline as well as serving practical operational
needs. For e.g it's quite possible to recover the table to a known good state,
when operational accidents cause the timeline and other metadata are deleted.
+Data Files have naming conventions that allows to easily track history of
files with respect to the timeline as well as serving practical operational
needs.
+For e.g it's quite possible to recover the table to a known good state, when
operational accidents cause the timeline and other metadata are deleted.
### Base Files
@@ -166,8 +175,9 @@ Base files are standard well-known file formats like Apache
Parquet, Apache Orc
where,
* **file\_id** \- Id of the file group that the base file belong to.
-* **write\_token** - Monotonically increasing token for every attempt to
write the base file within a given transaction. This should help uniquely
identifying the base file when there are failures and retries. Cleaning can
remove partial/uncommitted base files by comparing with the successful write
token.
-* **begin\_time** - Time indicating the begin time of the action that
generated this file.
+* **write\_token** - Monotonically increasing token for every attempt to
write the base file within a given transaction. This should help uniquely
identifying the base file when there are
+ failures and retries. Cleaning can remove partial/uncommitted base files
by comparing with the successful write token.
+* **requested\_time** - Time when this action was requested on the timeline.
* **extension** - base file extension matching the file format such as
.parquet, .orc.
Note that a single file group can contain base files with different extensions.
@@ -176,23 +186,23 @@ Note that a single file group can contain base files with
different extensions.
The log files contain different type of blocks, that encode delta updates,
deletes or change logs against the base file. They are named with the convention
-**_.\[file\_id\]\_\[begin\_instant\_time\].log.\[version\]\_\[write\_token\]_**.
+**_.\[file\_id\]\_\[requested\_instant\_time\].log.\[version\]\_\[write\_token\]_**.
* **file\_id** - File Id of the base file in the slice
-* **begin\_instant\_time** \- Time at which the write operation that
produced this log file was requested.
+* **requested\_instant\_time** \- Time at which the write operation that
produced this log file was requested.
* **version** - Current version of the log file format, to order deltas
against the base file.
-* **write\_token** - Monotonically increasing token for every attempt to
write the log file. This should help uniquely identifying the log file when
there are failures and retries. Cleaner can clean-up partial log files if the
write token is not the latest in the file slice.
+* **write\_token** - Monotonically increasing token for every attempt to
write the log file. This should help uniquely identifying the log file when
there are failures and retries.
+ Cleaner can clean-up partial log files if the write token is not the
latest in the file slice.
### Log Format
-The Log file format structure is a Hudi native format. The actual content
bytes are serialized into one of Apache Avro, Apache Parquet or Apache HFile
file formats based on configuration and the other metadata in the block is
serialized using primitive types and byte arrays.
+The Log file format structure is a Hudi native format. The actual content
bytes are serialized into one of Apache Avro, Apache Parquet or Apache HFile
file formats based
+on configuration and the other metadata in the block is serialized using
primitive types and byte arrays.
Hudi Log format specification is as follows.
-
-
| Section | #Bytes | Description |
| ---| ---| --- |
| **magic** | 6 | 6 Characters '#HUDI#' stored as a byte array. Sanity check
for block corruption to assert start 6 bytes matches the magic byte\[\]. |
@@ -243,12 +253,11 @@ Metadata key mapping from Integer to actual metadata is
as follows:
The following are the possible block types used in Hudi Log Format:
-
-
### Command Block (Id: 1)
-Encodes a command to the log reader. The Command block must be 0 byte content
block which only populates the metadata Command Block Type. Only possible
values in the current version of the log format is ROLLBACK\_PREVIOUS\_BLOCK,
which lets the reader to undo the previous block written in the log file. This
denotes that the previous action that wrote the log block was unsuccessful.
-
+Encodes a command to the log reader. The Command block must be 0 byte content
block which only populates the metadata Command Block Type.
+Only possible values in the current version of the log format is
ROLLBACK\_PREVIOUS\_BLOCK, which lets the reader to undo the previous block
written in the log file.
+This denotes that the previous action that wrote the log block was
unsuccessful.
### Delete Block (Id: 2)
@@ -260,12 +269,10 @@ Encodes a command to the log reader. The Command block
must be 0 byte content bl
| deleted keys | variable | Tombstone of the record to encode a delete. The
following 3 fields are serialized using the KryoSerializer. **Record Key** -
Unique record key within the partition to deleted **Partition Path** -
Partition path of the record deleted **Ordering Value** - In a particular batch
of updates, the delete block is always written after the data
(Avro/HFile/Parquet) block. This field would preserve the ordering of deletes
and inserts within the same batch. |
-
### Corrupted Block (Id: 3)
-This block type is never written to persistent storage. While reading a log
block, if the block is corrupted, then the reader gets an instance of the
Corrupted Block instead of a Data block. It is a reserved ID for handling cases
of corrupt or partially written blocks, for example on HDFS.
-
-
+This block type is never written to persistent storage. While reading a log
block, if the block is corrupted, then the reader gets an instance of the
+Corrupted Block instead of a Data block. It is a reserved ID for handling
cases of corrupt or partially written blocks, for example on HDFS.
### Avro Block (Id: 4)
@@ -279,18 +286,15 @@ Data block serializes the actual records written into the
log file
| record content | variable | Record represented as an Avro record serialized
using BinaryEncoder |
-
### HFile Block (Id: 5)
-The HFile data block serializes the records using the HFile file format. HFile
data model is a key value pair and both are encoded as byte arrays. Hudi record
key is encoded as Avro string and the Avro record serialized using
BinaryEncoder is stored as the value. HFile file format stores the records in
sorted order and with index to enable quick point reads and range scans.
-
-
+The HFile data block serializes the records using the HFile file format. HFile
data model is a key value pair and both are encoded as byte arrays. Hudi record
key is encoded
+as Avro string and the Avro record serialized using BinaryEncoder is stored as
the value. HFile file format stores the records in sorted order and with index
to enable quick point reads and range scans.
### Parquet Block (Id: 6)
-The Parquet Block serializes the records using the Apache Parquet file format.
The serialization layout is similar to the Avro block except for the byte array
content encoded in columnar Parquet format. This log block type enables
efficient columnar scans and better compression. Different data block types
offers different trade-offs and picking the right block is based on the
workload requirements and is critical for merge and read performance.
-
-
+The Parquet Block serializes the records using the Apache Parquet file format.
The serialization layout is similar to the Avro block except for the byte array
content encoded in
+columnar Parquet format. This log block type enables efficient columnar scans
and better compression. Different data block types offers different trade-offs
and picking the right block is based on the workload requirements and is
critical for merge and read performance.
### CDC Block (Id: 7)
@@ -303,9 +307,6 @@ The CDC block is used for change data capture and it
encodes the before and afte
| oldRecord | This is the before image
|
| newRecord | This is the after image
|
-
-
-
## Table Properties
As mentioned in the [storage layout](#storage-layout) section, the table
properties are stored in the `hoodie.properties` file under the `.hoodie`
directory in the table base path.
@@ -336,25 +337,17 @@ table.
## Metadata
-Hudi automatically extracts the physical data statistics and stores the
metadata along with the data to improve write and query performance. Hudi
Metadata is an internally-managed table which organizes the table metadata
under the base path _.hoodie/metadata._ The metadata is in itself a Hudi table,
organized with the Hudi merge-on-read storage format. Every record stored in
the metadata table is a Hudi record and hence has partitioning key and record
key specified. Following metadata is [...]
-
-
-
-* **files** - Partition path to file name index. Key for the Hudi record is
the partition path and the actual record is a map of file name to an instance
of `HoodieMetadataFileInfo`. Additionally, a special key `__all_partitions__`
holds the list of all partitions. The files index can be used to do file
listing and do filter based pruning of the scanset during query
-* **column\_stats** - contains statistics of columns for all the records in
the table. This enables fine-grained file pruning for filters and join
conditions in the query. The actual payload is an instance of
`HoodieMetadataColumnStats`.
-
+Hudi automatically extracts the physical data statistics and stores the
metadata along with the data to improve write and query performance. Hudi
Metadata is an internally-managed
+table which organizes the table metadata under the base path
_.hoodie/metadata._ The metadata is in itself a Hudi table, organized with the
Hudi merge-on-read storage format.
+Every record stored in the metadata table is a Hudi record and hence has
partitioning key and record key specified. Following metadata is stored in the
metadata table.
+* **files** - Partition path to file name index. Key for the Hudi record is
the partition path and the actual record is a map of file name to an instance
of `HoodieMetadataFileInfo`.
+ Additionally, a special key `__all_partitions__` holds the list of all
partitions. The files index can be used to do file listing and do filter based
pruning of the scanset during query
+* **column\_stats** - contains statistics of columns for all the records in
the table. This enables fine-grained file pruning for filters and join
conditions in the query.
+ The actual payload is an instance of `HoodieMetadataColumnStats`.
Apache Hudi platform employs HFile format, to store metadata and indexes, to
ensure high performance, though different implementations are free to choose
their own.
-
-
-### Schema
-
-\[WIP\] Tracking schema versions.
-
-
-
### File Listings
File listing is an index of partition path to file name stored under the
partition `files` in the metadata table.
@@ -362,8 +355,6 @@ The files index can be used to do file listing and do
filter based pruning of th
Key for the Hudi record is the partition path and the actual record is a map
of file name to an instance of `HoodieMetadataFileInfo` that encodes file group
ID and instant time along with file metadata.
Additionally, a special key `__all_partitions__` holds the list of all
partitions.
-
-
### Column Statistics
Column statistics is stored under the partition `column_stats` in the metadata
table. The actual payload is an instance of `HoodieMetadataColumnStats`.
@@ -373,11 +364,9 @@ The Hudi key is `str_concat(hash(column name),
str_concat(hash(partition name),
### Meta Fields
In addition to the fields specified by the table's schema, the following meta
fields are added to each record, to unlock incremental processing and ease of
debugging. These meta fields are part of the table schema and
-
stored with the actual record to avoid re-computation.
-
| Hudi meta-fields | Description |
| ---| --- |
| \_hoodie\_commit\_time | This field contains the commit timestamp in the
[timeline](http://#transaction-log-timeline) that created this record. This
enables granular, record-level history tracking on the table, much like
database change-data-capture. |
@@ -387,16 +376,13 @@ stored with the actual record to avoid re-computation.
| \_hoodie\_file\_name | The data file name this record belongs to. |
-
-Within a given file, all records share the same values for
`_hoodie_partition_path` and `_hoodie_file_name`, thus easily compressed away
without any overheads with columnar file formats. The other fields can also be
optional for writers depending on whether protection against key field changes
or incremental processing is desired. More on how to populate these fields in
the sections below.
-
-
+Within a given file, all records share the same values for
`_hoodie_partition_path` and `_hoodie_file_name`, thus easily compressed away
without any overheads with columnar file formats.
+The other fields can also be optional for writers depending on whether
protection against key field changes or incremental processing is desired. More
on how to populate these fields in the sections below.
## Indexes
### Naming
-
-
+Indexes are stored under `.hoodie/metadata` storage path, with separate
partitions of the form `<index_type>_<index_name>`.
### Bloom Filter Index
@@ -428,18 +414,12 @@ The record index is stored in Hudi metadata table under
the partition `record_in
| instantTime | A long that represents epoch time in millisecond
representing the commit time at which record was added.
|
-##
-
-### Secondary Indexes
-
-
+### Expression Indexes
-### Functional Indexes
-
-A [functional
index](https://github.com/apache/hudi/blob/00ece7bce0a4a8d0019721a28049723821e01842/rfc/rfc-63/rfc-63.md)
is an index on a function of a column.
-Hudi supports creating functional indexes for certain unary string and
timestamp functions supported by Apache Spark.
+A [expression
index](https://github.com/apache/hudi/blob/00ece7bce0a4a8d0019721a28049723821e01842/rfc/rfc-63/rfc-63.md)
is an index on a function of a column.
+Hudi supports creating expression indexes for certain unary string and
timestamp functions supported.
The index definition specified by the user is serialized to JSON format and
saved at a path specified by `hoodie.table.index.defs.path` in [table
properties](#table-properties).
-Index itself is stored in Hudi metadata table under the partition
`func_index_<user_specified_index_name>`.
+Index itself is stored in Hudi metadata table under the partition
`expr_index_<user_specified_index_name>`.
We covered different [storage layouts](#storage-layout) earlier. Functional
index aggregates stats by storage partitions and, as such, partitioning can be
absorbed into functional indexes.
From that perspective, some useful functions that can also be applied as
transforms on a field to extract and index partitions are listed below.
@@ -453,16 +433,11 @@ From that perspective, some useful functions that can
also be applied as transfo
| `hour` | Hour of the timestamp |
| `lower` | Lower case of the string |
-Full list of supported functions can be found in
[HoodieSparkFunctionalIndex](https://github.com/apache/hudi/blob/dcd5a8182a11faab8bfc1ce8aa7787fa590dd395/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/functional/HoodieSparkFunctionalIndex.java).
-
-
-
-
## Relational Model
-This section describes how to implement a traditional relational model, with
concurrent writers on top of the storage format described so far. Specifically,
it aims to employ optimistic concurrency control between writers to guarantee
serializable execution of write operations. To achieve this, Hudi assumes the
availability of a distributed lock service (see appendix for lock provider
implementations) to acquire an exclusive table level lock.
-
-
+This section describes how to implement a traditional relational model, with
concurrent writers on top of the storage format described so far.
+Specifically, it aims to employ optimistic concurrency control between writers
to guarantee serializable execution of write operations. To achieve this,
+Hudi assumes the availability of a distributed lock service (see appendix for
lock provider implementations) to acquire an exclusive table level lock.
There are three types of processes at play at any given time :
@@ -474,15 +449,17 @@ There are three types of processes at play at any given
time :
Writers generate a begin time for their actions, proceed to create new base
and log files and will finally transition the action state to completed
atomically on the timeline as follows.
-
-
-1. Writer requests a begin time for the write action as defined in the
Timeline section above (this may involve a lock, depending on TrueTime
generation mechanism used). Writer also records the latest completion time on
the timeline (let's call this **_snapshot write time_**).
-2. Writer produces new base and log files with updated, deleted, inserted
records, while ensuring updates/deletes reach the correct file group by looking
up an index as of the snapshot write time (see appendix for an alternative way
to encode updates as deletes to an existing file group and inserts into a new
file group, along with performance tradeoffs).
+1. Writer requests a begin time for the write action as defined in the
Timeline section above (this may involve a lock, depending on TrueTime
generation mechanism used).
+ Writer also records the latest completion time on the timeline (let's call
this **_snapshot write time_**).
+2. Writer produces new base and log files with updated, deleted, inserted
records, while ensuring updates/deletes reach the correct file group by looking
up an index as
+ of the snapshot write time (see appendix for an alternative way to encode
updates as deletes to an existing file group and inserts into a new file group,
along with performance tradeoffs).
3. Writer then grabs a distributed lock and proceeds to perform the following
steps within the critical section of the lock to finalize the write or
fail/abort.
- 1. Obtain metadata about all actions that have completed with completion
time greater than base snapshot time. This defines the **_concurrent set_** of
actions against which the writer needs to checks if any concurrent write
operations or table service actions conflict with it.
+ 1. Obtain metadata about all actions that have completed with completion
time greater than base snapshot time. This defines the **_concurrent set_** of
+ actions against which the writer needs to checks if any concurrent write
operations or table service actions conflict with it.
2. Writer checks for any overlapping file groups that have been written to
in the concurrent set and decides to abort if so.
3. Writer checks for any compactions planned or clustering actions
completed on overlapping file groups and decides to abort if so.
- 4. Writer reads out all record keys written by the concurrent set and
compares it against keys from files it is about to commit. If there are any
overlaps, the writer decides to abort (implementations could omit this check
for performance reasons with the ensuing tradeoffs/limitations, if deemed
acceptable)
+ 4. Writer reads out all record keys written by the concurrent set and
compares it against keys from files it is about to commit. If there are any
overlaps,
+ the writer decides to abort (implementations could omit this check for
performance reasons with the ensuing tradeoffs/limitations, if deemed
acceptable)
5. Writer checks
6. If any of the checks above are true, writer releases the lock and aborts
the write.
4. If checks in (3) pass, the writer proceeds to finalizing the write, while
holding the lock.
@@ -490,16 +467,15 @@ Writers generate a begin time for their actions, proceed
to create new base and
2. Writer generates a completion time for the write and records it to the
timeline atomically, along with necessary metadata
3. Optionally, writer plans any table services or even runs them. This is
again an implementation choice.
4. Writer releases the lock.
-
-
-
-Note that specific writer implementations can choose to even abort early to
improve efficiency and reduce resource wastage, if they can detect conflicts
using mechanisms like marker files described in the appendix.
+
+Note that specific writer implementations can choose to even abort early to
improve efficiency and reduce resource wastage,
+if they can detect conflicts using mechanisms like marker files described in
the appendix.
### Table Service/Writer Concurrency
-All table services acquire the same exclusive lock as writers above to plan
and finalize action, including updating the metadata/indexes. Table services
plans are serialized within the lock and once requested on the timeline are
idempotent and expected to complete eventually with potential retries on
failures. External table service management and built-in optional table
services within writer processes can seamlessly co-exist and merely a
deployment convenience.
-
-
+All table services acquire the same exclusive lock as writers above to plan
and finalize action, including updating the metadata/indexes.
+Table services plans are serialized within the lock and once requested on the
timeline are idempotent and expected to complete eventually with
+potential retries on failures. External table service management and built-in
optional table services within writer processes can seamlessly co-exist and
merely a deployment convenience.
Table services should follow these expectations to ensure they don't block
each other.
@@ -513,15 +489,12 @@ Table services should follow these expectations to ensure
they don't block each
Thus, Hudi does not treat table service actions as opaque modifications to the
table and thus supports running compaction without blocking writers helpful for
high update/concurrency workloads.
### Concurrent readers
-
-The model supports snapshot isolation for concurrent readers. A reader
constructs the state of the table based on the action with the greatest
completion time and proceeds to construct file slices out of file groups that
are of interest to the read, based on the type of query. Two concurrent readers
are never in contention even in the presence of concurrent writes happening.
+The model supports snapshot isolation for concurrent readers. A reader
constructs the state of the table based on the action with the greatest
completion time and proceeds
+to construct file slices out of file groups that are of interest to the read,
based on the type of query. Two concurrent readers are never in contention even
in the presence of concurrent writes happening.
### Reader Expectations
-
Readers further need to determine the correct file slice (an optional base
file plus an ordered list of log files) to read in order to construct the
snapshot state using the following steps
-
-
1. Reader picks an instant in the timeline - latest completion time on the
timeline or a specific time explicitly specified. Let's call this **_snapshot
read time_**.
2. Reader computes all file groups in the table, by first all files part of
the table, grouping them by file\_id and then eliminating files that don't
belong to any completed write action on the timeline
3. Reader then further eliminates replaced file groups, by removing any file
groups that are part of any **_replacecommit_** actions completed on the
timeline.
@@ -533,33 +506,31 @@ Readers further need to determine the correct file slice
(an optional base file
Obtaining the listings from storage could be slow or inefficient. It can be
further optimized by caching in memory or using the files metadata or with the
support of an external timeline serving system.
-
-
## Incremental Streaming Model
-Optimistic concurrency control yields poor performance with long running
transactions and even moderate levels of writer contention. While the relation
model is well-understood, serializability based on arbitrary system time (e.g
completion times in Hudi or SCNs in databases) may be an overkill and even
inadequate for many scenarios. Hudi storage also supports an alternative model
based on stream processing techniques, with the following changes.
+Optimistic concurrency control yields poor performance with long running
transactions and even moderate levels of writer contention. While the relation
model is well-understood,
+serializability based on arbitrary system time (e.g completion times in Hudi
or SCNs in databases) may be an overkill and even inadequate for many
scenarios. Hudi storage also supports
+an alternative model based on stream processing techniques, with the following
changes.
-
-
-1. **Event-time based ordering**: Instead of merging base and log files based
on completion times, latest value for a record is picked based on a
user-specified event field that denotes the latest version of the record.
-2. **Relaxed expectations to block/abort**: By tracking the span of actions
using both begin and completion times on the timeline, processes can detect
conflicting actions and adjust file slicing accordingly. For e.g this model
allows compaction to be even planned without blocking writer processes.
+1. **Event-time based ordering**: Instead of merging base and log files based
on completion times, latest value for a record is picked based on
+ a user-specified event field that denotes the latest version of the record.
+2. **Relaxed expectations to block/abort**: By tracking the span of actions
using both begin and completion times on the timeline, processes can
+ detect conflicting actions and adjust file slicing accordingly. For e.g
this model allows compaction to be even planned without blocking writer
processes.
3. **Custom merging** : Support implementation of custom merges using
RecordPayload and RecordMerger APIs.
### Non-blocking Concurrency Control
-
-\[WIP\] Please refer to
[RFC-66](https://github.com/apache/hudi/blob/master/rfc/rfc-66/rfc-66.md) for
more and a proof of correctness.
-
-
+Please refer to
[RFC-66](https://github.com/apache/hudi/blob/master/rfc/rfc-66/rfc-66.md) for
more and a proof of correctness.
## Table Management
-All table services can be run synchronous with the Table client that merges
modifications to the data or can be run asynchronously to the table client.
Asynchronous is default mode in the Apache Hudi platform. Any client can
trigger table management by registering a 'requested' state action in the Hudi
timeline. Process in charge of running the table management tasks
asynchronously looks for the presence of this trigger in the timeline.
+All table services can be run synchronous with the Table client that merges
modifications to the data or can be run asynchronously to the table client.
+Asynchronous is default mode in the Apache Hudi platform. Any client can
trigger table management by registering a 'requested' state action in the Hudi
timeline.
+Process in charge of running the table management tasks asynchronously looks
for the presence of this trigger in the timeline.
### Compaction
-Compaction is the process that efficiently updates a file slice (base and log
files) for efficient querying. It applies all the batched up updates in the log
files and writes a new file slice. The logic to apply the updates to the base
file follows the same set of rules listed in the Reader expectations.
-
-
+Compaction is the process that efficiently updates a file slice (base and log
files) for efficient querying. It applies all the batched up updates
+in the log files and writes a new file slice. The logic to apply the updates
to the base file follows the same set of rules listed in the Reader
expectations.
### Log Compaction
@@ -572,34 +543,28 @@ See
[RFC-48](https://github.com/apache/hudi/blob/master/rfc/rfc-48/rfc-48.md) fo
### Re-writing
-If the natural ingestion ordering does not match the query patterns, then data
skipping does not work efficiently. It is important for query efficiency to be
able to skip as much data on filter and join predicates with column level
statistics. Clustering columns need to be specified on the Hudi table. The goal
of the clustering table service, is to group data often accessed together and
consolidate small files to the optimum target file size for the table.
-
-
+If the natural ingestion ordering does not match the query patterns, then data
skipping does not work efficiently. It is important for query efficiency
+to be able to skip as much data on filter and join predicates with column
level statistics. Clustering columns need to be specified on the Hudi table.
+The goal of the clustering table service, is to group data often accessed
together and consolidate small files to the optimum target file size for the
table.
1. Identify file groups that are eligible for clustering - this is chosen
based on the clustering strategy (file size based, time based etc)
2. Identify clustering groups (file groups that should be clustered together)
and each group should expect data sizes in multiples of the target file size.
3. Persist the clustering plan in the Hudi timeline as a replacecommit, when
clustering is requested.
4. Clustering execution can then read the individual clustering groups, write
back new file groups with target size with base files sorted by the specified
clustering columns.
-
-
### Cleaning
-Cleaning is a process to free up storage space. Apache Hudi maintains a
timeline and multiple versions of the files written as file slices. It is
important to specify a cleaning protocol which deletes older versions and
reclaims the storage space. Cleaner cannot delete versions that are currently
in use or will be required in future. Snapshot reconstruction on a commit
instant which has been cleaned is not possible.
-
-
+Cleaning is a process to free up storage space. Apache Hudi maintains a
timeline and multiple versions of the files written as file slices.
+It is important to specify a cleaning protocol which deletes older versions
and reclaims the storage space. Cleaner cannot delete versions that
+are currently in use or will be required in future. Snapshot reconstruction on
a commit instant which has been cleaned is not possible.
For e.g, there are a couple of retention policies supported in Apache Hudi
platform
-
-
* **keep\_latest\_commits**: This is a temporal cleaning policy that ensures
the effect of having look-back into all the changes that happened in the last X
commits.
* **keep\_latest\_file\_versions**: This policy has the effect of keeping a
maximum of N number of file versions irrespective of time.
-
-
-Apache Hudi provides snapshot isolation between writers and readers by
managing multiple files with MVCC concurrency. These file versions provide
history and enable time travel and rollbacks, but it is important to manage how
much history you keep to balance your storage costs.
-
+Apache Hudi provides snapshot isolation between writers and readers by
managing multiple files with MVCC concurrency. These file versions provide
history
+and enable time travel and rollbacks, but it is important to manage how much
history you keep to balance your storage costs.
### Indexing
@@ -621,18 +586,17 @@ gracefully. Finally, when the indexing is complete, the
indexer writes the `[ins
only takes a lock while adding events to the timeline and not while writing
index files.
See [RFC-45](https://github.com/apache/hudi/blob/master/rfc/rfc-45/rfc-45.md)
for now.
-
## Compatibility
-Compatibility between different readers and writers is enforced through the
`hoodie.table.version` table config. Hudi storage format evolves in a backwards
compatible way for readers, where newer readers can read older table versions
correctly. However older readers may be required to upgrade in order to read
higher table versions. Hence, we recommend upgrading readers first, before
upgrading writers when the table version evolves.
-
-
+Compatibility between different readers and writers is enforced through the
`hoodie.table.version` table config. Hudi storage format evolves in a backwards
+compatible way for readers, where newer readers can read older table versions
correctly. However older readers may be required to upgrade in order to read
higher table versions.
+Hence, we recommend upgrading readers first, before upgrading writers when the
table version evolves.
## Change Log
-### version 7
+### version 8
-\[WIP\] This is a breaking format version. Users may need to first migrate
completely to the new timeline before resuming normal table operations.
+This is a breaking format version. Users may need to first migrate completely
to the new timeline before resuming normal table operations.
### version 6 & earlier
@@ -642,84 +606,56 @@ Please refer to the previous tech specs
[document](https://hudi.apache.org/tech-
### RecordMerger API
-Hudi data model ensures record key uniqueness constraint, to maintain this
constraint every single record merged into the table needs to be checked if the
same record key already exists in the table. If it does exist, the conflict
resolution strategy is applied to create a new merged record to be persisted.
This check is done at the file group level and every record merged needs to be
tagged to a single file group. By default, record merging is done during the
merge which makes it effici [...]
+Hudi data model ensures record key uniqueness constraint, to maintain this
constraint every single record merged into the table needs to be checked if the
+same record key already exists in the table. If it does exist, the conflict
resolution strategy is applied to create a new merged record to be persisted.
+This check is done at the file group level and every record merged needs to be
tagged to a single file group. By default, record merging is done during the
merge which
+makes it efficient for queries but for certain real-time streaming
requirements, it can also be deferred to the query as long as there is a
consistent way to mapping
+the record key to a certain file group using consistent hashing techniques.
### Indexing Functions
-\[WIP\] See
[RFC-63](https://github.com/apache/hudi/blob/master/rfc/rfc-63/rfc-63.md) for
design and direction.
-
-
-
-## \[WIP\] Appendix
-
-
-
-### Formal Proofs
-
-
-
-### Timeline Maintenance
-
-
-
-### Atomic writes to storage
-
-
-
-The requirement from the underlying storage system is to support an atomic-put
and read-after-write consistency.
-
-
-
-### TrueTime Implementations
+See [RFC-63](https://github.com/apache/hudi/blob/master/rfc/rfc-63/rfc-63.md)
for design and direction.
+## Appendix
### Lock Provider Implementations
-Apache Hudi platform uses optimistic locking and provides a pluggable
LockProvider interface and multiple implementations are available out of the
box (Apache Zookeeper, DynamoDB, Apache Hive and JVM Level in-process lock). It
is also worth noting that, if multiple writers originate from the same JVM
client, a simple locking at the client level would serialize the writes and no
external locking needs to be configured.
+Apache Hudi platform uses optimistic locking and provides a pluggable
LockProvider interface and multiple implementations are available out of the
box (Apache Zookeeper, DynamoDB, Apache Hive
+and JVM Level in-process lock). It is also worth noting that, if multiple
writers originate from the same JVM client, a simple locking at the client
level would serialize the writes
+and no external locking needs to be configured.
### Balancing write and read performance
-
-
-A critical design choice for any table is to pick the right trade-offs in the
data freshness and query performance spectrum. Hudi storage format lets the
users decide on this trade-off by picking the table type, record merging and
file sizing.
-
-
+A critical design choice for any table is to pick the right trade-offs in the
data freshness and query performance spectrum. Hudi storage format lets the
users decide
+on this trade-off by picking the table type, record merging and file sizing.
| | Merge Efficiency | Query Efficiency |
| ---| ---| --- |
-| Copy on Write (COW) | **Tunable:** COW table type creates a new File slice
in the file-group for every batch of updates. Write amplification can be quite
high when the update is spread across multiple file groups. The cost involved
can be high over a time period especially on tables with low data latency
requirements. | **Optimal:** COW table types create whole readable data files
in open source columnar file formats on each merge batch, there is minimal
overhead per record in the quer [...]
-| Merge on Read (MOR) | **Optimal:** MOR table type batches the updates to the
file slice in a separate optimized Log file, write amplification is amortized
over time when sufficient updates are batched. The merge cost involved will be
lower than COW since the churn on the records re-written for every update is
much lower. | **Tunable:** MOR Table type required record level merging during
query. Although there are techniques to make this merge as efficient as
possible, there is still a r [...]
-
+| Copy on Write (COW) | **Tunable:** COW table type creates a new File slice
in the file-group for every batch of updates. Write amplification can be quite
high <br/>when the update is spread across multiple file groups. <br/>The cost
involved can be high over a time period especially on tables with low data
latency requirements. | **Optimal:** COW table types create whole readable data
files in open source columnar file formats on each merge batch, there is
minimal overhead per record i [...]
+| Merge on Read (MOR) | **Optimal:** MOR table type batches the updates to the
file slice in a separate optimized Log file, write amplification is amortized
over time <br/>when sufficient updates are batched. The merge cost involved
will be lower than COW since the churn on the records re-written for every
update is much lower. | **Tunable:** MOR Table type required record level
merging during query. Although there are techniques to make this merge as
efficient as possible, there is stil [...]
> Interesting observation on the MOR table format is that, by providing a
> special view of the table which only serves the base files in the file slice
> (read optimized query of MOR table), query can pick between query efficiency
> and data freshness dynamically during query time. Compaction frequency
> determines the data freshness of the read optimized view. With this, the MOR
> has all the levers required to balance the merge and query performance
> dynamically.
-##
-
Sizing the file group is extremely critical to balance the merge and query
performance. Larger the file size, the more the write amplification when new
file slices are being created. So to balance the merge cost, compaction or
merge frequency should be tuned accordingly and this has an impact on the query
performance or data freshness.
-
-### Logging Updates as Deletes + Inserts
-
-
-
### Optimistic concurrency efficiency
-The efficiency of Optimistic concurrency is inversely proportional to the
possibility of a conflict, which in turn depends on the running time and the
files overlapping between the concurrent writers. Apache Hudi storage format
makes design choices that make it possible to configure the system to have a
low possibility of conflict with regular workloads
-
-
+The efficiency of Optimistic concurrency is inversely proportional to the
possibility of a conflict, which in turn depends on the running time and the
files overlapping between the concurrent writers.
+Apache Hudi storage format makes design choices that make it possible to
configure the system to have a low possibility of conflict with regular
workloads
* All records with the same record key are present in a single file group.
In other words, there is a 1-1 mapping between a record key and a file group
id, at all times.
-* Unit of concurrency is a single file group and this file group size is
configurable. If the table needs to be optimized for concurrent updates, the
file group size can be smaller than default which could mean lower collision
rates.
-* Merge-on-read storage engine has the option to store the contents in
record oriented file formats which reduces write latencies (often up to 10
times compared to columnar storage) which results in less collision with other
concurrent writers
+* Unit of concurrency is a single file group and this file group size is
configurable. If the table needs to be optimized for concurrent updates, the
file group size can be
+ smaller than default which could mean lower collision rates.
+* Merge-on-read storage engine has the option to store the contents in
record oriented file formats which reduces write latencies (often up to 10
times compared to columnar storage)
+ which results in less collision with other concurrent writers
* Merge-on-read storage engine combined with scalable metadata table ensures
that the system can handle frequent updates efficiently which means ingest jobs
can be frequent and quick, reducing the chance of conflicts
-
### Marker mechanism to remove uncommitted data
-See
[this](https://hudi.apache.org/blog/2021/08/18/improving-marker-mechanism/) and
[RFC-56](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md) for
now.
+See
[this](https://hudi.apache.org/blog/2021/08/18/improving-marker-mechanism/) and
[RFC-56](https://github.com/apache/hudi/blob/master/rfc/rfc-56/rfc-56.md).