subject:"\[PR\] \[DOCS\] Add a new Hudi Architectural Stack page \[hudi\]"

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-07 Thread via GitHub



bhasudha merged PR #10624:
URL: https://github.com/apache/hudi/pull/10624


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub



bhasudha commented on PR #10624:
URL: https://github.com/apache/hudi/pull/10624#issuecomment-1928592401

   Tested it locally, the diagrams may need to be reduced in size since they 
feel little disproportionate as compared to other pages. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub



xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478878886


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.

Review Comment:
   @dipankarmazumdar can you also fix all occurrences for File Group, File 
Slice, Base File, Log File, etc to align on the casing, indicating these are 
hudi specific terms



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub



xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478877408


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.
+
+# Transactional Database Layer
+The transactional database layer of Hudi comprises the core components that 
are responsible for the fundamental operations and services that enable Hudi to 
store, retrieve, and manage data efficiently on data lakehouse storages.
+
+## Table Format
+![Table Format](/assets/images/blog/hudistack/table_format_1.png)
+_Figure: Apache Hudi's Table format_
+
+Drawing an analogy to file formats, a table format simply comprises the file 
layout of the table, the schema, and metadata tracking changes. Hudi organizes 
files within a table or partition into File Groups. Updates are captured in log 
files tied to these File Groups, ensuring efficient merges. There are three 
major components related to Hudi’s table format.
+
+- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), 
stored in the /.hoodie folder, is a crucial event log recording all table 
actions in an ordered manner, with events kept for a specified period. Hudi 
uniquely designs each file group as a self-contained log, enabling record state 
reconstruction through delta logs, even after archival of related actions. This 
approach effectively limits metadata size based on table activity frequency, 
essential for managing tables with frequent updates. 
+
+- **File Group and File Slice** : Within each partition the data is physically 
stored as base and log files and organized into logical concepts as [File 
groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File 
slices. File groups contain multiple versions of file slices and are split into 
multiple file slices. A file slice comprises the base and log file. Each file 
slice within the file-group is uniquely identified by the commit's timestamp 
that created it.
+
+- **Metadata Table**

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub



xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478876581


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.
+
+# Transactional Database Layer
+The transactional database layer of Hudi comprises the core components that 
are responsible for the fundamental operations and services that enable Hudi to 
store, retrieve, and manage data efficiently on data lakehouse storages.
+
+## Table Format
+![Table Format](/assets/images/blog/hudistack/table_format_1.png)
+_Figure: Apache Hudi's Table format_
+
+Drawing an analogy to file formats, a table format simply comprises the file 
layout of the table, the schema, and metadata tracking changes. Hudi organizes 
files within a table or partition into File Groups. Updates are captured in log 
files tied to these File Groups, ensuring efficient merges. There are three 
major components related to Hudi’s table format.
+
+- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), 
stored in the /.hoodie folder, is a crucial event log recording all table 
actions in an ordered manner, with events kept for a specified period. Hudi 
uniquely designs each file group as a self-contained log, enabling record state 
reconstruction through delta logs, even after archival of related actions. This 
approach effectively limits metadata size based on table activity frequency, 
essential for managing tables with frequent updates. 
+
+- **File Group and File Slice** : Within each partition the data is physically 
stored as base and log files and organized into logical concepts as [File 
groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File 
slices. File groups contain multiple versions of file slices and are split into 
multiple file slices. A file slice comprises the base and log file. Each file 
slice within the file-group is uniquely identified by the commit's timestamp 
that created it.
+
+- **Metadata Table**

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub



xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478876581


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.
+
+# Transactional Database Layer
+The transactional database layer of Hudi comprises the core components that 
are responsible for the fundamental operations and services that enable Hudi to 
store, retrieve, and manage data efficiently on data lakehouse storages.
+
+## Table Format
+![Table Format](/assets/images/blog/hudistack/table_format_1.png)
+_Figure: Apache Hudi's Table format_
+
+Drawing an analogy to file formats, a table format simply comprises the file 
layout of the table, the schema, and metadata tracking changes. Hudi organizes 
files within a table or partition into File Groups. Updates are captured in log 
files tied to these File Groups, ensuring efficient merges. There are three 
major components related to Hudi’s table format.
+
+- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), 
stored in the /.hoodie folder, is a crucial event log recording all table 
actions in an ordered manner, with events kept for a specified period. Hudi 
uniquely designs each file group as a self-contained log, enabling record state 
reconstruction through delta logs, even after archival of related actions. This 
approach effectively limits metadata size based on table activity frequency, 
essential for managing tables with frequent updates. 
+
+- **File Group and File Slice** : Within each partition the data is physically 
stored as base and log files and organized into logical concepts as [File 
groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File 
slices. File groups contain multiple versions of file slices and are split into 
multiple file slices. A file slice comprises the base and log file. Each file 
slice within the file-group is uniquely identified by the commit's timestamp 
that created it.
+
+- **Metadata Table**

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub



xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478873515


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.
+
+# Transactional Database Layer
+The transactional database layer of Hudi comprises the core components that 
are responsible for the fundamental operations and services that enable Hudi to 
store, retrieve, and manage data efficiently on data lakehouse storages.
+
+## Table Format
+![Table Format](/assets/images/blog/hudistack/table_format_1.png)
+_Figure: Apache Hudi's Table format_
+
+Drawing an analogy to file formats, a table format simply comprises the file 
layout of the table, the schema, and metadata tracking changes. Hudi organizes 
files within a table or partition into File Groups. Updates are captured in log 
files tied to these File Groups, ensuring efficient merges. There are three 
major components related to Hudi’s table format.
+
+- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), 
stored in the /.hoodie folder, is a crucial event log recording all table 
actions in an ordered manner, with events kept for a specified period. Hudi 
uniquely designs each file group as a self-contained log, enabling record state 
reconstruction through delta logs, even after archival of related actions. This 
approach effectively limits metadata size based on table activity frequency, 
essential for managing tables with frequent updates. 
+
+- **File Group and File Slice** : Within each partition the data is physically 
stored as base and log files and organized into logical concepts as [File 
groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File 
slices. File groups contain multiple versions of file slices and are split into 
multiple file slices. A file slice comprises the base and log file. Each file 
slice within the file-group is uniquely identified by the commit's timestamp 
that created it.
+
+- **Metadata Table**

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub



xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478869049


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.
+
+# Transactional Database Layer
+The transactional database layer of Hudi comprises the core components that 
are responsible for the fundamental operations and services that enable Hudi to 
store, retrieve, and manage data efficiently on data lakehouse storages.
+
+## Table Format
+![Table Format](/assets/images/blog/hudistack/table_format_1.png)
+_Figure: Apache Hudi's Table format_
+
+Drawing an analogy to file formats, a table format simply comprises the file 
layout of the table, the schema, and metadata tracking changes. Hudi organizes 
files within a table or partition into File Groups. Updates are captured in log 
files tied to these File Groups, ensuring efficient merges. There are three 
major components related to Hudi’s table format.
+
+- **Timeline** : Hudi's [timeline](https://hudi.apache.org/docs/timeline), 
stored in the /.hoodie folder, is a crucial event log recording all table 
actions in an ordered manner, with events kept for a specified period. Hudi 
uniquely designs each file group as a self-contained log, enabling record state 
reconstruction through delta logs, even after archival of related actions. This 
approach effectively limits metadata size based on table activity frequency, 
essential for managing tables with frequent updates. 
+
+- **File Group and File Slice** : Within each partition the data is physically 
stored as base and log files and organized into logical concepts as [File 
groups](https://hudi.apache.org/tech-specs-1point0/#storage-layout) and File 
slices. File groups contain multiple versions of file slices and are split into 
multiple file slices. A file slice comprises the base and log file. Each file 
slice within the file-group is uniquely identified by the commit's timestamp 
that created it.
+
+- **Metadata Table**

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub



xushiyan commented on code in PR #10624:
URL: https://github.com/apache/hudi/pull/10624#discussion_r1478861210


##
website/docs/hudi_stack.md:
##
@@ -0,0 +1,99 @@
+---
+title: Apache Hudi Stack
+summary: "Explains about the various layers of software components that make 
up Hudi"
+toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
+last_modified_at:
+---
+
+Apache Hudi is a Transactional Data Lakehouse Platform built around a database 
kernel. It brings core warehouse and database functionality directly to a data 
lake thereby providing a table-level abstraction over open file formats like 
Apache Parquet/ORC (more recently known as the lakehouse architecture) and 
enabling transactional capabilities such as updates/deletes. Hudi also 
incorporates essential table services that are tightly integrated with the 
database kernel. These services can be executed automatically across both 
ingested and derived data to manage various aspects such as table bookkeeping, 
metadata, and storage layout. This integration along with various 
platform-specific services extends Hudi's role from being just a 'table format' 
to a comprehensive and robust data lakehouse platform.
+
+In this section, we will explore the Hudi stack and deconstruct the layers of 
software components that constitute Hudi. The features marked with an asterisk 
(*) represent work in progress, and the dotted boxes indicate planned future 
work. These components collectively aim to fulfill the 
[vision](https://github.com/apache/hudi/blob/master/rfc/rfc-69/rfc-69.md) for 
the project. 
+
+![Hudi Stack](/assets/images/blog/hudistack/hstck.png)
+_Figure: Apache Hudi Architectural stack_
+
+# Lake Storage
+The storage layer is where the data files (such as Parquet) are stored. Hudi 
interacts with the storage layer through the [Hadoop FileSystem 
API](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/FileSystem.html),
 enabling compatibility with various systems including HDFS for fast appends, 
and various cloud stores such as Amazon S3, Google Cloud Storage (GCS), and 
Azure Blob Storage. Additionally, Hudi offers its own storage APIs that can 
rely on Hadoop-independent file system implementation to simplify the 
integration of various file systems. Hudi adds a custom wrapper filesystem that 
lays out the foundation for improved storage optimizations.
+
+# File Formats
+![File Format](/assets/images/blog/hudistack/file_format.png)
+_Figure: File format structure in Hudi_
+
+File formats hold the raw data and are physically stored on the lake storage. 
Hudi operates on a 'base file and log file' structure. The base files are 
compacted and optimized for reads and are augmented with log files for 
efficient append. Future updates aim to integrate diverse formats like 
unstructured data (e.g., JSON, images), and compatibility with different 
storage layers in event-streaming, OLAP engines, and warehouses. Hudi's layout 
scheme encodes all changes to a log file as a sequence of blocks (data, delete, 
rollback). By making data available in open file formats (such as Parquet), 
Hudi enables users to bring any compute engine for specific workloads.

Review Comment:
   ```suggestion
   File formats hold the raw data and are physically stored on the lake 
storage. Hudi operates on logical structures of File Groups and File Slices, 
which consist of Base File and Log Files. Base Files are compacted and 
optimized for reads and are augmented with Log Files for efficient append. 
Future updates aim to integrate diverse formats like unstructured data (e.g., 
images), and compatibility with different storage layers in event-streaming, 
OLAP engines, and warehouses. Hudi's layout scheme encodes all changes to a Log 
File as a sequence of blocks (data, delete, rollback). By making data available 
in open file formats (such as Parquet), Hudi enables users to bring any compute 
engine for specific workloads.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

2024-02-05 Thread via GitHub



dipankarmazumdar opened a new pull request, #10624:
URL: https://github.com/apache/hudi/pull/10624

   ### Change Logs
   
   This PR adds a new page to the Hudi documentation called 'Apache Hudi Stack'
   
   ### Impact
   
   Adds a new page for clarity around Hudi's platform & architecture
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, 
config, or user-facing change_
   
   Update is for documentation
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Change Logs and Impact were stated clearly
   - [x] Adequate tests were added if applicable
   - CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

Re: [PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

[PR] [DOCS] Add a new Hudi Architectural Stack page [hudi]

10 matches

Site Navigation

Mail list logo

Footer information