This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new a040a1d41759 chore: clean up and fix duplicate pages (#14189)
a040a1d41759 is described below
commit a040a1d4175920dd17ed0770baf2e14383083566
Author: Shiyan Xu <[email protected]>
AuthorDate: Thu Oct 30 10:39:43 2025 -0700
chore: clean up and fix duplicate pages (#14189)
---
.gitignore | 20 +++++
website/.gitignore | 28 -------
website/assets/images/powers/hudi-logo-page.png | Bin 406763 -> 0 bytes
website/learn/index.md | 16 ++++
website/learn/use_cases.md | 81 ---------------------
website/src/pages/faq/design_and_concepts.md | 2 +-
website/versioned_docs/version-0.14.0/faq.md | 2 +-
.../version-0.14.1/faq_design_and_concepts.md | 2 +-
.../version-0.15.0/faq_design_and_concepts.md | 2 +-
.../version-1.0.0/faq_design_and_concepts.md | 2 +-
.../version-1.0.1/faq_design_and_concepts.md | 2 +-
.../version-1.0.2/faq_design_and_concepts.md | 2 +-
12 files changed, 43 insertions(+), 116 deletions(-)
diff --git a/.gitignore b/.gitignore
index d2eb67c69ff8..88ce7aedfe17 100644
--- a/.gitignore
+++ b/.gitignore
@@ -11,3 +11,23 @@ _site
*.iml
.DS_Store
node_modules/
+
+# Logs
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+
+# Website (Docusaurus)
+website/build/
+website/.docusaurus
+website/.cache-loader
+website/.node_modules/
+website/package-lock.json
+website/yarn.lock
+website/.env.local
+website/.env.development.local
+website/.env.test.local
+website/.env.production.local
+website/.idea/
+website/.vscode
+website/.changelog
diff --git a/website/.gitignore b/website/.gitignore
deleted file mode 100644
index dea10e483400..000000000000
--- a/website/.gitignore
+++ /dev/null
@@ -1,28 +0,0 @@
-# Dependencies
-/node_modules
-package-lock.json
-yarn.lock
-.node_modules/
-# Production
-/build
-
-# Generated files
-.docusaurus
-.cache-loader
-
-# Misc
-.DS_Store
-.env.local
-.env.development.local
-.env.test.local
-.env.production.local
-
-npm-debug.log*
-yarn-debug.log*
-yarn-error.log*
-
-# IDE
-.vscode
-.idea
-*.code-workspace
-.changelog
\ No newline at end of file
diff --git a/website/assets/images/powers/hudi-logo-page.png
b/website/assets/images/powers/hudi-logo-page.png
deleted file mode 100644
index 7389b4315973..000000000000
Binary files a/website/assets/images/powers/hudi-logo-page.png and /dev/null
differ
diff --git a/website/learn/index.md b/website/learn/index.md
new file mode 100644
index 000000000000..d307645e06eb
--- /dev/null
+++ b/website/learn/index.md
@@ -0,0 +1,16 @@
+---
+id: index
+title: Learning Hub
+sidebar_label: Overview
+description: Entry point for Hudi learning resources, including the Tutorial
Series, Talks, Blogs, Videos, and FAQ.
+---
+
+Welcome to Apache Hudi's learning hub.
+
+- Start with the quick-start guide in the docs:
[/docs/quick-start-guide](/docs/quick-start-guide)
+- Explore the tutorial series: [/learn/tutorial-series](/learn/tutorial-series)
+- Browse the blogs: [/blog](/blog)
+- Watch video guides: [/videos](/videos)
+- Check the FAQ: [/faq](/faq)
+
+If you're looking to contribute or join community events, see
[/community/get-involved](/community/get-involved).
diff --git a/website/learn/use_cases.md b/website/learn/use_cases.md
deleted file mode 100644
index 124cabe160f8..000000000000
--- a/website/learn/use_cases.md
+++ /dev/null
@@ -1,81 +0,0 @@
----
-title: "Use Cases"
-keywords: [ hudi, data ingestion, etl, real time, use cases]
-summary: "Following are some sample use-cases for Hudi, which illustrate the
benefits in terms of faster processing & increased efficiency"
-toc: true
-last_modified_at: 2019-12-30T15:59:57-04:00
----
-
-## Near Real-Time Ingestion
-
-Hudi offers some great benefits across ingestion of all kinds. Hudi helps
__enforces a minimum file size on DFS__. This helps
-solve the ["small files
problem"](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) for
HDFS and Cloud Stores alike,
-significantly improving query performance. Hudi adds the much needed ability
to atomically commit new data, shielding queries from
-ever seeing partial writes and helping ingestion recover gracefully from
failures.
-
-Ingesting data from OLTP sources like (event logs, databases, external
sources) into a [Data Lake](http://martinfowler.com/bliki/DataLake) is a common
problem,
-that is unfortunately solved in a piecemeal fashion, using a medley of
ingestion tools. This "raw data" layer of the data lake often forms the bedrock
on which
-more value is created.
-
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed
costly & inefficient bulk loads. It's very common to use a change capture
solution like
-[Debezium](http://debezium.io/) or [Kafka
Connect](https://docs.confluent.io/platform/current/connect/index) or
-[Sqoop Incremental
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide#_incremental_imports)
and apply them to an
-equivalent Hudi table on DFS. For NoSQL datastores like
[Cassandra](http://cassandra.apache.org/) /
[Voldemort](http://www.project-voldemort.com/voldemort/) /
[HBase](https://hbase.apache.org/),
-even moderately big installations store billions of rows. It goes without
saying that __full bulk loads are simply infeasible__ and more efficient
approaches
-are needed if ingestion is to keep up with the typically high update volumes.
-
-Even for immutable data sources like [Kafka](https://kafka.apache.org), there
is often a need to de-duplicate the incoming events against what's stored on
DFS.
-Hudi achieves this by [employing
indexes](http://hudi.apache.org/blog/hudi-indexing-mechanisms/) of different
kinds, quickly and efficiently.
-
-All of this is seamlessly achieved by the Hudi DeltaStreamer tool, which is
maintained in tight integration with rest of the code
-and we are always trying to add more capture sources, to make this easier for
the users. The tool also has a continuous mode, where it
-can self-manage clustering/compaction asynchronously, without blocking
ingestion, significantly improving data freshness.
-
-## Data Deletion
-
-Hudi also offers ability to delete the data stored in the data lake, and more
so provides efficient ways of dealing with
-large write amplification, resulting from random deletes based on user_id (or
any secondary key), by way of the `Merge On Read` table types.
-Hudi's elegant log based concurrency control, ensures that the
ingestion/writing can continue happening,as a background compaction job
-amortizes the cost of rewriting data/enforcing deletes.
-
-Hudi also unlocks special capabilities like data clustering, which allow users
to optimize the data layout for deletions. Specifically,
-users can cluster older event log data based on user_id, such that, queries
that evaluate candidates for data deletion can do so, while
-more recent partitions are optimized for query performance and clustered on
say timestamp.
-
-## Unified Storage For Analytics
-
-The world we live in is polarized - even on data analytics storage - into
real-time and offline/batch storage. Typically, real-time
[datamarts](https://en.wikipedia.org/wiki/Data_mart)
-are powered by specialized analytical stores such as [Druid](http://druid.io/)
or [Memsql](http://www.memsql.com/) or [Clickhouse](https://clickhouse.tech/),
fed by event buses like
-[Kafka](https://kafka.apache.org) or [Pulsar](https://pulsar.apache.org). This
model is prohibitively expensive, unless a small fraction of your data lake
data
-needs sub-second query responses such as system monitoring or interactive
real-time analysis.
-
-The same data gets ingested into data lake storage much later (say every few
hours or so) and then runs through batch ETL pipelines, with intolerable data
freshness
-to do any kind of near-realtime analytics. On the other hand, the data lakes
provide access to interactive SQL engines like Presto/SparkSQL, which can
horizontally scale
-easily and provide return even more complex queries, within few seconds.
-
-By bringing streaming primitives to data lake storage, Hudi opens up new
possibilities by being able to ingest data within few minutes and also author
incremental data
-pipelines that are orders of magnitude faster than traditional batch
processing. By bringing __data freshness to a few minutes__, Hudi can provide a
much efficient alternative,
-for a large class of data applications, compared to real-time datamarts. Also,
Hudi has no upfront server infrastructure investments
-and thus enables faster analytics on much fresher analytics, without
increasing the operational overhead. This external
[article](https://www.analyticsinsight.net/can-big-data-solutions-be-affordable/)
-further validates this newer model.
-
-## Incremental Processing Pipelines
-
-Data Lake ETL typically involves building a chain of tables derived from each
other via DAGs expressed as workflows. Workflows often depend on new data being
output by
-multiple upstream workflows and traditionally, availability of new data is
indicated by a new DFS Folder/Hive Partition.
-Let's take a concrete example to illustrate this. An upstream workflow `U` can
create a Hive partition for every hour, with data for that hour (event_time) at
the end of each hour (processing_time), providing effective freshness of 1 hour.
-Then, a downstream workflow `D`, kicks off immediately after `U` finishes, and
does its own processing for the next hour, increasing the effective latency to
2 hours.
-
-The above paradigm simply ignores late arriving data i.e when
`processing_time` and `event_time` drift apart.
-Unfortunately, in today's post-mobile & pre-IoT world, __late data from
intermittently connected mobile devices & sensors are the norm, not an
anomaly__.
-In such cases, the only remedy to guarantee correctness is to reprocess the
last few hours worth of data, over and over again each hour,
-which can significantly hurt the efficiency across the entire ecosystem. For
e.g; imagine reprocessing TBs worth of data every hour across hundreds of
workflows.
-
-Hudi comes to the rescue again, by providing a way to consume new data
(including late data) from an upstream Hudi table `HU` at a record granularity
(not folders/partitions),
-apply the processing logic, and efficiently update/reconcile late data with a
downstream Hudi table `HD`. Here, `HU` and `HD` can be continuously scheduled
at a much more frequent schedule
-like 15 mins, and providing an end-end latency of 30 mins at `HD`.
-
-To achieve this, Hudi has embraced similar concepts from stream processing
frameworks like [Spark
Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide#join-operations)
, Pub/Sub systems like
[Kafka](http://kafka.apache.org/documentation/#theconsumer)
-[Flink](https://flink.apache.org) or database replication technologies like
[Oracle
XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187).
-For the more curious, a more detailed explanation of the benefits of
Incremental Processing can be found
[here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)
-
diff --git a/website/src/pages/faq/design_and_concepts.md
b/website/src/pages/faq/design_and_concepts.md
index faa062c9c95c..f21c553ec3af 100644
--- a/website/src/pages/faq/design_and_concepts.md
+++ b/website/src/pages/faq/design_and_concepts.md
@@ -13,7 +13,7 @@ Hudi writers atomically move an inflight write operation to a
"completed" state
Hudi is very different from Hive in important aspects described below.
However, based on practical considerations, it chooses to be compatible with
Hive table layout by adopting partitioning, schema evolution and being
queryable through Hive query engine. Here are the key aspect where Hudi differs:
-* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/learn/use_cases/#incremental-processing-pipelines). Hudi tables
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi
tables which are beyond simple Hive extensions.
+* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing).
Hudi tables can be non-partitioned and the Hudi metadata table adds rich
indexes on Hudi tables which are beyond simple Hive extensions.
* Hive advocates partitioning as the main remedy for most performance-based
issues. Features like partition evolution and hidden partitioning are primarily
based on this Hive based principle of partitioning and aim to tackle the
metadata problem partially. Whereas, Hudi biases to coarse-grained
partitioning and emphasizes [clustering](/docs/clustering) for more
fine-grained partitioning. Further, users can strategize and evolve the
clustering asynchronously which “actually” help users [...]
* Hudi considers partition evolution as an anti-pattern and avoids such
schemes due to the inconsistent performance of queries that goes to depend on
which part of the table is being queried. Hudi’s design favors consistent
performance and is aware of the need to redesign to partitioning/tables to
achieve the same.
diff --git a/website/versioned_docs/version-0.14.0/faq.md
b/website/versioned_docs/version-0.14.0/faq.md
index 74bf66d3ae7a..59984161e6a5 100644
--- a/website/versioned_docs/version-0.14.0/faq.md
+++ b/website/versioned_docs/version-0.14.0/faq.md
@@ -110,7 +110,7 @@ Hudi writers atomically move an inflight write operation to
a "completed" state
Hudi is very different from Hive in important aspects described below.
However, based on practical considerations, it chooses to be compatible with
Hive table layout by adopting partitioning, schema evolution and being
queryable through Hive query engine. Here are the key aspect where Hudi differs:
-* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta
fields](https://hudi.apache.org/tech-specs#meta-fields) including instant time,
primary record key, and partition path to the data to support efficient upserts
and [incremental
queries/ETL](https://hudi.apache.org/learn/use_cases/#incremental-processing-pipelines).
Hudi tables can be non-partitioned and the Hudi metadata table adds rich
indexes on Hudi tables which are b [...]
+* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta
fields](https://hudi.apache.org/tech-specs#meta-fields) including instant time,
primary record key, and partition path to the data to support efficient upserts
and [incremental
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing).
Hudi tables can be non-partitioned and the Hudi metadata table adds rich
indexes on Hudi tables which are beyond sim [...]
* Hive advocates partitioning as the main remedy for most performance-based
issues. Features like partition evolution and hidden partitioning are primarily
based on this Hive based principle of partitioning and aim to tackle the
metadata problem partially. Whereas, Hudi biases to coarse-grained
partitioning and emphasizes
[clustering](https://hudi.apache.org/docs/clustering) for more fine-grained
partitioning. Further, users can strategize and evolve the clustering
asynchronously whic [...]
* Hudi considers partition evolution as an anti-pattern and avoids such
schemes due to the inconsistent performance of queries that goes to depend on
which part of the table is being queried. Hudi’s design favors consistent
performance and is aware of the need to redesign to partitioning/tables to
achieve the same.
diff --git a/website/versioned_docs/version-0.14.1/faq_design_and_concepts.md
b/website/versioned_docs/version-0.14.1/faq_design_and_concepts.md
index c0fd9d105b38..f532dc27c3ca 100644
--- a/website/versioned_docs/version-0.14.1/faq_design_and_concepts.md
+++ b/website/versioned_docs/version-0.14.1/faq_design_and_concepts.md
@@ -12,7 +12,7 @@ Hudi writers atomically move an inflight write operation to a
"completed" state
Hudi is very different from Hive in important aspects described below.
However, based on practical considerations, it chooses to be compatible with
Hive table layout by adopting partitioning, schema evolution and being
queryable through Hive query engine. Here are the key aspect where Hudi differs:
-* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/learn/use_cases/#incremental-processing-pipelines). Hudi tables
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi
tables which are beyond simple Hive extensions.
+* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing).
Hudi tables can be non-partitioned and the Hudi metadata table adds rich
indexes on Hudi tables which are beyond simple Hive extensions.
* Hive advocates partitioning as the main remedy for most performance-based
issues. Features like partition evolution and hidden partitioning are primarily
based on this Hive based principle of partitioning and aim to tackle the
metadata problem partially. Whereas, Hudi biases to coarse-grained
partitioning and emphasizes [clustering](/docs/clustering) for more
fine-grained partitioning. Further, users can strategize and evolve the
clustering asynchronously which “actually” help users [...]
* Hudi considers partition evolution as an anti-pattern and avoids such
schemes due to the inconsistent performance of queries that goes to depend on
which part of the table is being queried. Hudi’s design favors consistent
performance and is aware of the need to redesign to partitioning/tables to
achieve the same.
diff --git a/website/versioned_docs/version-0.15.0/faq_design_and_concepts.md
b/website/versioned_docs/version-0.15.0/faq_design_and_concepts.md
index c0fd9d105b38..f532dc27c3ca 100644
--- a/website/versioned_docs/version-0.15.0/faq_design_and_concepts.md
+++ b/website/versioned_docs/version-0.15.0/faq_design_and_concepts.md
@@ -12,7 +12,7 @@ Hudi writers atomically move an inflight write operation to a
"completed" state
Hudi is very different from Hive in important aspects described below.
However, based on practical considerations, it chooses to be compatible with
Hive table layout by adopting partitioning, schema evolution and being
queryable through Hive query engine. Here are the key aspect where Hudi differs:
-* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/learn/use_cases/#incremental-processing-pipelines). Hudi tables
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi
tables which are beyond simple Hive extensions.
+* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing).
Hudi tables can be non-partitioned and the Hudi metadata table adds rich
indexes on Hudi tables which are beyond simple Hive extensions.
* Hive advocates partitioning as the main remedy for most performance-based
issues. Features like partition evolution and hidden partitioning are primarily
based on this Hive based principle of partitioning and aim to tackle the
metadata problem partially. Whereas, Hudi biases to coarse-grained
partitioning and emphasizes [clustering](/docs/clustering) for more
fine-grained partitioning. Further, users can strategize and evolve the
clustering asynchronously which “actually” help users [...]
* Hudi considers partition evolution as an anti-pattern and avoids such
schemes due to the inconsistent performance of queries that goes to depend on
which part of the table is being queried. Hudi’s design favors consistent
performance and is aware of the need to redesign to partitioning/tables to
achieve the same.
diff --git a/website/versioned_docs/version-1.0.0/faq_design_and_concepts.md
b/website/versioned_docs/version-1.0.0/faq_design_and_concepts.md
index c0fd9d105b38..f532dc27c3ca 100644
--- a/website/versioned_docs/version-1.0.0/faq_design_and_concepts.md
+++ b/website/versioned_docs/version-1.0.0/faq_design_and_concepts.md
@@ -12,7 +12,7 @@ Hudi writers atomically move an inflight write operation to a
"completed" state
Hudi is very different from Hive in important aspects described below.
However, based on practical considerations, it chooses to be compatible with
Hive table layout by adopting partitioning, schema evolution and being
queryable through Hive query engine. Here are the key aspect where Hudi differs:
-* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/learn/use_cases/#incremental-processing-pipelines). Hudi tables
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi
tables which are beyond simple Hive extensions.
+* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing).
Hudi tables can be non-partitioned and the Hudi metadata table adds rich
indexes on Hudi tables which are beyond simple Hive extensions.
* Hive advocates partitioning as the main remedy for most performance-based
issues. Features like partition evolution and hidden partitioning are primarily
based on this Hive based principle of partitioning and aim to tackle the
metadata problem partially. Whereas, Hudi biases to coarse-grained
partitioning and emphasizes [clustering](/docs/clustering) for more
fine-grained partitioning. Further, users can strategize and evolve the
clustering asynchronously which “actually” help users [...]
* Hudi considers partition evolution as an anti-pattern and avoids such
schemes due to the inconsistent performance of queries that goes to depend on
which part of the table is being queried. Hudi’s design favors consistent
performance and is aware of the need to redesign to partitioning/tables to
achieve the same.
diff --git a/website/versioned_docs/version-1.0.1/faq_design_and_concepts.md
b/website/versioned_docs/version-1.0.1/faq_design_and_concepts.md
index c0fd9d105b38..f532dc27c3ca 100644
--- a/website/versioned_docs/version-1.0.1/faq_design_and_concepts.md
+++ b/website/versioned_docs/version-1.0.1/faq_design_and_concepts.md
@@ -12,7 +12,7 @@ Hudi writers atomically move an inflight write operation to a
"completed" state
Hudi is very different from Hive in important aspects described below.
However, based on practical considerations, it chooses to be compatible with
Hive table layout by adopting partitioning, schema evolution and being
queryable through Hive query engine. Here are the key aspect where Hudi differs:
-* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/learn/use_cases/#incremental-processing-pipelines). Hudi tables
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi
tables which are beyond simple Hive extensions.
+* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing).
Hudi tables can be non-partitioned and the Hudi metadata table adds rich
indexes on Hudi tables which are beyond simple Hive extensions.
* Hive advocates partitioning as the main remedy for most performance-based
issues. Features like partition evolution and hidden partitioning are primarily
based on this Hive based principle of partitioning and aim to tackle the
metadata problem partially. Whereas, Hudi biases to coarse-grained
partitioning and emphasizes [clustering](/docs/clustering) for more
fine-grained partitioning. Further, users can strategize and evolve the
clustering asynchronously which “actually” help users [...]
* Hudi considers partition evolution as an anti-pattern and avoids such
schemes due to the inconsistent performance of queries that goes to depend on
which part of the table is being queried. Hudi’s design favors consistent
performance and is aware of the need to redesign to partitioning/tables to
achieve the same.
diff --git a/website/versioned_docs/version-1.0.2/faq_design_and_concepts.md
b/website/versioned_docs/version-1.0.2/faq_design_and_concepts.md
index c0fd9d105b38..f532dc27c3ca 100644
--- a/website/versioned_docs/version-1.0.2/faq_design_and_concepts.md
+++ b/website/versioned_docs/version-1.0.2/faq_design_and_concepts.md
@@ -12,7 +12,7 @@ Hudi writers atomically move an inflight write operation to a
"completed" state
Hudi is very different from Hive in important aspects described below.
However, based on practical considerations, it chooses to be compatible with
Hive table layout by adopting partitioning, schema evolution and being
queryable through Hive query engine. Here are the key aspect where Hudi differs:
-* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/learn/use_cases/#incremental-processing-pipelines). Hudi tables
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi
tables which are beyond simple Hive extensions.
+* Unlike Hive, Hudi does not remove the partition columns from the data
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields)
including instant time, primary record key, and partition path to the data to
support efficient upserts and [incremental
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing).
Hudi tables can be non-partitioned and the Hudi metadata table adds rich
indexes on Hudi tables which are beyond simple Hive extensions.
* Hive advocates partitioning as the main remedy for most performance-based
issues. Features like partition evolution and hidden partitioning are primarily
based on this Hive based principle of partitioning and aim to tackle the
metadata problem partially. Whereas, Hudi biases to coarse-grained
partitioning and emphasizes [clustering](/docs/clustering) for more
fine-grained partitioning. Further, users can strategize and evolve the
clustering asynchronously which “actually” help users [...]
* Hudi considers partition evolution as an anti-pattern and avoids such
schemes due to the inconsistent performance of queries that goes to depend on
which part of the table is being queried. Hudi’s design favors consistent
performance and is aware of the need to redesign to partitioning/tables to
achieve the same.