(hudi) branch asf-site updated: chore: clean up and fix duplicate pages (#14189)

xushiyan Thu, 30 Oct 2025 10:39:57 -0700

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new a040a1d41759 chore: clean up and fix duplicate pages (#14189)
a040a1d41759 is described below

commit a040a1d4175920dd17ed0770baf2e14383083566
Author: Shiyan Xu <[email protected]>
AuthorDate: Thu Oct 30 10:39:43 2025 -0700

    chore: clean up and fix duplicate pages (#14189)
---
 .gitignore                                         |  20 +++++
 website/.gitignore                                 |  28 -------
 website/assets/images/powers/hudi-logo-page.png    | Bin 406763 -> 0 bytes
 website/learn/index.md                             |  16 ++++
 website/learn/use_cases.md                         |  81 ---------------------
 website/src/pages/faq/design_and_concepts.md       |   2 +-
 website/versioned_docs/version-0.14.0/faq.md       |   2 +-
 .../version-0.14.1/faq_design_and_concepts.md      |   2 +-
 .../version-0.15.0/faq_design_and_concepts.md      |   2 +-
 .../version-1.0.0/faq_design_and_concepts.md       |   2 +-
 .../version-1.0.1/faq_design_and_concepts.md       |   2 +-
 .../version-1.0.2/faq_design_and_concepts.md       |   2 +-
 12 files changed, 43 insertions(+), 116 deletions(-)

diff --git a/.gitignore b/.gitignore
index d2eb67c69ff8..88ce7aedfe17 100644
--- a/.gitignore
+++ b/.gitignore
@@ -11,3 +11,23 @@ _site
 *.iml
 .DS_Store
 node_modules/
+
+# Logs
+npm-debug.log*
+yarn-debug.log*
+yarn-error.log*
+
+# Website (Docusaurus)
+website/build/
+website/.docusaurus
+website/.cache-loader
+website/.node_modules/
+website/package-lock.json
+website/yarn.lock
+website/.env.local
+website/.env.development.local
+website/.env.test.local
+website/.env.production.local
+website/.idea/
+website/.vscode
+website/.changelog
diff --git a/website/.gitignore b/website/.gitignore
deleted file mode 100644
index dea10e483400..000000000000
--- a/website/.gitignore
+++ /dev/null
@@ -1,28 +0,0 @@
-# Dependencies
-/node_modules
-package-lock.json
-yarn.lock
-.node_modules/
-# Production
-/build
-
-# Generated files
-.docusaurus
-.cache-loader
-
-# Misc
-.DS_Store
-.env.local
-.env.development.local
-.env.test.local
-.env.production.local
-
-npm-debug.log*
-yarn-debug.log*
-yarn-error.log*
-
-# IDE
-.vscode
-.idea
-*.code-workspace
-.changelog
\ No newline at end of file
diff --git a/website/assets/images/powers/hudi-logo-page.png 
b/website/assets/images/powers/hudi-logo-page.png
deleted file mode 100644
index 7389b4315973..000000000000
Binary files a/website/assets/images/powers/hudi-logo-page.png and /dev/null 
differ
diff --git a/website/learn/index.md b/website/learn/index.md
new file mode 100644
index 000000000000..d307645e06eb
--- /dev/null
+++ b/website/learn/index.md
@@ -0,0 +1,16 @@
+---
+id: index
+title: Learning Hub
+sidebar_label: Overview
+description: Entry point for Hudi learning resources, including the Tutorial 
Series, Talks, Blogs, Videos, and FAQ.
+---
+
+Welcome to Apache Hudi's learning hub.
+
+- Start with the quick-start guide in the docs: 
[/docs/quick-start-guide](/docs/quick-start-guide)
+- Explore the tutorial series: [/learn/tutorial-series](/learn/tutorial-series)
+- Browse the blogs: [/blog](/blog)
+- Watch video guides: [/videos](/videos)
+- Check the FAQ: [/faq](/faq)
+
+If you're looking to contribute or join community events, see 
[/community/get-involved](/community/get-involved).
diff --git a/website/learn/use_cases.md b/website/learn/use_cases.md
deleted file mode 100644
index 124cabe160f8..000000000000
--- a/website/learn/use_cases.md
+++ /dev/null
@@ -1,81 +0,0 @@
----
-title: "Use Cases"
-keywords: [ hudi, data ingestion, etl, real time, use cases]
-summary: "Following are some sample use-cases for Hudi, which illustrate the 
benefits in terms of faster processing & increased efficiency"
-toc: true
-last_modified_at: 2019-12-30T15:59:57-04:00
----
-
-## Near Real-Time Ingestion
-
-Hudi offers some great benefits across ingestion of all kinds. Hudi helps 
__enforces a minimum file size on DFS__. This helps
-solve the ["small files 
problem"](https://blog.cloudera.com/blog/2009/02/the-small-files-problem/) for 
HDFS and Cloud Stores alike,
-significantly improving query performance. Hudi adds the much needed ability 
to atomically commit new data, shielding queries from
-ever seeing partial writes and helping ingestion recover gracefully from 
failures.
-
-Ingesting data from OLTP sources like (event logs, databases, external 
sources) into a [Data Lake](http://martinfowler.com/bliki/DataLake) is a common 
problem,
-that is unfortunately solved in a piecemeal fashion, using a medley of 
ingestion tools. This "raw data" layer of the data lake often forms the bedrock 
on which
-more value is created.
-
-For RDBMS ingestion, Hudi provides __faster loads via Upserts__, as opposed 
costly & inefficient bulk loads. It's very common to use a change capture 
solution like
-[Debezium](http://debezium.io/) or [Kafka 
Connect](https://docs.confluent.io/platform/current/connect/index) or 
-[Sqoop Incremental 
Import](https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide#_incremental_imports)
 and apply them to an
-equivalent Hudi table on DFS. For NoSQL datastores like 
[Cassandra](http://cassandra.apache.org/) / 
[Voldemort](http://www.project-voldemort.com/voldemort/) / 
[HBase](https://hbase.apache.org/), 
-even moderately big installations store billions of rows. It goes without 
saying that __full bulk loads are simply infeasible__ and more efficient 
approaches 
-are needed if ingestion is to keep up with the typically high update volumes.
-
-Even for immutable data sources like [Kafka](https://kafka.apache.org), there 
is often a need to de-duplicate the incoming events against what's stored on 
DFS.
-Hudi achieves this by [employing 
indexes](http://hudi.apache.org/blog/hudi-indexing-mechanisms/) of different 
kinds, quickly and efficiently.
-
-All of this is seamlessly achieved by the Hudi DeltaStreamer tool, which is 
maintained in tight integration with rest of the code 
-and we are always trying to add more capture sources, to make this easier for 
the users. The tool also has a continuous mode, where it
-can self-manage clustering/compaction asynchronously, without blocking 
ingestion, significantly improving data freshness.
-
-## Data Deletion
-
-Hudi also offers ability to delete the data stored in the data lake, and more 
so provides efficient ways of dealing with 
-large write amplification, resulting from random deletes based on user_id (or 
any secondary key), by way of the `Merge On Read` table types.
-Hudi's elegant log based concurrency control, ensures that the 
ingestion/writing can continue happening,as a background compaction job
-amortizes the cost of rewriting data/enforcing deletes.
-
-Hudi also unlocks special capabilities like data clustering, which allow users 
to optimize the data layout for deletions. Specifically,
-users can cluster older event log data based on user_id, such that, queries 
that evaluate candidates for data deletion can do so, while
-more recent partitions are optimized for query performance and clustered on 
say timestamp.
-
-## Unified Storage For Analytics
-
-The world we live in is polarized - even on data analytics storage - into 
real-time and offline/batch storage. Typically, real-time 
[datamarts](https://en.wikipedia.org/wiki/Data_mart) 
-are powered by specialized analytical stores such as [Druid](http://druid.io/) 
or [Memsql](http://www.memsql.com/) or [Clickhouse](https://clickhouse.tech/), 
fed by event buses like
-[Kafka](https://kafka.apache.org) or [Pulsar](https://pulsar.apache.org). This 
model is prohibitively expensive, unless a small fraction of your data lake 
data 
-needs sub-second query responses such as system monitoring or interactive 
real-time analysis.
-
-The same data gets ingested into data lake storage much later (say every few 
hours or so) and then runs through batch ETL pipelines, with intolerable data 
freshness
-to do any kind of near-realtime analytics. On the other hand, the data lakes 
provide access to interactive SQL engines like Presto/SparkSQL, which can 
horizontally scale 
-easily and provide return even more complex queries, within few seconds. 
-
-By bringing streaming primitives to data lake storage, Hudi opens up new 
possibilities by being able to ingest data within few minutes and also author 
incremental data
-pipelines that are orders of magnitude faster than traditional batch 
processing. By bringing __data freshness to a few minutes__, Hudi can provide a 
much efficient alternative, 
-for a large class of data applications, compared to real-time datamarts. Also, 
Hudi has no upfront server infrastructure investments
-and thus enables faster analytics on much fresher analytics, without 
increasing the operational overhead. This external 
[article](https://www.analyticsinsight.net/can-big-data-solutions-be-affordable/)
 
-further validates this newer model.
-
-## Incremental Processing Pipelines
-
-Data Lake ETL typically involves building a chain of tables derived from each 
other via DAGs expressed as workflows. Workflows often depend on new data being 
output by 
-multiple upstream workflows and traditionally, availability of new data is 
indicated by a new DFS Folder/Hive Partition.
-Let's take a concrete example to illustrate this. An upstream workflow `U` can 
create a Hive partition for every hour, with data for that hour (event_time) at 
the end of each hour (processing_time), providing effective freshness of 1 hour.
-Then, a downstream workflow `D`, kicks off immediately after `U` finishes, and 
does its own processing for the next hour, increasing the effective latency to 
2 hours.
-
-The above paradigm simply ignores late arriving data i.e when 
`processing_time` and `event_time` drift apart.
-Unfortunately, in today's post-mobile & pre-IoT world, __late data from 
intermittently connected mobile devices & sensors are the norm, not an 
anomaly__.
-In such cases, the only remedy to guarantee correctness is to reprocess the 
last few hours worth of data, over and over again each hour, 
-which can significantly hurt the efficiency across the entire ecosystem. For 
e.g; imagine reprocessing TBs worth of data every hour across hundreds of 
workflows.
-
-Hudi comes to the rescue again, by providing a way to consume new data 
(including late data) from an upstream Hudi table `HU` at a record granularity 
(not folders/partitions),
-apply the processing logic, and efficiently update/reconcile late data with a 
downstream Hudi table `HD`. Here, `HU` and `HD` can be continuously scheduled 
at a much more frequent schedule
-like 15 mins, and providing an end-end latency of 30 mins at `HD`.
-
-To achieve this, Hudi has embraced similar concepts from stream processing 
frameworks like [Spark 
Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide#join-operations)
 , Pub/Sub systems like 
[Kafka](http://kafka.apache.org/documentation/#theconsumer)
-[Flink](https://flink.apache.org) or database replication technologies like 
[Oracle 
XStream](https://docs.oracle.com/cd/E11882_01/server.112/e16545/xstrm_cncpt.htm#XSTRM187).
-For the more curious, a more detailed explanation of the benefits of 
Incremental Processing can be found 
[here](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)
-
diff --git a/website/src/pages/faq/design_and_concepts.md 
b/website/src/pages/faq/design_and_concepts.md
index faa062c9c95c..f21c553ec3af 100644
--- a/website/src/pages/faq/design_and_concepts.md
+++ b/website/src/pages/faq/design_and_concepts.md
@@ -13,7 +13,7 @@ Hudi writers atomically move an inflight write operation to a 
"completed" state
 
 Hudi is very different from Hive in important aspects described below. 
However, based on practical considerations, it chooses to be compatible with 
Hive table layout by adopting partitioning, schema evolution and being 
queryable through Hive query engine. Here are the key aspect where Hudi differs:
 
-*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/learn/use_cases/#incremental-processing-pipelines).  Hudi tables 
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi 
tables which are beyond simple Hive extensions.
+*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing). 
 Hudi tables can be non-partitioned and the Hudi metadata table adds rich 
indexes on Hudi tables which are beyond simple Hive extensions.
 *   Hive advocates partitioning as the main remedy for most performance-based 
issues. Features like partition evolution and hidden partitioning are primarily 
based on this Hive based principle of partitioning and aim to tackle the 
metadata problem partially.  Whereas, Hudi biases to coarse-grained 
partitioning and emphasizes [clustering](/docs/clustering) for more 
fine-grained partitioning. Further, users can strategize and evolve the 
clustering asynchronously which “actually” help users [...]
 *   Hudi considers partition evolution as an anti-pattern and avoids such 
schemes due to the inconsistent performance of queries that goes to depend on 
which part of the table is being queried. Hudi’s design favors consistent 
performance and is aware of the need to redesign to partitioning/tables to 
achieve the same.
 
diff --git a/website/versioned_docs/version-0.14.0/faq.md 
b/website/versioned_docs/version-0.14.0/faq.md
index 74bf66d3ae7a..59984161e6a5 100644
--- a/website/versioned_docs/version-0.14.0/faq.md
+++ b/website/versioned_docs/version-0.14.0/faq.md
@@ -110,7 +110,7 @@ Hudi writers atomically move an inflight write operation to 
a "completed" state
 
 Hudi is very different from Hive in important aspects described below. 
However, based on practical considerations, it chooses to be compatible with 
Hive table layout by adopting partitioning, schema evolution and being 
queryable through Hive query engine. Here are the key aspect where Hudi differs:
 
-*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta 
fields](https://hudi.apache.org/tech-specs#meta-fields) including instant time, 
primary record key, and partition path to the data to support efficient upserts 
and [incremental 
queries/ETL](https://hudi.apache.org/learn/use_cases/#incremental-processing-pipelines).
  Hudi tables can be non-partitioned and the Hudi metadata table adds rich 
indexes on Hudi tables which are b [...]
+*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta 
fields](https://hudi.apache.org/tech-specs#meta-fields) including instant time, 
primary record key, and partition path to the data to support efficient upserts 
and [incremental 
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing). 
 Hudi tables can be non-partitioned and the Hudi metadata table adds rich 
indexes on Hudi tables which are beyond sim [...]
 *   Hive advocates partitioning as the main remedy for most performance-based 
issues. Features like partition evolution and hidden partitioning are primarily 
based on this Hive based principle of partitioning and aim to tackle the 
metadata problem partially.  Whereas, Hudi biases to coarse-grained 
partitioning and emphasizes 
[clustering](https://hudi.apache.org/docs/clustering) for more fine-grained 
partitioning. Further, users can strategize and evolve the clustering 
asynchronously whic [...]
 *   Hudi considers partition evolution as an anti-pattern and avoids such 
schemes due to the inconsistent performance of queries that goes to depend on 
which part of the table is being queried. Hudi’s design favors consistent 
performance and is aware of the need to redesign to partitioning/tables to 
achieve the same.
 
diff --git a/website/versioned_docs/version-0.14.1/faq_design_and_concepts.md 
b/website/versioned_docs/version-0.14.1/faq_design_and_concepts.md
index c0fd9d105b38..f532dc27c3ca 100644
--- a/website/versioned_docs/version-0.14.1/faq_design_and_concepts.md
+++ b/website/versioned_docs/version-0.14.1/faq_design_and_concepts.md
@@ -12,7 +12,7 @@ Hudi writers atomically move an inflight write operation to a 
"completed" state
 
 Hudi is very different from Hive in important aspects described below. 
However, based on practical considerations, it chooses to be compatible with 
Hive table layout by adopting partitioning, schema evolution and being 
queryable through Hive query engine. Here are the key aspect where Hudi differs:
 
-*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/learn/use_cases/#incremental-processing-pipelines).  Hudi tables 
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi 
tables which are beyond simple Hive extensions.
+*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing). 
 Hudi tables can be non-partitioned and the Hudi metadata table adds rich 
indexes on Hudi tables which are beyond simple Hive extensions.
 *   Hive advocates partitioning as the main remedy for most performance-based 
issues. Features like partition evolution and hidden partitioning are primarily 
based on this Hive based principle of partitioning and aim to tackle the 
metadata problem partially.  Whereas, Hudi biases to coarse-grained 
partitioning and emphasizes [clustering](/docs/clustering) for more 
fine-grained partitioning. Further, users can strategize and evolve the 
clustering asynchronously which “actually” help users [...]
 *   Hudi considers partition evolution as an anti-pattern and avoids such 
schemes due to the inconsistent performance of queries that goes to depend on 
which part of the table is being queried. Hudi’s design favors consistent 
performance and is aware of the need to redesign to partitioning/tables to 
achieve the same.
 
diff --git a/website/versioned_docs/version-0.15.0/faq_design_and_concepts.md 
b/website/versioned_docs/version-0.15.0/faq_design_and_concepts.md
index c0fd9d105b38..f532dc27c3ca 100644
--- a/website/versioned_docs/version-0.15.0/faq_design_and_concepts.md
+++ b/website/versioned_docs/version-0.15.0/faq_design_and_concepts.md
@@ -12,7 +12,7 @@ Hudi writers atomically move an inflight write operation to a 
"completed" state
 
 Hudi is very different from Hive in important aspects described below. 
However, based on practical considerations, it chooses to be compatible with 
Hive table layout by adopting partitioning, schema evolution and being 
queryable through Hive query engine. Here are the key aspect where Hudi differs:
 
-*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/learn/use_cases/#incremental-processing-pipelines).  Hudi tables 
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi 
tables which are beyond simple Hive extensions.
+*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing). 
 Hudi tables can be non-partitioned and the Hudi metadata table adds rich 
indexes on Hudi tables which are beyond simple Hive extensions.
 *   Hive advocates partitioning as the main remedy for most performance-based 
issues. Features like partition evolution and hidden partitioning are primarily 
based on this Hive based principle of partitioning and aim to tackle the 
metadata problem partially.  Whereas, Hudi biases to coarse-grained 
partitioning and emphasizes [clustering](/docs/clustering) for more 
fine-grained partitioning. Further, users can strategize and evolve the 
clustering asynchronously which “actually” help users [...]
 *   Hudi considers partition evolution as an anti-pattern and avoids such 
schemes due to the inconsistent performance of queries that goes to depend on 
which part of the table is being queried. Hudi’s design favors consistent 
performance and is aware of the need to redesign to partitioning/tables to 
achieve the same.
 
diff --git a/website/versioned_docs/version-1.0.0/faq_design_and_concepts.md 
b/website/versioned_docs/version-1.0.0/faq_design_and_concepts.md
index c0fd9d105b38..f532dc27c3ca 100644
--- a/website/versioned_docs/version-1.0.0/faq_design_and_concepts.md
+++ b/website/versioned_docs/version-1.0.0/faq_design_and_concepts.md
@@ -12,7 +12,7 @@ Hudi writers atomically move an inflight write operation to a 
"completed" state
 
 Hudi is very different from Hive in important aspects described below. 
However, based on practical considerations, it chooses to be compatible with 
Hive table layout by adopting partitioning, schema evolution and being 
queryable through Hive query engine. Here are the key aspect where Hudi differs:
 
-*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/learn/use_cases/#incremental-processing-pipelines).  Hudi tables 
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi 
tables which are beyond simple Hive extensions.
+*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing). 
 Hudi tables can be non-partitioned and the Hudi metadata table adds rich 
indexes on Hudi tables which are beyond simple Hive extensions.
 *   Hive advocates partitioning as the main remedy for most performance-based 
issues. Features like partition evolution and hidden partitioning are primarily 
based on this Hive based principle of partitioning and aim to tackle the 
metadata problem partially.  Whereas, Hudi biases to coarse-grained 
partitioning and emphasizes [clustering](/docs/clustering) for more 
fine-grained partitioning. Further, users can strategize and evolve the 
clustering asynchronously which “actually” help users [...]
 *   Hudi considers partition evolution as an anti-pattern and avoids such 
schemes due to the inconsistent performance of queries that goes to depend on 
which part of the table is being queried. Hudi’s design favors consistent 
performance and is aware of the need to redesign to partitioning/tables to 
achieve the same.
 
diff --git a/website/versioned_docs/version-1.0.1/faq_design_and_concepts.md 
b/website/versioned_docs/version-1.0.1/faq_design_and_concepts.md
index c0fd9d105b38..f532dc27c3ca 100644
--- a/website/versioned_docs/version-1.0.1/faq_design_and_concepts.md
+++ b/website/versioned_docs/version-1.0.1/faq_design_and_concepts.md
@@ -12,7 +12,7 @@ Hudi writers atomically move an inflight write operation to a 
"completed" state
 
 Hudi is very different from Hive in important aspects described below. 
However, based on practical considerations, it chooses to be compatible with 
Hive table layout by adopting partitioning, schema evolution and being 
queryable through Hive query engine. Here are the key aspect where Hudi differs:
 
-*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/learn/use_cases/#incremental-processing-pipelines).  Hudi tables 
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi 
tables which are beyond simple Hive extensions.
+*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing). 
 Hudi tables can be non-partitioned and the Hudi metadata table adds rich 
indexes on Hudi tables which are beyond simple Hive extensions.
 *   Hive advocates partitioning as the main remedy for most performance-based 
issues. Features like partition evolution and hidden partitioning are primarily 
based on this Hive based principle of partitioning and aim to tackle the 
metadata problem partially.  Whereas, Hudi biases to coarse-grained 
partitioning and emphasizes [clustering](/docs/clustering) for more 
fine-grained partitioning. Further, users can strategize and evolve the 
clustering asynchronously which “actually” help users [...]
 *   Hudi considers partition evolution as an anti-pattern and avoids such 
schemes due to the inconsistent performance of queries that goes to depend on 
which part of the table is being queried. Hudi’s design favors consistent 
performance and is aware of the need to redesign to partitioning/tables to 
achieve the same.
 
diff --git a/website/versioned_docs/version-1.0.2/faq_design_and_concepts.md 
b/website/versioned_docs/version-1.0.2/faq_design_and_concepts.md
index c0fd9d105b38..f532dc27c3ca 100644
--- a/website/versioned_docs/version-1.0.2/faq_design_and_concepts.md
+++ b/website/versioned_docs/version-1.0.2/faq_design_and_concepts.md
@@ -12,7 +12,7 @@ Hudi writers atomically move an inflight write operation to a 
"completed" state
 
 Hudi is very different from Hive in important aspects described below. 
However, based on practical considerations, it chooses to be compatible with 
Hive table layout by adopting partitioning, schema evolution and being 
queryable through Hive query engine. Here are the key aspect where Hudi differs:
 
-*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/learn/use_cases/#incremental-processing-pipelines).  Hudi tables 
can be non-partitioned and the Hudi metadata table adds rich indexes on Hudi 
tables which are beyond simple Hive extensions.
+*   Unlike Hive, Hudi does not remove the partition columns from the data 
files. Hudi in fact adds record level [meta fields](/tech-specs#meta-fields) 
including instant time, primary record key, and partition path to the data to 
support efficient upserts and [incremental 
queries/ETL](/docs/use_cases#efficient-data-lakes-with-incremental-processing). 
 Hudi tables can be non-partitioned and the Hudi metadata table adds rich 
indexes on Hudi tables which are beyond simple Hive extensions.
 *   Hive advocates partitioning as the main remedy for most performance-based 
issues. Features like partition evolution and hidden partitioning are primarily 
based on this Hive based principle of partitioning and aim to tackle the 
metadata problem partially.  Whereas, Hudi biases to coarse-grained 
partitioning and emphasizes [clustering](/docs/clustering) for more 
fine-grained partitioning. Further, users can strategize and evolve the 
clustering asynchronously which “actually” help users [...]
 *   Hudi considers partition evolution as an anti-pattern and avoids such 
schemes due to the inconsistent performance of queries that goes to depend on 
which part of the table is being queried. Hudi’s design favors consistent 
performance and is aware of the need to redesign to partitioning/tables to 
achieve the same.

(hudi) branch asf-site updated: chore: clean up and fix duplicate pages (#14189)

Reply via email to