[hudi] 01/01: Revert "[DOCS] Remove duplicate faq page (#5998)"

codope Fri, 01 Jul 2022 04:13:05 -0700

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch revert-5998-asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


commit a0edab522abdb813b3b6e6f5f24e99d9c808d43a
Author: Sagar Sumit <sagarsumi...@gmail.com>
AuthorDate: Fri Jul 1 16:42:48 2022 +0530

    Revert "[DOCS] Remove duplicate faq page (#5998)"
    
    This reverts commit 68522fa006cb9eef6eb3a194802688c2473d038d.
---
 website/docs/faq.md            | 31 ++++---------------------------
 website/docusaurus.config.js   |  8 ++++++--
 website/{docs => learn}/faq.md | 36 ++++++++++++------------------------
 3 files changed, 22 insertions(+), 53 deletions(-)

diff --git a/website/docs/faq.md b/website/docs/faq.md
index 7d982dc7da..52137d1be5 100644
--- a/website/docs/faq.md
+++ b/website/docs/faq.md
@@ -68,7 +68,6 @@ When writing data into Hudi, you model the records like how 
you would on a key-v
 When querying/reading data, Hudi just presents itself as a json-like 
hierarchical table, everyone is used to querying using Hive/Spark/Presto over 
Parquet/Json/Avro. 
 
 ### Why does Hudi require a key field to be configured?
-
 Hudi was designed to support fast record level Upserts and thus requires a key 
to identify whether an incoming record is 
 an insert or update or delete, and process accordingly. Additionally, Hudi 
automatically maintains indexes on this primary 
 key and for many use-cases like CDC, ensuring such primary key constraints is 
crucial to ensure data quality. In this context, 
@@ -95,7 +94,7 @@ At a high level, Hudi is based on MVCC design that writes 
data to versioned parq
 
 ### What are some ways to write a Hudi dataset?
 
-Typically, you obtain a set of partial updates/inserts from your source and 
issue [write operations](https://hudi.apache.org/docs/write_operations/) 
against a Hudi dataset.  If you ingesting data from any of the standard sources 
like Kafka, or tailing DFS, the [delta 
streamer](https://hudi.apache.org/docs/hoodie_deltastreamer#deltastreamer) tool 
is invaluable and provides an easy, self-managed solution to getting data 
written into Hudi. You can also write your own code to capture data fr [...]
+Typically, you obtain a set of partial updates/inserts from your source and 
issue [write operations](https://hudi.apache.org/docs/write_operations/) 
against a Hudi dataset.  If you ingesting data from any of the standard sources 
like Kafka, or tailing DFS, the [delta 
streamer](https://hudi.apache.org/docs/hoodie_deltastreamer#deltastreamer) tool 
is invaluable and provides an easy, self-managed solution to getting data 
written into Hudi. You can also write your own code to capture data fr [...]
 
 ### How is a Hudi job deployed?
 
@@ -265,31 +264,9 @@ Simplest way to run compaction on MOR dataset is to run 
the [compaction inline](
 
 That said, for obvious reasons of not blocking ingesting for compaction, you 
may want to run it asynchronously as well. This can be done either via a 
separate [compaction 
job](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java)
 that is scheduled by your workflow scheduler/notebook independently. If you 
are using delta streamer, then you can run in [continuous 
mode](https://github.com/apache/hudi/blob/d3edac4612bde2fa9dec [...]
 
-### What options do I have for asynchronous/offline compactions on MOR dataset?
-
-There are a couple of options depending on how you write to Hudi. But first 
let us understand briefly what is involved. There are two parts to compaction
-- Scheduling: In this step, Hudi scans the partitions and selects file slices 
to be compacted. A compaction plan is finally written to Hudi timeline. 
Scheduling needs tighter coordination with other writers (regular ingestion is 
considered one of the writers). If scheduling is done inline with the ingestion 
job, this coordination is automatically taken care of. Else when scheduling 
happens asynchronously a lock provider needs to be configured for this 
coordination among multiple writers.
-- Execution: In this step the compaction plan is read and file slices are 
compacted. Execution doesnt need the same level of coordination with other 
writers as Scheduling step and can be decoupled from ingestion job easily.
-
-Depending on how you write to Hudi these are the possible options currently.
-- DeltaStreamer:
-   - In Continuous mode, asynchronous compaction is achieved by default. Here 
scheduling is done by the ingestion job inline and compaction execution is 
achieved asynchronously by a separate parallel thread.
-   - In non continuous mode, only inline compaction is possible. 
-   - Please note in either mode, by passing --disable-compaction compaction is 
completely disabled
-- Spark datasource:
-   - Async scheduling and async execution can be achieved by periodically 
running an offline Hudi Compactor Utility or Hudi CLI. However this needs a 
lock provider to be configured.
-   - Alternately, from 0.11.0, to avoid dependency on lock providers, 
scheduling alone can be done inline by regular writer using the config 
`hoodie.compact.schedule.inline` . And compaction execution can be done offline 
by periodically triggering the Hudi Compactor Utility or Hudi CLI.
-- Spark structured streaming:
-   - Compactions are scheduled and executed asynchronously inside the 
streaming job. Async Compactions are enabled by default for structured 
streaming jobs on Merge-On-Read table.
-   - Please note it is not possible to disable async compaction for MOR 
dataset with spark structured streaming. 
-- Flink:
-   - Async compaction is enabled by default for Merge-On-Read table.
-   - Offline compaction can be achieved by setting 
```compaction.async.enabled``` to ```false``` and periodically running [Flink 
offline 
Compactor](https://hudi.apache.org/docs/next/compaction/#flink-offline-compaction).
 When running the offline compactor, one needs to ensure there are no active 
writes to the table.
-   - Third option (highly recommended over the second one) is to schedule the 
compactions from the regular ingestion job and executing the compaction plans 
from an offline job. To achieve this set ```compaction.async.enabled``` to 
```false```, ```compaction.schedule.enabled``` to ```true``` and then run the 
[Flink offline 
Compactor](https://hudi.apache.org/docs/next/compaction/#flink-offline-compaction)
 periodically to execute the plans.
-
 ### What performance/ingest latency can I expect for Hudi writing?
 
-The speed at which you can write into Hudi depends on the [write 
operation](https://hudi.apache.org/docs/write_operations/) and some trade-offs 
you make along the way like file sizing. Just like how databases incur overhead 
over direct/raw file I/O on disks,  Hudi operations may have overhead from 
supporting  database like features compared to reading/writing raw DFS files. 
That said, Hudi implements advanced techniques from database literature to keep 
these minimal. User is encouraged t [...]
+The speed at which you can write into Hudi depends on the [write 
operation](https://hudi.apache.org/docs/write_operations) and some trade-offs 
you make along the way like file sizing. Just like how databases incur overhead 
over direct/raw file I/O on disks,  Hudi operations may have overhead from 
supporting  database like features compared to reading/writing raw DFS files. 
That said, Hudi implements advanced techniques from database literature to keep 
these minimal. User is encouraged to [...]
 
 | Storage Type | Type of workload | Performance | Tips |
 |-------|--------|--------|--------|
@@ -398,7 +375,7 @@ spark.read.parquet("your_data_set/path/to/month").limit(n) 
// Limit n records
      .save(basePath);
 ```
 
-For merge on read table, you may want to also try scheduling and running 
compaction jobs. You can run compaction directly using spark submit on 
org.apache.hudi.utilities.HoodieCompactor or by using [HUDI 
CLI](https://hudi.apache.org/docs/deployment/#cli).
+For merge on read table, you may want to also try scheduling and running 
compaction jobs. You can run compaction directly using spark submit on 
org.apache.hudi.utilities.HoodieCompactor or by using [HUDI 
CLI](https://hudi.apache.org/docs/cli).
 
 ### If I keep my file versions at 1, with this configuration will i be able to 
do a roll back (to the last commit) when write fail?
 
@@ -540,4 +517,4 @@ You can improve the FAQ by the following processes
 - Raise a PR to spot inaccuracies, typos on this page and leave suggestions.
 - Raise a PR to propose new questions with answers.
 - Lean towards making it very understandable and simple, and heavily link to 
parts of documentation as needed
-- One committer on the project will review new questions and incorporate them 
upon review.
+- One committer on the project will review new questions and incorporate them 
upon review.
\ No newline at end of file
diff --git a/website/docusaurus.config.js b/website/docusaurus.config.js
index 35a845ece9..5556a21a24 100644
--- a/website/docusaurus.config.js
+++ b/website/docusaurus.config.js
@@ -104,6 +104,10 @@ module.exports = {
             from: ['/releases'],
             to: '/releases/release-0.11.1',
           },
+          {
+            from: ['/docs/learn'],
+            to: '/learn/faq',
+          },
         ],
       },
     ],
@@ -134,7 +138,7 @@ module.exports = {
             },
             {
               label: 'FAQ',
-              href: '/docs/faq',
+              href: '/learn/faq',
             },
             {
               label: 'Technical Wiki',
@@ -281,7 +285,7 @@ module.exports = {
             },
             {
               label: 'FAQ',
-              href: '/docs/faq',
+              href: '/learn/faq',
             },
             {
               label: 'Technical Wiki',
diff --git a/website/docs/faq.md b/website/learn/faq.md
similarity index 94%
copy from website/docs/faq.md
copy to website/learn/faq.md
index 7d982dc7da..3ab46888d5 100644
--- a/website/docs/faq.md
+++ b/website/learn/faq.md
@@ -67,18 +67,6 @@ When writing data into Hudi, you model the records like how 
you would on a key-v
 
 When querying/reading data, Hudi just presents itself as a json-like 
hierarchical table, everyone is used to querying using Hive/Spark/Presto over 
Parquet/Json/Avro. 
 
-### Why does Hudi require a key field to be configured?
-
-Hudi was designed to support fast record level Upserts and thus requires a key 
to identify whether an incoming record is 
-an insert or update or delete, and process accordingly. Additionally, Hudi 
automatically maintains indexes on this primary 
-key and for many use-cases like CDC, ensuring such primary key constraints is 
crucial to ensure data quality. In this context, 
-pre combine key helps reconcile multiple records with same key in a single 
batch of input records. Even for append-only data 
-streams, Hudi supports key based de-duplication before inserting records. For 
e-g; you may have atleast once data integration 
-systems like Kafka MirrorMaker that can introduce duplicates during failures. 
Even for plain old batch pipelines, keys 
-help eliminate duplication that could be caused by backfill pipelines, where 
commonly it's unclear what set of records 
-need to be re-written. We are actively working on making keys easier by only 
requiring them for Upsert and/or automatically
-generate the key internally (much like RDBMS row_ids)
-
 ### Does Hudi support cloud storage/object stores?
 
 Yes. Generally speaking, Hudi is able to provide its functionality on any 
Hadoop FileSystem implementation and thus can read and write datasets on [Cloud 
stores](https://hudi.apache.org/docs/cloud) (Amazon S3 or Microsoft Azure or 
Google Cloud Storage). Over time, Hudi has also incorporated specific design 
aspects that make building Hudi datasets on the cloud easy, such as 
[consistency checks for 
s3](https://hudi.apache.org/docs/configurations#hoodieconsistencycheckenabled), 
Zero moves/r [...]
@@ -95,7 +83,7 @@ At a high level, Hudi is based on MVCC design that writes 
data to versioned parq
 
 ### What are some ways to write a Hudi dataset?
 
-Typically, you obtain a set of partial updates/inserts from your source and 
issue [write operations](https://hudi.apache.org/docs/write_operations/) 
against a Hudi dataset.  If you ingesting data from any of the standard sources 
like Kafka, or tailing DFS, the [delta 
streamer](https://hudi.apache.org/docs/hoodie_deltastreamer#deltastreamer) tool 
is invaluable and provides an easy, self-managed solution to getting data 
written into Hudi. You can also write your own code to capture data fr [...]
+Typically, you obtain a set of partial updates/inserts from your source and 
issue [write operations](https://hudi.apache.org/docs/writing_data/) against a 
Hudi dataset.  If you ingesting data from any of the standard sources like 
Kafka, or tailing DFS, the [delta 
streamer](https://hudi.apache.org/docs/writing_data/#deltastreamer) tool is 
invaluable and provides an easy, self-managed solution to getting data written 
into Hudi. You can also write your own code to capture data from a custom [...]
 
 ### How is a Hudi job deployed?
 
@@ -237,7 +225,7 @@ set 
hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat
 
 ### Can I register my Hudi dataset with Apache Hive metastore?
 
-Yes. This can be performed either via the standalone [Hive Sync 
tool](https://hudi.apache.org/docs/syncing_metastore#hive-sync-tool) or using 
options in 
[deltastreamer](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/docker/demo/sparksql-incremental.commands#L50)
 tool or 
[datasource](https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncenable).
+Yes. This can be performed either via the standalone [Hive Sync 
tool](https://hudi.apache.org/docs/writing_data/#syncing-to-hive) or using 
options in 
[deltastreamer](https://github.com/apache/hudi/blob/d3edac4612bde2fa9deca9536801dbc48961fb95/docker/demo/sparksql-incremental.commands#L50)
 tool or 
[datasource](https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncenable).
 
 ### How does the Hudi indexing work & what are its benefits? 
 
@@ -289,7 +277,7 @@ Depending on how you write to Hudi these are the possible 
options currently.
 
 ### What performance/ingest latency can I expect for Hudi writing?
 
-The speed at which you can write into Hudi depends on the [write 
operation](https://hudi.apache.org/docs/write_operations/) and some trade-offs 
you make along the way like file sizing. Just like how databases incur overhead 
over direct/raw file I/O on disks,  Hudi operations may have overhead from 
supporting  database like features compared to reading/writing raw DFS files. 
That said, Hudi implements advanced techniques from database literature to keep 
these minimal. User is encouraged t [...]
+The speed at which you can write into Hudi depends on the [write 
operation](https://hudi.apache.org/docs/writing_data/) and some trade-offs you 
make along the way like file sizing. Just like how databases incur overhead 
over direct/raw file I/O on disks,  Hudi operations may have overhead from 
supporting  database like features compared to reading/writing raw DFS files. 
That said, Hudi implements advanced techniques from database literature to keep 
these minimal. User is encouraged to ha [...]
 
 | Storage Type | Type of workload | Performance | Tips |
 |-------|--------|--------|--------|
@@ -514,20 +502,20 @@ With this understanding, if you want your DAG stage to 
run faster, *bring T as c
 
 
https://hudi.apache.org/docs/configurations#hoodiedatasourcehive_syncsupport_timestamp
 
-### How to convert an existing COW table to MOR? 
+### How to convert an existing COW table to MOR?
 
-All you need to do is to edit the table type property in hoodie.properties 
(located at hudi_table_path/.hoodie/hoodie.properties). 
-But manually changing it will result in checksum errors. So, we have to go via 
hudi-cli. 
+All you need to do is to edit the table type property in hoodie.properties 
(located at hudi_table_path/.hoodie/hoodie.properties).
+But manually changing it will result in checksum errors. So, we have to go via 
hudi-cli.
 
-1. Copy existing hoodie.properties to a new location. 
+1. Copy existing hoodie.properties to a new location.
 2. Edit table type to MERGE_ON_READ
-3. launch hudi-cli 
-   1. connect --path hudi_table_path
-   2. repair overwrite-hoodie-props --new-props-file new_hoodie.properties
+3. launch hudi-cli
+    1. connect --path hudi_table_path
+    2. repair overwrite-hoodie-props --new-props-file new_hoodie.properties
 
 ### Can I get notified when new commits happen in my Hudi table?
 
-Yes. Hudi provides the ability to post a callback notification about a write 
commit. You can use a http hook or choose to 
+Yes. Hudi provides the ability to post a callback notification about a write 
commit. You can use a http hook or choose to
 be notified via a Kafka/pulsar topic or plug in your own implementation to get 
notified. Please refer 
[here](https://hudi.apache.org/docs/next/writing_data/#commit-notifications)
 for details
 
@@ -540,4 +528,4 @@ You can improve the FAQ by the following processes
 - Raise a PR to spot inaccuracies, typos on this page and leave suggestions.
 - Raise a PR to propose new questions with answers.
 - Lean towards making it very understandable and simple, and heavily link to 
parts of documentation as needed
-- One committer on the project will review new questions and incorporate them 
upon review.
+- One committer on the project will review new questions and incorporate them 
upon review.
\ No newline at end of file

[hudi] 01/01: Revert "[DOCS] Remove duplicate faq page (#5998)"

Reply via email to