This is an automated email from the ASF dual-hosted git repository.
danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 20b733b5151 [DOCS] Update syncing DataHub docs (#12504)
20b733b5151 is described below
commit 20b733b5151d6a31abac33d71ff5bb5260ad9e1b
Author: Sergio Gómez Villamor <[email protected]>
AuthorDate: Wed Dec 18 05:19:47 2024 +0100
[DOCS] Update syncing DataHub docs (#12504)
---
website/docs/syncing_datahub.md | 27 +++++++++++++++++++++++++--
1 file changed, 25 insertions(+), 2 deletions(-)
diff --git a/website/docs/syncing_datahub.md b/website/docs/syncing_datahub.md
index 89cf9bf8799..55c0c0be601 100644
--- a/website/docs/syncing_datahub.md
+++ b/website/docs/syncing_datahub.md
@@ -9,8 +9,26 @@ obeservability, federated governance, etc.
Since Hudi 0.11.0, you can now sync to a DataHub instance by setting
`DataHubSyncTool` as one of the sync tool classes
for `HoodieStreamer`.
-The target Hudi table will be sync'ed to DataHub as a `Dataset`. The Hudi
table's avro schema will be sync'ed, along
-with the commit timestamp when running the sync.
+The target Hudi table will be sync'ed to DataHub as a `Dataset`, which will be
created with the following properties:
+
+* Hudi table properties and partitioning information
+* Spark-related properties
+* User-defined properties
+* The last commit and the last commit completion timestamps
+
+Additionally, the `Dataset` object will include the following metadata:
+
+* sub-type as `Table`
+* browse path
+* parent container
+* Avro schema
+* optionally, attached with a `Domain` object
+
+Also, the parent database will be sync'ed to DataHub as a `Container`,
including the following metadata:
+
+* sub-type as `Database`
+* browse paths
+* optionally, attached with a `Domain` object
### Configurations
@@ -27,6 +45,11 @@ By default, the sync config's database name and table name
will be used to make
Subclass `HoodieDataHubDatasetIdentifier` and set it to
`hoodie.meta.sync.datahub.dataset.identifier.class` to customize
the URN creation.
+Optionally, sync'ed `Dataset` and `Container` objects can be attached with a
`Domain` object. To do this, set
+`hoodie.meta.sync.datahub.domain.name` to a valid `Domain` URN. Also, sync'ed
`Dataset` can be attached with
+user defined properties. To do this, set
`hoodie.meta.sync.datahub.table.properties` to a comma-separated key-value
+string (_eg_ `key1=val1,key2=val2`).
+
### Example
The following shows an example configuration to run `HoodieStreamer` with
`DataHubSyncTool`.