tengqm commented on code in PR #6946:
URL: https://github.com/apache/gravitino/pull/6946#discussion_r2049798188
##########
docs/lineage/gravitino-spark-lineage.md:
##########
@@ -0,0 +1,79 @@
+---
+title: "Gravitino Spark Lineage support"
+slug: /lineage/gravitino-spark-lineage
+keyword: Gravitino Spark OpenLineage
+license: "This software is licensed under the Apache License version 2."
+---
+
+## Overview
+
+By leveraging OpenLineage Spark plugin, Gravitino provides a separate Spark
plugin to extract data lineage and transform the dataset identifier to
Gravitino identifier.
+
+## Capabilities
+
+- Supports column lineage.
+- Supports lineage across different catalogs like like fileset, Iceberg, Hudi,
Paimon, Hive, Model, etc.
+- Supports extract Gravitino dataset from GVFS.
+- Supports Gravitino spark connector and non Gravitino Spark connector.
+
+## Gravitino dataset
+
+The Gravitino OpenLineage Spark plugin transforms the Gravitino metalake name
into the dataset namespace. The dataset name varies by dataset type when
generating lineage information.
+
+If you are using to access the table managed by Gravitino, the dataset name is
as follows:
+When using the [Gravitino Spark
connector](/spark-connector/spark-connector.md) to access tables managed by
Gravitino, the dataset name follows this format:
+
+
+| Dataset Type | Dataset name | Example
| Since Version |
+|-----------------|------------------------------------------------|----------------------------|---------------|
+| Hive catalog | `$GravitinoCatalogName.$schemaName.$tableName` |
`hive_catalog.db.student` | 0.9.0 |
+| Iceberg catalog | `$GravitinoCatalogName.$schemaName.$tableName` |
`iceberg_catalog.db.score` | 0.9.0 |
+| Paimon catalog | `$GravitinoCatalogName.$schemaName.$tableName` |
`paimon_catalog.db.detail` | 0.9.0 |
+| JDBC catalog | `$GravitinoCatalogName.$schemaName.$tableName` |
`jdbc_catalog.db.score` | 0.9.0 |
+
+For datasets not managed by Gravitino, the dataset name is as follows:
+
+| Dataset Type | Dataset name | Example
| Since Version |
+|--------------|------------------------------------|----------------------------|---------------|
+| Hive | `spark_catalog.$dbName.$tableName` | `spark_catalog.db.table`
| 0.9.0 |
+| Iceberg | `$catalogName.$dbName.$tableName` |
`iceberg_catalog.db.table` | 0.9.0 |
+| JDBC v2 | `$catalogName.$dbName.$tableName` | `jdbc_catalog.db.table`
| 0.9.0 |
+| JDBC v1 | `spark_catalog.$dbName.$tableName` | `spark_catalog.db.table`
| 0.9.0 |
+
+When accessing datasets by location (e.g., `SELECT * FROM
parquet.$dataset_path`), the name is derived from the physical path:
+
+| Location Type | Dataset name | Example
| Since Version |
+|----------------|--------------------------------------------------|---------------------------------------|---------------|
+| GVFS location | `$GravitinoCatalogName.$schemaName.$filesetName` |
`fileset_catalog.schema.fileset_a` | 0.9.0 |
+| Other location | location path |
`hdfs://127.0.0.1:9000/tmp/a/student` | 0.9.0 |
+
+For fileset dataset, the plugin add location facets which contains the
location path.
+
+## How to use
+
+1. Download Gravitino OpenLineage plugin jar and place it to the classpath of
Spark.
+2. Add configuration to the Spark to enable lineage collect.
+
+Configuration example For Spark shell:
+
+```shell
+./bin/spark-sql -v \
+--jars
/$path/openlineage-spark_2.12-$version.jar,/$path/gravitino-spark-connector-runtime-3.5_2.12-$version.jar
\
+--conf
spark.plugins="org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin"
\
+--conf spark.sql.gravitino.uri=http://localhost:8090 \
+--conf spark.sql.gravitino.metalake=$metalakeName \
+--conf
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener \
+--conf spark.openlineage.transport.type=http \
+--conf spark.openlineage.transport.url=http://localhost:8090 \
+--conf spark.openlineage.transport.endpoint=/api/lineage \
+--conf spark.openlineage.namespace=$metalakeName \
+--conf spark.openlineage.appName=$appName \
+--conf spark.openlineage.columnLineage.datasetLineageEnabled=true
+```
+
+Please refer to [OpenLineage Spark
guides](https://openlineage.io/docs/guides/spark/) and [Gravitino Spark
connector](/spark-connector/spark-connector.md) for more details. Additionally,
Gravitino provides following configurations for lineage.
+
+| Configuration item | Description
| Default value |
Required | Since Version |
Review Comment:
The reason is that Markdown tables do not allow line wrapping for a row in
the table.
In many cases, this is very cumbersome for wide tables, although for simple
tables a markdown flavor is pretty good.
Yes. HTML tables are allowed in Markdown and in MDX/Docusaurus. I have
verified this with each and every commit in the PR
https://github.com/apache/gravitino-site/pull/50
and https://github.com/apache/gravitino/pull/6849
I'm not advocating using HTML tables everywhere. For narrower ones, such as
those comparing data types, a Markdown table is still preferred.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]