tengqm commented on code in PR #6946:
URL: https://github.com/apache/gravitino/pull/6946#discussion_r2049798188


##########
docs/lineage/gravitino-spark-lineage.md:
##########
@@ -0,0 +1,79 @@
+---
+title: "Gravitino Spark Lineage support"
+slug: /lineage/gravitino-spark-lineage
+keyword: Gravitino Spark OpenLineage
+license: "This software is licensed under the Apache License version 2."
+---
+
+## Overview
+
+By leveraging OpenLineage Spark plugin, Gravitino provides a separate Spark 
plugin to extract data lineage and transform the dataset identifier to 
Gravitino identifier.
+
+## Capabilities
+
+- Supports column lineage.
+- Supports lineage across different catalogs like like fileset, Iceberg, Hudi, 
Paimon, Hive, Model, etc.
+- Supports extract Gravitino dataset from GVFS.
+- Supports Gravitino spark connector and non Gravitino Spark connector.
+
+## Gravitino dataset
+
+The Gravitino OpenLineage Spark plugin transforms the Gravitino metalake name 
into the dataset namespace. The dataset name varies by dataset type when 
generating lineage information.
+
+If you are using to access the table managed by Gravitino, the dataset name is 
as follows:
+When using the [Gravitino Spark 
connector](/spark-connector/spark-connector.md) to access tables managed by 
Gravitino, the dataset name follows this format:
+
+
+| Dataset Type    | Dataset name                                   | Example   
                 | Since Version |
+|-----------------|------------------------------------------------|----------------------------|---------------|
+| Hive catalog    | `$GravitinoCatalogName.$schemaName.$tableName` | 
`hive_catalog.db.student`  | 0.9.0         |
+| Iceberg catalog | `$GravitinoCatalogName.$schemaName.$tableName` | 
`iceberg_catalog.db.score` | 0.9.0         |
+| Paimon catalog  | `$GravitinoCatalogName.$schemaName.$tableName` | 
`paimon_catalog.db.detail` | 0.9.0         |
+| JDBC catalog    | `$GravitinoCatalogName.$schemaName.$tableName` | 
`jdbc_catalog.db.score`    | 0.9.0         |
+
+For datasets not managed by Gravitino, the dataset name is as follows:
+
+| Dataset Type | Dataset name                       | Example                  
  | Since Version |
+|--------------|------------------------------------|----------------------------|---------------|
+| Hive         | `spark_catalog.$dbName.$tableName` | `spark_catalog.db.table` 
  | 0.9.0         |
+| Iceberg      | `$catalogName.$dbName.$tableName`  | 
`iceberg_catalog.db.table` | 0.9.0         |
+| JDBC v2      | `$catalogName.$dbName.$tableName`  | `jdbc_catalog.db.table`  
  | 0.9.0         |
+| JDBC v1      | `spark_catalog.$dbName.$tableName` | `spark_catalog.db.table` 
  | 0.9.0         |
+
+When accessing datasets by location (e.g., `SELECT * FROM 
parquet.$dataset_path`), the name is derived from the physical path:
+
+| Location Type  | Dataset name                                     | Example  
                             | Since Version |
+|----------------|--------------------------------------------------|---------------------------------------|---------------|
+| GVFS location  | `$GravitinoCatalogName.$schemaName.$filesetName` | 
`fileset_catalog.schema.fileset_a`    | 0.9.0         |
+| Other location | location path                                    | 
`hdfs://127.0.0.1:9000/tmp/a/student` | 0.9.0         |
+
+For fileset dataset, the plugin add location facets which contains the 
location path.
+
+## How to use 
+
+1. Download Gravitino OpenLineage plugin jar and place it to the classpath of 
Spark.
+2. Add configuration to the Spark to enable lineage collect.
+
+Configuration example For Spark shell:
+
+```shell
+./bin/spark-sql -v \
+--jars 
/$path/openlineage-spark_2.12-$version.jar,/$path/gravitino-spark-connector-runtime-3.5_2.12-$version.jar
 \
+--conf 
spark.plugins="org.apache.gravitino.spark.connector.plugin.GravitinoSparkPlugin"
 \
+--conf spark.sql.gravitino.uri=http://localhost:8090 \
+--conf spark.sql.gravitino.metalake=$metalakeName \
+--conf 
spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener \
+--conf spark.openlineage.transport.type=http \
+--conf spark.openlineage.transport.url=http://localhost:8090 \
+--conf spark.openlineage.transport.endpoint=/api/lineage \
+--conf spark.openlineage.namespace=$metalakeName \
+--conf spark.openlineage.appName=$appName \
+--conf spark.openlineage.columnLineage.datasetLineageEnabled=true 
+```
+
+Please refer to [OpenLineage Spark 
guides](https://openlineage.io/docs/guides/spark/) and [Gravitino Spark 
connector](/spark-connector/spark-connector.md) for more details. Additionally, 
Gravitino provides following configurations for lineage. 
+
+| Configuration item                           | Description                   
                                                                                
                                                                                
                                                        | Default value | 
Required | Since Version |

Review Comment:
   The reason is that Markdown tables do not allow line wrapping for a row in 
the table.
   In many cases, this is very cumbersome for wide tables, although for simple 
tables a markdown flavor is pretty good.
   
   Yes. HTML tables are allowed in Markdown and in MDX/Docusaurus. I have 
verified this with each and every commit in the PR 
   https://github.com/apache/gravitino-site/pull/50
   and https://github.com/apache/gravitino/pull/6849
   
   I'm not advocating using HTML tables everywhere. For narrower ones, such as 
those comparing data types, a Markdown table is still preferred.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to