yihua commented on code in PR #18876: URL: https://github.com/apache/hudi/pull/18876#discussion_r3322269255
########## website/docs/lance_file_format.md: ########## @@ -91,35 +90,18 @@ export LANCE_BUNDLE_JAR=/path/to/lance-spark-bundle-3.5_2.12-0.4.0.jar spark-shell --jars $HUDI_BUNDLE_JAR,$LANCE_BUNDLE_JAR ``` -## How Hudi + Lance Work Together +## Layering -Hudi manages the table layer — transactions, schema, timeline, table services — while Lance handles the -file-level storage: +Hudi manages the table layer (timeline, metadata, schema, file groups, table services). Lance is the +on-disk file format for base files. Log files for MOR tables remain Avro. -``` -┌───────────────────────────────────┐ -│ Hudi Table Layer │ -│ Timeline, Metadata, Indexing │ -│ Transactions, Schema Evolution │ -├───────────────────────────────────┤ -│ File Group / File Slice │ -│ (same Hudi concepts as Parquet) │ -├───────────────────────────────────┤ -│ Lance Data Files (.lance) │ -│ Columnar storage │ -│ Fragment-based layout │ -├───────────────────────────────────┤ -│ Storage (S3, GCS, HDFS, FS) │ -└───────────────────────────────────┘ -``` - -All Hudi table services work with Lance-backed tables: +Table-service behavior on Lance-backed tables: -- **Compaction** — merges log files into Lance base files -- **Clustering** — reorganizes Lance files for better data locality -- **Cleaning** — removes old Lance file versions -- **Metadata indexing** — bloom filters work across Lance files; column stats and partition stats are - **automatically disabled** for Lance tables +- **Compaction** — merges Avro log files into Lance base files. +- **Clustering** — reorganizes records into new Lance files. +- **Cleaning** — removes obsolete Lance file slices. +- **Metadata indexing** — bloom filter indexing is supported. Column-stats and partition-stats + indices are automatically disabled for Lance base files. Review Comment: This is removed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
