GitHub user rahil-c edited a discussion: RFC-100: Lance File Format support in Hudi
## ✅ Lance File Format Integration Tasks See the following feature for more context: https://github.com/apache/hudi/issues/14127 In regards to the following new feature for supporting unstructured data in Hudi via formats like Lance that are focused on AI/ML use cases. Here is the initial scope of what we are targeting(Note this list will continue to grow as we find get deeper within the integration, for now it aims to first support the Hudi Spark Client): - [ ] Add base `HoodieFileWriter` for Lance with a Spark implementation, [PR](https://github.com/apache/hudi/pull/14131) (merged in feature branch) - [ ] Add base `HoodieFileReader` for Lance with a Spark implementation, [PR](https://github.com/apache/hudi/pull/14132) - [ ] Add `SparkColumnarFileReader` implementation for Spark Datasource Integration with Lance, [PR](https://github.com/apache/hudi/pull/14135) - [ ] Implement bulk insert/ insert / upsert / delete validation for COW, [PR](https://github.com/apache/hudi/pull/14137) - [ ] Implement bulk insert/ insert / upsert / delete validation for MOR(with lance base file and avro log file), [PR](https://github.com/apache/hudi/pull/14154) - [ ] Ensure that basic compaction, and clustering works for Hudi Tables with Lance [PR](https://github.com/apache/hudi/pull/14156) (requires rebase from above) - [ ] Ensure that schema evolution(add column) works for Hudi Tables with Lance [PR](https://github.com/apache/hudi/pull/14160) - [ ] FileWriter improvements: Support canWrite() and bloom filter [comment1](https://github.com/apache/hudi/pull/14131#discussion_r2455261849), [comment2](https://github.com/apache/hudi/pull/14131#discussion_r2455244853) - [ ] Integrate Lance as a log file format - [ ] Add predicate (filter) push-down - [ ] Support `ColumnarBatch` vectorized reading Will be raising changes on the following open source feature branch: https://github.com/apache/hudi/tree/feature-branch-rfc100-unstructured-data ## Improving File Format Integration Points Changes that are modifying existing hudi master (1.2.0) to allow for better integration with file formats in the future. - [ ] Consolidate file writer side interfaces for new file format integrations, [PR](https://github.com/apache/hudi/pull/14173) GitHub link: https://github.com/apache/hudi/discussions/14128 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
