GitHub user rahil-c edited a discussion: RFC-100: Lance File Format support in 
Hudi

## ✅ Lance File Format Integration Tasks

See the following feature for more context: 
https://github.com/apache/hudi/issues/14127

In regards to the following new feature for supporting unstructured data in 
Hudi via formats like Lance that are focused on AI/ML use cases. Here is the 
initial scope of what we are targeting(Note this list will continue to grow as 
we find get deeper within the integration, for now it aims to first support the 
Hudi Spark Client):

- [ ] Add base `HoodieFileWriter` for Lance with a Spark implementation, 
[PR](https://github.com/apache/hudi/pull/14131) (merged in feature branch)
- [ ] Add base `HoodieFileReader` for Lance with a Spark implementation, 
[PR](https://github.com/apache/hudi/pull/14132)
- [ ] Add `SparkColumnarFileReader` implementation for Spark Datasource 
Integration with Lance, [PR](https://github.com/apache/hudi/pull/14135)
- [ ] Implement bulk insert/ insert / upsert / delete validation for COW, 
[PR](https://github.com/apache/hudi/pull/14137)
- [ ] Implement bulk insert/ insert / upsert / delete validation for MOR(with 
lance base file and avro log file), 
[PR](https://github.com/apache/hudi/pull/14154)
- [ ] Ensure that basic compaction, and clustering works for Hudi Tables with 
Lance [PR](https://github.com/apache/hudi/pull/14156) (requires rebase from 
above)
- [ ] Ensure that schema evolution(add column) works for Hudi Tables with Lance 
[PR](https://github.com/apache/hudi/pull/14160)
- [ ] FileWriter improvements: Support canWrite() and bloom filter 
[comment1](https://github.com/apache/hudi/pull/14131#discussion_r2455261849), 
[comment2](https://github.com/apache/hudi/pull/14131#discussion_r2455244853)
- [ ] Integrate Lance as a log file format
- [ ] Add predicate (filter) push-down  
- [ ] Support `ColumnarBatch` vectorized reading

Will be raising changes on the following open source feature branch: 
https://github.com/apache/hudi/tree/feature-branch-rfc100-unstructured-data


## Improving File Format Integration Points

Changes that are modifying existing hudi master (1.2.0) to allow for better 
integration with file formats in the future.

- [ ] Consolidate file writer side interfaces, 
[PR](https://github.com/apache/hudi/pull/14173)

GitHub link: https://github.com/apache/hudi/discussions/14128

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to