vinishjail97 commented on PR #768:
URL: https://github.com/apache/incubator-xtable/pull/768#issuecomment-3695751441
@sapienza88 I'm adding a more detailed design and a class level structure to
unblock this PR.
**Design Principle**
XTable operates at a metadata level only. The current PR approach of
writing new Parquet files with filtered data is incorrect. XTable should:
- Discover existing Parquet files from storage
- Generate table format metadata (Hudi, Iceberg, Delta) for those files
- NEVER write new Parquet files or transform data.
**Architecture**
```
┌────────────────────────────────────────────────────────────┐
│ ParquetConversionSource │
│ - Uses ParquetFileDiscovery to find files │
│ - Converts file metadata to InternalDataFile │
│ - Returns snapshots and table changes │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ ParquetFileDiscovery (new class) │
│ - Lists all .parquet files from filesystem │
│ - Filters files by modification time │
│ - Returns lightweight file metadata │
└────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ FileSystem (HDFS/S3/GCS/Azure) │
│ - fs.listFiles(basePath, recursive=true) │
└────────────────────────────────────────────────────────────┘
```
Use file modification time as commit identifier, you will be able to
identify which files have been synced and which haven't been synced. The files
not synced need to have metadata generated. The future functionality like
making it optimized, handling deleted parquet files in storage can be handled
incrementally, hoping to scope low for this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]