Re: [PR] Parquet Incremental Sync [incubator-xtable]

via GitHub Sun, 28 Dec 2025 23:46:16 -0800


vinishjail97 commented on PR #768:
URL: https://github.com/apache/incubator-xtable/pull/768#issuecomment-3695751441


   @sapienza88 I'm adding a more detailed design and a class level structure to 
unblock this PR.
   
   **Design Principle**
     XTable operates at a metadata level only. The current PR approach of 
writing new Parquet files with filtered data is incorrect. XTable should:
     - Discover existing Parquet files from storage
     - Generate table format metadata (Hudi, Iceberg, Delta) for those files
     - NEVER write new Parquet files or transform data. 
     
   **Architecture**
   ```
     ┌────────────────────────────────────────────────────────────┐
     │                  ParquetConversionSource                   │
     │  - Uses ParquetFileDiscovery to find files                 │
     │  - Converts file metadata to InternalDataFile              │
     │  - Returns snapshots and table changes                     │
     └────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
     ┌────────────────────────────────────────────────────────────┐
     │              ParquetFileDiscovery (new class)              │
     │  - Lists all .parquet files from filesystem                │
     │  - Filters files by modification time                      │
     │  - Returns lightweight file metadata                       │
     └────────────────────────────────────────────────────────────┘
                                 │
                                 ▼
     ┌────────────────────────────────────────────────────────────┐
     │            FileSystem (HDFS/S3/GCS/Azure)                  │
     │  - fs.listFiles(basePath, recursive=true)                  │
     └────────────────────────────────────────────────────────────┘
   ```
   
   Use file modification time as commit identifier, you will be able to 
identify which files have been synced and which haven't been synced. The files 
not synced need to have metadata generated. The future functionality like 
making it optimized, handling deleted parquet files in storage can be handled 
incrementally, hoping to scope low for this PR. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Parquet Incremental Sync [incubator-xtable]

Reply via email to