sapienza88 commented on PR #768:
URL: https://github.com/apache/incubator-xtable/pull/768#issuecomment-3695997614

   > @sapienza88 I'm adding a more detailed design and a class level structure 
to unblock this PR.
   > 
   > **Design Principle** XTable operates at a metadata level only. The current 
PR approach of writing new Parquet files with filtered data is incorrect. 
XTable should:
   > 
   > * Discover existing Parquet files from storage
   > * Generate table format metadata (Hudi, Iceberg, Delta) for those files
   > * NEVER write new Parquet files or transform data.
   > 
   > **Architecture**
   > 
   > ```
   >   ┌────────────────────────────────────────────────────────────┐
   >   │                  ParquetConversionSource                   │
   >   │  - Uses ParquetFileDiscovery to find files                 │
   >   │  - Converts file metadata to InternalDataFile              │
   >   │  - Returns snapshots and table changes                     │
   >   └────────────────────────────────────────────────────────────┘
   >                               │
   >                               ▼
   >   ┌────────────────────────────────────────────────────────────┐
   >   │              ParquetFileDiscovery (new class)              │
   >   │  - Lists all .parquet files from filesystem                │
   >   │  - Filters files by modification time                      │
   >   │  - Returns lightweight file metadata                       │
   >   └────────────────────────────────────────────────────────────┘
   >                               │
   >                               ▼
   >   ┌────────────────────────────────────────────────────────────┐
   >   │            FileSystem (HDFS/S3/GCS/Azure)                  │
   >   │  - fs.listFiles(basePath, recursive=true)                  │
   >   └────────────────────────────────────────────────────────────┘
   > ```
   > 
   > Use file modification time as commit identifier, you will be able to 
identify which files have been synced and which haven't been synced. The files 
not synced need to have metadata generated. The future functionality like 
making it optimized, handling deleted parquet files in storage can be handled 
incrementally, hoping to scope low for this PR.
   
   - @vinishjail97 thanks. We are already implementing most of the suggested 
logic, pls look at: 
   - 
https://github.com/apache/incubator-xtable/blob/22f4026f00b05069e952d7bfbefee7dda10d79c3/xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java#L72
 **for**  Converts file metadata to InternalDataFile,
   
   - 
https://github.com/apache/incubator-xtable/blob/22f4026f00b05069e952d7bfbefee7dda10d79c3/xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java#L217
 **for** Lists all .parquet files from filesystem
   
   - 
https://github.com/apache/incubator-xtable/blob/22f4026f00b05069e952d7bfbefee7dda10d79c3/xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java#L209
   **for** Filters files by modification time     
   
   - 
https://github.com/apache/incubator-xtable/blob/22f4026f00b05069e952d7bfbefee7dda10d79c3/xtable-core/src/main/java/org/apache/xtable/parquet/ParquetConversionSource.java#L151C3-L151C68
 **for** returning snapshot and table changes.
   
   Let me know if the highlighted implementation can be used in the current PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to