deniskuzZ commented on PR #6273: URL: https://github.com/apache/hive/pull/6273#issuecomment-3798225276
> Is this operation reversible, like tmrw, I land up restoring my files, can I rollback to the previous snapshots and kind of restore my table to original? Yes, the operation is reversible. The repair operation creates a new snapshot with updated manifests that remove references to missing data files. Iceberg maintains a complete snapshot history, allowing you to rollback to any previous snapshot. > What happens if the Manifest* files go missing? How do we repair that. This PR does not address missing manifest files. The current implementation follows the same basic repair functionality that Impala already implemented (commit fdad9d32041a736108b876704bd0354090a88d29), which focuses on detecting and removing references to missing data files. Missing manifest files represent a more severe form of corruption that would require reconstructing metadata from available manifests or data files, which is beyond the scope of this basic repair functionality. > We handle the `DataFile` missing scenario but for `DeleteFiles/DV's?` > Once a `DataFile` is dropped what about the `DeleteFIles/DV's` associated with it? The repair operation cannot proceed if there are missing delete files. This is a limitation of Iceberg's DeleteFiles API, which only allows removing data files, not delete files or deletion vectors. This aligns with the Impala implementation's scope and the fundamental constraints of the Iceberg DeleteFiles API. > I just skimmed over the implementation, we are doing `planTask` on the entire table, is that batched? I am doubtful like whether at scale it will lead to `OOM` kind of stuff within the HS2 The planFiles() method returns a CloseableIterable<FileScanTask>, which is a lazy iterator that does not load all files into memory at once. This design prevents OOM issues even for very large tables. > Did you explore rather than operating on the main table, rather get the entires from the `ALL_FILES` metadata table or some other relevant. The repair operation needs to check files referenced in the current table snapshot, which planFiles() provides directly The metadata table approach would add unnecessary complexity without performance benefits > Should we like have a split batch as well, like you found 1K files missing lets hold -> commit & start again, like to avoid memory pressure. Batching commits is not necessary for this implementation. The operation only stores file paths (strings) in memory, not file contents or metadata, making the memory footprint minimal. > Fundamentally, I am not sure we should discuss whether this should be within `MSCK` or an independent command within `ALTER TABLE EXECUTE <SOME FANCY Thing>`. MSCK I believe was to fix the inconsistency b/w the Metadata & Actual Data, like you ingested data or so. This is like fixing the Metadata post a Data Loss, this is bit debatable though, different people different but still we should maybe think once. I know Spark handles such things via there DeleteFile Action or some action they have, but it doesn't find the missing ones on its own Integrating repair functionality into MSCK REPAIR TABLE is the appropriate design choice for Hive, as it aligns with MSCK's core purpose of synchronizing metadata with existing data files. MSCK is designed to repair inconsistencies between the Hive Metastore and the actual data files in storage. For ACID tables, MSCK already handles missing writes and metadata synchronization (see TestMSCKRepairOnAcid.java). The repair operation fundamentally performs the same function: synchronizing table metadata with the actual state of files in storage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
