GitHub user suryaprasanna edited a discussion: Parquet Tool Interface for 
File-Level Operations in Clustering

### Context:
I'd like to restart the discussion around adding a parquet tool interface for 
file-level operations during clustering. I previously opened PR #9006 which 
implements this capability, and I believe this feature would be valuable for 
the Hudi community.

### Problem Statement
Currently, Hudi's clustering strategies operate on a record-by-record basis. 
For certain use cases like column pruning, encryption, or selective column 
preservation, this approach is inefficient. These operations don't require 
reading and deserializing individual records - they can be performed much more 
efficiently at the file level using parquet-tools.

### Proposed Solution
The PR introduces a ParquetToolsExecutionStrategy that enables efficient 
file-level operations during clustering. The implementation:
  - Extends SingleSparkJobExecutionStrategy to provide a framework for 
file-level clustering operations
  - Introduces HoodieFileWriteHandle for file-level operations (vs record-level)
  - Supports proper rollback via marker files
  - Enables efficient rewriting without record iteration

  This interface would be particularly beneficial for:
  1. Column pruning - removing unnecessary columns to reduce storage costs 
without deserializing records
  2. Encryption - applying encryption at the file level
  3. Schema evolution - efficient column reordering or type changes

GitHub link: https://github.com/apache/hudi/discussions/17958

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to