suryaprasanna commented on PR #9006: URL: https://github.com/apache/hudi/pull/9006#issuecomment-1613853964
> > If there is a use case of pruning some columns to save storage memory, current approach of clustering will iterate over every record and remove the unused column, this is so much time consuming. > > Thanks @suryaprasanna , can you clarify what's the relationship between column pruning and clustering, for regular notion of Hudi clustering, it only merges small file groups into larger ones with optional soring on columns, there is no pruning happens here, how the user expects to improve the efficiency with this patch overall? Clustering is initially added to do sorting and stitching. But its framework is flexible enough to do wide variety of rewriter use cases. Following are the other rewriter usecases that can be done using Clustering framework. 1. Encryption. Async encryption on data files can be done on demand basis, by restricting the clustering group to be 1. Which then becomes a update of the file. 2. Column pruning. This current change be used run parquet_tools prune command on unused columns to reduce the storage footprint. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org