amrishlal opened a new pull request, #8645:
URL: https://github.com/apache/hudi/pull/8645

   ### Change Logs
   
   Calculate and output file size stats of data files that were modified in the 
half-open interval [start date (--start-date parameter), end date (--end-date 
parameter)). --num-days parameter can be used to select data files over last 
--num-days. If --start-date is specified, --num-days will be ignored. If none 
of the date parameters are set, stats will be computed over all data files of 
all partitions in the table. By default, only table level file size stats are 
printed. If --partition-status option is used, partition level file size stats 
also get printed.
   
   The following stats and calculated:
    * Number of files.
    * Total table size.
    * Minimum file size
    * Maximum file size
    * Average file size
    * Median file size
    * p50 file size
    * p90 file size
    * p95 file size
    * p99 file size
   
    Sample spark-submit command:
    > ./bin/spark-submit \
    --class org.apache.hudi.utilities.TableSizeStats \
    
$HUDI_DIR/packaging/hudi-utilities-bundle/target/hudi-utilities-bundle_2.11-0.14.0-SNAPSHOT.jar
 \
    --base-path <base-path> \
    --num-days <number-of-days>
   
   ### Impact
   
   Offline utility
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to