[ https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sagar Sumit updated HUDI-2003: ------------------------------ Fix Version/s: 0.12.1 (was: 0.12.0) > Auto Compute Compression ratio for input data to output parquet/orc file size > ----------------------------------------------------------------------------- > > Key: HUDI-2003 > URL: https://issues.apache.org/jira/browse/HUDI-2003 > Project: Apache Hudi > Issue Type: Improvement > Components: writer-core > Reporter: Vinay > Priority: Minor > Labels: user-support-issues > Fix For: 0.12.1 > > > Context : > Submitted a spark job to read 3-4B ORC records and wrote to Hudi format. > Creating the following table with all the runs that I had carried out based > on different options > > ||CONFIG ||Number of Files Created||Size of each file|| > |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB| > |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB| > |PARQUET_FILE_MAX_BYTES=1GB > COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=1100000|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=1GB > BULKINSERT_PARALLELISM=100|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB| > |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB| > Based on this runs, it feels that the compression ratio is off. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)