[ https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nishith Agarwal updated HUDI-2003: ---------------------------------- Summary: Auto Compute Compression ratio for input data to output parquet/orc file size (was: Auto Compute Compression) > Auto Compute Compression ratio for input data to output parquet/orc file size > ----------------------------------------------------------------------------- > > Key: HUDI-2003 > URL: https://issues.apache.org/jira/browse/HUDI-2003 > Project: Apache Hudi > Issue Type: Bug > Components: Writer Core > Reporter: Vinay > Priority: Major > > Context : > Submitted a spark job to read 3-4B ORC records and wrote to Hudi format. > Creating the following table with all the runs that I had carried out based > on different options > > ||CONFIG ||Number of Files Created||Size of each file|| > |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB| > |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB| > |PARQUET_FILE_MAX_BYTES=1GB > COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=1100000|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=1GB > BULKINSERT_PARALLELISM=100|Same as before|Same as before| > |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB| > |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB| > Based on this runs, it feels that the compression ratio is off. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)