W-I-D-EE opened a new issue, #9293: URL: https://github.com/apache/iceberg/issues/9293
### Query engine Spark 3.2.3 ### Question When compacting small parquet files i noticed it seems slow. Writing a single ~256mb parquet file to HDFS is taking 4-5 mins (total time including reading/writing the files is 15 mins per file group). The sustained write throughput is hoving 1-1.5mb/s for one executor. I am wondering if there is something i can do to improve this throughput. My data scenario is that i have incredible small parquet records, they end up being 8-10 bytes per row. We have some ~20 columns. So one parquet file will end up with 100-250 million rows in it. I have increased max-concurrent-file-group-rewrites to max out active cores and increased shuffle-partitions-per-file (i back-ported the change to 3.2.3) to solve for OOM issues because my in memory footprint is so much larger than the parquet footprint. While they help me maximize CPU usage, nothing i have tried has increased IO throughput. Any suggestions? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org