[I] Spark Rewrite Write Performance [iceberg]

via GitHub Wed, 13 Dec 2023 16:58:16 -0800


W-I-D-EE opened a new issue, #9293:
URL: https://github.com/apache/iceberg/issues/9293


   ### Query engine
   
   Spark 3.2.3
   
   ### Question
   
   When compacting small parquet files i noticed it seems slow. Writing a 
single ~256mb parquet file to HDFS is taking 4-5 mins (total time including 
reading/writing the files is 15 mins per file group). The sustained write 
throughput is hoving 1-1.5mb/s for one executor. I am wondering if there is 
something i can do to improve this throughput. 
   
   My data scenario is that i have incredible small parquet records, they end 
up being 8-10 bytes per row. We have some ~20 columns. So one parquet file will 
end up with 100-250 million rows in it. 
   
   I have increased max-concurrent-file-group-rewrites to max out active cores 
and increased shuffle-partitions-per-file (i back-ported the change to 3.2.3) 
to solve for OOM issues because my in memory footprint is so much larger than 
the parquet footprint. While they help me maximize CPU usage, nothing i have 
tried has increased IO throughput. 
   
   Any suggestions? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

[I] Spark Rewrite Write Performance [iceberg]

Reply via email to