[GitHub] [hudi] bvaradar commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.

GitBox Sat, 21 Nov 2020 16:14:48 -0800


bvaradar commented on issue #2269:
URL: https://github.com/apache/hudi/issues/2269#issuecomment-731656591



   @AakashPradeep : I can quickly tell that the number of partitions is really 
high relative to the file size in each partition. It looks like each partition 
has only very little records (Parquet size ~ 400K).  S3 listing becomes a huge 
bottleneck in this case. I have seen S3 listing taking really long time to 
perform listing for ~100K partitions.  With 0.7.0 (next release), we are going 
to have 0-listing writes supported which will avoid this bottleneck.  
   
   But generally, you have too many partitions relative to your dataset size. 
If possible, keeping lower cardinality column as partition would help.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] bvaradar commented on issue #2269: [SUPPORT] - HUDI Table Bulk Insert for 5 gb parquet file progressively taking longer time to insert.

Reply via email to