[GitHub] [hudi] FelixKJose edited a comment on issue #4891: Clustering not working on large table and partitions

GitBox Mon, 07 Mar 2022 21:30:18 -0800


FelixKJose edited a comment on issue #4891:
URL: https://github.com/apache/hudi/issues/4891#issuecomment-1061421066



   @codope @suryaprasanna Thank you for the detailed information.
   
   Couple of questions:
   1. Let's say my each partitions (date) are large partitions (eg. 6.5 TB 
uncompressed data), so having the frequent async clustering is suggested right? 
I am running on r5.4xlarge (meaning 37GB driver memory), so what will be best 
clusering frequency? What will be the best value for 
`hoodie.clustering.plan.strategy.small.file.limit`?
   2. Also any other configurations I should be using considering the partition 
size as mentioned above
   3. Which lock provider is advised if I am running on AWS EMR?
   
   Note: Our requirement is to ingest data quickly and at the same time 
expecting to support interactive workloads for query side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] FelixKJose edited a comment on issue #4891: Clustering not working on large table and partitions

Reply via email to