davehagman edited a comment on issue #3733:
URL: https://github.com/apache/hudi/issues/3733#issuecomment-933845183


   >  just partitioning on year, month and day did not work out for you and 
hence you have to go w/ hour as well?
   
   We tested multiple partitioning schemes and this gave us a good tradeoff 
between read and write performance (especially under multi-hour processing 
delays when we need to ingest large amounts of more recent data to catch up to 
real-time). Removing the hour partition _could_ be feasible now though, I'm not 
sure how much testing we did originally with and without the hour specifically. 
   
   > are you seeing spikes only in those batches where records are spread 
across older partitions. if you have regular traffic which updates only the 
last few partitions, are the perf back to normal ?
   
   Yes exactly. We ended up splitting these very old sparse events out of the 
ingestion process and this allowed the performance to return to normal. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to