hudi-bot opened a new issue, #14778:
URL: https://github.com/apache/hudi/issues/14778

   Situation: In ExternalSpillMap, we need to control the amount of data in 
memory map to avoid OOM. Currently, we evaluate this by estimate the average 
size of each payload twice. And get total memory use by multiplying average 
payload size with payload number. The first time we get the size is when first 
payload is inserted while the second time is when there are 100 payloads stored 
in memory. 
   
   Problem: If the size is underestimated in the second estimation, an OOM will 
happen.
   
   Plan: Could we have a flag to control if we want an evaluation in accurate?
   
   Currently, I have several ideas but not sure which one could be the best or 
if there are any better one.
    # Estimate each payload, store the length of payload with its value.  Once 
update or remove happen, use diff old length and add new length if needed so 
that we keep the sum of all payload size precisely. This is the method I 
currently use in prod.
    # Do not store the length but evaluate old payload again when it is popped. 
It trades off space against time comparing to method one. A better performance 
may be reached when updating and removing are rare. I didn't adopt this because 
I had profile ingestion process by arthas and found size estimating in that may 
be time consuming in flame graph. But I'm not sure whether it is true in 
compaction. In my intuition,HoodieRecordPayload has a quite simple structure.
    # I also have a more accurate estimate method that is evaluate the whole 
map when size is 1,100,10000 and one million. Less underestimate will happen in 
such large amount of data.
   
   Look forward to any advice or suggestion or discussion.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-1796
   - Type: Improvement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to