maytasm opened a new pull request #10623:
URL: https://github.com/apache/druid/pull/10623


   
   Fix string byte calculation in StringDimensionIndexer
   
   ### Description
   
   Druid Ingestion has a guardrail of maxBytesInMemory. The default value of 
maxBytesInMemory is 1/6 of max java memory which should prevent the ingestion 
task (peon) from OOM. However, for this to work correctly, it relies on 
accurate calculation of bytes in memory from the `facts` in `FactsHolder` of 
the `OnheapIncrementalIndex`. When raw data has a lot of String dimension, the 
ingestion task (peon) can still OOM with default value of maxBytesInMemory. 
This is because the calculation of `estimateEncodedKeyComponentSize` in 
StringDimensionIndexer is underestimating the memory footprint of String. 
According to 
https://www.ibm.com/developerworks/java/library/j-codetoheap/index.html, String 
has the following memory usuage... 28 bytes of data for String metadata (class 
pointer, flags, locks, hash, count, offset, reference to char array) + 16 bytes 
of data for the char array metadata (class pointer, flags, locks, size) + 2 
bytes for every letter of the string. (note that previously, we w
 ere only account for 2 bytes for every letter of the string)
   
   This PR has:
   - [x] been self-reviewed.
   - [x] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [x] been tested in a test Druid cluster.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to