maytasm opened a new pull request #10623: URL: https://github.com/apache/druid/pull/10623
Fix string byte calculation in StringDimensionIndexer ### Description Druid Ingestion has a guardrail of maxBytesInMemory. The default value of maxBytesInMemory is 1/6 of max java memory which should prevent the ingestion task (peon) from OOM. However, for this to work correctly, it relies on accurate calculation of bytes in memory from the `facts` in `FactsHolder` of the `OnheapIncrementalIndex`. When raw data has a lot of String dimension, the ingestion task (peon) can still OOM with default value of maxBytesInMemory. This is because the calculation of `estimateEncodedKeyComponentSize` in StringDimensionIndexer is underestimating the memory footprint of String. According to https://www.ibm.com/developerworks/java/library/j-codetoheap/index.html, String has the following memory usuage... 28 bytes of data for String metadata (class pointer, flags, locks, hash, count, offset, reference to char array) + 16 bytes of data for the char array metadata (class pointer, flags, locks, size) + 2 bytes for every letter of the string. (note that previously, we w ere only account for 2 bytes for every letter of the string) This PR has: - [x] been self-reviewed. - [x] added documentation for new or modified features or behaviors. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml) - [x] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [x] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. - [ ] added integration tests. - [x] been tested in a test Druid cluster. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org For additional commands, e-mail: commits-h...@druid.apache.org