itschrispeck opened a new pull request, #13308: URL: https://github.com/apache/pinot/pull/13308
This PR contains two changes: 1. Prevent duplicates in the realtime Lucene index to fix IndexOutOfBounds query time exceptions 2. Use NRTCachingDirectory for the realtime segment For the first, I had seen `IndexOutOfBounds` exception is caused by `mappingBuffer.getInt(luceneDocId)`, but the mapping file is loaded in range `[0, numDocsFromSegment * 4 bytes]`. Therefore, if Lucene index contains duplicates, it's an eventuality that we'll try to `getInt` for a `luceneDocId` that is larger than `numDocsFromSegment`, causing the exception. For the second, NRT functionality is beneficial when refresh rate is high as it results in many tiny files being written. This allows for a configurable in memory buffer to cache these small writes and avoid many small files/high FDs. Tested in an internal cluster. suggested tags: `bugfix`, `enhancement` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org