itschrispeck opened a new pull request, #13308:
URL: https://github.com/apache/pinot/pull/13308

   This PR contains two changes: 
   1. Prevent duplicates in the realtime Lucene index to fix IndexOutOfBounds 
query time exceptions
   2. Use NRTCachingDirectory for the realtime segment
   
   For the first, I had seen `IndexOutOfBounds` exception is caused by 
`mappingBuffer.getInt(luceneDocId)`, but the mapping file is loaded in range 
`[0, numDocsFromSegment * 4 bytes]`. Therefore, if Lucene index contains 
duplicates, it's an eventuality that we'll try to `getInt` for a `luceneDocId` 
that is larger than `numDocsFromSegment`, causing the exception.
   
   For the second, NRT functionality is beneficial when refresh rate is high as 
it results in many tiny files being written. This allows for a configurable in 
memory buffer to cache these small writes and avoid many small files/high FDs. 
   
   Tested in an internal cluster. 
   
   suggested tags: `bugfix`, `enhancement`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to