itschrispeck opened a new pull request, #12744: URL: https://github.com/apache/pinot/pull/12744
**Problem:** We typically see long (7-10min) segment build times when using Lucene index with 1-1.5GB segment sizes. 70-80% of this time is spent building the Lucene text index. **Background:** In the existing implementation the Lucene index stores Pinot docIds: for the mutable segment these are the 'mutable' docIds, for the immutable segment we store each row with its new docId. Lucene queries return the matching Lucene DocIds, and we compute these on the fly for the mutable index, or from a mapping file for the immutable index. **Change Summary:** This change copies the mutable Lucene index during realtime segment conversion to reuse, instead of building a new Lucene index. To handle the potential docId change `sortedDocIds` is added to `IndexCreationContext` to compute a temporary mapping between the mutable docId and the immutable segment's docId. This temporary mapping is used during segment conversion to build the mapping file between the Lucene docId and the new immutable segment's docId. This mapping file is built during segment conversion, instead of during segment load in the traditional path. Internally we've seen roughly 40-60% improvement in overall segment build time. The lower peaks are from a table/tenant with this change, the higher ingestion delay peaks are from an identical table in a tenant without this change: <img width="1017" alt="image" src="https://github.com/apache/pinot/assets/27231838/0ab23a4c-f7d3-4332-9c5b-e662925c6f9c"> Testing: deployed internally, local testing, validated basic pause/restart/reload operations on a table to ensure no regression in TextIndexHandler index build. tags: ingestion `performance` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org