[ https://issues.apache.org/jira/browse/LUCENE-8980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Guoqiang Jiang updated LUCENE-8980: ----------------------------------- Description: *Description* In Elasticsearch, which is based on Lucene, each document has an _id field that uniquely identifies it, which is indexed so that documents can be looked up from Lucene. When users write data with self-generated _id values, even if the conflict rate is very low, Elasticsearch has to check _id uniqueness through Lucene API for each document, which result in poor write performance. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check whether the term fall in the range of minTerm and maxTerm, so that wo skip some useless segments as soon as possible. was: *Description* In Elasticsearch, which is based on Lucene, each document has an _id field that uniquely identifies it, which is indexed so that documents can be looked up from Lucene. When users write Elasticsearch with self-generated _id values, even if the conflict rate is very low, Elasticsearch has to check _id uniqueness through Lucene API for each document, which result in poor write performance. *Solution* As Lucene stores min/maxTerm metrics for each segment and field, we can use those metrics to optimise performance of Lucene look up API. When calling SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check whether the term fall in the range of minTerm and maxTerm, so that wo skip some useless segments as soon as possible. > Optimise SegmentTermsEnum.seekExact performance > ----------------------------------------------- > > Key: LUCENE-8980 > URL: https://issues.apache.org/jira/browse/LUCENE-8980 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Affects Versions: 8.2 > Reporter: Guoqiang Jiang > Assignee: David Wayne Smiley > Priority: Major > Labels: performance > Fix For: master (9.0) > > Time Spent: 3h 50m > Remaining Estimate: 0h > > *Description* > In Elasticsearch, which is based on Lucene, each document has an _id field > that uniquely identifies it, which is indexed so that documents can be looked > up from Lucene. When users write data with self-generated _id values, even if > the conflict rate is very low, Elasticsearch has to check _id uniqueness > through Lucene API for each document, which result in poor write performance. > > *Solution* > As Lucene stores min/maxTerm metrics for each segment and field, we can use > those metrics to optimise performance of Lucene look up API. When calling > SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check > whether the term fall in the range of minTerm and maxTerm, so that wo skip > some useless segments as soon as possible. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org