jgq2008303393 opened a new pull request #884: LUCENE-8980: optimise 
SegmentTermsEnum.seekExact performance
URL: https://github.com/apache/lucene-solr/pull/884
 
 
   <!--
   Before creating a pull request, please file an issue in the ASF Jira system 
for Lucene or Solr:
   
   * https://issues.apache.org/jira/projects/LUCENE
   * https://issues.apache.org/jira/projects/SOLR
   
   You will need to create an account in Jira in order to create an issue.
   
   The title of the PR should reference the Jira issue number in the form:
   
   * LUCENE-####: <short description of problem or changes>
   * SOLR-####: <short description of problem or changes>
   
   LUCENE and SOLR must be fully capitalized. A short description helps people 
scanning pull requests for items they can work on.
   
   Properly referencing the issue in the title ensures that Jira is correctly 
updated with code review comments and commits. -->
   
   
   # Description
   In Elasticsearch, each document has an _id field that uniquely identifies 
it, which is indexed so that documents can be looked up from Lucene. When users 
write Elasticsearch with self-generated _id values, even if the conflict rate 
is very low, Elasticsearch has to check _id uniqueness through Lucene API for 
each document, which result in poor write performance. 
   
   # Solution
   1. Choose a better _id generator before writing ES
   Different _id formats have a great impact on write performance. We have 
verified this in production cluster. Users can refer to the following blog and 
choose a better _id generator.
   
http://blog.mikemccandless.com/2014/05/choosing-fast-unique-identifier-uuid.html
   2. Optimise with min/maxTerm metrics in Lucene
   As Lucene stores min/maxTerm metrics for each segment and field, we can use 
those metrics to optimise performance of Lucene look up API. When calling 
SegmentTermsEnum.seekExact() to lookup an term in one segment, we can check 
whether the term fall in the range of minTerm and maxTerm, so that wo skip some 
useless segments as soon as possible.
   
   
   # Tests
   I have made some write benchmark using _id in UUID V1 format, and the 
benchmark result is as follows:
   
   | Branch      | Write speed after 4h  | CPU cost | Overall improvement | 
Write speed after 8h  | CPU cost | Overall improvement | 
   | ---------- | :-----------:  | :-----------: | :-----------:  | 
:-----------: | :-----------:  | :-----------: |
   | Original Lucene | 29.9w/s | 68.4% | N/A | 26.7w/s | 66.6% | N/A |
   | Optimised Lucene | 34.5w/s(+15.4%) | 63.8(-6.7%) | +22.1% | 31.5w/s(18.0%) 
| 61.5(-7.7%) | +25.7% |
   
   As shown above, after 8 hours of continuous writing, write speed improves by 
18.0%, CPU cost decreases by 7.7%, and overall performance improves by 25.7%. 
The Elasticsearch GET API and ids query would get similar performance 
improvements.
   
   It should be noted that the benchmark test needs to be run several hours 
continuously, because the performance improvements is not obvious when the data 
is completely cached or the number of segments is too small.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to