gf2121 opened a new pull request, #12573:
URL: https://github.com/apache/lucene/pull/12573
### Description
Recently, we captured a flame graph in a scene with frequent updates, which
showed that sorting deleted terms occupied a high CPU ratio.
Currently, we use JDK sort to sort deleted terms, compare Term with
Term#field, and then compare Term#bytes :
```
@Override
public final int compareTo(Term other) {
if (field.equals(other.field)) {
return bytes.compareTo(other.bytes);
} else {
return field.compareTo(other.field);
}
}
```
In scenarios with many deleted terms, most deleted terms have the same field
name. So a data structure like `Map<String, Map<BytesRef, Integer>>` instead of
`Map<Term, Integer>` could be better here —— We can avoid field name compare
and use `MSBRadixSort` to sort the bytes for each field.
Further more, we can take advantage of `BytesRefHash` here to implement
Map<BytesRef, Integer>. This can help get a more efficient memory layout, and
there's already MSBRadixSort impl in `BytesRefHash` we can reuse.
### Benchmark
I run a benchmark that update on field 1,000,000 times, total took decreased
35%.
<!--StartFragment--><byte-sheet-html-origin data-id="1695204360121"
data-version="4" data-is-embed="false" data-grid-line-hidden="false"
data-importRangeRawData-spreadSource="https://bytedance.feishu.cn/sheets/LcVzsNpiphJeqjtIGi5c738Jnyh"
data-importRangeRawData-range="'Sheet1'!A1:C2">
<!--StartFragment--><byte-sheet-html-origin data-id="1695204710175"
data-version="4" data-is-embed="false" data-grid-line-hidden="false"
data-importRangeRawData-spreadSource="https://bytedance.feishu.cn/sheets/LcVzsNpiphJeqjtIGi5c738Jnyh"
data-importRangeRawData-range="'Sheet1'!A1:D2">
| Baseline(default iw config) | Candidate (default iw config) | Took Diff
-- | -- | -- | --
total took | 3402 | 2202 | -35.27%
</byte-sheet-html-origin><!--EndFragment-->
</byte-sheet-html-origin><!--EndFragment-->
**benchmark code**
```
Directory dir =
FSDirectory.open(Paths.get("./test_index"));
IndexWriter w =
new IndexWriter(
dir,
new IndexWriterConfig()
.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
.setMergePolicy(NoMergePolicy.INSTANCE)
);
long start = System.currentTimeMillis();
for (int i = 0; i < 1000000; i++) {
Document doc = new Document();
doc.add(new StringField("id", i + "", Field.Store.NO));
Term delTerm = new Term("id", "x" + i);
w.updateDocument(delTerm, doc);
}
long end = System.currentTimeMillis();
System.out.println(end - start);
w.commit();
w.close();
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]