rishabhmaurya commented on code in PR #13123:
URL: https://github.com/apache/lucene/pull/13123#discussion_r1513889998
##########
lucene/misc/src/java/org/apache/lucene/misc/index/BPIndexReorderer.java:
##########
@@ -341,116 +344,94 @@ protected void compute() {
*/
private boolean shuffle(
ForwardIndex forwardIndex,
- IntsRef left,
- IntsRef right,
+ IntsRef docIDs,
+ int midPoint,
int[] leftDocFreqs,
int[] rightDocFreqs,
- float[] gains,
+ float[] biases,
int iter)
throws IOException {
- assert left.ints == right.ints;
- assert left.offset + left.length == right.offset;
- // Computing gains is typically a bottleneck, because each iteration
needs to iterate over all
- // postings to recompute gains, and the total number of postings is
usually one order of
+ // Computing biases is typically a bottleneck, because each iteration
needs to iterate over
+ // all postings to recompute biases, and the total number of postings is
usually one order of
// magnitude or more larger than the number of docs. So we try to
parallelize it.
- ComputeGainsTask leftGainsTask =
- new ComputeGainsTask(
- left.ints,
- gains,
- left.offset,
- left.offset + left.length,
+ new ComputeBiasTask(
+ docIDs.ints,
+ biases,
+ docIDs.offset,
+ docIDs.offset + docIDs.length,
leftDocFreqs,
rightDocFreqs,
threadLocal,
- depth);
- ComputeGainsTask rightGainsTask =
- new ComputeGainsTask(
- right.ints,
- gains,
- right.offset,
- right.offset + right.length,
- rightDocFreqs,
- leftDocFreqs,
- threadLocal,
- depth);
- if (shouldFork(docIDs.length, docIDs.ints.length)) {
- invokeAll(leftGainsTask, rightGainsTask);
- } else {
- leftGainsTask.compute();
- rightGainsTask.compute();
+ depth)
+ .compute();
+
+ float maxLeftBias = Float.NEGATIVE_INFINITY;
+ for (int i = docIDs.offset; i < midPoint; ++i) {
+ maxLeftBias = Math.max(maxLeftBias, biases[i]);
+ }
+ float minRightBias = Float.POSITIVE_INFINITY;
+ for (int i = midPoint, end = docIDs.offset + docIDs.length; i < end;
++i) {
+ minRightBias = Math.min(minRightBias, biases[i]);
+ }
+ float gain = maxLeftBias - minRightBias;
+ // This uses the simulated annealing proposed by Mackenzie et al in
"Tradeoff Options for
+ // Bipartite Graph Partitioning" by comparing the gain of swapping the
doc from the left side
+ // that is most attracted to the right and the doc from the right side
that is most attracted
+ // to the left against `iter` rather than zero.
+ if (gain <= iter) {
+ return false;
}
- class ByDescendingGainSorter extends IntroSorter {
+ new IntroSelector() {
int pivotDoc;
- float pivotGain;
+ float pivotBias;
@Override
protected void setPivot(int i) {
- pivotDoc = left.ints[i];
- pivotGain = gains[i];
+ pivotDoc = docIDs.ints[i];
+ pivotBias = biases[i];
}
@Override
protected int comparePivot(int j) {
- // Compare in reverse order to get a descending sort
- int cmp = Float.compare(gains[j], pivotGain);
+ int cmp = Float.compare(pivotBias, biases[j]);
Review Comment:
@jpountz To be precise, this is what I tried to avoid swaps as per the paper
-
```java
int cmp = Float.compare(pivotBias-iter, biases[j]);
if (cmp > 0) {
return cmp;
}
cmp = Float.compare(pivotBias, biases[j]);
if (cmp >= 0) {
// Tie break on the doc ID to preserve doc ID ordering as much as
possible
cmp = pivotDoc - docIDs.ints[j];
}
return cmp;
```
It won't be a precise kth largest but will be close to it with tolerance
factor governed by `iter`.
I used
[http_logs](https://github.com/opensearch-project/opensearch-benchmark-workloads/blob/main/http_logs)
workload and I see similar results as described in the paper with respect to
time taken to perform reordering.
For benchmark setup, I setup just a single shard and try to force merge into
one segment after indexing and the corpus used is `logs-231998`.
Here is the result -
### With this change
```shell
Final index size: 686.3mb
curl localhost:9200/_cat/indices
green open logs-231998 T-HpeIWARuSZFGa74PWOig 1 0 11961342 0 686.3mb 686.3mb
```
Some additional metrics I'm publishing
[here](https://github.com/rishabhmaurya/lucene/commit/fda8f19ac691af832e4bd7b4722324a4f14971bf#diff-43ddf233a75c4de53d99ec83e5b52a5eff0c39e8d601ae6816db49c7c14f785aR1154),
you can see **number of swaps to number of docs ratio is 8**
```
Reorder stats
totalDocs: 11961342; numTerms: 964; totalPostings: 88726436;
totalSwapsPerformed: 96188516; postingsDocRatio: 7; swapsDocsRatio: 8
12816206 9489609 12087839 12496439 9235544 9035264 8148933 6569703 5403994
3803447 2775037 1757719 1117468 668304 374775 216838 120914 70483 0 0 0 0 0 0 0
```
| Metric |
Task | Value | Unit |
|---------------------------------------------------------------:|-------------------------:|------------:|-------:|
| Cumulative merge time of primary shards |
| 9.23488 | min |
| Cumulative merge count of primary shards |
| 20 | |
### Without this change
```shell
Final index size: 840.2mb
curl localhost:9200/_cat/indices
green open logs-231998 wZ8LNHeDQaCiLgOkJTjKxg 1 0 11961342 0 840.2mb 840.2mb
```
**number of swaps to number of docs ratio is 47**
```
Reorder stats
totalDocs: 11961342; numTerms: 964; totalPostings: 88726436;
totalSwapsPerformed: 571132294; postingsDocRatio: 7; swapsDocsRatio: 47
47254463 55941681 48415352 38245831 39694052 36558063 37907697 33234795
32316838 30361191 27886028 26467715 24669276 22745648 21136997 17510104
16757532 14029031 0 0 0 0 0 0 0
```
| Metric | Task | Value | Unit |
| ---: | ---: | ---: | ---: |
| Cumulative merge time of primary shards | | 15.5215 | min |
| Cumulative merge count of primary shards | | 20 | |
So overall time to perform merge reduced from 15.5 mins to 9.23 mins. Also,
number of swaps performed to docs ratio reduced significantly here.
### Without reordering
```
Final index size: 923.8mb
curl localhost:9200/_cat/indices
green open logs-231998 O0bppXZUREadOqVz80qJBw 1 0 11961342 0 923.8mb
```
| Metric | Task | Value | Unit |
| ---: | ---: | ---: | ---: |
| Cumulative merge time of primary shards | | 1.13032 | min |
| Cumulative merge count of primary shards | | 21 | |
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]