zacharymorn commented on PR #12194:
URL: https://github.com/apache/lucene/pull/12194#issuecomment-1474743257
> Maybe we could try to leverage the geonames dataset (there's a few
benchmarks for it in lucene-util), which has a few low-cardinality fields like
the time zone or country. Then enable index sorting on these fields. And make
sure we're getting performance when using the time zone in required or excluded
clauses?
Thanks @jpountz for the suggestion! I actually did something similar by
modifying the benchmarking code to support sorting index by `month` field,
which is a `KeywordField`, and creating corresponding query with the following
syntax to exercise the `ReqExclScorer - Posting - DocValue` path that I have
changed:
`ReqExclWithDocValues: newbenchmarktest//names docvalues:-month:[Jan TO Aug]`
However, while I was able to run this query, the test actually failed when
verifying top matches & scores between baseline and modified code. Upon further
digging, I realized a potential issue to this API idea. As Lucene does a lot of
two phase iterations, and two phase iterator's approximation may provide a
superset of the actual matches. If we were to use this API to find and ignore /
skip over a bunch of doc ids from approximation, wouldn't the result be
inaccurate?
For example, in the below `skippingReqApproximation` disi I put inside
`ReqExclScorer`, `exclApproximation.peekNextNonMatchingDocID()` may provide an
inaccurate, longer run of matches as approximation provides superset matches.
If we were to further confirm the boundary by actually checking matches on its
iterator, then we basically resort to linear scan on the iterator, which
defeats the purpose of this new API?
```
final DocIdSetIterator skippingReqApproximation =
new DocIdSetIterator() {
@Override
public int docID() {}
@Override
public int nextDoc() throws IOException {}
@Override
public int advance(int target) throws IOException {
// this exclNonMatchingDoc could be inaccruate, as
exclApproximation provides a superset of matches than its underlying iterator
int exclNonMatchingDoc =
exclApproximation.peekNextNonMatchingDocID();
if (exclApproximation.docID() < target
&& target < exclNonMatchingDoc) {
return reqApproximation.advance(exclNonMatchingDoc);
} else {
return reqApproximation.advance(target);
}
}
@Override
public long cost() {}
};
```
Please let me know if you have any suggestion on this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]