jpountz commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1235332565
##########
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##########
@@ -86,19 +92,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be
built only when requested
*/
- public static TermStates build(IndexReaderContext context, Term term,
boolean needsStats)
+ public static TermStates build(
+ IndexSearcher indexSearcher, IndexReaderContext context, Term term,
boolean needsStats)
throws IOException {
assert context != null && context.isTopLevel;
final TermStates perReaderTermState = new TermStates(needsStats ? null :
term, context);
if (needsStats) {
- for (final LeafReaderContext ctx : context.leaves()) {
- // if (DEBUG) System.out.println(" r=" + leaves[i].reader);
- TermsEnum termsEnum = loadTermsEnum(ctx, term);
- if (termsEnum != null) {
- final TermState termState = termsEnum.termState();
- // if (DEBUG) System.out.println(" found");
- perReaderTermState.register(
- termState, ctx.ord, termsEnum.docFreq(),
termsEnum.totalTermFreq());
+ Executor executor = indexSearcher.getExecutor();
+ boolean isShutdown = false;
+ if (executor instanceof ExecutorService) {
+ isShutdown = ((ExecutorService) executor).isShutdown();
+ }
+ if (executor != null && isShutdown == false) {
+ // build term states concurrently
+ List<FutureTask<Integer>> tasks =
+ context.leaves().stream()
+ .map(
+ ctx ->
+ new FutureTask<>(
+ () -> {
+ TermsEnum termsEnum = loadTermsEnum(ctx, term);
+ if (termsEnum != null) {
+ final TermState termState =
termsEnum.termState();
+ perReaderTermState.register(
+ termState,
+ ctx.ord,
+ termsEnum.docFreq(),
+ termsEnum.totalTermFreq());
+ }
+ return 0;
+ }))
+ .toList();
+ for (FutureTask<Integer> task : tasks) {
+ executor.execute(task);
Review Comment:
Actually I thought it is a good idea to ignore slices here and parallelize
at the segment level. Slices are computed with query processing in mind, where
the cost of processing a query is mostly a function of the number of docs, so
IndexSearcher tries to create slices that have approximately the same number of
docs. On the other hand, the cost of terms dictionary lookups doesn't depend as
much on the number of docs as it depends on the number of segments, so
parallelizing based on segments here instead of slices makes sense to me.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]