jpountz commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1235420248
##########
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##########
@@ -86,19 +92,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be
built only when requested
*/
- public static TermStates build(IndexReaderContext context, Term term,
boolean needsStats)
+ public static TermStates build(
+ IndexSearcher indexSearcher, IndexReaderContext context, Term term,
boolean needsStats)
throws IOException {
assert context != null && context.isTopLevel;
final TermStates perReaderTermState = new TermStates(needsStats ? null :
term, context);
if (needsStats) {
- for (final LeafReaderContext ctx : context.leaves()) {
- // if (DEBUG) System.out.println(" r=" + leaves[i].reader);
- TermsEnum termsEnum = loadTermsEnum(ctx, term);
- if (termsEnum != null) {
- final TermState termState = termsEnum.termState();
- // if (DEBUG) System.out.println(" found");
- perReaderTermState.register(
- termState, ctx.ord, termsEnum.docFreq(),
termsEnum.totalTermFreq());
+ Executor executor = indexSearcher.getExecutor();
+ boolean isShutdown = false;
+ if (executor instanceof ExecutorService) {
+ isShutdown = ((ExecutorService) executor).isShutdown();
+ }
+ if (executor != null && isShutdown == false) {
+ // build term states concurrently
+ List<FutureTask<Integer>> tasks =
+ context.leaves().stream()
+ .map(
+ ctx ->
+ new FutureTask<>(
+ () -> {
+ TermsEnum termsEnum = loadTermsEnum(ctx, term);
+ if (termsEnum != null) {
+ final TermState termState =
termsEnum.termState();
+ perReaderTermState.register(
+ termState,
+ ctx.ord,
+ termsEnum.docFreq(),
+ termsEnum.totalTermFreq());
+ }
+ return 0;
+ }))
+ .toList();
+ for (FutureTask<Integer> task : tasks) {
+ executor.execute(task);
Review Comment:
I'm viewing it the other way around: ideally we wouldn't require users to
configure slicing, each segment would run in its own task, and we'd leave it to
the executor to schedule the work in a sensible way, e.g. by putting a limit on
the number of tasks that can run concurrently (ie. the size of the threadpool).
But Lucene has a few things it can do more efficiently sequentially than in
parallel so slices try to accomodate this by allowing users to configure a
trade-off between how much work should run sequentially vs. in parallel. Terms
dictionary lookups don't have this issue: there is no inefficiency with running
them in parallel vs. sequentially? Furthermore, every terms dictionary lookup
might block on I/O, so it's better to run them in parallel than sequentially to
give more opportunities to the OS to schedule I/O in a sensible way and to
fully utilize I/O?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]