shubhamvishu commented on code in PR #12183:
URL: https://github.com/apache/lucene/pull/12183#discussion_r1325027898
##########
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##########
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be
built only when requested
*/
- public static TermStates build(IndexReaderContext context, Term term,
boolean needsStats)
+ public static TermStates build(IndexSearcher indexSearcher, Term term,
boolean needsStats)
throws IOException {
- assert context != null && context.isTopLevel;
+ IndexReaderContext context = indexSearcher.getTopReaderContext();
+ assert context != null;
final TermStates perReaderTermState = new TermStates(needsStats ? null :
term, context);
if (needsStats) {
- for (final LeafReaderContext ctx : context.leaves()) {
- // if (DEBUG) System.out.println(" r=" + leaves[i].reader);
- TermsEnum termsEnum = loadTermsEnum(ctx, term);
- if (termsEnum != null) {
- final TermState termState = termsEnum.termState();
- // if (DEBUG) System.out.println(" found");
- perReaderTermState.register(
- termState, ctx.ord, termsEnum.docFreq(),
termsEnum.totalTermFreq());
+ Executor executor = indexSearcher.getExecutor();
+ if (executor == null) {
+ executor = Runnable::run;
+ }
+ List<FutureTask<TermStateInfo>> tasks =
+ context.leaves().stream()
+ .map(
+ ctx ->
+ new FutureTask<>(
+ () -> {
+ TermsEnum termsEnum = loadTermsEnum(ctx, term);
+ if (termsEnum != null) {
+ return new TermStateInfo(
+ termsEnum.termState(),
+ ctx.ord,
+ termsEnum.docFreq(),
+ termsEnum.totalTermFreq());
+ }
+ return null;
+ }))
+ .toList();
+ for (FutureTask<TermStateInfo> task : tasks) {
+ if (executor instanceof ThreadPoolExecutor pool) {
+ if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+ task.run();
Review Comment:
> If you really want to do this use StackWalker. But take care that it might
require additional privileges if used in a wrong way.
I did give this a shot in the new revision(s). I tried to use
`Thread.currentThread().getStackTrace()` first and also `StackWalker` using
`StackWalker.Option.SHOW_HIDDEN_FRAMES` option. I tried using different
`StackWalker.Option`s and observed the tests failing while using
`StackWalker.Option.RETAIN_CLASS_REFERENCE` due to missing privileges. Hence, I
tried to keep a fail-safe/fallback approach i.e. incase of permission issues we
would fallback to running the tasks on the caller thread sequentially. Since
I'm using `StackWalker.Option.SHOW_HIDDEN_FRAMES` here maybe we don't need the
fail-safe approach and directly throw the exception instead but I reckon we
would not want to fail here due to permission issues?
PS: I'm not much aware of the privileges in `StackWalker` and the scenarios
it could cause issues.
> The main problem with instance of checks is that there are many other
implementations of Executor that may deadlock in similar ways. Think also about
virtual threads in java 21!
Totally agreed💯....in the new revision I have tried to only go with
concurrent approach if its a `ThreadPoolExecutor` and it's not already running
on executor thread, if not the tasks would run sequentially on the caller
thread. This means for any other executor implementations we would not make use
of concurrency and stick to current behaviour of `TermStates#build` which is
non concurrent atleast till we have the nice permanent fix to this.
> I think the only way is to go away with the stupid old Executor
abstraction and use the ForkJoin framework of JDK. It is capable of forking
from inside threads which were already forked and can handle that without
deadlocks.
Nice! I really liked this idea....it's time to bid bye to these old executor
implementations. I'll try to dig and see more into this Fork/Join strategy.
-------------
@javanna @uschindler In the new revision I have tried to work around and
made changes to `TaskExecutor` to handle this specific deadlock issue/situation
only if 1) its a `ThreadPoolExecutor` and 2) we are already not running
executor's thread. So this solves the deadlock issues but we only run
`TermStates#build` concurrently when the above 2 situations suffice. This seems
to be a fix to unblock this PR but indeed we need to get rid of this and make
use of Fork/Join strategy as permanent fix (also avoid such annoying
`instanceof` checks for executors). Please let me know if you think we should
hold on this PR until we get the ideal fix in and we could instead go with this
fix meanwhile?. Thanks!
##########
lucene/core/src/java/org/apache/lucene/index/TermStates.java:
##########
@@ -86,19 +93,58 @@ public TermStates(
* @param needsStats if {@code true} then all leaf contexts will be visited
up-front to collect
* term statistics. Otherwise, the {@link TermState} objects will be
built only when requested
*/
- public static TermStates build(IndexReaderContext context, Term term,
boolean needsStats)
+ public static TermStates build(IndexSearcher indexSearcher, Term term,
boolean needsStats)
throws IOException {
- assert context != null && context.isTopLevel;
+ IndexReaderContext context = indexSearcher.getTopReaderContext();
+ assert context != null;
final TermStates perReaderTermState = new TermStates(needsStats ? null :
term, context);
if (needsStats) {
- for (final LeafReaderContext ctx : context.leaves()) {
- // if (DEBUG) System.out.println(" r=" + leaves[i].reader);
- TermsEnum termsEnum = loadTermsEnum(ctx, term);
- if (termsEnum != null) {
- final TermState termState = termsEnum.termState();
- // if (DEBUG) System.out.println(" found");
- perReaderTermState.register(
- termState, ctx.ord, termsEnum.docFreq(),
termsEnum.totalTermFreq());
+ Executor executor = indexSearcher.getExecutor();
+ if (executor == null) {
+ executor = Runnable::run;
+ }
+ List<FutureTask<TermStateInfo>> tasks =
+ context.leaves().stream()
+ .map(
+ ctx ->
+ new FutureTask<>(
+ () -> {
+ TermsEnum termsEnum = loadTermsEnum(ctx, term);
+ if (termsEnum != null) {
+ return new TermStateInfo(
+ termsEnum.termState(),
+ ctx.ord,
+ termsEnum.docFreq(),
+ termsEnum.totalTermFreq());
+ }
+ return null;
+ }))
+ .toList();
+ for (FutureTask<TermStateInfo> task : tasks) {
+ if (executor instanceof ThreadPoolExecutor pool) {
+ if ((pool.getCorePoolSize() - pool.getActiveCount()) <= 1) {
+ task.run();
Review Comment:
> If you really want to do this use StackWalker. But take care that it might
require additional privileges if used in a wrong way.
I did give this a shot in the new revision(s). I tried to use
`Thread.currentThread().getStackTrace()` first and also `StackWalker` using
`StackWalker.Option.SHOW_HIDDEN_FRAMES` option. I tried using different
`StackWalker.Option`s and observed the tests failing while using
`StackWalker.Option.RETAIN_CLASS_REFERENCE` due to missing privileges. Hence, I
tried to keep a fail-safe/fallback approach i.e. incase of permission issues we
would fallback to running the tasks on the caller thread sequentially. Since
I'm using `StackWalker.Option.SHOW_HIDDEN_FRAMES` here maybe we don't need the
fail-safe approach and directly throw the exception instead but I reckon we
would not want to fail here due to permission issues?
PS: I'm not much aware of the privileges in `StackWalker` and the scenarios
it could cause issues.
> The main problem with instance of checks is that there are many other
implementations of Executor that may deadlock in similar ways. Think also about
virtual threads in java 21!
Totally agreed💯....in the new revision I have tried to only go with
concurrent approach if its a `ThreadPoolExecutor` and it's not already running
on executor thread, if not the tasks would run sequentially on the caller
thread. This means for any other executor implementations we would not make use
of concurrency and stick to current behaviour of `TermStates#build` which is
non concurrent atleast till we have the nice permanent fix to this.
> I think the only way is to go away with the stupid old Executor
abstraction and use the ForkJoin framework of JDK. It is capable of forking
from inside threads which were already forked and can handle that without
deadlocks.
Nice! I really liked this idea....it's time to bid bye to these old executor
implementations. I'll try to dig and see more into this Fork/Join strategy.
-------------
@javanna @uschindler In the new revision I have tried to work around and
made changes to `TaskExecutor` to handle this specific deadlock issue/situation
only if 1) its a `ThreadPoolExecutor` and 2) we are already not running
executor's thread. So this solves the deadlock issues but we only run
`TermStates#build` concurrently when the above 2 situations suffice. This seems
to be a fix to unblock this PR but indeed we need to get rid of this and make
use of Fork/Join strategy as permanent fix (also avoid such annoying
`instanceof` checks for executors). Please let me know if you think we should
hold on this PR until we get the ideal fix in and we could instead go with this
fix meanwhile?. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]