[jira] [Commented] (SOLR-15660) Remove universal 10 second test thread leak linger.

Michael Gibney (Jira) Mon, 09 May 2022 06:51:06 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533820#comment-17533820
 ]


Michael Gibney commented on SOLR-15660:
---------------------------------------

Thanks for your response, Mark! I'm not sure I understand what you're 
suggesting as your preferred course of action though.

For my own part, my main concern is actually not "most devs" (for whatever 
reason this doesn't actually present much of a practical issue for me in local 
development), but rather the automated builds, which fail several times a day 
with a stacktrace on, e.g. 
{{java.util.concurrent.ThreadPoolExecutor.tryTerminate}} (Btw apologies for 
spamming the list while I edited the above comment until I figured out that I 
had to percent-encode the pipe character in the lists.apache.org url; the 
relevant link is 
[this|https://lists.apache.org/list?bui...@solr.apache.org:dfr=2022-4-1%7Cdto=2022-5-31:java.base@11.0.12/java.util.concurrent.ThreadPoolExecutor.tryTerminate]).

To take the stacktrace discussed above blocked in 
{{ZkStateReader.createClusterStateWatchersAndUpdate}}: I actually think the 
threads are neither deadlocked _nor_ waiting on something slow that holds the 
lock. The leaked threads in some (most?) cases -- 1 from a build, and 1 that I 
was able to reproduce locally -- were all identical, blocked on that 
synchronized method. I think they just got caught by thread leak detection in 
the fraction of a second between test method return and when they would 
naturally have died, with no ill effect.

This is a good case in fact to illustrate the tradeoff, because it _was_ a 
threadleak, and arguably would indeed be better off fixed. (Again, I'm fairly 
certain the problem is that {{ZooKeeper.close()}} does not wait for connection 
threads to terminate; fixed with 
[apache/solr#842|https://github.com/apache/solr/pull/842]). I'm all for 
catching and fixing such cases, even if I also think the practical consequence 
(at least in this case) is likely insignificant. (caveat/tangent: introducing 
blocking ZooKeeper.close() in automatic Zk reconnection attempts seems to make 
a significant difference under the hood, sometimes blocking for upwards of 10s, 
with no ill effect on the overall test suite. It's possible that the change 
could be significant (and perhaps beneficial, in ways that I don't yet 
understand?), even if I truly think it is practically insignificant in the 
end-of-test "ThreadLeakDetection" context where it was initially detected).

But there is literally no way to "fix" the {{tryTerminate}} case (link to 
builds mentioned above) -- it's not broken. That was the point of the link to 
the MAHOUT issue. And the negative practical consequences of being so strict by 
default are significant:
# people tune out build failures because there are enough failures (many of the 
classMethod-level, inscrutable failures) that the noise drowns out more 
meaningful signals
# developers spend time troubleshooting issues that end up being either 
unfixable/non-issue ({{tryTerminate}}) or of questionable practical 
significance (async {{ZooKeeper.close()}} connection thread death).
# at a "meta" level, people eventually come to the conclusion that "tests are 
flaky", to the detriment of the project's reputation, externally and internally

I guess as a way forward I'd suggest: sure, 10s is probably too long. But 
something shorter (1s? 250ms?) probably makes sense as a default. It would be 
great if there were a way to have this be configurable by project properties.

I'd be really curious to know, if we zeroed out ThreadLeakLinger across the 
board, how many of the resulting issues would end up being of the completely 
spurious, {{tryTerminate}}, variety. I'm sure I'm missing something, but in 
principle one could even inspect leaked thread stack traces and automatically 
prune false positives?

> Remove universal 10 second test thread leak linger.
> ---------------------------------------------------
>
>                 Key: SOLR-15660
>                 URL: https://issues.apache.org/jira/browse/SOLR-15660
>             Project: Solr
>          Issue Type: Test
>          Components: Tests
>            Reporter: Mark Robert Miller
>            Assignee: Mark Robert Miller
>            Priority: Minor
>             Fix For: 9.0
>
>         Attachments: screenshot-1.png
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

[jira] [Commented] (SOLR-15660) Remove universal 10 second test thread leak linger.

Reply via email to