[ https://issues.apache.org/jira/browse/SOLR-15660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533820#comment-17533820 ]
Michael Gibney commented on SOLR-15660: --------------------------------------- Thanks for your response, Mark! I'm not sure I understand what you're suggesting as your preferred course of action though. For my own part, my main concern is actually not "most devs" (for whatever reason this doesn't actually present much of a practical issue for me in local development), but rather the automated builds, which fail several times a day with a stacktrace on, e.g. {{java.util.concurrent.ThreadPoolExecutor.tryTerminate}} (Btw apologies for spamming the list while I edited the above comment until I figured out that I had to percent-encode the pipe character in the lists.apache.org url; the relevant link is [this|https://lists.apache.org/list?bui...@solr.apache.org:dfr=2022-4-1%7Cdto=2022-5-31:java.base@11.0.12/java.util.concurrent.ThreadPoolExecutor.tryTerminate]). To take the stacktrace discussed above blocked in {{ZkStateReader.createClusterStateWatchersAndUpdate}}: I actually think the threads are neither deadlocked _nor_ waiting on something slow that holds the lock. The leaked threads in some (most?) cases -- 1 from a build, and 1 that I was able to reproduce locally -- were all identical, blocked on that synchronized method. I think they just got caught by thread leak detection in the fraction of a second between test method return and when they would naturally have died, with no ill effect. This is a good case in fact to illustrate the tradeoff, because it _was_ a threadleak, and arguably would indeed be better off fixed. (Again, I'm fairly certain the problem is that {{ZooKeeper.close()}} does not wait for connection threads to terminate; fixed with [apache/solr#842|https://github.com/apache/solr/pull/842]). I'm all for catching and fixing such cases, even if I also think the practical consequence (at least in this case) is likely insignificant. (caveat/tangent: introducing blocking ZooKeeper.close() in automatic Zk reconnection attempts seems to make a significant difference under the hood, sometimes blocking for upwards of 10s, with no ill effect on the overall test suite. It's possible that the change could be significant (and perhaps beneficial, in ways that I don't yet understand?), even if I truly think it is practically insignificant in the end-of-test "ThreadLeakDetection" context where it was initially detected). But there is literally no way to "fix" the {{tryTerminate}} case (link to builds mentioned above) -- it's not broken. That was the point of the link to the MAHOUT issue. And the negative practical consequences of being so strict by default are significant: # people tune out build failures because there are enough failures (many of the classMethod-level, inscrutable failures) that the noise drowns out more meaningful signals # developers spend time troubleshooting issues that end up being either unfixable/non-issue ({{tryTerminate}}) or of questionable practical significance (async {{ZooKeeper.close()}} connection thread death). # at a "meta" level, people eventually come to the conclusion that "tests are flaky", to the detriment of the project's reputation, externally and internally I guess as a way forward I'd suggest: sure, 10s is probably too long. But something shorter (1s? 250ms?) probably makes sense as a default. It would be great if there were a way to have this be configurable by project properties. I'd be really curious to know, if we zeroed out ThreadLeakLinger across the board, how many of the resulting issues would end up being of the completely spurious, {{tryTerminate}}, variety. I'm sure I'm missing something, but in principle one could even inspect leaked thread stack traces and automatically prune false positives? > Remove universal 10 second test thread leak linger. > --------------------------------------------------- > > Key: SOLR-15660 > URL: https://issues.apache.org/jira/browse/SOLR-15660 > Project: Solr > Issue Type: Test > Components: Tests > Reporter: Mark Robert Miller > Assignee: Mark Robert Miller > Priority: Minor > Fix For: 9.0 > > Attachments: screenshot-1.png > > Time Spent: 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org