This looks more like a memoryleak rather than a thread issue. On Wed, 13 Oct 2021, 04:33 Joel Bernstein, <[email protected]> wrote:
> There is a thread dump on the Solr admin. You can use that to determine > what all those threads are doing and where they are getting stuck. You can > post parts of the thread dump back to this email thread as well. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Tue, Oct 12, 2021 at 11:15 AM Dominic Humphries > <[email protected]> wrote: > > > We run 8.3.1 in prod without any problems, but we're having issues with > > trying to upgrade. > > > > I've created an 8.9.0 leader & follower, imported our live data into it, > > and am testing it via replaying requests made to prod. We're seeing a big > > problem where fairly moderate request rates are causing the instance to > > become so slow it fails healthcheck. The logs showed a lot of errors > around > > creating threads: > > > > solr[4507]: [124136.511s][warning][os,thread] Failed to start thread - > > pthread_create failed (EAGAIN) for attributes: stacksize: 256k, > guardsize: > > 0k, detached. > > > > WARN (qtp178604517-3891) [ ] o.e.j.i.ManagedSelector => > > java.lang.OutOfMemoryError: unable to create native thread: possibly out > of > > memory or process/resource limits reached > > > > So I monitored thread count for the process whilst running the test suite > > and saw a persistent pattern: Threads increased until maxed out, the logs > > flooded with errors as it tried to create still more threads, and the > > instance slowed down until terminated as unhealthy. > > > > The DefaultTasksMax is set to 4915, I've tried raising and lowering it > but > > regardless of value the result is the same: it gets maxed and everything > > slows down. > > > > Is there anything I can do to stop solr spinning up so many threads it > > ceases to function? There have been a few test passes where it > > spontaneously dropped threadcount from thousands to hundreds and stayed > up > > longer, but there seems no pattern to when this happens. Running the same > > tests on 8.3.1 results in a much slower increase in threads and it never > > quite maxes them so things continue to function. > > > > See below for the thread count and healthcheck times seen on a (fairly > > harsh) test run of 100 requests/sec > > > > Thanks > > > > Dominic > > > > > > Threadcount: > > > > ubuntu@ip-10-40-22-166:~$ while [ 1 ]; do date; ps -eLF | grep > 'start.jar' > > | wc -l; sleep 10s; done > > Tue Oct 12 14:27:33 UTC 2021 > > 52 > > Tue Oct 12 14:27:43 UTC 2021 > > 52 > > Tue Oct 12 14:27:54 UTC 2021 > > 52 > > Tue Oct 12 14:28:04 UTC 2021 > > 52 > > Tue Oct 12 14:28:14 UTC 2021 > > 569 > > Tue Oct 12 14:28:24 UTC 2021 > > 899 > > Tue Oct 12 14:28:34 UTC 2021 > > 1198 > > Tue Oct 12 14:28:44 UTC 2021 > > 1589 > > Tue Oct 12 14:28:54 UTC 2021 > > 2016 > > Tue Oct 12 14:29:05 UTC 2021 > > 2451 > > Tue Oct 12 14:29:15 UTC 2021 > > 2851 > > Tue Oct 12 14:29:26 UTC 2021 > > 2934 > > Tue Oct 12 14:29:36 UTC 2021 > > 3249 > > Tue Oct 12 14:29:46 UTC 2021 > > 3501 > > Tue Oct 12 14:29:57 UTC 2021 > > 3734 > > Tue Oct 12 14:30:07 UTC 2021 > > 4128 > > Tue Oct 12 14:30:18 UTC 2021 > > 4374 > > Tue Oct 12 14:30:29 UTC 2021 > > 4637 > > Tue Oct 12 14:30:39 UTC 2021 > > 4693 > > Tue Oct 12 14:30:50 UTC 2021 > > 4807 > > Tue Oct 12 14:31:01 UTC 2021 > > 4916 > > Tue Oct 12 14:31:11 UTC 2021 > > 4916 > > Tue Oct 12 14:31:22 UTC 2021 > > Connection to 10.40.22.166 closed by remote host. > > > > > > Healthcheck: > > > > ubuntu@ip-10-40-22-166:~$ while [ 1 ]; do date; curl -v > > localhost:8983/solr/ 2>&1 | grep HTTP; date; echo '----'; sleep > > 10s; done > > Tue Oct 12 14:27:34 UTC 2021 > > > GET /solr/ HTTP/1.1 > > < HTTP/1.1 200 OK > > Tue Oct 12 14:27:34 UTC 2021 > > ---- > > Tue Oct 12 14:27:44 UTC 2021 > > > GET /solr/ HTTP/1.1 > > < HTTP/1.1 200 OK > > Tue Oct 12 14:27:44 UTC 2021 > > ---- > > Tue Oct 12 14:27:54 UTC 2021 > > > GET /solr/ HTTP/1.1 > > < HTTP/1.1 200 OK > > Tue Oct 12 14:27:54 UTC 2021 > > ---- > > Tue Oct 12 14:28:04 UTC 2021 > > > GET /solr/ HTTP/1.1 > > < HTTP/1.1 200 OK > > Tue Oct 12 14:28:04 UTC 2021 > > ---- > > Tue Oct 12 14:28:14 UTC 2021 > > > GET /solr/ HTTP/1.1 > > 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- > > 0< HTTP/1.1 200 OK > > Tue Oct 12 14:28:16 UTC 2021 > > ---- > > Tue Oct 12 14:28:26 UTC 2021 > > > GET /solr/ HTTP/1.1 > > 0 0 0 0 0 0 0 0 --:--:-- 0:00:12 --:--:-- > > 0< HTTP/1.1 200 OK > > Tue Oct 12 14:28:39 UTC 2021 > > ---- > > Tue Oct 12 14:28:49 UTC 2021 > > 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- > > 0> GET /solr/ HTTP/1.1 > > 0 0 0 0 0 0 0 0 --:--:-- 0:00:23 --:--:-- > > 0< HTTP/1.1 200 OK > > Tue Oct 12 14:29:13 UTC 2021 > > ---- > > Tue Oct 12 14:29:23 UTC 2021 > > 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- > > 0> GET /solr/ HTTP/1.1 > > < HTTP/1.1 200 OK > > Tue Oct 12 14:29:25 UTC 2021 > > ---- > > Tue Oct 12 14:29:35 UTC 2021 > > 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- > > 0> GET /solr/ HTTP/1.1 > > 0 0 0 0 0 0 0 0 --:--:-- 0:00:09 --:--:-- > > 0< HTTP/1.1 200 OK > > Tue Oct 12 14:29:44 UTC 2021 > > ---- > > Tue Oct 12 14:29:54 UTC 2021 > > > GET /solr/ HTTP/1.1 > > 0 0 0 0 0 0 0 0 --:--:-- 0:00:11 --:--:-- > > 0< HTTP/1.1 200 OK > > Tue Oct 12 14:30:06 UTC 2021 > > ---- > > Tue Oct 12 14:30:16 UTC 2021 > > > GET /solr/ HTTP/1.1 > > 0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- > > 0< HTTP/1.1 200 OK > > Tue Oct 12 14:30:20 UTC 2021 > > ---- > > Tue Oct 12 14:30:30 UTC 2021 > > > GET /solr/ HTTP/1.1 > > 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- > > 0< HTTP/1.1 200 OK > > Tue Oct 12 14:30:33 UTC 2021 > > ---- > > Tue Oct 12 14:30:43 UTC 2021 > > > GET /solr/ HTTP/1.1 > > < HTTP/1.1 200 OK > > Tue Oct 12 14:30:43 UTC 2021 > > ---- > > Tue Oct 12 14:30:53 UTC 2021 > > > GET /solr/ HTTP/1.1 > > Tue Oct 12 14:30:55 UTC 2021 > > ---- > > Tue Oct 12 14:31:05 UTC 2021 > > > GET /solr/ HTTP/1.1 > > < HTTP/1.1 200 OK > > Tue Oct 12 14:31:05 UTC 2021 > > ---- > > Tue Oct 12 14:31:15 UTC 2021 > > > GET /solr/ HTTP/1.1 > > < HTTP/1.1 200 OK > > Tue Oct 12 14:31:15 UTC 2021 > > ---- > > Connection to 10.40.22.166 closed by remote host. > > >
