We run 8.3.1 in prod without any problems, but we're having issues with
trying to upgrade.

I've created an 8.9.0 leader & follower, imported our live data into it,
and am testing it via replaying requests made to prod. We're seeing a big
problem where fairly moderate request rates are causing the instance to
become so slow it fails healthcheck. The logs showed a lot of errors around
creating threads:

solr[4507]: [124136.511s][warning][os,thread] Failed to start thread -
pthread_create failed (EAGAIN) for attributes: stacksize: 256k, guardsize:
0k, detached.

WARN  (qtp178604517-3891) [   ] o.e.j.i.ManagedSelector  =>
java.lang.OutOfMemoryError: unable to create native thread: possibly out of
memory or process/resource limits reached

So I monitored thread count for the process whilst running the test suite
and saw a persistent pattern: Threads increased until maxed out, the logs
flooded with errors as it tried to create still more threads, and the
instance slowed down until terminated as unhealthy.

The DefaultTasksMax is set to 4915, I've tried raising and lowering it but
regardless of value the result is the same: it gets maxed and everything
slows down.

Is there anything I can do to stop solr spinning up so many threads it
ceases to function? There have been a few test passes where it
spontaneously dropped threadcount from thousands to hundreds and stayed up
longer, but there seems no pattern to when this happens. Running the same
tests on 8.3.1 results in a much slower increase in threads and it never
quite maxes them so things continue to function.

See below for the thread count and healthcheck times seen on a (fairly
harsh) test run of 100 requests/sec

Thanks

Dominic


Threadcount:

ubuntu@ip-10-40-22-166:~$ while [ 1 ]; do date; ps -eLF | grep 'start.jar'
| wc -l; sleep 10s; done
Tue Oct 12 14:27:33 UTC 2021
52
Tue Oct 12 14:27:43 UTC 2021
52
Tue Oct 12 14:27:54 UTC 2021
52
Tue Oct 12 14:28:04 UTC 2021
52
Tue Oct 12 14:28:14 UTC 2021
569
Tue Oct 12 14:28:24 UTC 2021
899
Tue Oct 12 14:28:34 UTC 2021
1198
Tue Oct 12 14:28:44 UTC 2021
1589
Tue Oct 12 14:28:54 UTC 2021
2016
Tue Oct 12 14:29:05 UTC 2021
2451
Tue Oct 12 14:29:15 UTC 2021
2851
Tue Oct 12 14:29:26 UTC 2021
2934
Tue Oct 12 14:29:36 UTC 2021
3249
Tue Oct 12 14:29:46 UTC 2021
3501
Tue Oct 12 14:29:57 UTC 2021
3734
Tue Oct 12 14:30:07 UTC 2021
4128
Tue Oct 12 14:30:18 UTC 2021
4374
Tue Oct 12 14:30:29 UTC 2021
4637
Tue Oct 12 14:30:39 UTC 2021
4693
Tue Oct 12 14:30:50 UTC 2021
4807
Tue Oct 12 14:31:01 UTC 2021
4916
Tue Oct 12 14:31:11 UTC 2021
4916
Tue Oct 12 14:31:22 UTC 2021
Connection to 10.40.22.166 closed by remote host.


Healthcheck:

ubuntu@ip-10-40-22-166:~$ while [ 1 ]; do date; curl -v
localhost:8983/solr/ 2>&1 | grep HTTP; date; echo '----'; sleep
10s; done
Tue Oct 12 14:27:34 UTC 2021
> GET /solr/ HTTP/1.1
< HTTP/1.1 200 OK
Tue Oct 12 14:27:34 UTC 2021
----
Tue Oct 12 14:27:44 UTC 2021
> GET /solr/ HTTP/1.1
< HTTP/1.1 200 OK
Tue Oct 12 14:27:44 UTC 2021
----
Tue Oct 12 14:27:54 UTC 2021
> GET /solr/ HTTP/1.1
< HTTP/1.1 200 OK
Tue Oct 12 14:27:54 UTC 2021
----
Tue Oct 12 14:28:04 UTC 2021
> GET /solr/ HTTP/1.1
< HTTP/1.1 200 OK
Tue Oct 12 14:28:04 UTC 2021
----
Tue Oct 12 14:28:14 UTC 2021
> GET /solr/ HTTP/1.1
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--
  0< HTTP/1.1 200 OK
Tue Oct 12 14:28:16 UTC 2021
----
Tue Oct 12 14:28:26 UTC 2021
> GET /solr/ HTTP/1.1
  0     0    0     0    0     0      0      0 --:--:--  0:00:12 --:--:--
  0< HTTP/1.1 200 OK
Tue Oct 12 14:28:39 UTC 2021
----
Tue Oct 12 14:28:49 UTC 2021
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--
  0> GET /solr/ HTTP/1.1
  0     0    0     0    0     0      0      0 --:--:--  0:00:23 --:--:--
  0< HTTP/1.1 200 OK
Tue Oct 12 14:29:13 UTC 2021
----
Tue Oct 12 14:29:23 UTC 2021
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--
  0> GET /solr/ HTTP/1.1
< HTTP/1.1 200 OK
Tue Oct 12 14:29:25 UTC 2021
----
Tue Oct 12 14:29:35 UTC 2021
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--
  0> GET /solr/ HTTP/1.1
  0     0    0     0    0     0      0      0 --:--:--  0:00:09 --:--:--
  0< HTTP/1.1 200 OK
Tue Oct 12 14:29:44 UTC 2021
----
Tue Oct 12 14:29:54 UTC 2021
> GET /solr/ HTTP/1.1
  0     0    0     0    0     0      0      0 --:--:--  0:00:11 --:--:--
  0< HTTP/1.1 200 OK
Tue Oct 12 14:30:06 UTC 2021
----
Tue Oct 12 14:30:16 UTC 2021
> GET /solr/ HTTP/1.1
  0     0    0     0    0     0      0      0 --:--:--  0:00:03 --:--:--
  0< HTTP/1.1 200 OK
Tue Oct 12 14:30:20 UTC 2021
----
Tue Oct 12 14:30:30 UTC 2021
> GET /solr/ HTTP/1.1
  0     0    0     0    0     0      0      0 --:--:--  0:00:02 --:--:--
  0< HTTP/1.1 200 OK
Tue Oct 12 14:30:33 UTC 2021
----
Tue Oct 12 14:30:43 UTC 2021
> GET /solr/ HTTP/1.1
< HTTP/1.1 200 OK
Tue Oct 12 14:30:43 UTC 2021
----
Tue Oct 12 14:30:53 UTC 2021
> GET /solr/ HTTP/1.1
Tue Oct 12 14:30:55 UTC 2021
----
Tue Oct 12 14:31:05 UTC 2021
> GET /solr/ HTTP/1.1
< HTTP/1.1 200 OK
Tue Oct 12 14:31:05 UTC 2021
----
Tue Oct 12 14:31:15 UTC 2021
> GET /solr/ HTTP/1.1
< HTTP/1.1 200 OK
Tue Oct 12 14:31:15 UTC 2021
----
Connection to 10.40.22.166 closed by remote host.

Reply via email to