Gehel added a comment.
In T192759#4176404, @Smalyshev wrote:
Lots of threads in executor service seems ok, that's how the queries are served IIRC and this can require a number of them. Not sure about management - but it may be cheaper to leave some to hang around than to collect them all.
All
Smalyshev added a comment.
Lots of threads in executor service seems ok, that's how the queries are served IIRC and this can require a number of them. Not sure about management - but it may be cheaper to leave some to hang around than to collect them all.
A lot of HTTP client threads are really
Gehel added a comment.
The thread dumps are interesting!
F17602509: threads-2018-05-02-16:05:59.log
at 16:05, I can see
479 threads waiting in thread pools. Most of them from the com.bigdata.journal.Journal.executorService thread pool and a few in an unnamed pool. That's at least a very
Gehel added a comment.
Stupid monitoring script running on wdqs1003 to capture large stack traces:
#!/bin/sh
while true
do
blazegraph_pid=$(cat /sys/fs/cgroup/pids/system.slice/wdqs-blazegraph.service/cgroup.procs)
pids_current=$(cat
Gehel added a comment.
Thread dumps can now be collected correctly with sudo -u blazegraph jcmd Thread.printTASK DETAILhttps://phabricator.wikimedia.org/T192759EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, Gehel, Aklapper,
Gehel added a comment.
empirical measurement show that the difference between the number of threads reported by the JVM and the number reported by cgroup differs by a fairly stable 74. The monitoring of JVM threads shows peaks up to 2k over the last 24h.
Looking at blazegraph code, I can find
Gehel added a comment.
Strangely, the number of threads reported by the JVM is significantly lower than pids.current:
gehel@wdqs2006:~$ cat /sys/fs/cgroup/pids/system.slice/wdqs-blazegraph.service/pids.current ; curl -s localhost:9102 | grep jvm_threads_current
186
# HELP jvm_threads_current
Gehel added a comment.
The number of Java threads is now collected and available on the grafana dashboard.TASK DETAILhttps://phabricator.wikimedia.org/T192759EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, Gehel, Aklapper,
Smalyshev added a comment.
Maybe, 4915 seems reasonable. I don't think there's internal limits in Blazegraph so theoretically we could go over 4915. Maybe we should add this metric to graph. It seems to be easy t take it from /sys/fs/cgroup/pids/system.slice/wdqs-blazegraph.service/pids.current -
Gehel added a comment.
gehel@wdqs1004:~$ cat /sys/fs/cgroup/pids/system.slice/wdqs-blazegraph.service/pids.max
4915
The max number of tasks seems to be 4915 (which is 15% of /proc/sys/kernel/pid_max - thanks @Volans). If blazegraph is really trying to start almost 5k threads, it seems reasonable
Smalyshev added a comment.
Interestingly enough, ps -eT for Blazegraph PID shows 673 threads, which is more than 512, so maybe we're running with different default? Updater seems to use 83 threads.TASK DETAILhttps://phabricator.wikimedia.org/T192759EMAIL
Smalyshev added a comment.
From https://github.com/systemd/systemd/blob/master/NEWS#L2518:
* There's a new system.conf setting DefaultTasksMax= to
control the default TasksMax= setting for services and
scopes running on the system. (TasksMax= is the primary
setting
Gehel added a comment.
It looks like cgroup is preventing the fork:
Apr 23 07:39:09 wdqs1004 kernel: [3861112.854423] cgroup: fork rejected by pids controller in /system.slice/wdqs-blazegraph.service
Not sure what the limit is. Time to learn about cgroups...TASK
Gehel added a comment.
In T192759#4151679, @Smalyshev wrote:
Looking at the logs on wdq1003, I see a string of java.lang.OutOfMemoryError: unable to create new native thread starting with:
Apr 23 08:02:57 wdqs1003 bash[25917]: java.lang.OutOfMemoryError: unable to create new native thread
I am
Smalyshev added a comment.
Notable also OOME is about creating new threads, not memory allocations. Maybe we need change stack size for Java? Or maybe we are hitting some other OS limitation?TASK DETAILhttps://phabricator.wikimedia.org/T192759EMAIL
Smalyshev added a comment.
Looking at the logs on wdq1003, I see a string of java.lang.OutOfMemoryError: unable to create new native thread starting with:
Apr 23 08:02:57 wdqs1003 bash[25917]: java.lang.OutOfMemoryError: unable to create new native thread
I am not sure why at this point Java
Gehel added a comment.
Queries in error at the time of the issue: https://logstash.wikimedia.org/goto/a84c11d438e757265d6d53d4cb833797
Nothing looks more crazy than usual to me (but SPARQL always looks somewhat crazy to me)TASK DETAILhttps://phabricator.wikimedia.org/T192759EMAIL
17 matches
Mail list logo