[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-05-03 Thread Gehel
Gehel added a comment. In T192759#4176404, @Smalyshev wrote: Lots of threads in executor service seems ok, that's how the queries are served IIRC and this can require a number of them. Not sure about management - but it may be cheaper to leave some to hang around than to collect them all. All

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-05-02 Thread Smalyshev
Smalyshev added a comment. Lots of threads in executor service seems ok, that's how the queries are served IIRC and this can require a number of them. Not sure about management - but it may be cheaper to leave some to hang around than to collect them all. A lot of HTTP client threads are really

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-05-02 Thread Gehel
Gehel added a comment. The thread dumps are interesting! F17602509: threads-2018-05-02-16:05:59.log at 16:05, I can see 479 threads waiting in thread pools. Most of them from the com.bigdata.journal.Journal.executorService thread pool and a few in an unnamed pool. That's at least a very

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-05-02 Thread Gehel
Gehel added a comment. Stupid monitoring script running on wdqs1003 to capture large stack traces: #!/bin/sh while true do blazegraph_pid=$(cat /sys/fs/cgroup/pids/system.slice/wdqs-blazegraph.service/cgroup.procs) pids_current=$(cat

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-05-02 Thread Gehel
Gehel added a comment. Thread dumps can now be collected correctly with sudo -u blazegraph jcmd Thread.printTASK DETAILhttps://phabricator.wikimedia.org/T192759EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, Gehel, Aklapper,

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-05-02 Thread Gehel
Gehel added a comment. empirical measurement show that the difference between the number of threads reported by the JVM and the number reported by cgroup differs by a fairly stable 74. The monitoring of JVM threads shows peaks up to 2k over the last 24h. Looking at blazegraph code, I can find

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-05-01 Thread Gehel
Gehel added a comment. Strangely, the number of threads reported by the JVM is significantly lower than pids.current: gehel@wdqs2006:~$ cat /sys/fs/cgroup/pids/system.slice/wdqs-blazegraph.service/pids.current ; curl -s localhost:9102 | grep jvm_threads_current 186 # HELP jvm_threads_current

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-04-30 Thread Gehel
Gehel added a comment. The number of Java threads is now collected and available on the grafana dashboard.TASK DETAILhttps://phabricator.wikimedia.org/T192759EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, Gehel, Aklapper,

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-04-26 Thread Smalyshev
Smalyshev added a comment. Maybe, 4915 seems reasonable. I don't think there's internal limits in Blazegraph so theoretically we could go over 4915. Maybe we should add this metric to graph. It seems to be easy t take it from /sys/fs/cgroup/pids/system.slice/wdqs-blazegraph.service/pids.current -

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-04-26 Thread Gehel
Gehel added a comment. gehel@wdqs1004:~$ cat /sys/fs/cgroup/pids/system.slice/wdqs-blazegraph.service/pids.max 4915 The max number of tasks seems to be 4915 (which is 15% of /proc/sys/kernel/pid_max - thanks @Volans). If blazegraph is really trying to start almost 5k threads, it seems reasonable

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-04-26 Thread Smalyshev
Smalyshev added a comment. Interestingly enough, ps -eT for Blazegraph PID shows 673 threads, which is more than 512, so maybe we're running with different default? Updater seems to use 83 threads.TASK DETAILhttps://phabricator.wikimedia.org/T192759EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-04-26 Thread Smalyshev
Smalyshev added a comment. From https://github.com/systemd/systemd/blob/master/NEWS#L2518: * There's a new system.conf setting DefaultTasksMax= to control the default TasksMax= setting for services and scopes running on the system. (TasksMax= is the primary setting

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-04-26 Thread Gehel
Gehel added a comment. It looks like cgroup is preventing the fork: Apr 23 07:39:09 wdqs1004 kernel: [3861112.854423] cgroup: fork rejected by pids controller in /system.slice/wdqs-blazegraph.service Not sure what the limit is. Time to learn about cgroups...TASK

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-04-26 Thread Gehel
Gehel added a comment. In T192759#4151679, @Smalyshev wrote: Looking at the logs on wdq1003, I see a string of java.lang.OutOfMemoryError: unable to create new native thread starting with: Apr 23 08:02:57 wdqs1003 bash[25917]: java.lang.OutOfMemoryError: unable to create new native thread I am

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-04-23 Thread Smalyshev
Smalyshev added a comment. Notable also OOME is about creating new threads, not memory allocations. Maybe we need change stack size for Java? Or maybe we are hitting some other OS limitation?TASK DETAILhttps://phabricator.wikimedia.org/T192759EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-04-23 Thread Smalyshev
Smalyshev added a comment. Looking at the logs on wdq1003, I see a string of java.lang.OutOfMemoryError: unable to create new native thread starting with: Apr 23 08:02:57 wdqs1003 bash[25917]: java.lang.OutOfMemoryError: unable to create new native thread I am not sure why at this point Java

[Wikidata-bugs] [Maniphest] [Commented On] T192759: WDQS endpoint timeout

2018-04-23 Thread Gehel
Gehel added a comment. Queries in error at the time of the issue: https://logstash.wikimedia.org/goto/a84c11d438e757265d6d53d4cb833797 Nothing looks more crazy than usual to me (but SPARQL always looks somewhat crazy to me)TASK DETAILhttps://phabricator.wikimedia.org/T192759EMAIL