[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster
Smalyshev added a comment. Do we have anything left to do here?TASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: SmalyshevCc: Smalyshev, Framawiki, Stashbot, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Avner, Zppix, Jonas, FloNight, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster
Stashbot added a comment. Mentioned in SAL (#wikimedia-operations) [2018-06-25T12:00:11Z] repooling wdqs1005 it has catched up on updates - T198042TASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: StashbotCc: Framawiki, Stashbot, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Avner, Zppix, Jonas, FloNight, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster
Stashbot added a comment. Mentioned in SAL (#wikimedia-operations) [2018-06-25T07:46:25Z] depooling wdqs1005 to allow it to catch up on updates - T198042TASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: StashbotCc: Framawiki, Stashbot, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Avner, Zppix, Jonas, FloNight, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster
Gehel added a comment. The pattern of banned / throttled request as seen on wdqs matches a pattern of HTTP 500 seen on varnish. It is the same user agent / IP. I was expecting all those banned / throttled requests to be 403 / 429, but it looks like this is not the case. Something is wrong...TASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Framawiki, Stashbot, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Avner, Zppix, Jonas, FloNight, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster
Gehel added a comment. Looking at thread dumps on wdqs1005, there is > 5000 threads waiting logging (see stack trace below). We could improve the situation with an AsyncAppender (probably a good idea anyway), but that's only treating the symptoms, not the root cause. com.bigdata.journal.Journal.executorService20427 - priority:5 - threadId:0x7f5c8013f800 - nativeId:0x350c - state:WAITING stackTrace: java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x7f5d69220380> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at ch.qos.logback.core.OutputStreamAppender.writeBytes(OutputStreamAppender.java:197) at ch.qos.logback.core.OutputStreamAppender.subAppend(OutputStreamAppender.java:231) at ch.qos.logback.core.OutputStreamAppender.append(OutputStreamAppender.java:102) at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84) at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51) at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270) at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257) at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421) at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383) at ch.qos.logback.classic.Logger.log(Logger.java:765) at org.apache.log4j.Category.differentiatedLog(Category.java:193) at org.apache.log4j.Category.warn(Category.java:257) at com.bigdata.util.concurrent.Haltable.logCause(Haltable.java:474) at com.bigdata.util.concurrent.Haltable.halt(Haltable.java:197) at com.bigdata.bop.join.PipelineJoin$JoinTask$BindingSetConsumerTask.call(PipelineJoin.java:1022) at com.bigdata.bop.join.PipelineJoin$JoinTask.consumeSource(PipelineJoin.java:739) at com.bigdata.bop.join.PipelineJoin$JoinTask.call(PipelineJoin.java:623) at com.bigdata.bop.join.PipelineJoin$JoinTask.call(PipelineJoin.java:382) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at com.bigdata.concurrent.FutureTaskMon.run(FutureTaskMon.java:63) at com.bigdata.bop.engine.ChunkedRunningQuery$ChunkTask.call(ChunkedRunningQuery.java:1346) at com.bigdata.bop.engine.ChunkedRunningQuery$ChunkTaskWrapper.run(ChunkedRunningQuery.java:926) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at com.bigdata.concurrent.FutureTaskMon.run(FutureTaskMon.java:63) at com.bigdata.bop.engine.ChunkedRunningQuery$ChunkFutureTask.run(ChunkedRunningQuery.java:821) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)TASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Framawiki, Stashbot, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Avner, Zppix, Jonas, FloNight, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster
Stashbot added a comment. Mentioned in SAL (#wikimedia-operations) [2018-06-24T22:26:46Z] restarting wdqs1004 - T198042TASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: StashbotCc: Framawiki, Stashbot, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Avner, Zppix, Jonas, FloNight, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster
Stashbot added a comment. Mentioned in SAL (#wikimedia-operations) [2018-06-24T21:51:23Z] restarting wdqs1005 after taking a few thread dumps - T198042TASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: StashbotCc: Stashbot, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Avner, Zppix, Jonas, FloNight, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster
Gehel added a comment. wdqs1005 was lagging on updates. A few thread dumps for further analysis before restarting it: F22597390: threads.tar.gzTASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Avner, Zppix, Jonas, FloNight, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster
Gehel added a comment. Situation is better, but still not entirely stable (I just restarted blazegraph on wdqs1004). Looking at logstash, the number of Haltable errors is high, but it has been just as high in the last 7 days without major issues. Side note, we might want to have a log of long running queries, to see more easily if we can spot something...TASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, EBjune, merbst, LawExplorer, Avner, Zppix, Jonas, FloNight, Xmlizer, Wong128hk, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs