Gehel created this task.Gehel triaged this task as "High" priority.Gehel added projects: Wikidata, Wikidata-Query-Service.Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTIONAs described in parent task, we need to reset data on wdqs100[78].TASK D
Gehel added a comment.
In T213134#4861540, @Smalyshev wrote:
Same happening with wdq8, 3 hours later. Something spooky is going on... Will talk tomorrow morning with Bryan from Blazegraph, not sure if it's possible to do anything till then.
I'm not touching wdqs100[78] yet, so that you have
Gehel added a comment.
What we should probably do in this case is define default values to the hiera calls in profile::wdqs, and override only what needs to be different. At least for parameters where a default would make sense.TASK DETAILhttps://phabricator.wikimedia.org/T210431EMAIL
Gehel added a comment.
We already have the git-commit-id plugin configured, which creates a properties file and adds it to the jars. So we should be able to load it and output whatever we need. There is probably a jar somewhere with the logic required to parse that properties file, but it's
Gehel added a comment.
https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/474266 has been merged and deployed, but the test queries are still not available on the wdqs servers. There is something I don't understand about the packaging. @Smalyshev could you have a look and point me to what I
Gehel added a project: Discovery-Wikidata-Query-Service-Sprint.
TASK DETAILhttps://phabricator.wikimedia.org/T207665EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Aklapper, Mathew.onipe, Smalyshev, Gehel, CucyNoiD, Nandana, NebulousIris
Gehel closed this task as "Resolved".Gehel claimed this task.
TASK DETAILhttps://phabricator.wikimedia.org/T210169EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Aklapper, Smalyshev, Gehel, Mathew.onipe, Nandana, Lahi, Gq86, Lucas_Werkme
Gehel closed subtask T210169: Create an exim alias for wdqs administrator as "Resolved".
TASK DETAILhttps://phabricator.wikimedia.org/T207665EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Aklapper, Mathew.onipe, Smalyshev, Gehel
Gehel created this task.Gehel triaged this task as "Normal" priority.Gehel added projects: Wikidata-Query-Service, Wikidata.Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTIONThis will be used as part of parent task to notify wdqs admins of issues. It should contain:
Gehel added a comment.
Looks like the new metric is flowing to prometheusTASK DETAILhttps://phabricator.wikimedia.org/T206123EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Mathew.onipe, GehelCc: Stashbot, Smalyshev, Mathew.onipe, gerritbot, Aklapper, Gehel
Gehel claimed this task.
TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: WMDE-leszek, Multichill, agray, Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt
Gehel moved this task from Needs review to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment.
@Smalyshev we're good on this from my point of view. Could you check that running updater manually (with the -S option to output to console, or with -v for verbose) works
Gehel removed a project: Patch-For-Review.
TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev
Gehel added a comment.
In T199228#4715898, @Pintoch wrote:
The search interface can also be used for that thanks to the haswbstatement command. That only gets you one id per query, so it might not be suited for all tools. I don't know if the lag is lower in this interface.
The lag
Gehel added a comment.
In T199228#4715863, @Magnus wrote:
Batch jobs. For example, there is an issue with the sourcemd tool at the moment. Essentially, it checks for a large number of scientific publications if they exist on Wikidata or not. If not, it creates them. Now, if people have
Gehel added a comment.
Thanks for the feedback!
In T199228#4715815, @Jheald wrote:
This requires WDQS to be reasonably up to date most of the time. A lag of 5 minutes isn't such a problem. An occasional longer lag, if clearly signposted as the WDQS GUI does, also isn't such a problem
Gehel added a comment.
In T199228#4710863, @Pintoch wrote:
What matters much more for this tool is getting quick results and as little downtime as possible - lag is not really a concern.
I'd be most interested in how well this is going at the moment! The open for all and widely varying cost
Gehel added a comment.
For context, T202765 is about a bot sending annoying and somewhat expensive requests. That specific issue is now resolved.TASK DETAILhttps://phabricator.wikimedia.org/T202764EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc
Gehel added a comment.
new configuration deployed, but raising some deprecations, needs some tuning.TASK DETAILhttps://phabricator.wikimedia.org/T207834EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, gerritbot, Gehel, Aklapper
Gehel added a comment.
There are 3 issues here, and maybe they should be addressed on different tickets:
isolating updater from blazegraph: this is about reducing the interactions between the 2 components to what is essential, increasing robustness and simplifying investigation into any failure
Gehel created this task.Gehel triaged this task as "High" priority.Gehel added projects: Wikidata, Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint.Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTIONSince wdqs1003 is acting differently from oth
Gehel added a comment.
In T206636#4690384, @Smalyshev wrote:
@Andrew Also looks like there is some puppet issue there:
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error
Gehel added a comment.
My current patch is trying to put all that logic into logback.xml, but it is definitely starting to be unreadable. And coding ifs in XML just seems wrong :/
I think that instead we should have different static logback.xml files and have a flag to switch between them
Gehel added a comment.
In T207817#4691885, @Ottomata wrote:
Interesting! I checked Jodatime stuff to make sure one of our Java based pipeline handled the timestamp format change, I'm surprised that Jackson can't parse this!
We can definitely parse that format if we wanted to, but we do have
Gehel added a comment.
In T207817#4691569, @mmodell wrote:
Do we have a patch or should we roll back group0?
We have a workaround on the WDQS side (switching back to recent changes instead of kafka events). But the root cause isn't fixed, and it is unclear to me what change caused that issue
Gehel created this task.Gehel triaged this task as "High" priority.Gehel added projects: Wikidata, Wikidata-Query-Service, Discovery-Search (Current work), Operations.Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTIONwdqs updater is expected to exit on a number of fail
Gehel created this task.Gehel triaged this task as "High" priority.Gehel added projects: Wikidata-Query-Service, Discovery-Search (Current work), Operations.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata.
TASK DESCRIPTIONWe've ha
Gehel created this task.Gehel triaged this task as "High" priority.Gehel added projects: Wikidata-Query-Service, Discovery-Search (Current work), Operations.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata.
TASK DESCRIPTIONAt the mo
Gehel added a subtask: T207656: WDQS logging to logstash should be rate limited.
TASK DETAILhttps://phabricator.wikimedia.org/T207817EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Stashbot, 20after4, TerraCodes, Liuxinyu970226, gerritbot, Gehel
Gehel added a comment.
Some minimal packet drop is still seen (< 100 packet / 24h), so the situation is very much better. More work needs to be done on limiting CPU usage on the blazegraph side.TASK DETAILhttps://phabricator.wikimedia.org/T206105EMAIL PREFERENCEShttps://phabricator.wikimedia.
Gehel created this task.Gehel added a project: Wikidata-Query-Service.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata.
TASK DESCRIPTIONNow that we automatically deploy code on wdqs test servers, we should also validate that this code works
Gehel added a comment.
A few wishes I have from an operations point of view for any replacement. Those are not necessarily mandatory, but we should evaluate them at some point:
ability to scale both read and write load across multiple nodes
ability to limit resource consumption to fail
Gehel added a comment.
@Smalyshev if you could take a heap dump of blazegraph under load, we might be able to trace more precisely where this unnamed thread pool is coming from. Feel free to send me the dump for analysis.TASK DETAILhttps://phabricator.wikimedia.org/T206880EMAIL PREFERENCEShttps
Gehel added a project: Discovery-Wikidata-Query-Service-Sprint.
TASK DETAILhttps://phabricator.wikimedia.org/T206423EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Stashbot, Smalyshev, Mathew.onipe, Gehel, Aklapper, CucyNoiD, Nandana
Gehel added a subscriber: Mathew.onipe.Gehel added a comment.
In T199228#4655321, @Smalyshev wrote:
I think update lag is not the biggest issue. Endpoint availability and response times is more important for most of the users, at least short-term. If there's a lag spike that goes away, most users
Gehel added a comment.
The main contention point for WDQS (or investigating alternatives) seems to be IOPS. We tried setting up a wdqs test instance on WMCS, but IO contention meant that we were not able to keep up with the update flow. Our production instance consume ~3-4K IOPS just for updates
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Operations, cloud-services-team.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata.
TASK DESCRIPTIONSome testing needs resources that are not easily available on WMCS. Having
Gehel added a comment.
Coming back to this discussion, I'll try to make my point more clear:
wdqs public endpoint is by nature a service more fragile than most of our other services. The update lag is a good example of a problem we don't seem to be able to get under control on the public endpoint
Gehel added a comment.
With some trial an error, it looks like the smp_affinity = 00ff00ff would allow the IRQ to be managed by any CPU, but it is still managed by the first one (in this case, any == CPU0). Setting each IRQ on a specific CPU (and one only) will spread them. It looks like puppet
Gehel added a comment.
Looking at dropped packets, it looks like we did not have any over the last few days. So we have another cause to our lag. Also not that while the issue still seems more present on wdqs2003, we also see issue with other nodes.TASK DETAILhttps://phabricator.wikimedia.org
Gehel added a comment.
Looking at Grafana I can see spikes in batch progress that correlate with drops in lag. Zooming in, I can even see negative drops into batch progress, which should not happen. I suspect our metrics are skewed by the non monotonic nature of kafka updates (just a guess). Since
Gehel added a comment.
@mobrovac thanks for the fast response!
I was wondering if we had a cleaner way to declare that a scap::target manages multiple services, but it seems that's not the case.TASK DETAILhttps://phabricator.wikimedia.org/T206303EMAIL PREFERENCEShttps://phabricator.wikimedia.org
Gehel claimed this task.Gehel moved this task from Backlog to In progress on the Discovery-Wikidata-Query-Service-Sprint board.
TASK DETAILhttps://phabricator.wikimedia.org/T206105WORKBOARDhttps://phabricator.wikimedia.org/project/board/1239/EMAIL PREFERENCEShttps://phabricator.wikimedia.org
Gehel added a subscriber: BBlack.Gehel added a comment.
My current understanding of the issue:
All IRQs from NIC are handled by a single CPU. Under load, Blazegraph saturate this CPU (and others), this creates CPU contention with the NIC IRQ and leads to packet being dropped. Note that we also
Gehel moved this task from In progress to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment.
Actionable tasks have been created, the investigation itself is done.TASK DETAILhttps://phabricator.wikimedia.org/T200563WORKBOARDhttps://phabricator.wikimedia.org/project
Gehel triaged this task as "Normal" priority.
TASK DETAILhttps://phabricator.wikimedia.org/T206121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, Gehel, Aklapper, Nandana, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkme
Gehel created this task.Gehel added projects: Discovery-Wikidata-Query-Service-Sprint, Operations, Wikidata-Query-Service.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata.
TASK DESCRIPTIONLogging config of WDQS is somewhat of a mess. The goal
Gehel triaged this task as "High" priority.
TASK DETAILhttps://phabricator.wikimedia.org/T206105EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Aklapper, Gehel, Nandana, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkme
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint, Operations.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata.
TASK DESCRIPTIONWhile investigating T200563, it was found that wdqs[12]003
Gehel closed this task as "Resolved".Gehel claimed this task.Gehel added a comment.
The specifics of this task are being addressed in T205607 and T200594 (most specifically in https://github.com/kartotherian/geoshapes/pull/1). I'm closing this task as the actual work is being tr
Gehel added a project: Discovery-Wikidata-Query-Service-Sprint.
TASK DETAILhttps://phabricator.wikimedia.org/T200563EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Volans, Stashbot, Gehel, Aklapper, Smalyshev, CucyNoiD, Nandana
Gehel added a comment.
In T200563#4623531, @Smalyshev wrote:
Great work!
Thanks (I'll forward to @Volans)
I am not sure though why logging would be that much of an issue, shouldn't the log code take care of batching it, etc.? As for not logging nginx - do we have these logs somewhere else
Gehel added a subscriber: Volans.Gehel added a comment.
All credit for the findings below goes to @Volans:
we have some dropped packets on the NICs, both on wdqs[12]003 and other servers, but higher on wdqs[12]003.
NIC interrupts are processed only by CPU0 (see /proc/interrupts), we could spread
Gehel added a comment.
I'm late to the party, so a few notes in no particular order:
WDQS queries from Kartotherian are arbitrary, and it is not really possible to restrict them without heavily impacting functionality. In most cases they will come from a user editing a tag, so we have some
Gehel created this task.Gehel added projects: Operations, Wikidata-Query-Service.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata.
TASK DESCRIPTIONCurrently, we don't have cumin aliases for each individual wdqs clusters (see aliases.yaml.erb
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Operations, Wikimedia-Logstash.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata.
TASK DESCRIPTIONWe recently had cases of wdqs sending >10K logs per seconds to logst
Gehel added a comment.
It looks like there is a correlation between bot activity on wikidata query service (T202765) and the rate of those errors. This would tend to indicate that cause of this issue is load on wdqs and not slowdown on wikidata. I don't have any explanation of the causality chain
Gehel added a comment.
The issue as seen from WDQS can be followed on logstash.TASK DETAILhttps://phabricator.wikimedia.org/T202764EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Krinkle, GehelCc: Krinkle, Addshore, Yurik, jcrespo, Imarlier, Ladsgroup
Gehel moved this task from Backlog to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment.
New SSD in place, server reimaged and data reimported. We're all good!TASK DETAILhttps://phabricator.wikimedia.org/T202777WORKBOARDhttps://phabricator.wikimedia.org/project/board
Gehel added a comment.
New SSD in place, server reimaged and data reimported. We're all good!TASK DETAILhttps://phabricator.wikimedia.org/T202779EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Cmjohnson, GehelCc: ops-monitoring-bot, Stashbot, mark, faidon
Gehel closed this task as "Resolved".Gehel claimed this task.Gehel added a comment.
New SSD in place, server reimaged and data reimported. We're all good!TASK DETAILhttps://phabricator.wikimedia.org/T196485EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailp
Gehel added a comment.
New SSD in place, server reimaged and data reimported. We're all good!TASK DETAILhttps://phabricator.wikimedia.org/T202778EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Stashbot, Mathew.onipe, Aklapper, mark, faidon, Addshore
Gehel moved this task from Backlog to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment.
New SSD in place, server reimaged and data reimported. We're all good!TASK DETAILhttps://phabricator.wikimedia.org/T202780WORKBOARDhttps://phabricator.wikimedia.org/project/board
Gehel added a comment.
Looking a bit into this, it does not look like blazegraph has a deep integration with Jetty (why does it even have any dependency on Jetty is a mystery to me). So repackaging with a more recent jetty-http (or the whole jetty stack) might not be that hard (well, it is trivial
Gehel added a comment.
error during reimage of wdqs2001:
┌┤ [!!] Partition disks ├─┐
│ │
│ Error while setting up RAID │
│ An unexpected
Gehel added a comment.
@Cmjohnson wdqs1004 is back into rotation, ping me when you have time for the next one (we also have T202780)TASK DETAILhttps://phabricator.wikimedia.org/T202779EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Cmjohnson, GehelCc: ops
Gehel added a comment.
@Papaul: I'm ready to reimage wdqs2002 today. Ping me when you're around and I'll shut it down.TASK DETAILhttps://phabricator.wikimedia.org/T202777EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Papaul, GehelCc: Aklapper, mark, faidon
Gehel added a comment.
Digging into this a bit more from the WDQS side, we see a few interesting things:
The NoHttpResponseException seems to not be a timeout client side, but an empty response (not even headers), with a state transition. It looks similar to what we would see if an intermediate
Gehel added a comment.
Note that data import after reimage can be done by copying over data from wdqs1010, which has been reimported recently. Procedure is documented on https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Data_transfer_procedure.TASK DETAILhttps://phabricator.wikimedia.org
Gehel added a subscriber: Mathew.onipe.Gehel added a comment.
@Papaul: we'll start by reimaging wdqs2003 (wdqs200[12] to follow). We'll reimage them one by one, to ensure that we have at most 1 host down in the cluster at any time.
@Papaul: ping me when you are around, and I'll depool / shutdown
Gehel added a subscriber: Mathew.onipe.Gehel added a comment.
To not duplicate infos on each of the child tasks, I'll add anything that is common to all on this task.
We'll take this occasion to reimage the systems, so that we can validate that we have a working partman configuration with the new
Gehel added a comment.
wdqs1003 is a bit older, purchased on 2016-12-02 vs 2017-06-29 for wdqs100[45].
Looking at CPU infos, it seems that wdqs1003 does have the same CPU model (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz) as wdqs100[45], but CPU max MHz: 2100., vs CPU max MHz
Gehel added a comment.
Damn, we already set minReleaseAge=1 in RWStore.properties. We need to be looking for something else.TASK DETAILhttps://phabricator.wikimedia.org/T200202EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Aklapper, Smalyshev, Gehel
Gehel added a comment.
It looks like there is some configuration around the release of historical data. Setting com.bigdata.service.AbstractTransactionService.minReleaseAge=1 might allow to reclaim space.TASK DETAILhttps://phabricator.wikimedia.org/T200202EMAIL PREFERENCEShttps
Gehel added a comment.
Looking at http://localhost:/bigdata/#namespaces it seems that categories namespaces are deleted. But maybe the disk space is not recovered on deletion?TASK DETAILhttps://phabricator.wikimedia.org/T200202EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Operations.Herald added a subscriber: Aklapper.Herald added a project: Wikidata.
TASK DESCRIPTIONWe are getting low on disk for WDQS servers. This is being addressed in T196485. In the meantime, while looking at graphs, we see
Gehel added a comment.
I think response times and number of timeouts are not a good metric for this type of thing
To echo what @Smalyshev is saying, yes, I agree that we don't have good measures of either the reliability or the performances of WDQS. And it is somewhat related to the fact that we
Gehel added a comment.
You can have a look at the historical values we have for
update lag: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m=8=1=now-30d=now
response times: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service-frontend?panelId=13=1=now-30d
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint, Operations.Herald added a subscriber: Aklapper.Herald added a project: Wikidata.
TASK DESCRIPTIONAs noted in a number of other places, a public SPARQL endpoint is fragile in nature. We
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint.Herald added a subscriber: Aklapper.Herald added a project: Wikidata.
TASK DESCRIPTIONAs discovered in T199146, WDQS uses the external endpoint (www.wikidata.org) through a proxy to talk
Gehel added a comment.
In T199146#4409514, @Smalyshev wrote:
Yeah looks like ipblocks table for wikidata has block on 2620:0:862:101:0:0:0:0/96 by user "Merlissimo" with comment 'Toolserver Range - no anon edits'.
That block goes from 2620::0862:0101:::: to
2620
Gehel added a comment.
It looks to me that the block is done by mediawiki itself (see P7355 for details):
< x-cache: cp1066 pass, cp1054 pass
< x-cache-status: pass
That looks like varnish just lets it through. (Note that I have no idea how those blocks are working, just trying to guess
Gehel added a comment.
In T199146#4409455, @BBlack wrote:
This raises some questions that are probably unrelated to the problem at hand, but might affect things indirectly:
Why is an internal service (wdqs) querying a public endpoint? It should probably use private internal endpoints like
Gehel added a comment.
Looks good, we can close this.TASK DETAILhttps://phabricator.wikimedia.org/T198055EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Smalyshev, Aklapper, Gehel, AndyTan, Gaboe420, Versusxo, Majesticalreaper22
Gehel closed this task as "Resolved".Gehel claimed this task.
TASK DETAILhttps://phabricator.wikimedia.org/T198055EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Smalyshev, Aklapper, Gehel, AndyTan, Gaboe420, Versusxo, Majestic
Gehel closed subtask T198055: Investigate HTTP 500 on POST request to WDQS as "Resolved".
TASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, Framawiki, Stashbot, Gehel, Aklappe
Gehel moved this task from In progress to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment.
I think we're all done here.TASK DETAILhttps://phabricator.wikimedia.org/T198042WORKBOARDhttps://phabricator.wikimedia.org/project/board/1239/EMAIL PREFERENCEShttps
Gehel updated the task description. (Show Details)
CHANGES TO TASK DESCRIPTIONWhile investigating T198042, we realized that there is a [[ https://logstash.wikimedia.org/goto/9fcb0f1cb5485506523fc10e61a9094c | high number of HTTP 500 errors ]] on POST requests to https://query.wikidata.org/sparql
Gehel added a comment.
The pattern of banned / throttled request as seen on wdqs matches a pattern of HTTP 500 seen on varnish. It is the same user agent / IP. I was expecting all those banned / throttled requests to be 403 / 429, but it looks like this is not the case. Something is wrong...TASK
Gehel created this task.Gehel added projects: Wikidata, Wikidata-Query-Service, Operations, Discovery, Discovery-Wikidata-Query-Service-Sprint.Herald added a subscriber: Aklapper.
TASK DESCRIPTIONAs seen in T198042, WDQS has a number of threads stuck on logging. We should use an async logger
Gehel added a comment.
Looking at thread dumps on wdqs1005, there is > 5000 threads waiting logging (see stack trace below). We could improve the situation with an AsyncAppender (probably a good idea anyway), but that's only treating the symptoms, not the root ca
Gehel added a comment.
wdqs1005 was lagging on updates. A few thread dumps for further analysis before restarting it: F22597390: threads.tar.gzTASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Gehel
Gehel added a comment.
Situation is better, but still not entirely stable (I just restarted blazegraph on wdqs1004). Looking at logstash, the number of Haltable errors is high, but it has been just as high in the last 7 days without major issues.
Side note, we might want to have a log of long
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Operations, Discovery-Wikidata-Query-Service-Sprint.Herald added a subscriber: Aklapper.Herald added projects: Wikidata, Discovery.
TASK DESCRIPTIONWDQS has paged for the eqiad public cluster. Symptoms are:
high response times
Gehel added a comment.
We have a "sleeping" task to order new disks: T186526TASK DETAILhttps://phabricator.wikimedia.org/T196485EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Gehel, Aklapper, Smalyshev, Davinaclare77, Qtn1293,
Gehel added a comment.
@RobH : thanks!TASK DETAILhttps://phabricator.wikimedia.org/T195797EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: RobH, GehelCc: RobH, MoritzMuehlenhoff, gerritbot, Gehel, Aklapper, Smalyshev, Versusxo, Majesticalreaper22, Giuliamocci
Gehel moved this task from In progress to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment.
Data load is complete, this can be closed.TASK DETAILhttps://phabricator.wikimedia.org/T194184WORKBOARDhttps://phabricator.wikimedia.org/project/board/1239/EMAIL
Gehel added subscribers: Pnorman, Mholloway.Gehel added a comment.
Looking at kartotherian configuration, I can't find a reference to wdqs, so I presume this is hardcoded.
We do configure a proxy in the kartotherian and tilerator configs. I'm not sure if it is used for anything else
Gehel added a comment.
Investigation on T192759 lead to some interesting discoveries.
Blazegraph Journal uses an unbounded executor service. Under high load (either because of more queries or more expensive queries), this executor creates a large number of threads for a short duration. We find
Gehel moved this task from In progress to Needs review on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment.
cgroup limits have been bumped to 10'000 pids, which seems to be enough so far. We have not received any alerts about blazegraph timeout since then, so that's probably
2001 - 2100 of 2450 matches
Mail list logo