[Wikidata-bugs] [Maniphest] [Created] T213217: Copy database from wdq[345] to wdq7 and wdq8

2019-01-08 Thread Gehel
Gehel created this task.Gehel triaged this task as "High" priority.Gehel added projects: Wikidata, Wikidata-Query-Service.Restricted Application added a subscriber: Aklapper. TASK DESCRIPTIONAs described in parent task, we need to reset data on wdqs100[78].TASK D

[Wikidata-bugs] [Maniphest] [Commented On] T213134: wdqs1007 database corruption

2019-01-08 Thread Gehel
Gehel added a comment. In T213134#4861540, @Smalyshev wrote: Same happening with wdq8, 3 hours later. Something spooky is going on... Will talk tomorrow morning with Bryan from Blazegraph, not sure if it's possible to do anything till then. I'm not touching wdqs100[78] yet, so that you have

[Wikidata-bugs] [Maniphest] [Commented On] T210431: WDQS puppet/hiera configs are too distributed

2018-12-03 Thread Gehel
Gehel added a comment. What we should probably do in this case is define default values to the hiera calls in profile::wdqs, and override only what needs to be different. At least for parameters where a default would make sense.TASK DETAILhttps://phabricator.wikimedia.org/T210431EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T210903: Report build ID when launching WDQS/Updater

2018-12-03 Thread Gehel
Gehel added a comment. We already have the git-commit-id plugin configured, which creates a properties file and adds it to the jars. So we should be able to load it and output whatever we need. There is probably a jar somewhere with the logic required to parse that properties file, but it's

[Wikidata-bugs] [Maniphest] [Commented On] T207665: Run test queries automatically on wdqs autodeployed servers

2018-11-29 Thread Gehel
Gehel added a comment. https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/474266 has been merged and deployed, but the test queries are still not available on the wdqs servers. There is something I don't understand about the packaging. @Smalyshev could you have a look and point me to what I

[Wikidata-bugs] [Maniphest] [Updated] T207665: Run test queries automatically on wdqs autodeployed servers

2018-11-23 Thread Gehel
Gehel added a project: Discovery-Wikidata-Query-Service-Sprint. TASK DETAILhttps://phabricator.wikimedia.org/T207665EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Aklapper, Mathew.onipe, Smalyshev, Gehel, CucyNoiD, Nandana, NebulousIris

[Wikidata-bugs] [Maniphest] [Closed] T210169: Create an exim alias for wdqs administrator

2018-11-22 Thread Gehel
Gehel closed this task as "Resolved".Gehel claimed this task. TASK DETAILhttps://phabricator.wikimedia.org/T210169EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Aklapper, Smalyshev, Gehel, Mathew.onipe, Nandana, Lahi, Gq86, Lucas_Werkme

[Wikidata-bugs] [Maniphest] [Unblock] T207665: Run test queries automatically on wdqs autodeployed servers

2018-11-22 Thread Gehel
Gehel closed subtask T210169: Create an exim alias for wdqs administrator as "Resolved". TASK DETAILhttps://phabricator.wikimedia.org/T207665EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Aklapper, Mathew.onipe, Smalyshev, Gehel

[Wikidata-bugs] [Maniphest] [Created] T210169: Create an exim alias for wdqs administrator

2018-11-22 Thread Gehel
Gehel created this task.Gehel triaged this task as "Normal" priority.Gehel added projects: Wikidata-Query-Service, Wikidata.Restricted Application added a subscriber: Aklapper. TASK DESCRIPTIONThis will be used as part of parent task to notify wdqs admins of issues. It should contain:

[Wikidata-bugs] [Maniphest] [Commented On] T206123: Monitor query / request concurrency on Blazegraph

2018-11-15 Thread Gehel
Gehel added a comment. Looks like the new metric is flowing to prometheusTASK DETAILhttps://phabricator.wikimedia.org/T206123EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Mathew.onipe, GehelCc: Stashbot, Smalyshev, Mathew.onipe, gerritbot, Aklapper, Gehel

[Wikidata-bugs] [Maniphest] [Claimed] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-09 Thread Gehel
Gehel claimed this task. TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: WMDE-leszek, Multichill, agray, Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt

[Wikidata-bugs] [Maniphest] [Changed Project Column] T207834: Cleanup Wikidata Query Service logging configuration

2018-11-07 Thread Gehel
Gehel moved this task from Needs review to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment. @Smalyshev we're good on this from my point of view. Could you check that running updater manually (with the -S option to output to console, or with -v for verbose) works

[Wikidata-bugs] [Maniphest] [Updated] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Gehel
Gehel removed a project: Patch-For-Review. TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev

[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Gehel
Gehel added a comment. In T199228#4715898, @Pintoch wrote: The search interface can also be used for that thanks to the haswbstatement command. That only gets you one id per query, so it might not be suited for all tools. I don't know if the lag is lower in this interface. The lag

[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Gehel
Gehel added a comment. In T199228#4715863, @Magnus wrote: Batch jobs. For example, there is an issue with the sourcemd tool at the moment. Essentially, it checks for a large number of scientific publications if they exist on Wikidata or not. If not, it creates them. Now, if people have

[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Gehel
Gehel added a comment. Thanks for the feedback! In T199228#4715815, @Jheald wrote: This requires WDQS to be reasonably up to date most of the time. A lag of 5 minutes isn't such a problem. An occasional longer lag, if clearly signposted as the WDQS GUI does, also isn't such a problem

[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Gehel
Gehel added a comment. In T199228#4710863, @Pintoch wrote: What matters much more for this tool is getting quick results and as little downtime as possible - lag is not really a concern. I'd be most interested in how well this is going at the moment! The open for all and widely varying cost

[Wikidata-bugs] [Maniphest] [Commented On] T202764: Wikidata produces a lot of failed requests for recentchanges API

2018-11-01 Thread Gehel
Gehel added a comment. For context, T202765 is about a bot sending annoying and somewhat expensive requests. That specific issue is now resolved.TASK DETAILhttps://phabricator.wikimedia.org/T202764EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc

[Wikidata-bugs] [Maniphest] [Commented On] T207834: Cleanup Wikidata Query Service logging configuration

2018-10-30 Thread Gehel
Gehel added a comment. new configuration deployed, but raising some deprecations, needs some tuning.TASK DETAILhttps://phabricator.wikimedia.org/T207834EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, gerritbot, Gehel, Aklapper

[Wikidata-bugs] [Maniphest] [Commented On] T207837: wdqs updater should be better isolated from blazegraph and common workload should be shared between servers

2018-10-25 Thread Gehel
Gehel added a comment. There are 3 issues here, and maybe they should be addressed on different tickets: isolating updater from blazegraph: this is about reducing the interactions between the 2 components to what is essential, increasing robustness and simplifying investigation into any failure

[Wikidata-bugs] [Maniphest] [Created] T207947: https://phabricator.wikimedia.org/T200563Switch wdqs1003 with one of the internal wdqs cluster

2018-10-25 Thread Gehel
Gehel created this task.Gehel triaged this task as "High" priority.Gehel added projects: Wikidata, Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint.Restricted Application added a subscriber: Aklapper. TASK DESCRIPTIONSince wdqs1003 is acting differently from oth

[Wikidata-bugs] [Maniphest] [Commented On] T206636: Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service

2018-10-25 Thread Gehel
Gehel added a comment. In T206636#4690384, @Smalyshev wrote: @Andrew Also looks like there is some puppet issue there: Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error

[Wikidata-bugs] [Maniphest] [Commented On] T207834: Cleanup Wikidata Query Service logging configuration

2018-10-24 Thread Gehel
Gehel added a comment. My current patch is trying to put all that logic into logback.xml, but it is definitely starting to be unreadable. And coding ifs in XML just seems wrong :/ I think that instead we should have different static logback.xml files and have a flag to switch between them

[Wikidata-bugs] [Maniphest] [Commented On] T207817: WDQS Updater ran into issue and stopped working

2018-10-24 Thread Gehel
Gehel added a comment. In T207817#4691885, @Ottomata wrote: Interesting! I checked Jodatime stuff to make sure one of our Java based pipeline handled the timestamp format change, I'm surprised that Jackson can't parse this! We can definitely parse that format if we wanted to, but we do have

[Wikidata-bugs] [Maniphest] [Commented On] T207817: WDQS Updater ran into issue and stopped working

2018-10-24 Thread Gehel
Gehel added a comment. In T207817#4691569, @mmodell wrote: Do we have a patch or should we roll back group0? We have a workaround on the WDQS side (switching back to recent changes instead of kafka events). But the root cause isn't fixed, and it is unclear to me what change caused that issue

[Wikidata-bugs] [Maniphest] [Created] T207843: increase restart interval of wdqs updater

2018-10-24 Thread Gehel
Gehel created this task.Gehel triaged this task as "High" priority.Gehel added projects: Wikidata, Wikidata-Query-Service, Discovery-Search (Current work), Operations.Restricted Application added a subscriber: Aklapper. TASK DESCRIPTIONwdqs updater is expected to exit on a number of fail

[Wikidata-bugs] [Maniphest] [Created] T207837: wdqs updater should be better isolated from blazegraph and common workload should be shared between servers

2018-10-24 Thread Gehel
Gehel created this task.Gehel triaged this task as "High" priority.Gehel added projects: Wikidata-Query-Service, Discovery-Search (Current work), Operations.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata. TASK DESCRIPTIONWe've ha

[Wikidata-bugs] [Maniphest] [Created] T207834: Cleanup Wikidata Query Service logging configuration

2018-10-24 Thread Gehel
Gehel created this task.Gehel triaged this task as "High" priority.Gehel added projects: Wikidata-Query-Service, Discovery-Search (Current work), Operations.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata. TASK DESCRIPTIONAt the mo

[Wikidata-bugs] [Maniphest] [Updated] T207817: WDQS Updater ran into issue and stopped working

2018-10-24 Thread Gehel
Gehel added a subtask: T207656: WDQS logging to logstash should be rate limited. TASK DETAILhttps://phabricator.wikimedia.org/T207817EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Stashbot, 20after4, TerraCodes, Liuxinyu970226, gerritbot, Gehel

[Wikidata-bugs] [Maniphest] [Commented On] T206105: Optimize networking configuration for WDQS

2018-10-22 Thread Gehel
Gehel added a comment. Some minimal packet drop is still seen (< 100 packet / 24h), so the situation is very much better. More work needs to be done on limiting CPU usage on the blazegraph side.TASK DETAILhttps://phabricator.wikimedia.org/T206105EMAIL PREFERENCEShttps://phabricator.wikimedia.

[Wikidata-bugs] [Maniphest] [Created] T207665: Run test queries automatically on wdqs autodeployed servers

2018-10-22 Thread Gehel
Gehel created this task.Gehel added a project: Wikidata-Query-Service.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata. TASK DESCRIPTIONNow that we automatically deploy code on wdqs test servers, we should also validate that this code works

[Wikidata-bugs] [Maniphest] [Commented On] T206560: [Epic] Evaluate alternatives to Blazegraph

2018-10-19 Thread Gehel
Gehel added a comment. A few wishes I have from an operations point of view for any replacement. Those are not necessarily mandatory, but we should evaluate them at some point: ability to scale both read and write load across multiple nodes ability to limit resource consumption to fail

[Wikidata-bugs] [Maniphest] [Commented On] T206880: Investigate runaway Blazegraph threads

2018-10-17 Thread Gehel
Gehel added a comment. @Smalyshev if you could take a heap dump of blazegraph under load, we might be able to trace more precisely where this unnamed thread pool is coming from. Feel free to send me the dump for analysis.TASK DETAILhttps://phabricator.wikimedia.org/T206880EMAIL PREFERENCEShttps

[Wikidata-bugs] [Maniphest] [Updated] T206423: The usual Lag pattern for wdqs2003 seems to be taking another turn

2018-10-15 Thread Gehel
Gehel added a project: Discovery-Wikidata-Query-Service-Sprint. TASK DETAILhttps://phabricator.wikimedia.org/T206423EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Stashbot, Smalyshev, Mathew.onipe, Gehel, Aklapper, CucyNoiD, Nandana

[Wikidata-bugs] [Maniphest] [Changed Subscribers] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-10-11 Thread Gehel
Gehel added a subscriber: Mathew.onipe.Gehel added a comment. In T199228#4655321, @Smalyshev wrote: I think update lag is not the biggest issue. Endpoint availability and response times is more important for most of the users, at least short-term. If there's a lag spike that goes away, most users

[Wikidata-bugs] [Maniphest] [Commented On] T206636: Provide a way to have test servers on real hardware, isolated from production

2018-10-11 Thread Gehel
Gehel added a comment. The main contention point for WDQS (or investigating alternatives) seems to be IOPS. We tried setting up a wdqs test instance on WMCS, but IO contention meant that we were not able to keep up with the update flow. Our production instance consume ~3-4K IOPS just for updates

[Wikidata-bugs] [Maniphest] [Created] T206636: Provide a way to have test servers on real hardware, isolated from production

2018-10-10 Thread Gehel
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Operations, cloud-services-team.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata. TASK DESCRIPTIONSome testing needs resources that are not easily available on WMCS. Having

[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-10-10 Thread Gehel
Gehel added a comment. Coming back to this discussion, I'll try to make my point more clear: wdqs public endpoint is by nature a service more fragile than most of our other services. The update lag is a good example of a problem we don't seem to be able to get under control on the public endpoint

[Wikidata-bugs] [Maniphest] [Commented On] T206105: Optimize networking configuration for WDQS

2018-10-10 Thread Gehel
Gehel added a comment. With some trial an error, it looks like the smp_affinity = 00ff00ff would allow the IRQ to be managed by any CPU, but it is still managed by the first one (in this case, any == CPU0). Setting each IRQ on a specific CPU (and one only) will spread them. It looks like puppet

[Wikidata-bugs] [Maniphest] [Commented On] T206423: The usual Lag pattern for wdqs2003 seems to be taking another turn

2018-10-09 Thread Gehel
Gehel added a comment. Looking at dropped packets, it looks like we did not have any over the last few days. So we have another cause to our lag. Also not that while the issue still seems more present on wdqs2003, we also see issue with other nodes.TASK DETAILhttps://phabricator.wikimedia.org

[Wikidata-bugs] [Maniphest] [Commented On] T206423: The usual Lag pattern for wdqs2003 seems to be taking another turn

2018-10-08 Thread Gehel
Gehel added a comment. Looking at Grafana I can see spikes in batch progress that correlate with drops in lag. Zooming in, I can even see negative drops into batch progress, which should not happen. I suspect our metrics are skewed by the non monotonic nature of kafka updates (just a guess). Since

[Wikidata-bugs] [Maniphest] [Commented On] T206303: Add sudo rules for wdqs-updater in puppet

2018-10-05 Thread Gehel
Gehel added a comment. @mobrovac thanks for the fast response! I was wondering if we had a cleaner way to declare that a scap::target manages multiple services, but it seems that's not the case.TASK DETAILhttps://phabricator.wikimedia.org/T206303EMAIL PREFERENCEShttps://phabricator.wikimedia.org

[Wikidata-bugs] [Maniphest] [Claimed] T206105: Optimize networking configuration for WDQS

2018-10-04 Thread Gehel
Gehel claimed this task.Gehel moved this task from Backlog to In progress on the Discovery-Wikidata-Query-Service-Sprint board. TASK DETAILhttps://phabricator.wikimedia.org/T206105WORKBOARDhttps://phabricator.wikimedia.org/project/board/1239/EMAIL PREFERENCEShttps://phabricator.wikimedia.org

[Wikidata-bugs] [Maniphest] [Updated] T206105: Optimize networking configuration for WDQS

2018-10-04 Thread Gehel
Gehel added a subscriber: BBlack.Gehel added a comment. My current understanding of the issue: All IRQs from NIC are handled by a single CPU. Under load, Blazegraph saturate this CPU (and others), this creates CPU contention with the NIC IRQ and leads to packet being dropped. Note that we also

[Wikidata-bugs] [Maniphest] [Changed Project Column] T200563: wdq1003 is anomalous

2018-10-03 Thread Gehel
Gehel moved this task from In progress to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment. Actionable tasks have been created, the investigation itself is done.TASK DETAILhttps://phabricator.wikimedia.org/T200563WORKBOARDhttps://phabricator.wikimedia.org/project

[Wikidata-bugs] [Maniphest] [Triaged] T206121: Cleanup WDQS logging configuration

2018-10-03 Thread Gehel
Gehel triaged this task as "Normal" priority. TASK DETAILhttps://phabricator.wikimedia.org/T206121EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, Gehel, Aklapper, Nandana, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkme

[Wikidata-bugs] [Maniphest] [Created] T206121: Cleanup WDQS logging configuration

2018-10-03 Thread Gehel
Gehel created this task.Gehel added projects: Discovery-Wikidata-Query-Service-Sprint, Operations, Wikidata-Query-Service.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata. TASK DESCRIPTIONLogging config of WDQS is somewhat of a mess. The goal

[Wikidata-bugs] [Maniphest] [Triaged] T206105: Optimize networking configuration for WDQS

2018-10-03 Thread Gehel
Gehel triaged this task as "High" priority. TASK DETAILhttps://phabricator.wikimedia.org/T206105EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Aklapper, Gehel, Nandana, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkme

[Wikidata-bugs] [Maniphest] [Created] T206105: Optimize networking configuration for WDQS

2018-10-03 Thread Gehel
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint, Operations.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata. TASK DESCRIPTIONWhile investigating T200563, it was found that wdqs[12]003

[Wikidata-bugs] [Maniphest] [Closed] T195438: WDQS requests from aluminium.wikimedia.org being throttled

2018-10-03 Thread Gehel
Gehel closed this task as "Resolved".Gehel claimed this task.Gehel added a comment. The specifics of this task are being addressed in T205607 and T200594 (most specifically in https://github.com/kartotherian/geoshapes/pull/1). I'm closing this task as the actual work is being tr

[Wikidata-bugs] [Maniphest] [Updated] T200563: wdq1003 is anomalous

2018-10-02 Thread Gehel
Gehel added a project: Discovery-Wikidata-Query-Service-Sprint. TASK DETAILhttps://phabricator.wikimedia.org/T200563EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Volans, Stashbot, Gehel, Aklapper, Smalyshev, CucyNoiD, Nandana

[Wikidata-bugs] [Maniphest] [Commented On] T200563: wdq1003 is anomalous

2018-09-28 Thread Gehel
Gehel added a comment. In T200563#4623531, @Smalyshev wrote: Great work! Thanks (I'll forward to @Volans) I am not sure though why logging would be that much of an issue, shouldn't the log code take care of batching it, etc.? As for not logging nginx - do we have these logs somewhere else

[Wikidata-bugs] [Maniphest] [Changed Subscribers] T200563: wdq1003 is anomalous

2018-09-27 Thread Gehel
Gehel added a subscriber: Volans.Gehel added a comment. All credit for the findings below goes to @Volans: we have some dropped packets on the NICs, both on wdqs[12]003 and other servers, but higher on wdqs[12]003. NIC interrupts are processed only by CPU0 (see /proc/interrupts), we could spread

[Wikidata-bugs] [Maniphest] [Commented On] T200594: Add client identifier to requests sent from Kartotherian to WDQS

2018-09-27 Thread Gehel
Gehel added a comment. I'm late to the party, so a few notes in no particular order: WDQS queries from Kartotherian are arbitrary, and it is not really possible to restrict them without heavily impacting functionality. In most cases they will come from a user editing a tag, so we have some

[Wikidata-bugs] [Maniphest] [Created] T205542: Add cumin aliases for each wdqs clusters

2018-09-26 Thread Gehel
Gehel created this task.Gehel added projects: Operations, Wikidata-Query-Service.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata. TASK DESCRIPTIONCurrently, we don't have cumin aliases for each individual wdqs clusters (see aliases.yaml.erb

[Wikidata-bugs] [Maniphest] [Created] T204364: Rate limit wdqs logs

2018-09-14 Thread Gehel
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Operations, Wikimedia-Logstash.Restricted Application added a subscriber: Aklapper.Restricted Application added a project: Wikidata. TASK DESCRIPTIONWe recently had cases of wdqs sending >10K logs per seconds to logst

[Wikidata-bugs] [Maniphest] [Updated] T202764: Wikidata produces a lot of failed requests for recentchanges API

2018-09-11 Thread Gehel
Gehel added a comment. It looks like there is a correlation between bot activity on wikidata query service (T202765) and the rate of those errors. This would tend to indicate that cause of this issue is load on wdqs and not slowdown on wikidata. I don't have any explanation of the causality chain

[Wikidata-bugs] [Maniphest] [Commented On] T202764: Wikidata produces a lot of failed requests for recentchanges API

2018-09-11 Thread Gehel
Gehel added a comment. The issue as seen from WDQS can be followed on logstash.TASK DETAILhttps://phabricator.wikimedia.org/T202764EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Krinkle, GehelCc: Krinkle, Addshore, Yurik, jcrespo, Imarlier, Ladsgroup

[Wikidata-bugs] [Maniphest] [Changed Project Column] T202777: add SSDs to wdqs200[12]

2018-09-06 Thread Gehel
Gehel moved this task from Backlog to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment. New SSD in place, server reimaged and data reimported. We're all good!TASK DETAILhttps://phabricator.wikimedia.org/T202777WORKBOARDhttps://phabricator.wikimedia.org/project/board

[Wikidata-bugs] [Maniphest] [Commented On] T202779: add SSDs to wdqs100[45]

2018-09-06 Thread Gehel
Gehel added a comment. New SSD in place, server reimaged and data reimported. We're all good!TASK DETAILhttps://phabricator.wikimedia.org/T202779EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Cmjohnson, GehelCc: ops-monitoring-bot, Stashbot, mark, faidon

[Wikidata-bugs] [Maniphest] [Closed] T196485: WDQS diskspace is low

2018-09-06 Thread Gehel
Gehel closed this task as "Resolved".Gehel claimed this task.Gehel added a comment. New SSD in place, server reimaged and data reimported. We're all good!TASK DETAILhttps://phabricator.wikimedia.org/T196485EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailp

[Wikidata-bugs] [Maniphest] [Commented On] T202778: add ssds to wdqs2003

2018-09-06 Thread Gehel
Gehel added a comment. New SSD in place, server reimaged and data reimported. We're all good!TASK DETAILhttps://phabricator.wikimedia.org/T202778EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Stashbot, Mathew.onipe, Aklapper, mark, faidon, Addshore

[Wikidata-bugs] [Maniphest] [Changed Project Column] T202780: add SSDs to wdqs1003

2018-09-06 Thread Gehel
Gehel moved this task from Backlog to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment. New SSD in place, server reimaged and data reimported. We're all good!TASK DETAILhttps://phabricator.wikimedia.org/T202780WORKBOARDhttps://phabricator.wikimedia.org/project/board

[Wikidata-bugs] [Maniphest] [Commented On] T202785: Federation request to https://ld.stadt-zuerich.ch/query fails

2018-08-31 Thread Gehel
Gehel added a comment. Looking a bit into this, it does not look like blazegraph has a deep integration with Jetty (why does it even have any dependency on Jetty is a mystery to me). So repackaging with a more recent jetty-http (or the whole jetty stack) might not be that hard (well, it is trivial

[Wikidata-bugs] [Maniphest] [Commented On] T202777: add SSDs to wdqs200[12]

2018-08-29 Thread Gehel
Gehel added a comment. error during reimage of wdqs2001: ┌┤ [!!] Partition disks ├─┐ │ │ │ Error while setting up RAID │ │ An unexpected

[Wikidata-bugs] [Maniphest] [Updated] T202779: add SSDs to wdqs100[45]

2018-08-29 Thread Gehel
Gehel added a comment. @Cmjohnson wdqs1004 is back into rotation, ping me when you have time for the next one (we also have T202780)TASK DETAILhttps://phabricator.wikimedia.org/T202779EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Cmjohnson, GehelCc: ops

[Wikidata-bugs] [Maniphest] [Commented On] T202777: add SSDs to wdqs200[12]

2018-08-28 Thread Gehel
Gehel added a comment. @Papaul: I'm ready to reimage wdqs2002 today. Ping me when you're around and I'll shut it down.TASK DETAILhttps://phabricator.wikimedia.org/T202777EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Papaul, GehelCc: Aklapper, mark, faidon

[Wikidata-bugs] [Maniphest] [Commented On] T202764: Wikidata produces a lot of failed requests for recentchanges API

2018-08-27 Thread Gehel
Gehel added a comment. Digging into this a bit more from the WDQS side, we see a few interesting things: The NoHttpResponseException seems to not be a timeout client side, but an empty response (not even headers), with a state transition. It looks similar to what we would see if an intermediate

[Wikidata-bugs] [Maniphest] [Commented On] T196485: WDQS diskspace is low

2018-08-27 Thread Gehel
Gehel added a comment. Note that data import after reimage can be done by copying over data from wdqs1010, which has been reimported recently. Procedure is documented on https://wikitech.wikimedia.org/wiki/Wikidata_query_service#Data_transfer_procedure.TASK DETAILhttps://phabricator.wikimedia.org

[Wikidata-bugs] [Maniphest] [Changed Subscribers] T202778: add ssds to wdqs2003

2018-08-27 Thread Gehel
Gehel added a subscriber: Mathew.onipe.Gehel added a comment. @Papaul: we'll start by reimaging wdqs2003 (wdqs200[12] to follow). We'll reimage them one by one, to ensure that we have at most 1 host down in the cluster at any time. @Papaul: ping me when you are around, and I'll depool / shutdown

[Wikidata-bugs] [Maniphest] [Changed Subscribers] T196485: WDQS diskspace is low

2018-08-27 Thread Gehel
Gehel added a subscriber: Mathew.onipe.Gehel added a comment. To not duplicate infos on each of the child tasks, I'll add anything that is common to all on this task. We'll take this occasion to reimage the systems, so that we can validate that we have a working partman configuration with the new

[Wikidata-bugs] [Maniphest] [Updated] T200563: wdq1003 is anomalous

2018-07-31 Thread Gehel
Gehel added a comment. wdqs1003 is a bit older, purchased on 2016-12-02 vs 2017-06-29 for wdqs100[45]. Looking at CPU infos, it seems that wdqs1003 does have the same CPU model (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz) as wdqs100[45], but CPU max MHz: 2100., vs CPU max MHz

[Wikidata-bugs] [Maniphest] [Commented On] T200202: WDQS disk usage increase is correlated with reloading of categories

2018-07-23 Thread Gehel
Gehel added a comment. Damn, we already set minReleaseAge=1 in RWStore.properties. We need to be looking for something else.TASK DETAILhttps://phabricator.wikimedia.org/T200202EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Aklapper, Smalyshev, Gehel

[Wikidata-bugs] [Maniphest] [Commented On] T200202: WDQS disk usage increase is correlated with reloading of categories

2018-07-23 Thread Gehel
Gehel added a comment. It looks like there is some configuration around the release of historical data. Setting com.bigdata.service.AbstractTransactionService.minReleaseAge=1 might allow to reclaim space.TASK DETAILhttps://phabricator.wikimedia.org/T200202EMAIL PREFERENCEShttps

[Wikidata-bugs] [Maniphest] [Commented On] T200202: WDQS disk usage increase is correlated with reloading of categories

2018-07-23 Thread Gehel
Gehel added a comment. Looking at http://localhost:/bigdata/#namespaces it seems that categories namespaces are deleted. But maybe the disk space is not recovered on deletion?TASK DETAILhttps://phabricator.wikimedia.org/T200202EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel

[Wikidata-bugs] [Maniphest] [Created] T200202: WDQS disk usage increase is correlated with reloading of categories

2018-07-23 Thread Gehel
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Operations.Herald added a subscriber: Aklapper.Herald added a project: Wikidata. TASK DESCRIPTIONWe are getting low on disk for WDQS servers. This is being addressed in T196485. In the meantime, while looking at graphs, we see

[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-07-12 Thread Gehel
Gehel added a comment. I think response times and number of timeouts are not a good metric for this type of thing To echo what @Smalyshev is saying, yes, I agree that we don't have good measures of either the reliability or the performances of WDQS. And it is somewhat related to the fact that we

[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-07-12 Thread Gehel
Gehel added a comment. You can have a look at the historical values we have for update lag: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m=8=1=now-30d=now response times: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service-frontend?panelId=13=1=now-30d

[Wikidata-bugs] [Maniphest] [Created] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-07-10 Thread Gehel
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint, Operations.Herald added a subscriber: Aklapper.Herald added a project: Wikidata. TASK DESCRIPTIONAs noted in a number of other places, a public SPARQL endpoint is fragile in nature. We

[Wikidata-bugs] [Maniphest] [Created] T199219: WDQS should use internal endpoint to communicate to Wikidata

2018-07-10 Thread Gehel
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Discovery-Wikidata-Query-Service-Sprint.Herald added a subscriber: Aklapper.Herald added a project: Wikidata. TASK DESCRIPTIONAs discovered in T199146, WDQS uses the external endpoint (www.wikidata.org) through a proxy to talk

[Wikidata-bugs] [Maniphest] [Commented On] T199146: "Blocked" response when trying to access constraintsrdf action from production host

2018-07-09 Thread Gehel
Gehel added a comment. In T199146#4409514, @Smalyshev wrote: Yeah looks like ipblocks table for wikidata has block on 2620:0:862:101:0:0:0:0/96 by user "Merlissimo" with comment 'Toolserver Range - no anon edits'. That block goes from 2620::0862:0101:::: to 2620

[Wikidata-bugs] [Maniphest] [Updated] T199146: "Blocked" response when trying to access constraintsrdf action from production host

2018-07-09 Thread Gehel
Gehel added a comment. It looks to me that the block is done by mediawiki itself (see P7355 for details): < x-cache: cp1066 pass, cp1054 pass < x-cache-status: pass That looks like varnish just lets it through. (Note that I have no idea how those blocks are working, just trying to guess

[Wikidata-bugs] [Maniphest] [Commented On] T199146: "Blocked" response when trying to access constraintsrdf action from production host

2018-07-09 Thread Gehel
Gehel added a comment. In T199146#4409455, @BBlack wrote: This raises some questions that are probably unrelated to the problem at hand, but might affect things indirectly: Why is an internal service (wdqs) querying a public endpoint? It should probably use private internal endpoints like

[Wikidata-bugs] [Maniphest] [Commented On] T198055: Investigate HTTP 500 on POST request to WDQS

2018-06-29 Thread Gehel
Gehel added a comment. Looks good, we can close this.TASK DETAILhttps://phabricator.wikimedia.org/T198055EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Smalyshev, Aklapper, Gehel, AndyTan, Gaboe420, Versusxo, Majesticalreaper22

[Wikidata-bugs] [Maniphest] [Closed] T198055: Investigate HTTP 500 on POST request to WDQS

2018-06-29 Thread Gehel
Gehel closed this task as "Resolved".Gehel claimed this task. TASK DETAILhttps://phabricator.wikimedia.org/T198055EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: gerritbot, Smalyshev, Aklapper, Gehel, AndyTan, Gaboe420, Versusxo, Majestic

[Wikidata-bugs] [Maniphest] [Unblock] T198042: WDQS timeout on the public eqiad cluster

2018-06-29 Thread Gehel
Gehel closed subtask T198055: Investigate HTTP 500 on POST request to WDQS as "Resolved". TASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Smalyshev, Framawiki, Stashbot, Gehel, Aklappe

[Wikidata-bugs] [Maniphest] [Changed Project Column] T198042: WDQS timeout on the public eqiad cluster

2018-06-29 Thread Gehel
Gehel moved this task from In progress to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment. I think we're all done here.TASK DETAILhttps://phabricator.wikimedia.org/T198042WORKBOARDhttps://phabricator.wikimedia.org/project/board/1239/EMAIL PREFERENCEShttps

[Wikidata-bugs] [Maniphest] [Edited] T198055: Investigate HTTP 500 on POST request to WDQS

2018-06-25 Thread Gehel
Gehel updated the task description. (Show Details) CHANGES TO TASK DESCRIPTIONWhile investigating T198042, we realized that there is a [[ https://logstash.wikimedia.org/goto/9fcb0f1cb5485506523fc10e61a9094c | high number of HTTP 500 errors ]] on POST requests to https://query.wikidata.org/sparql

[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster

2018-06-25 Thread Gehel
Gehel added a comment. The pattern of banned / throttled request as seen on wdqs matches a pattern of HTTP 500 seen on varnish. It is the same user agent / IP. I was expecting all those banned / throttled requests to be 403 / 429, but it looks like this is not the case. Something is wrong...TASK

[Wikidata-bugs] [Maniphest] [Created] T198051: Enable async logging on Wikidata Query Service

2018-06-25 Thread Gehel
Gehel created this task.Gehel added projects: Wikidata, Wikidata-Query-Service, Operations, Discovery, Discovery-Wikidata-Query-Service-Sprint.Herald added a subscriber: Aklapper. TASK DESCRIPTIONAs seen in T198042, WDQS has a number of threads stuck on logging. We should use an async logger

[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster

2018-06-25 Thread Gehel
Gehel added a comment. Looking at thread dumps on wdqs1005, there is > 5000 threads waiting logging (see stack trace below). We could improve the situation with an AsyncAppender (probably a good idea anyway), but that's only treating the symptoms, not the root ca

[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster

2018-06-24 Thread Gehel
Gehel added a comment. wdqs1005 was lagging on updates. A few thread dumps for further analysis before restarting it: F22597390: threads.tar.gzTASK DETAILhttps://phabricator.wikimedia.org/T198042EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Gehel

[Wikidata-bugs] [Maniphest] [Commented On] T198042: WDQS timeout on the public eqiad cluster

2018-06-24 Thread Gehel
Gehel added a comment. Situation is better, but still not entirely stable (I just restarted blazegraph on wdqs1004). Looking at logstash, the number of Haltable errors is high, but it has been just as high in the last 7 days without major issues. Side note, we might want to have a log of long

[Wikidata-bugs] [Maniphest] [Created] T198042: WDQS timeout on the public eqiad cluster

2018-06-24 Thread Gehel
Gehel created this task.Gehel added projects: Wikidata-Query-Service, Operations, Discovery-Wikidata-Query-Service-Sprint.Herald added a subscriber: Aklapper.Herald added projects: Wikidata, Discovery. TASK DESCRIPTIONWDQS has paged for the eqiad public cluster. Symptoms are: high response times

[Wikidata-bugs] [Maniphest] [Updated] T196485: WDQS diskspace is low

2018-06-05 Thread Gehel
Gehel added a comment. We have a "sleeping" task to order new disks: T186526TASK DETAILhttps://phabricator.wikimedia.org/T196485EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Gehel, Aklapper, Smalyshev, Davinaclare77, Qtn1293,

[Wikidata-bugs] [Maniphest] [Commented On] T195797: Stas needs root access on WDQS test cluster

2018-06-04 Thread Gehel
Gehel added a comment. @RobH : thanks!TASK DETAILhttps://phabricator.wikimedia.org/T195797EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: RobH, GehelCc: RobH, MoritzMuehlenhoff, gerritbot, Gehel, Aklapper, Smalyshev, Versusxo, Majesticalreaper22, Giuliamocci

[Wikidata-bugs] [Maniphest] [Changed Project Column] T194184: rack/setup/install wdqs10[09|10].eqiad.wmnet

2018-06-04 Thread Gehel
Gehel moved this task from In progress to Done on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment. Data load is complete, this can be closed.TASK DETAILhttps://phabricator.wikimedia.org/T194184WORKBOARDhttps://phabricator.wikimedia.org/project/board/1239/EMAIL

[Wikidata-bugs] [Maniphest] [Changed Subscribers] T195438: WDQS requests from aluminium.wikimedia.org being throttled

2018-05-24 Thread Gehel
Gehel added subscribers: Pnorman, Mholloway.Gehel added a comment. Looking at kartotherian configuration, I can't find a reference to wdqs, so I presume this is hardcoded. We do configure a proxy in the kartotherian and tilerator configs. I'm not sure if it is used for anything else

[Wikidata-bugs] [Maniphest] [Updated] T181988: Investigate and improve memory allocation rates of WDQS

2018-05-18 Thread Gehel
Gehel added a comment. Investigation on T192759 lead to some interesting discoveries. Blazegraph Journal uses an unbounded executor service. Under high load (either because of more queries or more expensive queries), this executor creates a large number of threads for a short duration. We find

[Wikidata-bugs] [Maniphest] [Changed Project Column] T192759: WDQS endpoint timeout

2018-05-18 Thread Gehel
Gehel moved this task from In progress to Needs review on the Discovery-Wikidata-Query-Service-Sprint board.Gehel added a comment. cgroup limits have been bumped to 10'000 pids, which seems to be enough so far. We have not received any alerts about blazegraph timeout since then, so that's probably

<    16   17   18   19   20   21   22   23   24   25   >