[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2019-06-29 Thread Multichill
Multichill added a comment.


  @Smalyshev : Maybe do it like jsub on the Toollabs: Give an option to add the 
expected runtime? Based on this the load balancer in front of the different 
SPARQL services can assign you a node.

TASK DETAIL
  https://phabricator.wikimedia.org/T199228

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel, Multichill
Cc: Esc3300, Mholloway, Alexsdutton, WMDE-leszek, Multichill, agray, Jheald, 
Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, 
debt, Joe, Smalyshev, Gehel, Aklapper, darthmon_wmde, Legado_Shulgin, Nandana, 
thifranc, AndyTan, Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, 
LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
fgiunchedi
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2019-06-26 Thread Esc3300
Esc3300 added a comment.


  I'm trying to figure out what the volume of queries on WQS may be:
  
  If I get 
https://grafana.wikimedia.org/d/00489/wikidata-query-service?panelId=18&fullscreen&orgId=1
  correctly, it's < 50 queries a second on each of some 12 servers?
  
  Personally, I don't think some lag is problematic if one is aware of it and 
if an hour later it's not just an hour more.

TASK DETAIL
  https://phabricator.wikimedia.org/T199228

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel, Esc3300
Cc: Esc3300, Mholloway, Alexsdutton, WMDE-leszek, Multichill, agray, Jheald, 
Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, 
debt, Joe, Smalyshev, Gehel, Aklapper, darthmon_wmde, Legado_Shulgin, Nandana, 
thifranc, AndyTan, Davinaclare77, Qtn1293, Techguru.pc, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, 
LawExplorer, Zppix, _jensen, rosalieper, Jonas, Xmlizer, Wong128hk, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, 
fgiunchedi
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-16 Thread Smalyshev
Smalyshev added a comment.
ensuring that the data in the WDQS nodes accurately reflects the data upstream of the service, or at least that the data is consistent between query nodes

I am not sure how you would propose ensuring that. Given the database of almost 7 billion triples, it is not feasible to compare two of them or to verify the whole database against Wikidata.

as far as I can tell these events aren't actively monitored

How would you propose to actively monitor these?

for each node monitored to spot when data has gotten lost on the way

If we had a process to know in advance which data has gotten lost, we could use the same process to recover the data. The whole problem is we don't know when the data is lost - this happens when existing process of data synchronization does not work for some reason.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Gehel, SmalyshevCc: Alexsdutton, WMDE-leszek, Multichill, agray, Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, D3r1ck01, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread agray
agray added a comment.
To follow up on @Jheald and @Lydia_Pintscher's comments here - for a lot of use-cases, I agree that a lag measured in a few hours isn't much of a problem, because the underlying data is fairly static. Most property values don't change minute-to-minute or even month-to-month, most ontologies and hierarchies are reasonably stable, and so on. Queries which are "tell me something interesting about the underlying data" will tend to return reasonably good results whatever the update lag, assuming that it's not something that changes frequently or has been recently worked on.

But maintenance work often presumes a much quicker response rate, and people have built workflows around that expectation  - it has been, historically, pretty reliable at maintaining an update lag of a minute or so. Maybe we've just got soft and lazy because it's been so good :-)

To give a practical example, I have a regular data-cleaning workflow which uses a series of related queries to look at a class of people, with one set of queries identifying if the group is complete, another set bringing up various bits of metadata so it can be manually checked, and a third set looking for inconsistencies between them. Corrections at one stage can feed into another, and cleaning a set of data often means running the reports two or three times to check everything fits together accurately after changes have been made. Once I'm confident it's all complete and comprehensive, I mark it as validated and move on to the next one.

In this sort of situation, a few minutes lag is no real problem, but a few hours lag makes it very challenging to complete a single batch of data cleaning in one session. If that then means having to do it over several days, it leads to extra work as well as an increased chance of mistakes through human error (eg losing track of which bits I've done).

I don't want to say this is the biggest problem there is, or anything like that, and I would completely agree that ultimately accuracy of the results and the system being reliably up are very much priorities #1 and #2, with lag behind them at #3. However, increasing lag does still have noticeable impacts on maintenance work, and doing that work becomes more difficult once lag gets sufficiently long.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: agrayCc: agray, Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Gehel
Gehel added a comment.

In T199228#4715898, @Pintoch wrote:
The search interface can also be used for that thanks to the haswbstatement command. That only gets you one id per query, so it might not be suited for all tools. I don't know if the lag is lower in this interface.


The lag on elasticsearch is typically lower than what we have on WDQS. But that's still an asynchronous process, with lag expected to climb occasionally.

Retrieving items by identifiers is quite crucial in many tools so it would be useful to have a solid interface for that instead of relying on SPARQL (which feels indeed like using a sledgehammer to crack a nut).

Again, I'm missing context / knowledge, but accessing item by identifier sounds like an operation that should be exposed by a Wikidata API directly, not by a secondary datastore like WDQS/Blazegraph or Search/elasticsearch.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Pintoch
Pintoch added a comment.
The search interface can also be used for that thanks to the haswbstatement command. That only gets you one id per query, so it might not be suited for all tools. I don't know if the lag is lower in this interface.
Retrieving items by identifiers is quite crucial in many tools so it would be useful to have a solid interface for that instead of relying on SPARQL (which feels indeed like using a sledgehammer to crack a nut).TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: PintochCc: Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Gehel
Gehel added a comment.

In T199228#4715863, @Magnus wrote:

Batch jobs. For example, there is an issue with the sourcemd tool at the moment. Essentially, it checks for a large number of scientific publications if they exist on Wikidata or not. If not, it creates them. Now, if people have accidentally listed the same paper twice, or if two different batch jobs check/create the same paper, then the first create is sometimes "invisible" due to SPARQL lag, and a duplicate item is created. This has apparently happened a lot in the last few days.



Interesting. I'm probably lacking context, so apologies if I'm completely wrong here.

It sounds very much like an asynchronous process (WDQS updates) is used as part of an operation that should be transactional (creating an item if it does not exist). Even if we somewhat improve the lag, it still seems like the wrong tool for the job. Sadly, I don't have a proposable for a better tool for that job.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Gehel
Gehel added a comment.
Thanks for the feedback!


In T199228#4715815, @Jheald wrote:
This requires WDQS to be reasonably up to date most of the time.  A lag of 5 minutes isn't such a problem.  An occasional longer lag, if clearly signposted as the WDQS GUI does, also isn't such a problem -- if the server is having a slow moment, one can go away and work on something else for a while.


In the current situation, we are doing far worse than only occasional lag higher than 5 minutes (see graph of the last 7 days). Being able to support this kind of SLO would require a significant amount of work and rearchitecture.

I'm not saying we should not do it, just that we're not there at the moment. And that if we put a strong constraint on updater lag, we should review the way we manage this public endpoint.

However, what is even worse for this workflow is if edits get missed -- ie leading to anomalies that should be reported getting missed, or anomalies that should have been cleared failing to disappear.  Accurate eventual synchronisation remains the #1 priority.

@Smalyshev is doing great work in tracking the synchronisation issues! I don't think we have a good measurement of the miss rate (having that measure is probably almost as hard as fixing the actual issues).

I note that if we manage to formally define an SLO, missed edit should be part of it.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Magnus
Magnus added a comment.
In many cases, especially bot/background tasks (e.g. Listeria), a lag of hours is not critical. This is also true for many interactive tools, where the user gets some items matching certain criteria.

However, two situations I see as problematic:


"Instant gratification" - You see a problem via one of my tools, go fix it, reload to see how nicely it's fixed now - except it isn't. Frustrating, but tolerable.
Batch jobs. For example, there is an issue with the sourcemd tool at the moment. Essentially, it checks for a large number of scientific publications if they exist on Wikidata or not. If not, it creates them. Now, if people have accidentally listed the same paper twice, or if two different batch jobs check/create the same paper, then the first create is sometimes "invisible" due to SPARQL lag, and a duplicate item is created. This has apparently happened a lot in the last few days.
TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: MagnusCc: Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Jheald
Jheald added a comment.
To support what Smalyshev said: occasional termporary update lag may not be such a high-priority issue; but prolonged or repeated update lag rapidly would be.

A common workflow for editors making manual edits to fix problems on Wikidata is to use WDQS to generate a list of anomalies; then to invesigate and manually fix the first 'n' of those anomalies; then to re-run the WDQS query to get an updated list of anomalies remaining, with luck now excluding anomalies involving the 'n' items that have been edited.

This requires WDQS to be reasonably up to date most of the time.  A lag of 5 minutes isn't such a problem.  An occasional longer lag, if clearly signposted as the WDQS GUI does, also isn't such a problem -- if the server is having a slow moment, one can go away and work on something else for a while.

But if update lags become sustained or repeated, then this breaks the workflow and does become a problem.

However, what is even worse for this workflow is if edits get missed -- ie leading to anomalies that should be reported getting missed, or anomalies that should have been cleared failing to disappear.  Accurate eventual synchronisation remains the #1 priority.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: JhealdCc: Jheald, Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Pintoch
Pintoch added a comment.
@Gehel my service has been quite unstable for some time, but I haven't found the time yet to find out exactly where the problem is coming from - it could be SPARQL, the Wikidata API, redis or the webservice itself. I will add a few more metrics to understand what is going on and report back here.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: PintochCc: Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-11-02 Thread Gehel
Gehel added a comment.

In T199228#4710863, @Pintoch wrote:
What matters much more for this tool is getting quick results and as little downtime as possible - lag is not really a concern.


I'd be most interested in how well this is going at the moment! The open for all and widely varying cost of requests on WDQS makes it really hard to guarantee the same kind of level of service than what you might expect from other services. But we don't have a good measurement of the current quality of service. So your subjective observations are very valuable!TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-10-31 Thread Pintoch
Pintoch added a comment.
Thanks for the ping Lydia! On the top of my mind, the only uses of SPARQL in the tools I maintain are in the openrefine-wikidata interface:


queries to retrieve the list of subclasses of a given class - lag is not critical at all for this as the ontology is assumed to be stable. (These results are cached on my side for 24 hours, for any root class.)
queries to retrieve items by external identifiers or sitelinks - lag can be more of an issue for this but I would not consider it critical. (These results are not cached.)


What matters much more for this tool is getting quick results and as little downtime as possible - lag is not really a concern.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: PintochCc: Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-10-31 Thread gerritbot
gerritbot added a comment.
Change 470819 merged by Gehel:
[operations/puppet@production] wdqs: raise alerting threshold on updater lag for public cluster

https://gerrit.wikimedia.org/r/470819TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: gerritbotCc: Magnus, Pintoch, gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, CucyNoiD, Nandana, NebulousIris, thifranc, AndyTan, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Davinaclare77, Adrian1985, Qtn1293, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, merbst, LawExplorer, Lewizho99, Zppix, Maathavan, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-10-31 Thread gerritbot
gerritbot added a comment.
Change 470819 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: raise alerting threshold on updater lag for public cluster

https://gerrit.wikimedia.org/r/470819TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: gerritbotCc: gerritbot, Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Legado_Shulgin, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-10-11 Thread Smalyshev
Smalyshev added a comment.
If update lag is not a big issue for our users, then we should make it clear.

More precise statement would be a temporary update lag is not the highest priority issue. I.e. getting a hour-old data for a short time is not going to matter for most users, while e.g. not getting any response it all would be a problem for everybody.

Note that long-term lag (e.g. one happening for hours) is still an issue, so if the server is not catching up quickly, it still needs to be depooled, etc.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: SmalyshevCc: Mathew.onipe, Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Nandana, thifranc, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-10-10 Thread Smalyshev
Smalyshev added a comment.
I think update lag is not the biggest issue. Endpoint availability and response times is more important for most of the users, at least short-term. If there's a lag spike that goes away, most users won't even notice (persistent lag is different of course). If however the user's queries time out, that is different.

The problem that I see here is how to quantify what we want. We probably can reasonably promise endpoint availability, as in "can run trivial queries" (even that I would not be sure how to quantify). However, if we get to the "interesting" queries, the variety is so large then I am not sure how to express any guarantees in any certain terms. Maybe p95/p99? But that can be influenced by any random bot...

Do you have any SLOs in mind that we could look at and get an impression how that should look like?TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: SmalyshevCc: Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Nandana, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-10-10 Thread Gehel
Gehel added a comment.
Coming back to this discussion, I'll try to make my point more clear:

wdqs public endpoint is by nature a service more fragile than most of our other services. The update lag is a good example of a problem we don't seem to be able to get under control on the public endpoint. The consequence of that is that we are starting to ignore those lag alerts. And we don't see any major consequences to this lag on the public cluster, which is an indication that our current alerting threshold does not match the reality of what is needed by the clients of that service. I might be wrong here and maybe this lag is an important issue, and if it is, we need to address it with a high priority.


In T199228#4420685, @Smalyshev wrote:
WDQS public endpoint is not expected to have high availability / stability guarantees.

Well, this sounds a bit like giving up on availability (even if it's not the intention), so I think we want to have something. Let's think/brainstorm on what this something could be and how we could measure it.


Yes, it is giving up on at least some level of availability. Or matching expectations with the reality of that availability. Or taking a strong product decision that this public sparql endpoint is expected to have higher availability than it has and acting on that decision.

I'm not sure how much we should formalize an SLO on this endpoint, but expecting the same level of service as from other endpoint does not match the reality and never will unless we take drastic actions. This influences how we should react to failures on this endpoint, so it should be defined in some way.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Nandana, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-10-09 Thread Stashbot
Stashbot added a comment.
Mentioned in SAL (#wikimedia-operations) [2018-10-09T12:54:29Z]  silencing wdqs-public lag alerts (service still functional, and SLO unclear) - T199228TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: StashbotCc: Stashbot, Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, Nandana, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-09-11 Thread Smalyshev
Smalyshev added a comment.
Is something happening on this or this was shelved for now?TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: SmalyshevCc: Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-07-12 Thread Smalyshev
Smalyshev added a comment.
( ASK{ ?x ?y ?z };) does timeout from time to time.

This is definitely the thing that should not be happening, but I wonder how we can build metric around it beyond "this should never happen".

We could solve those issues by throwing loads of hardware at the problem.

I guess. But before that, we should define what the issues are :) I.e. do we want to get p95 or p50 into some range? Which range? Do we even care is p95 is 0.5s or 10s or 30s? If p99 is 40s, is it good or bad? Right now I am not sure I know how to answer these.

WDQS public endpoint is not expected to have high availability / stability guarantees.

Well, this sounds a bit like giving up on availability (even if it's not the intention), so I think we want to have something. Let's think/brainstorm on what this something could be and how we could measure it.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: SmalyshevCc: Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-07-12 Thread Gehel
Gehel added a comment.
I think response times and number of timeouts are not a good metric for this type of thing

To echo what @Smalyshev is saying, yes, I agree that we don't have good measures of either the reliability or the performances of WDQS. And it is somewhat related to the fact that we don't have a good definition of what those should be. We could measure the response time of a standard query and see how it evolves over time. I know that the query we use to check availability from LVS ( ASK{ ?x ?y ?z };) does timeout from time to time.

We could look at p95 or p50, but we will be measuring factors which are outside of our control as much as anything. Lag is mostly under our control (or should be) except for rare cases of massive change dump, but those are rare.

Yes and no. We could solve those issues by throwing loads of hardware at the problem. Or changing the architecture completely. Or requiring authentication on the service and having drastic usage limits. All those are probably thing we don't want to do.

What I would like, even if we don't have precise metrics, is that we agree on a statement that WDQS public endpoint is not expected to have high availability / stability guarantees.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-07-12 Thread Smalyshev
Smalyshev added a comment.
I think response times and number of timeouts are not a good metric for this type of thing - people run a lot of heavy queries, for which it's OK to time out, and the query response time is largely dependent on the type of queries (and bots) running. We could look at p95 or p50, but we will be measuring factors which are outside of our control as much as anything. Lag is mostly under our control (or should be) except for rare cases of massive change dump, but those are rare. So I think we do need to think some about how exactly we define what we want to achieve.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: SmalyshevCc: Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-07-12 Thread Gehel
Gehel added a comment.
You can have a look at the historical values we have for


update lag: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m&panelId=8&fullscreen&orgId=1&from=now-30d&to=now
response times: https://grafana.wikimedia.org/dashboard/db/wikidata-query-service-frontend?panelId=13&fullscreen&orgId=1&from=now-30d&to=now
number of timeouts: https://logstash.wikimedia.org/goto/6e9c89cc32924ceeeacca6b394bfc84b
documented wdqs incidents: https://wikitech.wikimedia.org/wiki/Incident_documentation


I don't think we have SLO defined for other services. For most service, the approach is "the service should always be available". We can discuss if that is a good idea in general, but for the public WDQS endpoint, this is not appropriate, or at least not reflecting the reality.

@Lydia_Pintscher if this does not answer your question, it is probably because I did not understand it correctly. Ping me for a chat if need be.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: GehelCc: Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T199228: Define an SLO for Wikidata Query Service public endpoint and communicate it

2018-07-11 Thread Lydia_Pintscher
Lydia_Pintscher added a comment.
Is there a place where I can look at the current thresholds for this and other services? Then I could probably give some more meaningful input.TASK DETAILhttps://phabricator.wikimedia.org/T199228EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Lydia_PintscherCc: Lydia_Pintscher, EBjune, debt, Joe, Smalyshev, Gehel, Aklapper, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, merbst, LawExplorer, Zppix, Jonas, Xmlizer, Wong128hk, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs