Hi Stas, One thing that I've been wondering about is whether we could take a little bit of load off via caching.
At the moment, if you run the same query again within a minute or two, it uses the cached results. But after a few minutes, anyone who follows the link triggers a new run. If a query is embedded somewhere, or it does the rounds on Twitter or in the newsletter, it might get a long stream of visitors spread out enough to miss the cache window, meaning we need to recalculate it a lot. For a lot of queries, of course, this is a good thing - we want people to have the newest data, especially for maintenance queries. But for a lot of others, either the data isn't going to change in the next day (eg maps of cities) or it's so high level that being a little old won't affect much (eg high-level counts of groups of items where all the results are in the tens of thousands anyway). So the suggestion: would it be possible to have some kind of comment/command (similar to #defaultView:Map) that keeps the results cached for a day or two? This would make it an opt-in approach, and if this is done as a comment then the user could remove it or tweak the query to force an update. It certainly wouldn't solve the underlying load issues - bots aren't likely to want longer cache times - but it might help take a little bit of the load off. It might also improve the user experience in some circumstances - if I email someone a query which I can force to be cached, then I know when they open it, they'll get something promptly rather than taking a long time, and (if it's a complex query) I can know for sure it'll run rather than timing out. Andrew. On Tue, 23 Jul 2019 at 08:43, Stas Malyshev <smalys...@wikimedia.org> wrote: > > Hello all! > > Here is (at last!) an update on what we are doing to protect the > stability of Wikidata Query Service. > > For 4 years we have been offering to Wikidata users the Query Service, a > powerful tool that allows anyone to query the content of Wikidata, > without any identification needed. This means that anyone can use the > service using a script and make heavy or very frequent requests. > However, this freedom has led to the service being overloaded by a too > big amount of queries, causing the issues or lag that you may have noticed. > > A reminder about the context: > > We have had a number of incidents where the public WDQS endpoint was > overloaded by bot traffic. We don't think that any of that activity was > intentionally malicious, but rather that the bot authors most probably > don't understand the cost of their queries and the impact they have on > our infrastructure. We've recently seen more distributed bots, coming > from multiple IPs from cloud providers. This kind of pattern makes it > harder and harder to filter or throttle an individual bot. The impact > has ranged from increased update lag to full service interruption. > > What we have been doing: > > While we would love to allow anyone to run any query they want at any > time, we're not able to sustain that load, and we need to be more > aggressive in how we throttle clients. We want to be fair to our users > and allow everyone to use the service productively. We also want the > service to be available to the casual user and provide up-to-date access > to the live Wikidata data. And while we would love to throttle only > abusive bots, to be able to do that we need to be able to identify them. > > We have two main means of identifying bots: > > 1) their user agent and IP address > 2) the pattern of their queries > > Identifying patterns in queries is done manually, by a person inspecting > the logs. It takes time and can only be done after the fact. We can only > start our identification process once the service is already overloaded. > This is not going to scale. > > IP addresses are starting to be problematic. We see bots running on > cloud providers and running their workloads on multiple instances, with > multiple IP addresses. > > We are left with user agents. But here, we have a problem again. To > block only abusive bots, we would need those bots to use a clearly > identifiable user agent, so that we can throttle or block them and > contact the author to work together on a solution. It is unlikely that > an intentionally abusive bot will voluntarily provide a way to be > blocked. So we need to be more aggressive about bots which are using a > generic user agent. We are not blocking those, but we are limiting the > number of requests coming from generic user agents. This is a large > bucket, with a lot of bots that are in this same category of "generic > user agent". Sadly, this is also the bucket that contains many small > bots that generate only a very reasonable load. And so we are also > impacting the bots that play fair. > > At the moment, if your bot is affected by our restrictions, configure a > custom user agent that identifies you; this should be sufficient to give > you enough bandwidth. If you are still running into issues, please > contact us; we'll find a solution together. > > What's coming next: > > First, it is unlikely that we will be able to remove the current > restrictions in the short term. We're sorry for that, but the > alternative - service being unresponsive or severely lagged for everyone > - is worse. > > We are exploring a number of alternatives. Adding authentication to the > service, and allowing higher quotas to bots that authenticate. Creating > an asynchronous queue, which could allow running more expensive queries, > but with longer deadlines. And we are in the process of hiring another > engineer to work on these ideas. > > Thanks for your patience! > > WDQS Team > > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata -- - Andrew Gray and...@generalist.org.uk _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata