Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

Andrew Gray Wed, 24 Jul 2019 04:39:58 -0700

Hi Stas,

One thing that I've been wondering about is whether we could take a
little bit of load off via caching.


At the moment, if you run the same query again within a minute or two,
it uses the cached results. But after a few minutes, anyone who
follows the link triggers a new run.

If a query is embedded somewhere, or it does the rounds on Twitter or
in the newsletter, it might get a long stream of visitors spread out
enough to miss the cache window, meaning we need to recalculate it a
lot.

For a lot of queries, of course, this is a good thing - we want people
to have the newest data, especially for maintenance queries. But for a
lot of others, either the data isn't going to change in the next day
(eg maps of cities) or it's so high level that being a little old
won't affect much (eg high-level counts of groups of items where all
the results are in the tens of thousands anyway).

So the suggestion: would it be possible to have some kind of
comment/command (similar to #defaultView:Map) that keeps the results
cached for a day or two? This would make it an opt-in approach, and if
this is done as a comment then the user could remove it or tweak the
query to force an update. It certainly wouldn't solve the underlying
load issues - bots aren't likely to want longer cache times - but it
might help take a little bit of the load off.

It might also improve the user experience in some circumstances - if I
email someone a query which I can force to be cached, then I know when
they open it, they'll get something promptly rather than taking a long
time, and (if it's a complex query) I can know for sure it'll run
rather than timing out.

Andrew.


On Tue, 23 Jul 2019 at 08:43, Stas Malyshev <smalys...@wikimedia.org> wrote:
>
> Hello all!
>
> Here is (at last!) an update on what we are doing to protect the
> stability of Wikidata Query Service.
>
> For 4 years we have been offering to Wikidata users the Query Service, a
> powerful tool that allows anyone to query the content of Wikidata,
> without any identification needed. This means that anyone can use the
> service using a script and make heavy or very frequent requests.
> However, this freedom has led to the service being overloaded by a too
> big amount of queries, causing the issues or lag that you may have noticed.
>
> A reminder about the context:
>
> We have had a number of incidents where the public WDQS endpoint was
> overloaded by bot traffic. We don't think that any of that activity was
> intentionally malicious, but rather that the bot authors most probably
> don't understand the cost of their queries and the impact they have on
> our infrastructure. We've recently seen more distributed bots, coming
> from multiple IPs from cloud providers. This kind of pattern makes it
> harder and harder to filter or throttle an individual bot. The impact
> has ranged from increased update lag to full service interruption.
>
> What we have been doing:
>
> While we would love to allow anyone to run any query they want at any
> time, we're not able to sustain that load, and we need to be more
> aggressive in how we throttle clients. We want to be fair to our users
> and allow everyone to use the service productively. We also want the
> service to be available to the casual user and provide up-to-date access
> to the live Wikidata data. And while we would love to throttle only
> abusive bots, to be able to do that we need to be able to identify them.
>
> We have two main means of identifying bots:
>
> 1) their user agent and IP address
> 2) the pattern of their queries
>
> Identifying patterns in queries is done manually, by a person inspecting
> the logs. It takes time and can only be done after the fact. We can only
> start our identification process once the service is already overloaded.
> This is not going to scale.
>
> IP addresses are starting to be problematic. We see bots running on
> cloud providers and running their workloads on multiple instances, with
> multiple IP addresses.
>
> We are left with user agents. But here, we have a problem again. To
> block only abusive bots, we would need those bots to use a clearly
> identifiable user agent, so that we can throttle or block them and
> contact the author to work together on a solution. It is unlikely that
> an intentionally abusive bot will voluntarily provide a way to be
> blocked. So we need to be more aggressive about bots which are using a
> generic user agent. We are not blocking those, but we are limiting the
> number of requests coming from generic user agents. This is a large
> bucket, with a lot of bots that are in this same category of "generic
> user agent". Sadly, this is also the bucket that contains many small
> bots that generate only a very reasonable load. And so we are also
> impacting the bots that play fair.
>
> At the moment, if your bot is affected by our restrictions, configure a
> custom user agent that identifies you; this should be sufficient to give
> you enough bandwidth. If you are still running into issues, please
> contact us; we'll find a solution together.
>
> What's coming next:
>
> First, it is unlikely that we will be able to remove the current
> restrictions in the short term. We're sorry for that, but the
> alternative - service being unresponsive or severely lagged for everyone
> - is worse.
>
> We are exploring a number of alternatives. Adding authentication to the
> service, and allowing higher quotas to bots that authenticate. Creating
> an asynchronous queue, which could allow running more expensive queries,
> but with longer deadlines. And we are in the process of hiring another
> engineer to work on these ideas.
>
> Thanks for your patience!
>
> WDQS Team
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



--
- Andrew Gray
  and...@generalist.org.uk

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

Reply via email to