Hey,
Forgive my ignorance. I don't know much about infrastructure of WDQS and
how it works. I just want to mention how application servers do it. In
appservers, there are dedicated nodes both for apache and the replica
database. So if a bot overdo things in Wikipedia (which happens quite a
lot), users won't feel anything but the other bots take the hit. Routing
based on UA seems hard though while it's easy in mediawiki (if you hit
api.php, we assume it's a bot).

Did you consider this in a more long-term solution?
Best

On Tue, 23 Jul 2019 at 09:43, Stas Malyshev <smalys...@wikimedia.org> wrote:

> Hello all!
>
> Here is (at last!) an update on what we are doing to protect the
> stability of Wikidata Query Service.
>
> For 4 years we have been offering to Wikidata users the Query Service, a
> powerful tool that allows anyone to query the content of Wikidata,
> without any identification needed. This means that anyone can use the
> service using a script and make heavy or very frequent requests.
> However, this freedom has led to the service being overloaded by a too
> big amount of queries, causing the issues or lag that you may have noticed.
>
> A reminder about the context:
>
> We have had a number of incidents where the public WDQS endpoint was
> overloaded by bot traffic. We don't think that any of that activity was
> intentionally malicious, but rather that the bot authors most probably
> don't understand the cost of their queries and the impact they have on
> our infrastructure. We've recently seen more distributed bots, coming
> from multiple IPs from cloud providers. This kind of pattern makes it
> harder and harder to filter or throttle an individual bot. The impact
> has ranged from increased update lag to full service interruption.
>
> What we have been doing:
>
> While we would love to allow anyone to run any query they want at any
> time, we're not able to sustain that load, and we need to be more
> aggressive in how we throttle clients. We want to be fair to our users
> and allow everyone to use the service productively. We also want the
> service to be available to the casual user and provide up-to-date access
> to the live Wikidata data. And while we would love to throttle only
> abusive bots, to be able to do that we need to be able to identify them.
>
> We have two main means of identifying bots:
>
> 1) their user agent and IP address
> 2) the pattern of their queries
>
> Identifying patterns in queries is done manually, by a person inspecting
> the logs. It takes time and can only be done after the fact. We can only
> start our identification process once the service is already overloaded.
> This is not going to scale.
>
> IP addresses are starting to be problematic. We see bots running on
> cloud providers and running their workloads on multiple instances, with
> multiple IP addresses.
>
> We are left with user agents. But here, we have a problem again. To
> block only abusive bots, we would need those bots to use a clearly
> identifiable user agent, so that we can throttle or block them and
> contact the author to work together on a solution. It is unlikely that
> an intentionally abusive bot will voluntarily provide a way to be
> blocked. So we need to be more aggressive about bots which are using a
> generic user agent. We are not blocking those, but we are limiting the
> number of requests coming from generic user agents. This is a large
> bucket, with a lot of bots that are in this same category of "generic
> user agent". Sadly, this is also the bucket that contains many small
> bots that generate only a very reasonable load. And so we are also
> impacting the bots that play fair.
>
> At the moment, if your bot is affected by our restrictions, configure a
> custom user agent that identifies you; this should be sufficient to give
> you enough bandwidth. If you are still running into issues, please
> contact us; we'll find a solution together.
>
> What's coming next:
>
> First, it is unlikely that we will be able to remove the current
> restrictions in the short term. We're sorry for that, but the
> alternative - service being unresponsive or severely lagged for everyone
> - is worse.
>
> We are exploring a number of alternatives. Adding authentication to the
> service, and allowing higher quotas to bots that authenticate. Creating
> an asynchronous queue, which could allow running more expensive queries,
> but with longer deadlines. And we are in the process of hiring another
> engineer to work on these ideas.
>
> Thanks for your patience!
>
> WDQS Team
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
Amir Sarabadani (he/him)
Software engineer

Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Tel. (030) 219 158 26-0
https://wikimedia.de

Unsere Vision ist eine Welt, in der alle Menschen am Wissen der Menschheit
teilhaben, es nutzen und mehren können. Helfen Sie uns dabei!
https://spenden.wikimedia.de

Wikimedia Deutschland — Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to