[Analytics] Wikipedia throttling

John Bohannon Wed, 27 Feb 2019 06:01:55 -0800

Hello!

I'm hoping to get advice on how we should approach the following challenge...


I am building a public website that will provide information that is 
automatically harvested from online news articles about the work of scientists. 
The goal is to make it easier to create and maintain scientific content on 
Wikipedia.

Here's some news about the project: 
https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer 
<https://www.theverge.com/2018/8/8/17663544/ai-scientists-wikipedia-primer> 

And here is the prototype of the site:  https://quicksilver.primer.ai 
<https://quicksilver.primer.ai/> 

What I am working on now is a self-updating version of this site. 

The goal is to provide daily refreshed information for scientists most likely 
to be missing from Wikipedia. 

For now I am focusing on English-language news and English-language Wikipedia. 
Eventually this will expand to other languages.

The  ~100 scientists shown on any given day are selected from ~100k scientists 
that the system is tracking for news updates.

So here's the challenge:  

To choose the 100 scientists most in need of an update on Wikipedia, we need to 
query Wikipedia each day for the 100k scientists to see if they have an article 
yet, and if so to get its content (to check if we have new information).

I am getting throttled by the Wikipedia servers. 100k is a lot of queries.

What is the most polite, sanctioned method for programmatic access to Wikipedia 
for a daily job on this scale?

Many thanks for help/advice!

John Bohannon
http://johnbohannon.org

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Wikipedia throttling

Reply via email to