Thank you Michael. That was helpful. Will reach out to ops team. On Thu, May 9, 2019 at 6:24 AM Michael Holloway <mhollo...@wikimedia.org> wrote:
> Aadithya, > > About title batching , you're not missing anything — unlike the action > api (/w/api.php), the REST API (/api/rest_v1) page content endpoints take > only a single title at a time. > > It sounds like you may indeed be running into some periodic rate limit. > The best source of info on current rate limits are the Traffic engineers on > the Site Reliability Engineering > <https://www.mediawiki.org/wiki/Wikimedia_Site_Reliability_Engineering> > team; I'm not sure if any of them are subscribed to this list. You may > have better luck asking on the Operations mailing list ( > o...@lists.wikimedia.org) or the #wikimedia-operations channel on IRC > (irc://irc.freenode.net/wikimedia-operations). > > On Wed, May 8, 2019 at 5:20 PM Aadithya C Udupa <udupa.adit...@gmail.com> > wrote: > >> Hi, >> I am making the queries to get the latest HTML content for a title. I am >> using the API documented here - >> https://en.wikipedia.org/api/rest_v1/#/Page%20content/get_page_html__title_ >> and >> I may be missing something, but I do not see an option to send a list of >> titles. >> Also I am working on a project to do some semistructured and unstructured >> data extractions from wikipedia html. >> >> >> On Wed, May 8, 2019 at 1:23 PM Betacommand <betacomm...@gmail.com> wrote: >> >>> >>> Why are you making so queries? Have you tried batching pages together? >>> What kind of project needs a real-time copy of a large data set? >>> >>> On Wed, May 8, 2019 at 2:49 PM Aadithya C Udupa <udupa.adit...@gmail.com> >>> wrote: >>> >>>> Thank you for the quick response, Michael. >>>> I was making close to 10 requests per second previously. But would hit >>>> the HTTP 429 errors frequently. In the etiquette document here >>>> <https://www.mediawiki.org/wiki/API:Etiquette>, it suggested we make >>>> requests in serial manner rather than parallel. Hence started making >>>> requests in serial manner and one request per second, as I did not want to >>>> abuse the API. But as you can imagine it takes up a lot of time, especially >>>> when trying to expand to multiple languages. >>>> Also, I send a valid User-Agent header as described here >>>> <https://meta.wikimedia.org/wiki/User-Agent_policy>. >>>> What do you think could be other reasons why I hit the HTTP 429 error? >>>> Is there a cap on total number of requests per day/week etc.? >>>> >>>> >>>> On Wed, May 8, 2019 at 10:43 AM Michael Holloway < >>>> mhollo...@wikimedia.org> wrote: >>>> >>>>> Hi Aadithya, >>>>> >>>>> According to the information on the top of the REST API docs page >>>>> <https://wikimedia.org/api/rest_v1/>, you should in general be able >>>>> to make up to 200 read requests per second to the REST API without any >>>>> trouble. As far as I know, that information is accurate. Are you hitting >>>>> 429s at a lower request rate than that? >>>>> >>>>> To answer your question, sending requests in parallel to multiple >>>>> language subdomains should not be a problem so long as your overall >>>>> request >>>>> rate remains lower than ~200/s. >>>>> >>>>> On Tue, May 7, 2019 at 8:27 PM Aadithya C Udupa < >>>>> udupa.adit...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> For one of my projects, I need to be able to keep the most up to date >>>>>> version of wikipedia html pages for a few languages like en, zh, de, es, >>>>>> fr >>>>>> etc. So this is done currently in two steps, >>>>>> 1. Listen to changes on stream API documented here >>>>>> <https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams> and >>>>>> then extract the page titles. >>>>>> 2. For each of the titles, get the latest HTML using the Wikipedia >>>>>> REST api >>>>>> <https://en.wikipedia.org/api/rest_v1/#/Page%20content/get_page_title__title_> >>>>>> and >>>>>> persist the HTML. >>>>>> >>>>>> I understand that in order to avoid the 429 (Too many requests >>>>>> error), we need to make sure we limit the api request to 1 per second. >>>>>> Just >>>>>> wanted to check if we can make requests to different languages like >>>>>> en.wikipedia.org, fr.wikipedia.org etc in parallel or do those >>>>>> requests also need to be done in serial manner (1 per second), in order >>>>>> to >>>>>> not hit HTTP 429 error. >>>>>> >>>>>> Please let me know if you need more information. >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> Aadithya >>>>>> -- >>>>>> Sent from my iPad3 >>>>>> _______________________________________________ >>>>>> Mediawiki-api mailing list >>>>>> Mediawiki-api@lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api >>>>>> >>>>> >>>>> >>>>> -- >>>>> Michael Holloway >>>>> Software Engineer, Reading Infrastructure >>>>> _______________________________________________ >>>>> Mediawiki-api mailing list >>>>> Mediawiki-api@lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> Aadithya >>>> _______________________________________________ >>>> Mediawiki-api mailing list >>>> Mediawiki-api@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api >>>> >>> _______________________________________________ >>> Mediawiki-api mailing list >>> Mediawiki-api@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api >>> >> >> >> -- >> Regards, >> Aadithya >> _______________________________________________ >> Mediawiki-api mailing list >> Mediawiki-api@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api >> > > > -- > Michael Holloway > Software Engineer, Reading Infrastructure > _______________________________________________ > Mediawiki-api mailing list > Mediawiki-api@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/mediawiki-api > -- Regards, Aadithya
_______________________________________________ Mediawiki-api mailing list Mediawiki-api@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/mediawiki-api