A memorable piece of research in this area sampled articles using the API. https://arxiv.org/abs/1904.08139
Regards, Will Avery On Thu, 16 Sep 2021, 19:21 Risker, <risker...@gmail.com> wrote: > Mike's suggestion is good. You would likely get better responses by > asking this question to the Wikimedia developers, so I am forwarding to > that list. > > Risker > > On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l < > wikimedi...@lists.wikimedia.org> wrote: > >> Hello everyone, >> >> >> >> It is my first time interacting in this mailing list, so I will be happy >> to receive further feedbacks on how to better interact with the community :) >> >> >> >> I am trying to access Wikipedia meta data in a streaming and >> time/resource sustainable manner. By meta data I mean many of the voices >> that can be found in the statistics of a wiki article, such as edits, >> editors list, page views etc. >> >> I would like to do such for an online classifier type of structure: >> retrieve the data from a big number of wiki pages every tot time and use it >> as input for predictions. >> >> >> >> I tried to use the Wiki API, however it is time and resource expensive, >> both for me and Wikipedia. >> >> >> >> My preferred choice now would be to query the specific tables in the >> Wikipedia database, in the same way this is done through the Quarry tool. >> The problem with Quarry is that I would like to build a standalone script, >> without having to depend on a user interface like Quarry. Do you think that >> this is possible? I am still fairly new to all of this and I don’t know >> exactly which is the best direction. >> >> I saw [1] <https://meta.wikimedia.org/wiki/Research:Data> that I could >> access wiki replicas both through Toolforge and PAWS, however I didn’t >> understand which one would serve me better, could I ask you for some >> feedback? >> >> >> >> Also, as far as I understood [2] >> <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>, directly >> accessing the DB through Hive is too technical for what I need, right? >> Especially because it seems that I would need an account with production >> shell access and I honestly don’t think that I would be granted access to >> it. Also, I am not interested in accessing sensible and private data. >> >> >> >> Last resource is parsing analytics dumps, however this seems less organic >> in the way of retrieving and polishing the data. As also, it would be >> strongly decentralised and physical-machine dependent, unless I upload the >> polished data online every time. >> >> >> >> Sorry for this long message, but I thought it was better to give you a >> clearer picture (hoping this is clear enough). If you could give me even >> some hint it would be highly appreciated. >> >> >> >> Best, >> >> Cristina >> >> >> >> [1] https://meta.wikimedia.org/wiki/Research:Data >> >> [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake >> _______________________________________________ >> Wikimedia-l mailing list -- wikimedi...@lists.wikimedia.org, guidelines >> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and >> https://meta.wikimedia.org/wiki/Wikimedia-l >> Public archives at >> https://lists.wikimedia.org/hyperkitty/list/wikimedi...@lists.wikimedia.org/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/ >> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org > > _______________________________________________ > Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org > To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/