[Wikitech-l] Re: [Wikimedia-l] Accessing wikipedia metadata

William Avery Thu, 16 Sep 2021 11:58:02 -0700

A memorable piece of research in this area sampled articles using the API.
https://arxiv.org/abs/1904.08139


Regards,

Will Avery

On Thu, 16 Sep 2021, 19:21 Risker, <risker...@gmail.com> wrote:

> Mike's suggestion is good.  You would likely get better responses by
> asking this question to the Wikimedia developers, so I am forwarding to
> that list.
>
> Risker
>
> On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l <
> wikimedi...@lists.wikimedia.org> wrote:
>
>> Hello everyone,
>>
>>
>>
>> It is my first time interacting in this mailing list, so I will be happy
>> to receive further feedbacks on how to better interact with the community :)
>>
>>
>>
>> I am trying to access Wikipedia meta data in a streaming and
>> time/resource sustainable manner. By meta data I mean many of the voices
>> that can be found in the statistics of a wiki article, such as edits,
>> editors list, page views etc.
>>
>> I would like to do such for an online classifier type of structure:
>> retrieve the data from a big number of wiki pages every tot time and use it
>> as input for predictions.
>>
>>
>>
>> I tried to use the Wiki API, however it is time and resource expensive,
>> both for me and Wikipedia.
>>
>>
>>
>> My preferred choice now would be to query the specific tables in the
>> Wikipedia database, in the same way this is done through the Quarry tool.
>> The problem with Quarry is that I would like to build a standalone script,
>> without having to depend on a user interface like Quarry. Do you think that
>> this is possible? I am still fairly new to all of this and I don’t know
>> exactly which is the best direction.
>>
>> I saw [1] <https://meta.wikimedia.org/wiki/Research:Data> that I could
>> access wiki replicas both through Toolforge and PAWS, however I didn’t
>> understand which one would serve me better, could I ask you for some
>> feedback?
>>
>>
>>
>> Also, as far as I understood [2]
>> <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>, directly
>> accessing the DB through Hive is too technical for what I need, right?
>> Especially because it seems that I would need an account with production
>> shell access and I honestly don’t think that I would be granted access to
>> it. Also, I am not interested in accessing sensible and private data.
>>
>>
>>
>> Last resource is parsing analytics dumps, however this seems less organic
>> in the way of retrieving and polishing the data. As also, it would be
>> strongly decentralised and physical-machine dependent, unless I upload the
>> polished data online every time.
>>
>>
>>
>> Sorry for this long message, but I thought it was better to give you a
>> clearer picture (hoping this is clear enough). If you could give me even
>> some hint it would be highly appreciated.
>>
>>
>>
>> Best,
>>
>> Cristina
>>
>>
>>
>> [1] https://meta.wikimedia.org/wiki/Research:Data
>>
>> [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
>> _______________________________________________
>> Wikimedia-l mailing list -- wikimedi...@lists.wikimedia.org, guidelines
>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
>> https://meta.wikimedia.org/wiki/Wikimedia-l
>> Public archives at
>> https://lists.wikimedia.org/hyperkitty/list/wikimedi...@lists.wikimedia.org/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/
>> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Re: [Wikimedia-l] Accessing wikipedia metadata

Reply via email to