Hi Ariel,

Thank you for the detail! Helpful to understand.

Is it possible to disambiguate completely new pages vs. pages with new
revisions in the "Adds/changes dumps"? Looking at nodes in the XML I'm not
sure there's a way to do this.

In the interim I wrote a golang script that parses a meta-history file as
previously described, with various filters – excludes redirect pages (wow
there so many), user pages, etc. It worked out rather well. A bit sloppy
but here is the script
<https://gist.github.com/ericandrewlewis/2fce39aff70b78b8316cfc87cd10a3eb> for
reference.

Eric Andrew Lewis
ericandrewlewis.com
+1 610 715 8560


On Wed, Jan 18, 2023 at 12:49 PM Ariel Glenn WMF <ar...@wikimedia.org>
wrote:

> Eric,
>
> We don't produce dumps of the revision table in sql format because some of
> those revisions may be hidden from public view, and even metadata about
> them should not be released. We do however publish so-called Adds/Changes
> dumps once a day for each wiki, providing stubs and content files in xml of
> just new pages and revisions since the last such dump. They lag about 12
> hours behind to allow vandalism and such to be filtered out by wiki admins,
> but hopefully that's good enough for your needs.  You can find those here:
> https://dumps.wikimedia.org/other/incr/
>
> Ariel Glenn
> ar...@wikimedia.org
>
> On Tue, Jan 17, 2023 at 6:22 AM Eric Andrew Lewis <
> eric.andrew.le...@gmail.com> wrote:
>
>> Hi,
>>
>> I am interested in performing analysis on recently created pages on
>> English Wikipedia.
>>
>> One way to find recently created pages is downloading a meta-history file
>> for the English language, and filter through the XML, looking for pages
>> where the oldest revision is within the desired timespan.
>>
>> Since this requires a library to parse through XML string data, I would
>> imagine this is much slower than a database query. Is page revision data
>> available in one of the SQL dumps which I could query for this use case?
>> Looking at the exported tables list
>> <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#Database_tables>,
>> it does not look like it is. Maybe this is intentional?
>>
>> Thanks,
>> Eric Andrew Lewis
>> ericandrewlewis.com
>> +1 610 715 8560
>> _______________________________________________
>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>
>
_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org

Reply via email to