Hello Eric, so we don’t have any tag or field in the add/changes dump that
differentiates between “new pages vs pages with new revision”. But we have
a column in the page table called “page_is_new” and we dump that table
twice every month in SQL format.  You might want to cross-check the page
table dump against the add/changes dump of the same day;


Please note that both dumps are not done at exactly the same time and the
data on both files might not be consistent.

On Tue, Jan 24, 2023 at 4:31 AM Eric Andrew Lewis <
eric.andrew.le...@gmail.com> wrote:

> Hi Ariel,
>
> Thank you for the detail! Helpful to understand.
>
> Is it possible to disambiguate completely new pages vs. pages with new
> revisions in the "Adds/changes dumps"? Looking at nodes in the XML I'm not
> sure there's a way to do this.
>
> In the interim I wrote a golang script that parses a meta-history file as
> previously described, with various filters – excludes redirect pages (wow
> there so many), user pages, etc. It worked out rather well. A bit sloppy
> but here is the script
> <https://gist.github.com/ericandrewlewis/2fce39aff70b78b8316cfc87cd10a3eb> for
> reference.
>
> Eric Andrew Lewis
> ericandrewlewis.com
> +1 610 715 8560
>
>
> On Wed, Jan 18, 2023 at 12:49 PM Ariel Glenn WMF <ar...@wikimedia.org>
> wrote:
>
>> Eric,
>>
>> We don't produce dumps of the revision table in sql format because some
>> of those revisions may be hidden from public view, and even metadata about
>> them should not be released. We do however publish so-called Adds/Changes
>> dumps once a day for each wiki, providing stubs and content files in xml of
>> just new pages and revisions since the last such dump. They lag about 12
>> hours behind to allow vandalism and such to be filtered out by wiki admins,
>> but hopefully that's good enough for your needs.  You can find those here:
>> https://dumps.wikimedia.org/other/incr/
>>
>> Ariel Glenn
>> ar...@wikimedia.org
>>
>> On Tue, Jan 17, 2023 at 6:22 AM Eric Andrew Lewis <
>> eric.andrew.le...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am interested in performing analysis on recently created pages on
>>> English Wikipedia.
>>>
>>> One way to find recently created pages is downloading a meta-history
>>> file for the English language, and filter through the XML, looking for
>>> pages where the oldest revision is within the desired timespan.
>>>
>>> Since this requires a library to parse through XML string data, I would
>>> imagine this is much slower than a database query. Is page revision data
>>> available in one of the SQL dumps which I could query for this use case?
>>> Looking at the exported tables list
>>> <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#Database_tables>,
>>> it does not look like it is. Maybe this is intentional?
>>>
>>> Thanks,
>>> Eric Andrew Lewis
>>> ericandrewlewis.com
>>> +1 610 715 8560
>>> _______________________________________________
>>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>>
>> _______________________________________________
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
_______________________________________________
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org

Reply via email to