[Xmldatadumps-l] Re: Querying for recently created pages

2023-02-03 Thread Eric Andrew Lewis
I was able to compile a list of New Wikipedia Pages, November 2022 -
January 2023



Thanks again for the help!

Eric Andrew Lewis
ericandrewlewis.com
+1 610 715 8560


On Wed, Jan 25, 2023 at 6:39 AM Peter Bowman  wrote:

> Hello. New pages lack the `` field within ``, so you
> should be able to tell whether a revision is creating a new page or not
> using this information.
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Querying for recently created pages

2023-01-25 Thread Adhittya Ramadhan
[MAC]Sorry[MAC]

Pada tanggal 22 Jan 2023 22:28, "John"  menulis:

> How far back do you need to go?
>
> On Sun, Jan 22, 2023 at 10:25 AM Adhittya Ramadhan <
> adhittya.r...@gmail.com> wrote:
>
>>
>> Pada tanggal 17 Jan 2023 11:23, "Eric Andrew Lewis" <
>> eric.andrew.le...@gmail.com> menulis:
>>
>>> Hi,
>>>
>>> I am interested in performing analysis on recently created pages on
>>> English Wikipedia.
>>>
>>> One way to find recently created pages is downloading a meta-history
>>> file for the English language, and filter through the XML, looking for
>>> pages where the oldest revision is within the desired timespan.
>>>
>>> Since this requires a library to parse through XML string data, I would
>>> imagine this is much slower than a database query. Is page revision data
>>> available in one of the SQL dumps which I could query for this use case?
>>> Looking at the exported tables list
>>> ,
>>> it does not look like it is. Maybe this is intentional?
>>>
>>> Thanks,
>>> Eric Andrew Lewis
>>> ericandrewlewis.com
>>> +1 610 715 8560
>>>
>>> ___
>>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>>
>>> ___
>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Querying for recently created pages

2023-01-25 Thread Adhittya Ramadhan
+1 610 715 8560 [udconcept@]

Pada tanggal 17 Jan 2023 11:23, "Eric Andrew Lewis" <
eric.andrew.le...@gmail.com> menulis:

> Hi,
>
> I am interested in performing analysis on recently created pages on
> English Wikipedia.
>
> One way to find recently created pages is downloading a meta-history file
> for the English language, and filter through the XML, looking for pages
> where the oldest revision is within the desired timespan.
>
> Since this requires a library to parse through XML string data, I would
> imagine this is much slower than a database query. Is page revision data
> available in one of the SQL dumps which I could query for this use case?
> Looking at the exported tables list
> ,
> it does not look like it is. Maybe this is intentional?
>
> Thanks,
> Eric Andrew Lewis
> ericandrewlewis.com
> +1 610 715 8560
>
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Querying for recently created pages

2023-01-25 Thread Peter Bowman
Hello. New pages lack the `` field within ``, so you should 
be able to tell whether a revision is creating a new page or not using this 
information.
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Querying for recently created pages

2023-01-24 Thread Hannah Okwelum
Hello Eric, so we don’t have any tag or field in the add/changes dump that
differentiates between “new pages vs pages with new revision”. But we have
a column in the page table called “page_is_new” and we dump that table
twice every month in SQL format.  You might want to cross-check the page
table dump against the add/changes dump of the same day;


Please note that both dumps are not done at exactly the same time and the
data on both files might not be consistent.

On Tue, Jan 24, 2023 at 4:31 AM Eric Andrew Lewis <
eric.andrew.le...@gmail.com> wrote:

> Hi Ariel,
>
> Thank you for the detail! Helpful to understand.
>
> Is it possible to disambiguate completely new pages vs. pages with new
> revisions in the "Adds/changes dumps"? Looking at nodes in the XML I'm not
> sure there's a way to do this.
>
> In the interim I wrote a golang script that parses a meta-history file as
> previously described, with various filters – excludes redirect pages (wow
> there so many), user pages, etc. It worked out rather well. A bit sloppy
> but here is the script
>  for
> reference.
>
> Eric Andrew Lewis
> ericandrewlewis.com
> +1 610 715 8560
>
>
> On Wed, Jan 18, 2023 at 12:49 PM Ariel Glenn WMF 
> wrote:
>
>> Eric,
>>
>> We don't produce dumps of the revision table in sql format because some
>> of those revisions may be hidden from public view, and even metadata about
>> them should not be released. We do however publish so-called Adds/Changes
>> dumps once a day for each wiki, providing stubs and content files in xml of
>> just new pages and revisions since the last such dump. They lag about 12
>> hours behind to allow vandalism and such to be filtered out by wiki admins,
>> but hopefully that's good enough for your needs.  You can find those here:
>> https://dumps.wikimedia.org/other/incr/
>>
>> Ariel Glenn
>> ar...@wikimedia.org
>>
>> On Tue, Jan 17, 2023 at 6:22 AM Eric Andrew Lewis <
>> eric.andrew.le...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am interested in performing analysis on recently created pages on
>>> English Wikipedia.
>>>
>>> One way to find recently created pages is downloading a meta-history
>>> file for the English language, and filter through the XML, looking for
>>> pages where the oldest revision is within the desired timespan.
>>>
>>> Since this requires a library to parse through XML string data, I would
>>> imagine this is much slower than a database query. Is page revision data
>>> available in one of the SQL dumps which I could query for this use case?
>>> Looking at the exported tables list
>>> ,
>>> it does not look like it is. Maybe this is intentional?
>>>
>>> Thanks,
>>> Eric Andrew Lewis
>>> ericandrewlewis.com
>>> +1 610 715 8560
>>> ___
>>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>>
>> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Querying for recently created pages

2023-01-23 Thread Eric Andrew Lewis
Hi Ariel,

Thank you for the detail! Helpful to understand.

Is it possible to disambiguate completely new pages vs. pages with new
revisions in the "Adds/changes dumps"? Looking at nodes in the XML I'm not
sure there's a way to do this.

In the interim I wrote a golang script that parses a meta-history file as
previously described, with various filters – excludes redirect pages (wow
there so many), user pages, etc. It worked out rather well. A bit sloppy
but here is the script
 for
reference.

Eric Andrew Lewis
ericandrewlewis.com
+1 610 715 8560


On Wed, Jan 18, 2023 at 12:49 PM Ariel Glenn WMF 
wrote:

> Eric,
>
> We don't produce dumps of the revision table in sql format because some of
> those revisions may be hidden from public view, and even metadata about
> them should not be released. We do however publish so-called Adds/Changes
> dumps once a day for each wiki, providing stubs and content files in xml of
> just new pages and revisions since the last such dump. They lag about 12
> hours behind to allow vandalism and such to be filtered out by wiki admins,
> but hopefully that's good enough for your needs.  You can find those here:
> https://dumps.wikimedia.org/other/incr/
>
> Ariel Glenn
> ar...@wikimedia.org
>
> On Tue, Jan 17, 2023 at 6:22 AM Eric Andrew Lewis <
> eric.andrew.le...@gmail.com> wrote:
>
>> Hi,
>>
>> I am interested in performing analysis on recently created pages on
>> English Wikipedia.
>>
>> One way to find recently created pages is downloading a meta-history file
>> for the English language, and filter through the XML, looking for pages
>> where the oldest revision is within the desired timespan.
>>
>> Since this requires a library to parse through XML string data, I would
>> imagine this is much slower than a database query. Is page revision data
>> available in one of the SQL dumps which I could query for this use case?
>> Looking at the exported tables list
>> ,
>> it does not look like it is. Maybe this is intentional?
>>
>> Thanks,
>> Eric Andrew Lewis
>> ericandrewlewis.com
>> +1 610 715 8560
>> ___
>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Querying for recently created pages

2023-01-22 Thread John
How far back do you need to go?

On Sun, Jan 22, 2023 at 10:25 AM Adhittya Ramadhan 
wrote:

>
> Pada tanggal 17 Jan 2023 11:23, "Eric Andrew Lewis" <
> eric.andrew.le...@gmail.com> menulis:
>
>> Hi,
>>
>> I am interested in performing analysis on recently created pages on
>> English Wikipedia.
>>
>> One way to find recently created pages is downloading a meta-history file
>> for the English language, and filter through the XML, looking for pages
>> where the oldest revision is within the desired timespan.
>>
>> Since this requires a library to parse through XML string data, I would
>> imagine this is much slower than a database query. Is page revision data
>> available in one of the SQL dumps which I could query for this use case?
>> Looking at the exported tables list
>> ,
>> it does not look like it is. Maybe this is intentional?
>>
>> Thanks,
>> Eric Andrew Lewis
>> ericandrewlewis.com
>> +1 610 715 8560
>>
>> ___
>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>>
>> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Querying for recently created pages

2023-01-22 Thread Adhittya Ramadhan
Pada tanggal 17 Jan 2023 11:23, "Eric Andrew Lewis" <
eric.andrew.le...@gmail.com> menulis:

> Hi,
>
> I am interested in performing analysis on recently created pages on
> English Wikipedia.
>
> One way to find recently created pages is downloading a meta-history file
> for the English language, and filter through the XML, looking for pages
> where the oldest revision is within the desired timespan.
>
> Since this requires a library to parse through XML string data, I would
> imagine this is much slower than a database query. Is page revision data
> available in one of the SQL dumps which I could query for this use case?
> Looking at the exported tables list
> ,
> it does not look like it is. Maybe this is intentional?
>
> Thanks,
> Eric Andrew Lewis
> ericandrewlewis.com
> +1 610 715 8560
>
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org


[Xmldatadumps-l] Re: Querying for recently created pages

2023-01-18 Thread Ariel Glenn WMF
Eric,

We don't produce dumps of the revision table in sql format because some of
those revisions may be hidden from public view, and even metadata about
them should not be released. We do however publish so-called Adds/Changes
dumps once a day for each wiki, providing stubs and content files in xml of
just new pages and revisions since the last such dump. They lag about 12
hours behind to allow vandalism and such to be filtered out by wiki admins,
but hopefully that's good enough for your needs.  You can find those here:
https://dumps.wikimedia.org/other/incr/

Ariel Glenn
ar...@wikimedia.org

On Tue, Jan 17, 2023 at 6:22 AM Eric Andrew Lewis <
eric.andrew.le...@gmail.com> wrote:

> Hi,
>
> I am interested in performing analysis on recently created pages on
> English Wikipedia.
>
> One way to find recently created pages is downloading a meta-history file
> for the English language, and filter through the XML, looking for pages
> where the oldest revision is within the desired timespan.
>
> Since this requires a library to parse through XML string data, I would
> imagine this is much slower than a database query. Is page revision data
> available in one of the SQL dumps which I could query for this use case?
> Looking at the exported tables list
> ,
> it does not look like it is. Maybe this is intentional?
>
> Thanks,
> Eric Andrew Lewis
> ericandrewlewis.com
> +1 610 715 8560
> ___
> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org
>
___
Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org
To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org