Hello Eric, so we don’t have any tag or field in the add/changes dump that differentiates between “new pages vs pages with new revision”. But we have a column in the page table called “page_is_new” and we dump that table twice every month in SQL format. You might want to cross-check the page table dump against the add/changes dump of the same day;
Please note that both dumps are not done at exactly the same time and the data on both files might not be consistent. On Tue, Jan 24, 2023 at 4:31 AM Eric Andrew Lewis < eric.andrew.le...@gmail.com> wrote: > Hi Ariel, > > Thank you for the detail! Helpful to understand. > > Is it possible to disambiguate completely new pages vs. pages with new > revisions in the "Adds/changes dumps"? Looking at nodes in the XML I'm not > sure there's a way to do this. > > In the interim I wrote a golang script that parses a meta-history file as > previously described, with various filters – excludes redirect pages (wow > there so many), user pages, etc. It worked out rather well. A bit sloppy > but here is the script > <https://gist.github.com/ericandrewlewis/2fce39aff70b78b8316cfc87cd10a3eb> for > reference. > > Eric Andrew Lewis > ericandrewlewis.com > +1 610 715 8560 > > > On Wed, Jan 18, 2023 at 12:49 PM Ariel Glenn WMF <ar...@wikimedia.org> > wrote: > >> Eric, >> >> We don't produce dumps of the revision table in sql format because some >> of those revisions may be hidden from public view, and even metadata about >> them should not be released. We do however publish so-called Adds/Changes >> dumps once a day for each wiki, providing stubs and content files in xml of >> just new pages and revisions since the last such dump. They lag about 12 >> hours behind to allow vandalism and such to be filtered out by wiki admins, >> but hopefully that's good enough for your needs. You can find those here: >> https://dumps.wikimedia.org/other/incr/ >> >> Ariel Glenn >> ar...@wikimedia.org >> >> On Tue, Jan 17, 2023 at 6:22 AM Eric Andrew Lewis < >> eric.andrew.le...@gmail.com> wrote: >> >>> Hi, >>> >>> I am interested in performing analysis on recently created pages on >>> English Wikipedia. >>> >>> One way to find recently created pages is downloading a meta-history >>> file for the English language, and filter through the XML, looking for >>> pages where the oldest revision is within the desired timespan. >>> >>> Since this requires a library to parse through XML string data, I would >>> imagine this is much slower than a database query. Is page revision data >>> available in one of the SQL dumps which I could query for this use case? >>> Looking at the exported tables list >>> <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#Database_tables>, >>> it does not look like it is. Maybe this is intentional? >>> >>> Thanks, >>> Eric Andrew Lewis >>> ericandrewlewis.com >>> +1 610 715 8560 >>> _______________________________________________ >>> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org >>> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org >>> >> _______________________________________________ > Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org > To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org >
_______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org