Hi Ariel, Thank you for the detail! Helpful to understand.
Is it possible to disambiguate completely new pages vs. pages with new revisions in the "Adds/changes dumps"? Looking at nodes in the XML I'm not sure there's a way to do this. In the interim I wrote a golang script that parses a meta-history file as previously described, with various filters – excludes redirect pages (wow there so many), user pages, etc. It worked out rather well. A bit sloppy but here is the script <https://gist.github.com/ericandrewlewis/2fce39aff70b78b8316cfc87cd10a3eb> for reference. Eric Andrew Lewis ericandrewlewis.com +1 610 715 8560 On Wed, Jan 18, 2023 at 12:49 PM Ariel Glenn WMF <ar...@wikimedia.org> wrote: > Eric, > > We don't produce dumps of the revision table in sql format because some of > those revisions may be hidden from public view, and even metadata about > them should not be released. We do however publish so-called Adds/Changes > dumps once a day for each wiki, providing stubs and content files in xml of > just new pages and revisions since the last such dump. They lag about 12 > hours behind to allow vandalism and such to be filtered out by wiki admins, > but hopefully that's good enough for your needs. You can find those here: > https://dumps.wikimedia.org/other/incr/ > > Ariel Glenn > ar...@wikimedia.org > > On Tue, Jan 17, 2023 at 6:22 AM Eric Andrew Lewis < > eric.andrew.le...@gmail.com> wrote: > >> Hi, >> >> I am interested in performing analysis on recently created pages on >> English Wikipedia. >> >> One way to find recently created pages is downloading a meta-history file >> for the English language, and filter through the XML, looking for pages >> where the oldest revision is within the desired timespan. >> >> Since this requires a library to parse through XML string data, I would >> imagine this is much slower than a database query. Is page revision data >> available in one of the SQL dumps which I could query for this use case? >> Looking at the exported tables list >> <https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#Database_tables>, >> it does not look like it is. Maybe this is intentional? >> >> Thanks, >> Eric Andrew Lewis >> ericandrewlewis.com >> +1 610 715 8560 >> _______________________________________________ >> Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org >> To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org >> >
_______________________________________________ Xmldatadumps-l mailing list -- xmldatadumps-l@lists.wikimedia.org To unsubscribe send an email to xmldatadumps-l-le...@lists.wikimedia.org