Dear Ariel, 0) Context.
I am trying to understand how the XML dumps are split (as seen for enwiki, frwiki, dewiki,, etc.). This is because I would like to write a script that can recognize when a complete set of, say `pages-articles', split dumps has been posted (even if the `pages-meta-history' split dumps are not complete). To that end, I have some questions. 1) Naming. Most wikis with split files (`dewiki', `frwiki', `wikidatawiki', and six others) are split into four pieces. There is a one-to-one correspondence between the `pages' and `stub' split files. It is easy to write code for this case. How are the split dumps for the `enwiki' (and soon the `frwiki' and `dewiki') named? I notice that the page range of the last `pages' split file changes every month. There are no pages ranges on the `stub' files. There is a many-to-one correspondence between `pages-meta-history' and `stub-meta-history' split files. It is harder to write code for this case. It is also not possible to use the `mwxml2sql' transform tool unless there is a one-to-one correspondence between `pages' and `stub' files. 2) Splitting. How are the dumps split.? There seems to be a one-to-one correspondence between `pages-articles' and `stub-articles' files. Yet, the `enwiki-20151002' dumps are split in an anomalous way. The `pages-articles' dumps are split into 28 files, while the `stub-articles' dumps are split into 27 files. Likewise with the `pages-meta-current' (28 files) and `stub-meta-current' dumps (27 files). Should my code be able to handle this as valid, or flag it as a bug? There is a many-to-one correspondence between `pages-meta-history' and `stub-meta-history' files. How do we understand this well enough to write code? 3) Posting. When split dumps are generated, are the files posted one-by-one, or atomically as a complete set? In other words, how do we recognize when a `pages-articles' dump set is complete, even if the `pages-meta-history' dump set is missing? Sincerely Yours, Kent On Fri, Dec 4, 2015 at 4:24 AM, Ariel T. Glenn <agl...@wikimedia.org> wrote: > Στις 03-12-2015, ημέρα Πεμ, και ώρα 15:30 -0700, ο/η Bryan White > έγραψε: > > I see where almost all the dumps have "Dump complete" next to them > > and the data has been transferred to labs. Problem is, the dumps are > > not complete. Is this the new paradigm?... After each stage of the > > dump, label them done and then transfer what files were generated? > > Wash, rinse and repeat? > > > > Bryan > > _______________________________________________ > > Transferring each file that is complete when the rsync runs is the new > paradigm, which has been happening since sometime last month. The > marking of all dumps as 'Dump complete' is a bug from my last deploy 2 > days ago; I have to track that down. It should be listing them as > 'Partial Dump'. > > Ariel > > _______________________________________________ > Xmldatadumps-l mailing list > Xmldatadumps-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l >
_______________________________________________ Xmldatadumps-l mailing list Xmldatadumps-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l