[Wikidata-bugs] [Maniphest] [Commented On] T221504: investigate why content history dump of certain wikidata page ranges is so slow

2019-04-30 Thread gerritbot
gerritbot added a comment. Change 507268 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn): [operations/dumps@master] split up page content jobs with max bytes per page range https://gerrit.wikimedia.org/r/507268 TASK DETAIL https://phabricator.wikimedia.org/T22150

[Wikidata-bugs] [Maniphest] [Commented On] T221504: investigate why content history dump of certain wikidata page ranges is so slow

2019-04-30 Thread gerritbot
gerritbot added a comment. Change 507268 **merged** by ArielGlenn: [operations/dumps@master] split up page content jobs with max bytes per page range https://gerrit.wikimedia.org/r/507268 TASK DETAIL https://phabricator.wikimedia.org/T221504 EMAIL PREFERENCES https://phabricator.w

[Wikidata-bugs] [Maniphest] [Commented On] T221504: investigate why content history dump of certain wikidata page ranges is so slow

2019-05-01 Thread ArielGlenn
ArielGlenn added a comment. The above change was deployed last night and will take effect for the new run starting today. We should see results for the wikidata page-meta-history dump, with files being a lot smaller than the 40GB of some files last month. With numerous small jobs we can at l

[Wikidata-bugs] [Maniphest] [Commented On] T221504: investigate why content history dump of certain wikidata page ranges is so slow

2019-05-02 Thread Smalyshev
Smalyshev added a comment. I agree that probably more efficient format is needed, but unfortunately I don't have any immediate ideas - all dumps I've worked with were RDFs etc. and this is completely different one. In general, I'm not even sure Wikidata is a good fit for storing data lik

[Wikidata-bugs] [Maniphest] [Commented On] T221504: investigate why content history dump of certain wikidata page ranges is so slow

2019-05-14 Thread ArielGlenn
ArielGlenn added a comment. Old revisions are re-read from previous dumps. But even that takes plenty of time to decompress the old file, and recompress the already existing content so that it can be written out to the new file. Dumps are done by page, so the likelihood of a batch of pages h

[Wikidata-bugs] [Maniphest] [Commented On] T221504: investigate why content history dump of certain wikidata page ranges is so slow

2019-04-21 Thread Mahir256
Mahir256 added a comment. @ArielGlenn It appears that particle physics is a massively collaborative enterprise, so that the results presented in a single paper can have thousands of people behind them, all of whom are credited (hence the particularly large revision size). TASK DETAIL htt

[Wikidata-bugs] [Maniphest] [Commented On] T221504: investigate why content history dump of certain wikidata page ranges is so slow

2019-04-22 Thread ArielGlenn
ArielGlenn added a comment. I looked at the author list. But even with around 2000 authors, if we gave each one of them 80 bytes (plenty for first name, last name and an id) we'd have 160k of data, not 1.5 megabytes. But one author is represented this way: {"mainsnak": {