daniel added a comment.

Quick addendum to @Tgr's last point:

in theory a lot of resources could be saved if identical slot contents are only written out once (they will be a very frequent occurrence due to reverts)

Reverts are not a new problem, and not the largest problem. Inherited slots are: If a page has three slots, but only one of the is frequently edited, the content of the other two will be copied into the output over and over, for every revision. In the database, this is deduplicated using the "inheritance" mechanism, but in the dump, no such mechanism exists yet.

  • not so much in the actual dump files, as the compression there takes care of duplicate content anyway, but it would mean less text to process, both in the dump infrastructure and for reusers. Text IDs / blob URLs can be assumed to be deduplicated, but they also cannot be published without checking that they correspond to a (visible) revision. That seems like a hard problem; can we do something about it?

One approach I could image ins:

  • always output the location attribute (instead of only using it in stub mode and omitting it in full mode)
  • when generating output for a given page, keep a ledger of the content location already emitted for that page.
  • when encountering a content location that was already emitted, only emit a sub version of the <text> tag, instead of the full content.

For a single page, the total number of different locations should be small enough to not cause a problem with the size of the ledger. But if we want to make sure, the ledger can be made "leaky", allowing for false negatives. This way, deduplication may not be perfect, but duplicate output would still be rare, while enforcing a cap on the ledger size. This approach is used by Wikibase when emitting RDF dumps, it has an implementation of such a "leaky ledger" in HashDedupeBag, using truncated hashes as keys.


TASK DETAIL
https://phabricator.wikimedia.org/T199121

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, daniel
Cc: kchapman, tstarling, awight, JAllemandou, hoo, pmiazga, Nemo_bis, brion, Tgr, Aklapper, Fjalapeno, ArielGlenn, daniel, Nandana, kostajh, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, JJMC89, Agabi10, D3r1ck01, SBisson, gnosygnu, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to