[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

Mitar Fri, 31 Dec 2021 02:56:09 -0800

Mitar added a comment.


  I learned today that Wikipedia has a nice approach with a multistream bz2 
archive <https://dumps.wikimedia.org/enwiki/> and additional file with an 
index, which tells you an offset into the bz2 archive you have to decompress as 
a chunk to access particular page. Wikidata could do the same, just for items 
and properties. This would allow one to extract only those entities they care 
about. Mutlistream also enables one to decompress parts of the file in parallel 
on multiple machines, by distributing offsets between them. Wikipedia also 
provides the same multistream archive as multiple files so that one can even 
easier distribute the whole dump over multiple machines. I like that approach.

TASK DETAIL
  https://phabricator.wikimedia.org/T115223

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mitar
Cc: Addshore, Mitar, abian, JanZerebecki, Hydriz, hoo, Halfak, NealMcB, 
Aklapper, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Svick, Mbch331, jeremyb

_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

[Wikidata-bugs] [Maniphest] T115223: Provide wikidata downloads as multiple files to make access more robust and efficient

Reply via email to