Mitar added a comment.

  I learned today that Wikipedia has a nice approach with a multistream bz2 
archive <https://dumps.wikimedia.org/enwiki/> and additional file with an 
index, which tells you an offset into the bz2 archive you have to decompress as 
a chunk to access particular page. Wikidata could do the same, just for items 
and properties. This would allow one to extract only those entities they care 
about. Mutlistream also enables one to decompress parts of the file in parallel 
on multiple machines, by distributing offsets between them. Wikipedia also 
provides the same multistream archive as multiple files so that one can even 
easier distribute the whole dump over multiple machines. I like that approach.

TASK DETAIL
  https://phabricator.wikimedia.org/T115223

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mitar
Cc: Addshore, Mitar, abian, JanZerebecki, Hydriz, hoo, Halfak, NealMcB, 
Aklapper, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Svick, Mbch331, jeremyb
_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

Reply via email to