bennofs created this task.
bennofs added projects: Wikidata, Dumps-Generation.
Restricted Application added a subscriber: Liuxinyu970226.

TASK DESCRIPTION
  At this time, wikidata provides JSON dumps compressed with gzip or bzip2.  
However, neither are not optimal:
  
  - the gzip dump is quite big (about 100% larger than bzip2)
  - the bzip2 dump takes a lot of time to decompress (estimated 7h on my laptop)
  
  As a consumer of these dumps, it would be nice to have a format that 
compresses well but also has good decompression speeds. I tested Zstandard 
<https://facebook.github.io/zstd/> and it performs much better than either of 
those two variants:
  
  - decompression (with default compression level settings) is //much// faster: 
takes about 15 minutes on my laptop (CPU bound) (this might even be faster than 
gzip, I didn't have enough SSD space to test how well gzip performs)
  - the size at default settings is very close to bzip2 (37.7 GB compared to 
~35 GB that bzip2 produces)
  
  This directly affects processing speed of tools operating on these dumps.

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bennofs
Cc: Liuxinyu970226, bennofs, darthmon_wmde, alaa_wmde, Nandana, Lahi, Gq86, 
GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, gnosygnu, 
Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to