bennofs created this task. bennofs added projects: Wikidata, Dumps-Generation. Restricted Application added a subscriber: Liuxinyu970226.
TASK DESCRIPTION At this time, wikidata provides JSON dumps compressed with gzip or bzip2. However, neither are not optimal: - the gzip dump is quite big (about 100% larger than bzip2) - the bzip2 dump takes a lot of time to decompress (estimated 7h on my laptop) As a consumer of these dumps, it would be nice to have a format that compresses well but also has good decompression speeds. I tested Zstandard <https://facebook.github.io/zstd/> and it performs much better than either of those two variants: - decompression (with default compression level settings) is //much// faster: takes about 15 minutes on my laptop (CPU bound) (this might even be faster than gzip, I didn't have enough SSD space to test how well gzip performs) - the size at default settings is very close to bzip2 (37.7 GB compared to ~35 GB that bzip2 produces) This directly affects processing speed of tools operating on these dumps. TASK DETAIL https://phabricator.wikimedia.org/T222985 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: bennofs Cc: Liuxinyu970226, bennofs, darthmon_wmde, alaa_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, gnosygnu, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs