[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2024-04-08 Thread Lydia_Pintscher
Lydia_Pintscher added a parent task: T88991: improve Wikidata dumps [tracking]. TASK DETAIL https://phabricator.wikimedia.org/T222985 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Lydia_Pintscher Cc: Sascha, Mitar, ImreSamu, hoo, Smalyshev,

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2023-05-13 Thread Mitar
Mitar added a comment. Awesome! Thanks. This looks really amazing. I am not too convinced that we should introduce a different dump format, but changing compression seems to really be a low hanging fruit. TASK DETAIL https://phabricator.wikimedia.org/T222985 EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2023-05-12 Thread Sascha
Sascha added a comment. @Mitar Done. TASK DETAIL https://phabricator.wikimedia.org/T222985 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Sascha Cc: Sascha, Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, bennofs, Busfault,

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2023-05-08 Thread Mitar
Mitar added a comment. I think it would be useful to have a benchmark with more options: JSON with gzip, bzip (decompressed with lbzip2), and zstd. And then for QuickStatements the same. Could you do that? TASK DETAIL https://phabricator.wikimedia.org/T222985 EMAIL PREFERENCES

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2023-05-08 Thread Sascha
Sascha added a comment. I’ve tried Wikidata dumps in QuickStatements format with Zstd compression, and benchmarked it: https://github.com/brawer/wikidata-qsdump File size shrinks to one third, and decompression is 150 times faster (on a typical modern cloud server) compared to pbzip2.

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-21 Thread bennofs
bennofs added a comment. In T222985#7163999 , @ArielGlenn wrote: > lbzip2 decompresses in parallel as well. We use that for compression of the SQL/XML dumps. Yes, the problem is that bzip2 is just really slow to decompress in

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-20 Thread Mitar
Mitar added a comment. OK, so it seems the problem is in pbzip2. It is not able to decompress in parallel unless compression was made with pbzip2, too. But lbzip2 can decompress all of them in parallel. See: $ time bunzip2 -c -k latest-lexemes.json.bz2 > /dev/null real

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-20 Thread ArielGlenn
ArielGlenn added a comment. In T222985#7164049 , @Mitar wrote: > Are you saying that existing wikidata json dumps can be decompressed in parallel if using lbzip2, but not pbzip2? lbzip2 is format-compatible with bzip2 and can read

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-20 Thread Mitar
Mitar added a comment. Are you saying that existing wikidata json dumps can be decompressed in parallel if using lbzip2, but not pbzip2? TASK DETAIL https://phabricator.wikimedia.org/T222985 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Mitar

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-20 Thread ArielGlenn
ArielGlenn added a comment. lbzip2 decompresses in parallel as well. We use that for compression of the SQL/XML dumps. TASK DETAIL https://phabricator.wikimedia.org/T222985 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ArielGlenn Cc: Mitar,

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-19 Thread Mitar
Mitar added a comment. As a reference see also this discussion . I think the problem with bzip2 is that it is currently singlestream so one cannot really decompress it in parallel.

[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-04-29 Thread ImreSamu
Restricted Application added a project: wdwb-tech. TASK DETAIL https://phabricator.wikimedia.org/T222985 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: ImreSamu Cc: ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, bennofs, Invadibot,