ArielGlenn added a comment.
We need some timing tests on these: is there a happy medium between 'best
settings for compression' and 'best settings for speed'? What are we looking at
in terms of execution time and space, if we add this step? We'd continue to
provide bz2s I guess, since those
ArielGlenn added a comment.
I don't want to replace existing compression formats; this would be in
addition to what we have.
I'll have to look at the graphs to see how we are as far as CPU usage goes.
Let's just do the json dump for now, if we do this.
TASK DETAIL
Smalyshev added a comment.
I tried zstd some time ago and found that with default settings it's bigger
than bz2 and with max settings it's rather slow, so I did not proceed. But I
agree that decompression speed might matter too, I did not consider that.
So if there's no problem storing
ArielGlenn added a comment.
I've run some tests using the (nfs-mounted) filesystem to which our dumps are
written in production.
ariel@snapshot1008:/mnt/dumpsdata/otherdumps/wikibase/wikidatawiki/20190513$
time (zcat wikidata-20190513-all.json.gz | gzip >
bennofs added a comment.
$ time zstdcat -v -d wikidata-20190506-all.json.bz2 | zstd > /dev/null
real4m5.341s
user2m22.452s
bennofs added a comment.
But I can do a zstd decompression -> zstd compression test.
TASK DETAIL
https://phabricator.wikimedia.org/T222985
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: bennofs
Cc: ArielGlenn, Liuxinyu970226, bennofs,
bennofs added a comment.
I don't have enough disk space for a compression test, that's correct.
TASK DETAIL
https://phabricator.wikimedia.org/T222985
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: bennofs
Cc: ArielGlenn, Liuxinyu970226, bennofs,
bennofs added a comment.
Now the same with zstd:
$ time zstdcat -v -d wikidata-20190506-all.json.bz2 | cat > /dev/null
real3m48.657s
user0m3.792s
sys 0m58.768s
here's the sizes:
35G wikidata-20190506-all.json.bz2
39G
ArielGlenn added a comment.
Impressive. Would you be willing to do a compression timing test too, or is
that prohibitive given your available disk space?
TASK DETAIL
https://phabricator.wikimedia.org/T222985
EMAIL PREFERENCES
bennofs added a comment.
So I tried lbzip2, here's the result (on a VM sever with 2 cores, 2.1GHz, the
decompression is CPU bound):
$ time lbzip2 -n2 -v -d -c wikidata-20190506-all.json.bz2 | cat > /dev/null
ArielGlenn added a comment.
Have you tried lbzip2? You can specify a number of threads and get some
speedup for compression or decompression, even from pipes.
TASK DETAIL
https://phabricator.wikimedia.org/T222985
EMAIL PREFERENCES
11 matches
Mail list logo