[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2024-04-08 Thread Lydia_Pintscher
Lydia_Pintscher added a parent task: T88991: improve Wikidata dumps [tracking].

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lydia_Pintscher
Cc: Sascha, Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, 
bennofs, Ullasoff, Amayus, Danny_Benjafield_WMDE, JEbe-WMF, Stevemunene, 
S8321414, xcollazo, Busfault, Astuthiodit_1, Atieno, karapayneWMDE, Invadibot, 
maantietaja, jannee_e, ItamarWMDE, Akuckartz, holger.knust, Nandana, Lahi, 
Gq86, GoranSMilovanovic, Lunewa, QZanden, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2023-05-13 Thread Mitar
Mitar added a comment.


  Awesome! Thanks. This looks really amazing. I am not too convinced that we 
should introduce a different dump format, but changing compression seems to 
really be a low hanging fruit.

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mitar
Cc: Sascha, Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, 
bennofs, Busfault, Astuthiodit_1, Atieno, karapayneWMDE, Invadibot, 
maantietaja, jannee_e, ItamarWMDE, Akuckartz, holger.knust, Nandana, Lahi, 
Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Mbch331, Hokwelum
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2023-05-12 Thread Sascha
Sascha added a comment.


  @Mitar Done.

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Sascha
Cc: Sascha, Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, 
bennofs, Busfault, Astuthiodit_1, Atieno, karapayneWMDE, Invadibot, 
maantietaja, jannee_e, ItamarWMDE, Akuckartz, holger.knust, Nandana, Lahi, 
Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Mbch331, Hokwelum
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2023-05-08 Thread Mitar
Mitar added a comment.


  I think it would be useful to have a benchmark with more options: JSON with 
gzip, bzip (decompressed with lbzip2), and zstd. And then for QuickStatements 
the same. Could you do that?

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mitar
Cc: Sascha, Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, 
bennofs, Busfault, Astuthiodit_1, Atieno, karapayneWMDE, Invadibot, 
maantietaja, jannee_e, ItamarWMDE, Akuckartz, holger.knust, Nandana, Lahi, 
Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Mbch331, Hokwelum
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2023-05-08 Thread Sascha
Sascha added a comment.


  I’ve tried Wikidata dumps in QuickStatements format with Zstd compression, 
and benchmarked it: https://github.com/brawer/wikidata-qsdump 
  File size shrinks to one third, and decompression is 150 times faster (on a 
typical modern cloud server) compared to pbzip2.

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Sascha
Cc: Sascha, Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, 
bennofs, Busfault, Astuthiodit_1, Atieno, karapayneWMDE, Invadibot, 
maantietaja, jannee_e, ItamarWMDE, Akuckartz, holger.knust, Nandana, Lahi, 
Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Mbch331, Hokwelum
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-21 Thread bennofs
bennofs added a comment.


  In T222985#7163999 , 
@ArielGlenn wrote:
  
  > lbzip2 decompresses in parallel as well. We use that for compression of the 
SQL/XML dumps.
  
  Yes, the problem is that bzip2 is just really slow to decompress in general. 
You need to use a lot of cores before it gets faster than single-thread gzip.

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: bennofs
Cc: Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, bennofs, 
Invadibot, maantietaja, jannee_e, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Addshore, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-20 Thread Mitar
Mitar added a comment.


  OK, so it seems the problem is in pbzip2. It is not able to decompress in 
parallel unless compression was made with pbzip2, too. But lbzip2 can 
decompress all of them in parallel.
  
  See:
  
$ time bunzip2 -c -k latest-lexemes.json.bz2 > /dev/null

real1m0.101s
user0m59.912s
sys 0m0.180s
$ time pbzip2 -d -k -c latest-lexemes.json.bz2 > /dev/null

real0m57.662s
user0m57.792s
sys 0m0.180s
$ time lbunzip2 -c -k latest-lexemes.json.bz2 > /dev/null

real0m13.346s
user1m35.951s
sys 0m2.342s
$ lbunzip2 -c -k latest-lexemes.json.bz2 > serial.json
$ pbzip2 -z < serial.json > parallel.json.bz2
$ time lbunzip2 -c -k parallel.json.bz2 > /dev/null

real0m16.270s
user1m43.004s
sys 0m2.262s
$ time pbzip2 -d -c -k parallel.json.bz2 > /dev/null

real0m17.324s
user1m52.946s
sys 0m0.659s
  
  Size is very similar:
  
$ ll parallel.json.bz2 latest-lexemes.json.bz2 
-rw-rw-r-- 1 mitar mitar 168657719 Jun 15 20:36 latest-lexemes.json.bz2
-rw-rw-r-- 1 mitar mitar 168840138 Jun 20 07:35 parallel.json.bz2

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mitar
Cc: Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, bennofs, 
Invadibot, maantietaja, jannee_e, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Addshore, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-20 Thread ArielGlenn
ArielGlenn added a comment.


  In T222985#7164049 , 
@Mitar wrote:
  
  > Are you saying that existing wikidata json dumps can be decompressed in 
parallel if using lbzip2, but not pbzip2?
  
  lbzip2 is format-compatible with bzip2 and can read bzip2 or lbzip2 
compressed files and use multiple cores to decompress, indeed. pbzip2 should 
also work forr that matter.

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn
Cc: Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, bennofs, 
Invadibot, maantietaja, jannee_e, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Addshore, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-20 Thread Mitar
Mitar added a comment.


  Are you saying that existing wikidata json dumps can be decompressed in 
parallel if using lbzip2, but not pbzip2?

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mitar
Cc: Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, bennofs, 
Invadibot, maantietaja, jannee_e, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Addshore, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-20 Thread ArielGlenn
ArielGlenn added a comment.


  lbzip2 decompresses in parallel as well. We use that for compression of the 
SQL/XML dumps.

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn
Cc: Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, bennofs, 
Invadibot, maantietaja, jannee_e, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Addshore, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-06-19 Thread Mitar
Mitar added a comment.


  As a reference see also this discussion 
.
  
  I think the problem with bzip2 is that it is currently singlestream so one 
cannot really decompress it in parallel. Based on this answer 

 it seems that this was done on purpose, but since 2016 maybe we do not have to 
worry about compatibility anymore and just change bzip2 to be multistream? For 
example, by using this tool .
  
  But from my experience (from other contexts), zstd is really good. +1 on 
providing that as well, if possible from disk space perspective.
  
  I think by supporting parallel decompression, then issue 
https://phabricator.wikimedia.org/T115223 could be addressed as well.

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Mitar
Cc: Mitar, ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, bennofs, 
Invadibot, maantietaja, jannee_e, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, gnosygnu, Wikidata-bugs, aude, Addshore, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T222985: Provide wikidata JSON dumps compressed with zstd

2021-04-29 Thread ImreSamu
Restricted Application added a project: wdwb-tech.

TASK DETAIL
  https://phabricator.wikimedia.org/T222985

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ImreSamu
Cc: ImreSamu, hoo, Smalyshev, ArielGlenn, Liuxinyu970226, bennofs, Invadibot, 
maantietaja, jannee_e, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
Lunewa, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, gnosygnu, 
Wikidata-bugs, aude, Addshore, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs