ArielGlenn added a comment.

It does indeed look like the specific compression implementation.

I grabbed about 30gb uncompressed from a wikidata pages-meta-history file and an enwiki file, bz2 compressed them, checked that compressed sizes were nearly the same too, and did some tests.

ariel@dumpsdata1002:/data/arieltemp$ lbzcat -n 3 en-pmh.bz2 | head -171828490 | bzip2 > en-partial.bz2
ariel@dumpsdata1002:/data/arieltemp$ lbzcat -n 3 wd-pmh.bz2 | head -2391222 | bzip2 > wd-partial.bz2
ariel@dumpsdata1002:/data/arieltemp$ lbzcat -n 3 wd-pmh.bz2 | head -2391222 | wc -c
30424472866
ariel@dumpsdata1002:/data/arieltemp$ lbzcat -n 3 en-pmh.bz2 | head -171828490 | wc -c
30161804631
ariel@dumpsdata1002:/data/arieltemp$ ls -l *partial*
-rw-r--r-- 1 ariel wikidev 2930976277 Jan 22 22:51 en-partial.bz2
-rw-r--r-- 1 ariel wikidev 2876199800 Jan 22 20:09 wd-partial.bz2

Here's the timing tests. I'd already run fuller tests on more data but it was taking too long running on the original files.
bzcat isn't the bottleneck, there's a difference but it's small compared to the total time.

ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat wd-partial.bz2  > /dev/null )
real    12m39.037s
user    12m36.040s
sys     0m2.960s
ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat en-partial.bz2  > /dev/null )
real    12m54.762s
user    12m51.664s
sys     0m3.056s

Now let's time these for decompression - recompression.

ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat en-partial.bz2  | bzip2 >  /dev/null )
real    93m51.660s
user    97m35.768s
sys     0m43.056s
ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat wd-partial.bz2  | bzip2 >  /dev/null )
real    140m51.785s
user    144m15.584s
sys     0m40.860s

Horrible. Let's try another bzip2 implementation that happens to be lying around:

ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat en-partial.bz2  | lbzip2 -n 1  >  /dev/null )
real    34m59.430s
user    56m28.592s
sys     1m30.448s
ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat wd-partial.bz2  | lbzip2 -n 1  >  /dev/null )
real    30m52.658s
user    50m17.488s
sys     1m28.340s

Next up: hack mediawiki to add an lbzip2 stream handler, and time the dumps of a single history file for a given page range with the bz2 stream vs the lbzip2 stream.


TASK DETAIL
https://phabricator.wikimedia.org/T214293

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn
Cc: hoo, ArielGlenn, Nandana, Lahi, Gq86, Darkminds3113, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, Vali.matei, _jensen, Volker_E, gnosygnu, Wikidata-bugs, aude, GWicke, Dinoguy1000, Mbch331, Jay8g
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to