ArielGlenn added a comment. |
It does indeed look like the specific compression implementation.
I grabbed about 30gb uncompressed from a wikidata pages-meta-history file and an enwiki file, bz2 compressed them, checked that compressed sizes were nearly the same too, and did some tests.
ariel@dumpsdata1002:/data/arieltemp$ lbzcat -n 3 en-pmh.bz2 | head -171828490 | bzip2 > en-partial.bz2 ariel@dumpsdata1002:/data/arieltemp$ lbzcat -n 3 wd-pmh.bz2 | head -2391222 | bzip2 > wd-partial.bz2 ariel@dumpsdata1002:/data/arieltemp$ lbzcat -n 3 wd-pmh.bz2 | head -2391222 | wc -c 30424472866 ariel@dumpsdata1002:/data/arieltemp$ lbzcat -n 3 en-pmh.bz2 | head -171828490 | wc -c 30161804631 ariel@dumpsdata1002:/data/arieltemp$ ls -l *partial* -rw-r--r-- 1 ariel wikidev 2930976277 Jan 22 22:51 en-partial.bz2 -rw-r--r-- 1 ariel wikidev 2876199800 Jan 22 20:09 wd-partial.bz2
Here's the timing tests. I'd already run fuller tests on more data but it was taking too long running on the original files.
bzcat isn't the bottleneck, there's a difference but it's small compared to the total time.ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat wd-partial.bz2 > /dev/null ) real 12m39.037s user 12m36.040s sys 0m2.960s ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat en-partial.bz2 > /dev/null ) real 12m54.762s user 12m51.664s sys 0m3.056sNow let's time these for decompression - recompression.
ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat en-partial.bz2 | bzip2 > /dev/null ) real 93m51.660s user 97m35.768s sys 0m43.056s ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat wd-partial.bz2 | bzip2 > /dev/null ) real 140m51.785s user 144m15.584s sys 0m40.860sHorrible. Let's try another bzip2 implementation that happens to be lying around:
ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat en-partial.bz2 | lbzip2 -n 1 > /dev/null ) real 34m59.430s user 56m28.592s sys 1m30.448s ariel@dumpsdata1002:/data/arieltemp$ time ( bzcat wd-partial.bz2 | lbzip2 -n 1 > /dev/null ) real 30m52.658s user 50m17.488s sys 1m28.340sNext up: hack mediawiki to add an lbzip2 stream handler, and time the dumps of a single history file for a given page range with the bz2 stream vs the lbzip2 stream.
Cc: hoo, ArielGlenn, Nandana, Lahi, Gq86, Darkminds3113, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, Vali.matei, _jensen, Volker_E, gnosygnu, Wikidata-bugs, aude, GWicke, Dinoguy1000, Mbch331, Jay8g
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs