hoo added a comment.

I ran the above mentioned tool on a slow-ish VM over the latest truthy dump:

$ time ~/gz-sort/gz-sort -u -S 100M wikidata-20170927-truthy-BETA.nt.gz ~/wikidata-20170927-truthy-BETA.nt.sort.gz
 line count: 1924967162
 presort: 219.15 minutes
 merge 396083: 186.55 minutes
 merge 792167: 183.47 minutes
 merge 1584335: 182.98 minutes
 merge 3166064: 183.37 minutes
 merge 6332128: 183.77 minutes
 merge 12664257: 183.28 minutes
 merge 25328515: 183.90 minutes
 merge 50657030: 185.42 minutes
 merge 101314061: 217.00 minutes
 merge 192496716: 219.67 minutes
 merge 384993432: 218.62 minutes
 merge 641655720: 217.23 minutes
 merge 962483581: 224.97 minutes
removed 303419 non-unique lines

real    2789m32.668s
user    2598m34.233s
sys     18m21.880s

The resulting gzipped file was about 4% larger, but that was probably due to it not being compressed with -9. Sadly I accidentally deleted the sorted dump, thus I can't check how large it would be with gzip -9 or other compressions… but I kind of doubt that's worth it.


TASK DETAIL
https://phabricator.wikimedia.org/T177533

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: hoo
Cc: daniel, Lydia_Pintscher, ArielGlenn, aude, Aklapper, hoo, GoranSMilovanovic, QZanden, Izno, Wikidata-bugs, Svick, Mbch331, jeremyb
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to