I'm exploring various ways of working with the XML data dumps on 
/publib/dumps/public/enwiki.  I've got a process which runs through all of the 
enwiki-20210301-pages-articles[123456789]*.xml* files in about 6 hours.  If 
I've done the math right, that's just about 18 GB of data, or 3 GB/h, or 8 MB/s 
that I'm slurping off NFS.

If I were to spin up 8 VPS nodes and run 8 jobs in parallel, in theory I could 
process 64 MB/s (512 Mb/s).  Is that realistic?  Or am I just going to beat the 
hell out of the poor NFS server, or peg some backbone network link, or hit some 
other rate limiting bottleneck long before I run out of CPU?  Hitting a 
bottleneck doesn't bother me so much as not wanting to trash a shared resource 
by doing something stupid to it.

Putting it another way, would trying this be a bad idea?


_______________________________________________
Wikimedia Cloud Services mailing list
Cloud@lists.wikimedia.org (formerly lab...@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to