Daryl C. W. O'Shea writes: > Justin Mason wrote: > > Daryl C. W. O'Shea writes: > > > The zone's nightly-mc corpus (uploaded corpora) are this big (in KB): > > > > 2 /export/home/bbmass/rawcor/doc > > 19760 /export/home/bbmass/rawcor/fredt > > 6764040 /export/home/bbmass/rawcor/jm (mostly spam, since May 2007) > > 209393 /export/home/bbmass/rawcor/zmi > > > > so that's pretty big. In terms of disk space usage, that probably > > wouldn't take much space to cp -al; but it'd take a fair bit of time, > > esp on the zone, which has serious I/O bottleneck problems. > > > >> As an aside, if bandwidth is free, the whole mass-check will run quite a > >> bit faster if you rsync the corpus to each of the slaves. Of course > >> that assumes you've got the disk space and i/o to spare (i/o you may > >> already have if /tmp isn't a ramdisk). > > > > yeah, rsyncing about 7GB of corpora, nightly, would definitely be slow ;) > > Not really, it's probably less than 100MB change a day. My current > personal spam corpus is 2.1 GB over the last 60 days. Rsync'ing it > nightly with my 128kbit upload speed doesn't take very long. If the > objective is to get our disk i/o usage down on the zone this would make > a serious difference.
I think the changes I made to the "scan stage" take care of that... the mass-checks limit themselves to just the most recent 50k messages, which should be a good bit less than the full 7GB. (I haven't measured it. I've been meaning to modify mass-check to track that...) --j.
