Daryl C. W. O'Shea writes:
> Justin Mason wrote:
> > Daryl C. W. O'Shea writes:
> 
> > The zone's nightly-mc corpus (uploaded corpora) are this big (in KB):
> > 
> >   2       /export/home/bbmass/rawcor/doc
> >   19760   /export/home/bbmass/rawcor/fredt
> >   6764040 /export/home/bbmass/rawcor/jm (mostly spam, since May 2007)
> >   209393  /export/home/bbmass/rawcor/zmi
> > 
> > so that's pretty big.  In terms of disk space usage, that probably
> > wouldn't take much space to cp -al; but it'd take a fair bit of time,
> > esp on the zone, which has serious I/O bottleneck problems.
> > 
> >> As an aside, if bandwidth is free, the whole mass-check will run quite a 
> >> bit faster if you rsync the corpus to each of the slaves.  Of course 
> >> that assumes you've got the disk space and i/o to spare (i/o you may 
> >> already have if /tmp isn't a ramdisk).
> > 
> > yeah, rsyncing about 7GB of corpora, nightly, would definitely be slow ;)
> 
> Not really, it's probably less than 100MB change a day.  My current 
> personal spam corpus is 2.1 GB over the last 60 days.  Rsync'ing it 
> nightly with my 128kbit upload speed doesn't take very long.  If the 
> objective is to get our disk i/o usage down on the zone this would make 
> a serious difference.

I think the changes I made to the "scan stage" take care of that...
the mass-checks limit themselves to just the most recent 50k messages,
which should be a good bit less than the full 7GB.  (I haven't measured
it.  I've been meaning to modify mass-check to track that...)

--j.

Reply via email to