Hah, do not be too hard on yourself. There are like 4 people on the planet that have really dug into these scripts so I appreciate you working on it.
On Sat, May 25, 2019, 16:09 Paul Stead <paul.st...@gmail.com> wrote: > I'm chasing my tail here.... > > OF COURSE files are "disappearing" from the corpus directory, they get > updated with todays/this weeks content, they don't get renamed/deleted they > get changed to logs from today - I've been looking in the wrong place. > > Looks like corpus-hourly shouldn't be working from the corpus directory > when re-calculating the class files for previous days but I clearly need to > have a break and relax > > > Paul > > On Sat, 25 May 2019 at 18:05, Paul Stead <paul.st...@gmail.com> wrote: > > > The 14:05 run has finished, here's the before and after in terms of > output > > on ruleqa (attached) > > > > I saw files disappear in the /usr/local/spamassassin/automc/rsync/corpus > > from 18 May but still can't find the trigger that is removing these > files. > > > > Will come back to this later if no one has any ideas > > > > On Sat, 25 May 2019 at 17:54, Paul Stead <paul.st...@gmail.com> wrote: > > > >> TLDR; > >> Any pointers on what might be clearing up the old or "invalid" files in > >> /usr/local/spamassassin/automc/rsync/corpus? > >> > >> ---- > >> > >> I'm going on the opinion that some function is cleaning up the > >> > >> /usr/local/spamassassin/automc/rsync/corpus > >> > >> directory underneath the corpus-hourly script - though I've so far been > >> unable to distinguish what. There seems to be a lot of superfluous > scripts > >> hanging around in the svn directories. > >> > >> As far as I can tell it isn't the corpus-hourly cron, nor the > >> /usr/local/bin/checkMasscheckContribs.sh script. > >> > >> During my investigations I've noticed that the hourly does seem to take > >> more than an hour to run, thus two processes can run at the same time > >> > >> automc 7749 13.9 0.1 40632 19040 ? RN 15:05 3:27 > >> /usr/bin/perl -w > >> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly > >> --dir=/usr/local/spamassassin/automc/rsync/corpus > >> automc 8708 99.7 0.8 164560 145008 ? RN 15:09 20:10 > >> /usr/bin/perl -w ./hit-frequencies -TxpaP -o > >> /usr/local/spamassassin/automc/tmp/spam.log.25383 > >> /usr/local/spamassassin/automc/tmp/ham.log.25383 > >> automc 25383 9.3 0.1 38880 17480 ? SN 14:05 7:56 > >> /usr/bin/perl -w > >> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly > >> --dir=/usr/local/spamassassin/automc/rsync/corpus > >> > >> I'm not 100% that this is causing a problem, I see some protection > >> against this for the running files, but I'm not sure about the resulting > >> class files that are output. > >> > >> Paul > >> > >> On Sat, 25 May 2019 at 13:00, Paul Stead <paul.st...@gmail.com> wrote: > >> > >>> I'm investingating the problem with disappearing corpus - see the bug > >>> report here - > >>> > >>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715 > >>> > >>> Whilst that is an issue, I've realised this might not be everything > >>> involved. > >>> > >>> I'm on the system but I can't find the process that is "cleaning" up > the > >>> directory at > >>> > >>> /usr/local/spamassassin/automc/rsync/corpus > >>> > >>> At first I thought it was the hourly script but I don't think this is > >>> true. > >>> > >>> I've checked through cron.d run scripts and just can't seem to find it > - > >>> I've a feeling something is deleting logs from the corpus directory > >>> prematurely, which then stops it being captured during the hourly when > it > >>> should - it's a case of < 1 hour. > >>> > >>> It's possible this script has code to figure out if it's running at UTC > >>> or needs an offset similar to the one in the bug. > >>> > >>> It seems that the script is aware if it is running a nightly or weekly > >>> and doesn't run the nightly on a Saturday. > >>> > >>> Hope you might have an idea of which script I'm referring to? > >>> > >>> I've "fixed" my problem by moving my corpus check to make sure it > >>> completes after 10:00 UTC - this will like fix everyone's but I'd like > to > >>> make sure that when we say mass check after 09:00 UTC we mean it. > >>> > >>> Paul > >>> > >> >