Re: Disappearing corpus
Thanks for working on this. I'm +1 without a technical review. On 5/30/2019 3:27 PM, Paul Stead wrote: > I'm at the root of the issue and ready to commit changes around this: > > https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715 > > The changes will not affect how ruleqa works or how submissions should > be done - please continue to submit *after 0900 UTC* > > Any feedback appreciated, will be applying after 1st June unless > feedback received. > > Paul -- Kevin A. McGrail Member, Apache Software Foundation Chair Emeritus Apache SpamAssassin Project https://www.linkedin.com/in/kmcgrail - 703.798.0171
Re: Disappearing corpus
I'm at the root of the issue and ready to commit changes around this: https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715 The changes will not affect how ruleqa works or how submissions should be done - please continue to submit *after 0900 UTC* Any feedback appreciated, will be applying after 1st June unless feedback received. Paul
Re: Disappearing corpus
I'm chasing my tail here OF COURSE files are "disappearing" from the corpus directory, they get updated with todays/this weeks content, they don't get renamed/deleted they get changed to logs from today - I've been looking in the wrong place. Looks like corpus-hourly shouldn't be working from the corpus directory when re-calculating the class files for previous days but I clearly need to have a break and relax Paul On Sat, 25 May 2019 at 18:05, Paul Stead wrote: > The 14:05 run has finished, here's the before and after in terms of output > on ruleqa (attached) > > I saw files disappear in the /usr/local/spamassassin/automc/rsync/corpus > from 18 May but still can't find the trigger that is removing these files. > > Will come back to this later if no one has any ideas > > On Sat, 25 May 2019 at 17:54, Paul Stead wrote: > >> TLDR; >> Any pointers on what might be clearing up the old or "invalid" files in >> /usr/local/spamassassin/automc/rsync/corpus? >> >> >> >> I'm going on the opinion that some function is cleaning up the >> >> /usr/local/spamassassin/automc/rsync/corpus >> >> directory underneath the corpus-hourly script - though I've so far been >> unable to distinguish what. There seems to be a lot of superfluous scripts >> hanging around in the svn directories. >> >> As far as I can tell it isn't the corpus-hourly cron, nor the >> /usr/local/bin/checkMasscheckContribs.sh script. >> >> During my investigations I've noticed that the hourly does seem to take >> more than an hour to run, thus two processes can run at the same time >> >> automc7749 13.9 0.1 40632 19040 ?RN 15:05 3:27 >> /usr/bin/perl -w >> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly >> --dir=/usr/local/spamassassin/automc/rsync/corpus >> automc8708 99.7 0.8 164560 145008 ? RN 15:09 20:10 >> /usr/bin/perl -w ./hit-frequencies -TxpaP -o >> /usr/local/spamassassin/automc/tmp/spam.log.25383 >> /usr/local/spamassassin/automc/tmp/ham.log.25383 >> automc 25383 9.3 0.1 38880 17480 ?SN 14:05 7:56 >> /usr/bin/perl -w >> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly >> --dir=/usr/local/spamassassin/automc/rsync/corpus >> >> I'm not 100% that this is causing a problem, I see some protection >> against this for the running files, but I'm not sure about the resulting >> class files that are output. >> >> Paul >> >> On Sat, 25 May 2019 at 13:00, Paul Stead wrote: >> >>> I'm investingating the problem with disappearing corpus - see the bug >>> report here - >>> >>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715 >>> >>> Whilst that is an issue, I've realised this might not be everything >>> involved. >>> >>> I'm on the system but I can't find the process that is "cleaning" up the >>> directory at >>> >>> /usr/local/spamassassin/automc/rsync/corpus >>> >>> At first I thought it was the hourly script but I don't think this is >>> true. >>> >>> I've checked through cron.d run scripts and just can't seem to find it - >>> I've a feeling something is deleting logs from the corpus directory >>> prematurely, which then stops it being captured during the hourly when it >>> should - it's a case of < 1 hour. >>> >>> It's possible this script has code to figure out if it's running at UTC >>> or needs an offset similar to the one in the bug. >>> >>> It seems that the script is aware if it is running a nightly or weekly >>> and doesn't run the nightly on a Saturday. >>> >>> Hope you might have an idea of which script I'm referring to? >>> >>> I've "fixed" my problem by moving my corpus check to make sure it >>> completes after 10:00 UTC - this will like fix everyone's but I'd like to >>> make sure that when we say mass check after 09:00 UTC we mean it. >>> >>> Paul >>> >>
Re: Disappearing corpus
The 14:05 run has finished, here's the before and after in terms of output on ruleqa (attached) I saw files disappear in the /usr/local/spamassassin/automc/rsync/corpus from 18 May but still can't find the trigger that is removing these files. Will come back to this later if no one has any ideas On Sat, 25 May 2019 at 17:54, Paul Stead wrote: > TLDR; > Any pointers on what might be clearing up the old or "invalid" files in > /usr/local/spamassassin/automc/rsync/corpus? > > > > I'm going on the opinion that some function is cleaning up the > > /usr/local/spamassassin/automc/rsync/corpus > > directory underneath the corpus-hourly script - though I've so far been > unable to distinguish what. There seems to be a lot of superfluous scripts > hanging around in the svn directories. > > As far as I can tell it isn't the corpus-hourly cron, nor the > /usr/local/bin/checkMasscheckContribs.sh script. > > During my investigations I've noticed that the hourly does seem to take > more than an hour to run, thus two processes can run at the same time > > automc7749 13.9 0.1 40632 19040 ?RN 15:05 3:27 > /usr/bin/perl -w > /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly > --dir=/usr/local/spamassassin/automc/rsync/corpus > automc8708 99.7 0.8 164560 145008 ? RN 15:09 20:10 > /usr/bin/perl -w ./hit-frequencies -TxpaP -o > /usr/local/spamassassin/automc/tmp/spam.log.25383 > /usr/local/spamassassin/automc/tmp/ham.log.25383 > automc 25383 9.3 0.1 38880 17480 ?SN 14:05 7:56 > /usr/bin/perl -w > /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly > --dir=/usr/local/spamassassin/automc/rsync/corpus > > I'm not 100% that this is causing a problem, I see some protection against > this for the running files, but I'm not sure about the resulting class > files that are output. > > Paul > > On Sat, 25 May 2019 at 13:00, Paul Stead wrote: > >> I'm investingating the problem with disappearing corpus - see the bug >> report here - >> >> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715 >> >> Whilst that is an issue, I've realised this might not be everything >> involved. >> >> I'm on the system but I can't find the process that is "cleaning" up the >> directory at >> >> /usr/local/spamassassin/automc/rsync/corpus >> >> At first I thought it was the hourly script but I don't think this is >> true. >> >> I've checked through cron.d run scripts and just can't seem to find it - >> I've a feeling something is deleting logs from the corpus directory >> prematurely, which then stops it being captured during the hourly when it >> should - it's a case of < 1 hour. >> >> It's possible this script has code to figure out if it's running at UTC >> or needs an offset similar to the one in the bug. >> >> It seems that the script is aware if it is running a nightly or weekly >> and doesn't run the nightly on a Saturday. >> >> Hope you might have an idea of which script I'm referring to? >> >> I've "fixed" my problem by moving my corpus check to make sure it >> completes after 10:00 UTC - this will like fix everyone's but I'd like to >> make sure that when we say mass check after 09:00 UTC we mean it. >> >> Paul >> >
Re: Disappearing corpus
TLDR; Any pointers on what might be clearing up the old or "invalid" files in /usr/local/spamassassin/automc/rsync/corpus? I'm going on the opinion that some function is cleaning up the /usr/local/spamassassin/automc/rsync/corpus directory underneath the corpus-hourly script - though I've so far been unable to distinguish what. There seems to be a lot of superfluous scripts hanging around in the svn directories. As far as I can tell it isn't the corpus-hourly cron, nor the /usr/local/bin/checkMasscheckContribs.sh script. During my investigations I've noticed that the hourly does seem to take more than an hour to run, thus two processes can run at the same time automc7749 13.9 0.1 40632 19040 ?RN 15:05 3:27 /usr/bin/perl -w /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly --dir=/usr/local/spamassassin/automc/rsync/corpus automc8708 99.7 0.8 164560 145008 ? RN 15:09 20:10 /usr/bin/perl -w ./hit-frequencies -TxpaP -o /usr/local/spamassassin/automc/tmp/spam.log.25383 /usr/local/spamassassin/automc/tmp/ham.log.25383 automc 25383 9.3 0.1 38880 17480 ?SN 14:05 7:56 /usr/bin/perl -w /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly --dir=/usr/local/spamassassin/automc/rsync/corpus I'm not 100% that this is causing a problem, I see some protection against this for the running files, but I'm not sure about the resulting class files that are output. Paul On Sat, 25 May 2019 at 13:00, Paul Stead wrote: > I'm investingating the problem with disappearing corpus - see the bug > report here - > > https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715 > > Whilst that is an issue, I've realised this might not be everything > involved. > > I'm on the system but I can't find the process that is "cleaning" up the > directory at > > /usr/local/spamassassin/automc/rsync/corpus > > At first I thought it was the hourly script but I don't think this is true. > > I've checked through cron.d run scripts and just can't seem to find it - > I've a feeling something is deleting logs from the corpus directory > prematurely, which then stops it being captured during the hourly when it > should - it's a case of < 1 hour. > > It's possible this script has code to figure out if it's running at UTC or > needs an offset similar to the one in the bug. > > It seems that the script is aware if it is running a nightly or weekly and > doesn't run the nightly on a Saturday. > > Hope you might have an idea of which script I'm referring to? > > I've "fixed" my problem by moving my corpus check to make sure it > completes after 10:00 UTC - this will like fix everyone's but I'd like to > make sure that when we say mass check after 09:00 UTC we mean it. > > Paul >
Disappearing corpus
I'm investingating the problem with disappearing corpus - see the bug report here - https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715 Whilst that is an issue, I've realised this might not be everything involved. I'm on the system but I can't find the process that is "cleaning" up the directory at /usr/local/spamassassin/automc/rsync/corpus At first I thought it was the hourly script but I don't think this is true. I've checked through cron.d run scripts and just can't seem to find it - I've a feeling something is deleting logs from the corpus directory prematurely, which then stops it being captured during the hourly when it should - it's a case of < 1 hour. It's possible this script has code to figure out if it's running at UTC or needs an offset similar to the one in the bug. It seems that the script is aware if it is running a nightly or weekly and doesn't run the nightly on a Saturday. Hope you might have an idea of which script I'm referring to? I've "fixed" my problem by moving my corpus check to make sure it completes after 10:00 UTC - this will like fix everyone's but I'd like to make sure that when we say mass check after 09:00 UTC we mean it. Paul
Re: Disappearing Corpus
Paul Stead kirjoitti 16.5.2019 15:25: > Also mentioned yesterday again was the jarif corpus submitting net-only rules > during the nightly masscheck? > > Paul Someone claimed that yes, but I have verified my dailies and they do not post -net. t. jarif -- ja...@iki.fi
Re: Disappearing Corpus
This seems to have happened again today (it does seem to have been happening a little while) First attachment from around 0600 GMT and second just a few minutes ago. For ease, the first screen shot has: axb-coi-bulk axb-generic axb-ham-misc darxus ena-week0 ena-week1 ena-week2 ena-week3 ena-week4 giovanni-ham giovanni-spam giovanni-spammy grenier hege jarif jbrooks pds sihde spamsponge thendrikx the second is axb-coi-bulk axb-generic axb-ham-misc ena-week2 ena-week4 giovanni-ham giovanni-spam giovanni-spammy jarif jbrooks llanga mmiroslaw-mails-ham mmiroslaw-mails-spam sihde Also mentioned yesterday again was the jarif corpus submitting net-only rules during the nightly masscheck? Paul On Wed, 15 May 2019 at 13:42, Paul Stead wrote: > Hiya, > > Over the last few days I thought I'd noticed nightlies going missing from > the QA website. > > Attached are two images - the first taken last night at around 21:44 - > notice we have quite a few corpus' in the list, > > The second attachment shows the same date-rev as of a few minutes ago - > many corpus have gone missing - including my pds one - can anyone think > what might be going on here? > > I've a feeling this might be causing some "tromboning" of some of the > rules - they are popping in and out of the active.list daily > > Paul >
Fwd: Disappearing Corpus
FYI Forwarded Message Subject:Disappearing Corpus Date: Wed, 15 May 2019 13:42:39 +0100 From: Paul Stead Reply-To: rul...@spamassassin.apache.org To: rul...@spamassassin.apache.org Hiya, Over the last few days I thought I'd noticed nightlies going missing from the QA website. Attached are two images - the first taken last night at around 21:44 - notice we have quite a few corpus' in the list, The second attachment shows the same date-rev as of a few minutes ago - many corpus have gone missing - including my pds one - can anyone think what might be going on here? I've a feeling this might be causing some "tromboning" of some of the rules - they are popping in and out of the active.list daily Paul