Re: Disappearing corpus

2019-05-30 Thread Kevin A. McGrail
Thanks for working on this.  I'm +1 without a technical review.

On 5/30/2019 3:27 PM, Paul Stead wrote:
> I'm at the root of the issue and ready to commit changes around this:
>
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715
>
> The changes will not affect how ruleqa works or how submissions should
> be done - please continue to submit *after 0900 UTC*
>
> Any feedback appreciated, will be applying after 1st June unless
> feedback received.
>
> Paul


-- 
Kevin A. McGrail
Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171



Re: Disappearing corpus

2019-05-30 Thread Paul Stead
I'm at the root of the issue and ready to commit changes around this:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715

The changes will not affect how ruleqa works or how submissions should be
done - please continue to submit *after 0900 UTC*

Any feedback appreciated, will be applying after 1st June unless feedback
received.

Paul


Re: Disappearing corpus

2019-05-25 Thread Paul Stead
I'm chasing my tail here

OF COURSE files are "disappearing" from the corpus directory, they get
updated with todays/this weeks content, they don't get renamed/deleted they
get changed to logs from today - I've been looking in the wrong place.

Looks like corpus-hourly shouldn't be working from the corpus directory
when re-calculating the class files for previous days but I clearly need to
have a break and relax


Paul

On Sat, 25 May 2019 at 18:05, Paul Stead  wrote:

> The 14:05 run has finished, here's the before and after in terms of output
> on ruleqa (attached)
>
> I saw files disappear in the /usr/local/spamassassin/automc/rsync/corpus
> from 18 May but still can't find the trigger that is removing these files.
>
> Will come back to this later if no one has any ideas
>
> On Sat, 25 May 2019 at 17:54, Paul Stead  wrote:
>
>> TLDR;
>> Any pointers on what might be clearing up the old or "invalid" files in
>> /usr/local/spamassassin/automc/rsync/corpus?
>>
>> 
>>
>> I'm going on the opinion that some function is cleaning up the
>>
>> /usr/local/spamassassin/automc/rsync/corpus
>>
>> directory underneath the corpus-hourly script - though I've so far been
>> unable to distinguish what. There seems to be a lot of superfluous scripts
>> hanging around in the svn directories.
>>
>> As far as I can tell it isn't the corpus-hourly cron, nor the
>> /usr/local/bin/checkMasscheckContribs.sh script.
>>
>> During my investigations I've noticed that the hourly does seem to take
>> more than an hour to run, thus two processes can run at the same time
>>
>> automc7749 13.9  0.1  40632 19040 ?RN   15:05   3:27
>> /usr/bin/perl -w
>> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
>> --dir=/usr/local/spamassassin/automc/rsync/corpus
>> automc8708 99.7  0.8 164560 145008 ?   RN   15:09  20:10
>> /usr/bin/perl -w ./hit-frequencies -TxpaP -o
>> /usr/local/spamassassin/automc/tmp/spam.log.25383
>> /usr/local/spamassassin/automc/tmp/ham.log.25383
>> automc   25383  9.3  0.1  38880 17480 ?SN   14:05   7:56
>> /usr/bin/perl -w
>> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
>> --dir=/usr/local/spamassassin/automc/rsync/corpus
>>
>> I'm not 100% that this is causing a problem, I see some protection
>> against this for the running files, but I'm not sure about the resulting
>> class files that are output.
>>
>> Paul
>>
>> On Sat, 25 May 2019 at 13:00, Paul Stead  wrote:
>>
>>> I'm investingating the problem with disappearing corpus - see the bug
>>> report here -
>>>
>>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715
>>>
>>> Whilst that is an issue, I've realised this might not be everything
>>> involved.
>>>
>>> I'm on the system but I can't find the process that is "cleaning" up the
>>> directory at
>>>
>>> /usr/local/spamassassin/automc/rsync/corpus
>>>
>>> At first I thought it was the hourly script but I don't think this is
>>> true.
>>>
>>> I've checked through cron.d run scripts and just can't seem to find it -
>>> I've a feeling something is deleting logs from the corpus directory
>>> prematurely, which then stops it being captured during the hourly when it
>>> should - it's a case of < 1 hour.
>>>
>>> It's possible this script has code to figure out if it's running at UTC
>>> or needs an offset similar to the one in the bug.
>>>
>>> It seems that the script is aware if it is running a nightly or weekly
>>> and doesn't run the nightly on a Saturday.
>>>
>>> Hope you might have an idea of which script I'm referring to?
>>>
>>> I've "fixed" my problem by moving my corpus check to make sure it
>>> completes after 10:00 UTC - this will like fix everyone's but I'd like to
>>> make sure that when we say mass check after 09:00 UTC we mean it.
>>>
>>> Paul
>>>
>>


Re: Disappearing corpus

2019-05-25 Thread Paul Stead
The 14:05 run has finished, here's the before and after in terms of output
on ruleqa (attached)

I saw files disappear in the /usr/local/spamassassin/automc/rsync/corpus
from 18 May but still can't find the trigger that is removing these files.

Will come back to this later if no one has any ideas

On Sat, 25 May 2019 at 17:54, Paul Stead  wrote:

> TLDR;
> Any pointers on what might be clearing up the old or "invalid" files in
> /usr/local/spamassassin/automc/rsync/corpus?
>
> 
>
> I'm going on the opinion that some function is cleaning up the
>
> /usr/local/spamassassin/automc/rsync/corpus
>
> directory underneath the corpus-hourly script - though I've so far been
> unable to distinguish what. There seems to be a lot of superfluous scripts
> hanging around in the svn directories.
>
> As far as I can tell it isn't the corpus-hourly cron, nor the
> /usr/local/bin/checkMasscheckContribs.sh script.
>
> During my investigations I've noticed that the hourly does seem to take
> more than an hour to run, thus two processes can run at the same time
>
> automc7749 13.9  0.1  40632 19040 ?RN   15:05   3:27
> /usr/bin/perl -w
> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
> --dir=/usr/local/spamassassin/automc/rsync/corpus
> automc8708 99.7  0.8 164560 145008 ?   RN   15:09  20:10
> /usr/bin/perl -w ./hit-frequencies -TxpaP -o
> /usr/local/spamassassin/automc/tmp/spam.log.25383
> /usr/local/spamassassin/automc/tmp/ham.log.25383
> automc   25383  9.3  0.1  38880 17480 ?SN   14:05   7:56
> /usr/bin/perl -w
> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
> --dir=/usr/local/spamassassin/automc/rsync/corpus
>
> I'm not 100% that this is causing a problem, I see some protection against
> this for the running files, but I'm not sure about the resulting class
> files that are output.
>
> Paul
>
> On Sat, 25 May 2019 at 13:00, Paul Stead  wrote:
>
>> I'm investingating the problem with disappearing corpus - see the bug
>> report here -
>>
>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715
>>
>> Whilst that is an issue, I've realised this might not be everything
>> involved.
>>
>> I'm on the system but I can't find the process that is "cleaning" up the
>> directory at
>>
>> /usr/local/spamassassin/automc/rsync/corpus
>>
>> At first I thought it was the hourly script but I don't think this is
>> true.
>>
>> I've checked through cron.d run scripts and just can't seem to find it -
>> I've a feeling something is deleting logs from the corpus directory
>> prematurely, which then stops it being captured during the hourly when it
>> should - it's a case of < 1 hour.
>>
>> It's possible this script has code to figure out if it's running at UTC
>> or needs an offset similar to the one in the bug.
>>
>> It seems that the script is aware if it is running a nightly or weekly
>> and doesn't run the nightly on a Saturday.
>>
>> Hope you might have an idea of which script I'm referring to?
>>
>> I've "fixed" my problem by moving my corpus check to make sure it
>> completes after 10:00 UTC - this will like fix everyone's but I'd like to
>> make sure that when we say mass check after 09:00 UTC we mean it.
>>
>> Paul
>>
>


Re: Disappearing corpus

2019-05-25 Thread Paul Stead
TLDR;
Any pointers on what might be clearing up the old or "invalid" files in
/usr/local/spamassassin/automc/rsync/corpus?



I'm going on the opinion that some function is cleaning up the

/usr/local/spamassassin/automc/rsync/corpus

directory underneath the corpus-hourly script - though I've so far been
unable to distinguish what. There seems to be a lot of superfluous scripts
hanging around in the svn directories.

As far as I can tell it isn't the corpus-hourly cron, nor the
/usr/local/bin/checkMasscheckContribs.sh script.

During my investigations I've noticed that the hourly does seem to take
more than an hour to run, thus two processes can run at the same time

automc7749 13.9  0.1  40632 19040 ?RN   15:05   3:27
/usr/bin/perl -w
/usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
--dir=/usr/local/spamassassin/automc/rsync/corpus
automc8708 99.7  0.8 164560 145008 ?   RN   15:09  20:10
/usr/bin/perl -w ./hit-frequencies -TxpaP -o
/usr/local/spamassassin/automc/tmp/spam.log.25383
/usr/local/spamassassin/automc/tmp/ham.log.25383
automc   25383  9.3  0.1  38880 17480 ?SN   14:05   7:56
/usr/bin/perl -w
/usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
--dir=/usr/local/spamassassin/automc/rsync/corpus

I'm not 100% that this is causing a problem, I see some protection against
this for the running files, but I'm not sure about the resulting class
files that are output.

Paul

On Sat, 25 May 2019 at 13:00, Paul Stead  wrote:

> I'm investingating the problem with disappearing corpus - see the bug
> report here -
>
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715
>
> Whilst that is an issue, I've realised this might not be everything
> involved.
>
> I'm on the system but I can't find the process that is "cleaning" up the
> directory at
>
> /usr/local/spamassassin/automc/rsync/corpus
>
> At first I thought it was the hourly script but I don't think this is true.
>
> I've checked through cron.d run scripts and just can't seem to find it -
> I've a feeling something is deleting logs from the corpus directory
> prematurely, which then stops it being captured during the hourly when it
> should - it's a case of < 1 hour.
>
> It's possible this script has code to figure out if it's running at UTC or
> needs an offset similar to the one in the bug.
>
> It seems that the script is aware if it is running a nightly or weekly and
> doesn't run the nightly on a Saturday.
>
> Hope you might have an idea of which script I'm referring to?
>
> I've "fixed" my problem by moving my corpus check to make sure it
> completes after 10:00 UTC - this will like fix everyone's but I'd like to
> make sure that when we say mass check after 09:00 UTC we mean it.
>
> Paul
>


Disappearing corpus

2019-05-25 Thread Paul Stead
I'm investingating the problem with disappearing corpus - see the bug
report here -

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715

Whilst that is an issue, I've realised this might not be everything
involved.

I'm on the system but I can't find the process that is "cleaning" up the
directory at

/usr/local/spamassassin/automc/rsync/corpus

At first I thought it was the hourly script but I don't think this is true.

I've checked through cron.d run scripts and just can't seem to find it -
I've a feeling something is deleting logs from the corpus directory
prematurely, which then stops it being captured during the hourly when it
should - it's a case of < 1 hour.

It's possible this script has code to figure out if it's running at UTC or
needs an offset similar to the one in the bug.

It seems that the script is aware if it is running a nightly or weekly and
doesn't run the nightly on a Saturday.

Hope you might have an idea of which script I'm referring to?

I've "fixed" my problem by moving my corpus check to make sure it completes
after 10:00 UTC - this will like fix everyone's but I'd like to make sure
that when we say mass check after 09:00 UTC we mean it.

Paul


Re: Disappearing Corpus

2019-05-20 Thread Jari Fredriksson
Paul Stead kirjoitti 16.5.2019 15:25:

> Also mentioned yesterday again was the jarif corpus submitting net-only rules 
> during the nightly masscheck? 
> 
> Paul

Someone claimed that yes, but I have verified my dailies and they do not
post -net. 

t. jarif 

-- 
ja...@iki.fi

Re: Disappearing Corpus

2019-05-16 Thread Paul Stead
This seems to have happened again today (it does seem to have been
happening a little while)

First attachment from around 0600 GMT and second just a few minutes ago.

For ease, the first screen shot has:

axb-coi-bulk axb-generic axb-ham-misc darxus ena-week0 ena-week1 ena-week2
ena-week3 ena-week4 giovanni-ham giovanni-spam giovanni-spammy grenier hege
jarif jbrooks pds sihde spamsponge thendrikx

the second is

axb-coi-bulk axb-generic axb-ham-misc ena-week2 ena-week4 giovanni-ham
giovanni-spam giovanni-spammy jarif jbrooks llanga mmiroslaw-mails-ham
mmiroslaw-mails-spam sihde


Also mentioned yesterday again was the jarif corpus submitting net-only
rules during the nightly masscheck?

Paul


On Wed, 15 May 2019 at 13:42, Paul Stead  wrote:

> Hiya,
>
> Over the last few days I thought I'd noticed nightlies going missing from
> the QA website.
>
> Attached are two images - the first taken last night at around 21:44 -
> notice we have quite a few corpus' in the list,
>
> The second attachment shows the same date-rev as of a few minutes ago -
> many corpus have gone missing - including my pds one - can anyone think
> what might be going on here?
>
> I've a feeling this might be causing some "tromboning" of some of the
> rules - they are popping in and out of the active.list daily
>
> Paul
>


Fwd: Disappearing Corpus

2019-05-15 Thread Kevin A. McGrail
FYI


 Forwarded Message 
Subject:Disappearing Corpus
Date:   Wed, 15 May 2019 13:42:39 +0100
From:   Paul Stead 
Reply-To:   rul...@spamassassin.apache.org
To: rul...@spamassassin.apache.org



Hiya,

Over the last few days I thought I'd noticed nightlies going missing
from the QA website.

Attached are two images - the first taken last night at around 21:44 -
notice we have quite a few corpus' in the list,

The second attachment shows the same date-rev as of a few minutes ago -
many corpus have gone missing - including my pds one - can anyone think
what might be going on here?

I've a feeling this might be causing some "tromboning" of some of the
rules - they are popping in and out of the active.list daily

Paul