subjectFrequencyCache has nothing to do with maxAllowedDups!
subjectFrequency is a blocking feature with high priority - maxAllowedDups
is related to the collection with low priority
>Question: this cache seems to start anew at restart, at least it did for
me.
subjectFrequencyCache is stored in to tmpDB/files at shutdown and reloaded
from there at startup
> I assume assp is removing non-printable characters in the normalization
process?
subject normalization is done:
substitute:
$sub =~ s/\.?\d+[\d.,]*(?: +\.?\d+[\d.,]*)*/ randnumber /go; # any
sequence of numbers . ,
$sub =~ s/\p{Currency_Symbol}+/ currency /goi; # any sequence of
currency symbol
$sub =~ s/(?:\p{Symbol})+/ symbol /go; # any sequence of unicode
symbols
$sub =~ s/[_\[\]\~\@\%\$\&\{\}<>#(),.'";:=!?*+\/\\\-]+/ /go; # any
sequence of punctuation and separator
find words:
any at least two character long words (unicode)
stem words:
detect the language of the subject
stem all words
Normalization is done in sub normSubject.
http://your.assp:55555/edit?file=DB-SpamfileNames¬e=1h
will show all normailzed filenames and the related file numbers
Open the hash, pickup some file numbers of the 50+, find the numbers in
the hash values (browser search), compare the normalized subject (hash
keys)
Show some examples.
Thomas
Von: "K Post" <nntp.p...@gmail.com>
An: "ASSP development mailing list" <assp-test@lists.sourceforge.net>
Datum: 05.04.2018 03:10
Betreff: Re: [Assp-test] maxAllowedDups not working when
punctuation is in subject???
The new sorting function in 18094 will make reviewing
DB-subjectFrequencyCache much easier. Thank you.
Question: this cache seems to start anew at restart, at least it did for
me. If a normalized subject filename isn't in the cache will the file be
initially written to the corpus regardless of how many there already are
there?
Of the messages where I have 50+ seemingly identical filenames (except for
the number at the end that ASSP inserts), I see NO difference in subject.
I assume assp is removing non-printable characters in the normalization
process? It's not many sets of 50+ duplicates relative to my corpus size,
but there's a bunch, and the files date back several months. Weird right?
On Tue, Apr 3, 2018 at 4:50 AM, Thomas Eckardt <thomas.ecka...@thockar.com
> wrote:
If this is the case, there is at least one character different in the
normalized filenames.
Normalization is done in sub normSubject.
http://your.assp:55555/edit?file=DB-SpamfileNames¬e=1h
will show all normailzed filenames and the related file numbers
Thomas
Von: "K Post" <nntp.p...@gmail.com>
An: "ASSP development mailing list" <
assp-test@lists.sourceforge.net>
Datum: 31.03.2018 19:58
Betreff: Re: [Assp-test] maxAllowedDups not working when
punctuation is in subject???
The vast majority of messages with identical subjects are only in the
corpus max 3 times (as I have it set), but there's a bunch there 50+
times, spanning multiple weeks/months.
On Wed, Mar 28, 2018 at 3:49 AM, Thomas Eckardt <
thomas.ecka...@thockar.com> wrote:
maxAllowedDups has a very very low priority in assp, because maintening a
large amount of files may consum a large amount of system resources.
The spam folder will be maintened completely (and correctly) at startup
and at the end of the 'MaintBayesCollection' task.
you'll see a logline like
Mar-27-18 10:53:06 [Worker_10000] Info: MaxAllowedDups - 20,823 files
registered in spam folder - 0 files moved to folder C:/assp/discarded
If the same (or similar) subject is received multiple times within a short
time frame (some minutes), it may happen, that all these mails are
collected. If the same subject is received again some times later, while
the workload is low enought, the files for this subject will be maintened
by the SMTP worker. If the same subject is never seen again in a received
mail, the duplicate files (filenames) will be corrected with the next
regular maintenance task (startup/MaintBayesCollection).
The logline above (at startup) shows, that this feature is working like
expected. There were no unmaintened subjects at the last shutdown.
Having maxAllowedDups ignored for a specific subject at a point in time is
normal. Over a long time range, this feature works like expected.
Thomas
Von: "K Post" <nntp.p...@gmail.com>
An: "ASSP development mailing list" <
assp-test@lists.sourceforge.net>
Datum: 27.03.2018 18:29
Betreff: Re: [Assp-test] maxAllowedDups not working when
punctuation is in subject???
Here's one that contains no punctuation that I have 50+ times in the
corpus
Subject: Shark Tank Loves This New Diet Product
Of the ones that didn't have punctuation that I spot checked, they all
were sent to an address that matched a Collect Address. Are collect
addresses exempt from the duplicate file name limit?
Here's an example of one with punctuation that is there many times and is
not sent to collect addresses
Subject: Ringing Ears? Eat This for Breakfast & Destroy Tinnitus
Fast.
Anyone else seeing anything like this?
On Tue, Mar 27, 2018 at 12:13 PM, K Post <nntp.p...@gmail.com> wrote:
I use subject names to store messages.
maxSubjectLength: 0
maxAllowedDups: 3
It's been set this exact way for years and generally seems to work.
However, having just completed going through the 15,000 spam messages in
/spam (that was fun, I see that some) subjects are repeated many many
times.
Most of these more then 3 duplicates (but I don't think all) seem to have
punctuation in the subject, and extra colon, ends in an exclamation mark,
a period, etc. Those punctuation marks are (rightfully) ignored in the
file name, but might they be used in the comparison so they're not coming
up as already there? I really don't know, I just know what I'm seeing.
Thanks
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test
DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally
privileged and protected in law and are intended solely for the use of the
individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no
known virus in this email!
*******************************************************
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test
DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally
privileged and protected in law and are intended solely for the use of the
individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no
known virus in this email!
*******************************************************
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test
DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally
privileged and protected in law and are intended solely for the use of the
individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no
known virus in this email!
*******************************************************
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test