subjectFrequencyCache has nothing to do with maxAllowedDups!

subjectFrequency is a blocking feature with high priority - maxAllowedDups 
is related to the collection with low priority

>Question: this cache seems to start anew at restart, at least it did for 

subjectFrequencyCache is stored in to tmpDB/files at shutdown and reloaded 
from there at startup

> I assume assp is removing non-printable characters in the normalization 

subject normalization is done:

    $sub =~ s/\.?\d+[\d.,]*(?: +\.?\d+[\d.,]*)*/ randnumber /go;  # any 
sequence of numbers . ,
    $sub =~ s/\p{Currency_Symbol}+/ currency /goi; # any sequence of 
currency symbol
    $sub =~ s/(?:\p{Symbol})+/ symbol /go; # any sequence of unicode 
    $sub =~ s/[_\[\]\~\@\%\$\&\{\}<>#(),.'";:=!?*+\/\\\-]+/ /go; # any 
sequence of punctuation and separator

find words:
  any at least two character long words (unicode)

stem words:
  detect the language of the subject
  stem all words

Normalization is done in sub normSubject. 


will show all normailzed filenames and the related file numbers 

Open the hash, pickup some file numbers of the 50+, find the numbers in 
the hash values (browser search), compare the normalized subject (hash 
Show some examples.


Von:    "K Post" <>
An:     "ASSP development mailing list" <>
Datum:  05.04.2018 03:10
Betreff:        Re: [Assp-test] maxAllowedDups not working when 
punctuation is in subject???

The new sorting function in 18094 will make reviewing 
DB-subjectFrequencyCache much easier.   Thank you.

Question: this cache seems to start anew at restart, at least it did for 
me.  If a normalized subject filename isn't in the cache will the file be 
initially written to the corpus regardless of how many there already are 

Of the messages where I have 50+ seemingly identical filenames (except for 
the number at the end that ASSP inserts), I see NO difference in subject.  
I assume assp is removing non-printable characters in the normalization 
process?  It's not many sets of 50+ duplicates relative to my corpus size, 
but there's a bunch, and the files date back several months.  Weird right?

On Tue, Apr 3, 2018 at 4:50 AM, Thomas Eckardt <
> wrote:
If this is the case, there is at least one character different in the 
normalized filenames. 

Normalization is done in sub normSubject. 


will show all normailzed filenames and the related file numbers 


Von:        "K Post" <> 
An:        "ASSP development mailing list" <> 
Datum:        31.03.2018 19:58 
Betreff:        Re: [Assp-test] maxAllowedDups not working when 
punctuation is in subject??? 

The vast majority of messages with identical subjects are only in the 
corpus max 3 times (as I have it set), but there's a bunch there 50+ 
times, spanning multiple weeks/months.   

On Wed, Mar 28, 2018 at 3:49 AM, Thomas Eckardt <> wrote: 
maxAllowedDups has a very very low priority in assp, because maintening a 
large amount of files may consum a large amount of system resources. 
The spam folder will be maintened completely (and correctly) at startup 
and at the end of the 'MaintBayesCollection' task. 

you'll see a logline like 
Mar-27-18 10:53:06 [Worker_10000] Info: MaxAllowedDups - 20,823 files 
registered in spam folder - 0 files moved to folder C:/assp/discarded 

If the same (or similar) subject is received multiple times within a short 
time frame (some minutes), it may happen, that all these mails are 
collected. If the same subject is received again some times later, while 
the workload is low enought, the files for this subject will be maintened 
by the SMTP worker. If the same subject is never seen again in a received 
mail, the duplicate files (filenames) will be corrected with the next 
regular maintenance task (startup/MaintBayesCollection). 

The logline above (at startup) shows, that this feature is working like 
expected. There were no unmaintened subjects at the last shutdown. 

Having maxAllowedDups ignored for a specific subject at a point in time is 
normal. Over a long time range, this feature works like expected. 


Von:        "K Post" <> 
An:        "ASSP development mailing list" <> 
Datum:        27.03.2018 18:29 
Betreff:        Re: [Assp-test] maxAllowedDups not working when 
punctuation is in subject??? 

Here's one that contains no punctuation that I have 50+ times in the 
    Subject: Shark Tank Loves This New Diet Product 

Of the ones that didn't have punctuation that I spot checked, they all 
were sent to an address that matched a Collect Address.  Are collect 
addresses exempt from the duplicate file name limit? 

Here's an example of one with punctuation that is there many times and is 
not sent to collect addresses 
     Subject: Ringing Ears? Eat This for Breakfast & Destroy Tinnitus 

Anyone else seeing anything like this? 

On Tue, Mar 27, 2018 at 12:13 PM, K Post <> wrote: 
I use subject names to store messages. 
maxSubjectLength: 0 
maxAllowedDups: 3 

It's been set this exact way for years and generally seems to work.  
However, having just completed going through the 15,000 spam messages in 
/spam (that was fun, I see that some) subjects are repeated many many 

Most of these more then 3 duplicates (but I don't think all) seem to have 
punctuation in the subject, and extra colon, ends in an exclamation mark, 
a period, etc.  Those punctuation marks are (rightfully) ignored in the 
file name, but might they be used in the comparison so they're not coming 
up as already there?  I really don't know, I just know what I'm seeing. 


Check out the vibrant tech community on one of the world's most
engaging tech sites,! 
Assp-test mailing list

This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!

Check out the vibrant tech community on one of the world's most
engaging tech sites,!
Assp-test mailing list

Check out the vibrant tech community on one of the world's most
engaging tech sites,! 
Assp-test mailing list

This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!

Check out the vibrant tech community on one of the world's most
engaging tech sites,!
Assp-test mailing list

Check out the vibrant tech community on one of the world's most
engaging tech sites,!
Assp-test mailing list

This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!

Check out the vibrant tech community on one of the world's most
engaging tech sites,!
Assp-test mailing list

Reply via email to