Ken , it would be nice if you consider to signoff this list or at least to 
no longer post here.

Thank you.

Thomas





Von:    "K Post" <nntp.p...@gmail.com>
An:     "ASSP development mailing list" <assp-test@lists.sourceforge.net>
Datum:  12.11.2021 22:46
Betreff:        Re: [Assp-test] Concept Question: Scan entire message for 
Bombs, regardless of MaxBytes setting? New MaxBytes recommendation?



First off, WOW.  Our rebuild times are in no way similar.   At first I 
thought it was you with fancy SSD's and lots of horsepower, but I'm seeing 
now that you have both useDB4Rebuild off and RebuildUseFileModel on.  The 
opposite of my settings.  I have useDB4Rebuild on and never enabled the 
RebuildUsedFileModel after initial attempts were failing (Early on with 
that feature).  useDB4Rebuild is the default and I was always worried 
about RAM when I started using ASSP 10+ years ago and never looked back.  

A long rebuild time doesn't bother me, but seeing how fast you can do one 
has got me back to needing to test the settings on my end again.  Thanks 
for that encouragement.


I'm worried that going up to 50k maxbytes on my system seemed to cause a 
lot of false positives.  I don't understand how that's possible, but it's 
what happened.  I would have thought it was the other way around, too much 
spam getting through vs. too much legit being blocked.  Plus, I don't 
think that generally using that much for bayesian is necessary (or maybe 
it's even detrimental?)  Accuracy was very high for me at  6k and 10k, but 
I was missing the bombs. 


The question remains for me about the >CONCEPT< of optionally scanning 
more of a message at the time of attempted delivery for bombs.  ClamAV 
uses its own maximum size setting.  Why not also give us that option for 
Bombs?  For the case I explained where bombs are late in the email body 
and likely other scenarios, don't you think it would be helpful to have a 
BombAddlBytes variable in the GUI? 

You know there's no way that I could ever code a plugin and that there's 
even less of a chance of this charity paying for one to be built!  I still 
have duct tape holding my desk chair together.  

Modifying getbody seems pretty straight forward.  Add a new variable 
called $bombdataref that would be used in place of $dataref for all bomb 
comparisons - similarly to the way that $clamavbytes is for the clamav 
stuff.  
my $bombdataref = $maxbytes + $BombAddlBytes : $BombAddlBytes : 0;
then, instead of if ( ! BombOK( $fh, $dataref ) ) { 
if ( ! BombOK( $fh, $bombdataref ) ) {
and the like everywhere that there's a bomb or script check in getbody

There would also need to be changes in analyze and anywhere else that the 
bomb checks are done.

I'm more than willing to try to modify ASSP as described above, give it a 
go, and report back.  It won't be easy for me to make the changes and have 
it work, but I'm game.  Before I do though, I'm concerned that you don't 
think that scanning more for bombs is a sound concept.  Or maybe you just 
don't think it's necessary?  I'm most interested in your opinion on that 
before I move forward.




On Fri, Nov 12, 2021 at 1:08 PM Thomas Eckardt <thomas.ecka...@thockar.com
> wrote:
Nov-12-21 04:00:20 RebuildSpamDB-thread rebuildspamdb-version 8.14 started 
in ASSP version 2.6.6(21314) 

Nov-12-21 04:00:20 detection of local disclaimers is enabled 

Nov-12-21 04:00:20 info: 'useDB4Rebuild' is NOT set to on - the rebuild 
spamdb process will possibly require a large amount of memory - but it 
will run very fast! 

Nov-12-21 04:00:20 RebuildSpamDB reloaded and uses the internal FileModel 
(with 39917 entries) to speedup processing 

Nov-12-21 04:00:20 RebuildSpamDB allocated 963.08 MByte of RAM to load the 
internal FileModel 

Nov-12-21 04:00:20 RebuildSpamDB will create a Hidden Markov Model 

Nov-12-21 04:00:20 RebuildSpamDB will include attachment-database-entries 
in to spamdb 

Nov-12-21 04:00:20 RebuildSpamDB will create unicode enabled databases 

Nov-12-21 04:00:20 RebuildSpamDB will process all words as Sequence of UAX 
#29 Grapheme Clusters 

Nov-12-21 04:00:20 RebuildSpamDB will normalize unicode characters 

Nov-12-21 04:00:20 RebuildSpamDB will use the ASSP_WordStem engine 

Nov-12-21 04:00:20 ---ASSP Settings--- 

Nov-12-21 04:00:20 RebuildSpamDB will create private spamdb entries for 
users email addresses and each local domain. 

Nov-12-21 04:00:20 Do Not Collect RedRe Messages: Enabled 
**Messages matching the RedRe will be removed from the corpus!** 

Nov-12-21 04:00:20 Use Subject as Maillog Names: True 
Nov-12-21 04:00:20 Maxbytes: 25,000 
Nov-12-21 04:00:20 Maxfiles: 31,000 
Nov-12-21 04:00:20 RebuildFileTimeLimit: 1 5 
Nov-12-21 04:00:20 RebuildFileTimeLimit: files will be moved away from the 
corpus if their processing takes longer than 5 second(s) 

processing ~40.000 corpus files in ~4 minutes 
building 15.500 spamdb.helo records in 2 seconds 
building 3.200.000 spamdb records in 25 seconds 
building 7.200.000 hmmdb records in 1:33 seconds 

complete processing time is 6 minutes. 

populating the records to the mysql database takes some minutes longer 


So -  maxBytes:=100.000 seems to be a possible setting (but this will IMHO 
not improve detection rates) 

If you need to process complete mails for bombs - you'll need to write 
your own level 2 assp-plugin. 

Thomas 








Von:        "K Post" <nntp.p...@gmail.com> 
An:        "ASSP development mailing list" <
assp-test@lists.sourceforge.net> 
Datum:        12.11.2021 16:56 
Betreff:        Re: [Assp-test] Concept Question: Scan entire message for 
Bombs, regardless of MaxBytes setting? New MaxBytes recommendation? 




Absolutely I've thought about this.  I consider everything I post prior to 
posting. 

Can you briefly explain why the ability to scan (MaxBytes + some 
additional amount)kb on incoming mails for bombs but only use MaxBytes for 
bayesian and the rebuild would be such a bad idea? 

Since you questioned if I ever thought about this, here's what the thought 
process is and the reason for the request.  Maybe I didn't explain myself 
well enough in the previous messages: 

The MaxBytes "documentation" says to lower it to 3000 for a mature 
installation, but 10x larger than that if you can handle it. 

How many bytes of the message body will ASSP look at - the message header 
is always included in all checks. Mails stored in the collecting folders 
will be truncated to this size, if StoreCompleteMail is disabled. The 
average of Ham messages (message body) is 6K, the average of Spam messages 
is 3K. Usually the spam folder will be filled quicker than the notspam 
folder, therefore set this value to 4000 to get more wordpairs per Ham 
Message. When both folders are close to the maxfiles limit, reduce it to 
3000. 
If your system is fast enough and has enough RAM multiply all the above 
recommendations and the default value by ten.


The gui doesn't say "IF the average is 6k ham, 3k spam," is says that it 
IS 6k ham / 3k spam.  That's not true of my installation.  My average spam 
size, as I've mentioned before, has a median size of about 20kb because of 
all of the html in them.  And not-spam has a median size of 40kb.  Using 
the logic in your gui, I believe I should set my MaxBytes to 20kb, the 
median size of my spam corpus.   

But, if I set my MaxBytes to 20kb (which it appears to be able to handle 
okay, rebuilding in an hour and change), then bombs after 20kb aren't 
detected when a message is attempting delivery.   

Why does this matter to me? 
We're seeing messages with @gmail.com and @whatever.onmicrosoft.com 
addresses that are copying legitimate looking order receipts from vendors 
like Amazon.com, BestBuy (US based big box electronics store), and 
Norton.  Many look identical to a legitimate message.  Ultimately, they 
want to call them on the phone and give your credit card number, using the 
guise that they're going to refund it.  Classic scam. 

These messages will always pass bayesian, they read identically to real 
messages.  BUT, I can detect some with the phone numbers that they direct 
people to.   The email addresses change frequently, but the scam phone 
numbers remain pretty constant.  I could maintain a list of known bad 
phone numbers (also available online) to capture these messages before 
they're delivered.  Simple.  If the message has one of these phone 
numbers, score it such that it'll get blocked. 

The problem with many of these emails is that the phone number is way past 
the 3k mark, and past the 20k mark too.  The scammers have a bunch of HTML 
in the "confirmation" email, just like real stores tend to do.  I tried 
increasing MaxBytes up to 50kb, which easily caught messages with bombs 
later in the body, but that then seemed to cause a lot of false positives 
and obviously much longer rebuild process.   

If there could be a "continue canning for bombs for ___kb after maxbytes" 
setting, that would let bombs later in the body be detected.  I don't know 
what the downside to having such a feature would be. 


Based on your reaction to my question, I'm obviously missing something 
important. 
  




On Thu, Nov 11, 2021 at 1:38 AM Thomas Eckardt <thomas.ecka...@thockar.com
> wrote: 
>Is there logic to having a separate MaxBytes setting like 
MaxBytesForBombs that's used only during message delivery?  That way, the 
entire message can be scanned for bombs, but the rebuild could use a lower 
number to better balance the differential between the average sized spam 
and average sized not-spam message. 

DID YOU EVER thougth about that ??????????????? Or do you only write 
something to fillup the community mailing list? 

No - no way! 

Thomas 







Von:        "K Post" <nntp.p...@gmail.com> 
An:        "ASSP development mailing list" <
assp-test@lists.sourceforge.net> 
Datum:        10.11.2021 20:22 
Betreff:        Re: [Assp-test] Concept Question: Scan entire message for 
Bombs, regardless of MaxBytes setting? New MaxBytes recommendation? 



After about 12 weeks of going from MaxBytes of 4k to MaxBytes of 50k, 've 
seen: 
1) Rebuild go from just over an hour (with 30k MaxFiles) to just over 2 
hours.  I'm fine with that, there's more to scan 
2) Bomb detections improve, as a lot of what's detected is beyond the 20k 
or 30k mark 
3) but, bayesian false positives going way up.  Lots of mail that would 
have (correctly) been delivered, is now getting too high of a score and is 
blocked. 

Surely #3 is specific to the types of messages my users are getting and I 
can tweak settings.  BUT, it makes me raise this question again: 
Is there logic to having a separate MaxBytes setting like MaxBytesForBombs 
that's used only during message delivery?  That way, the entire message 
can be scanned for bombs, but the rebuild could use a lower number to 
better balance the differential between the average sized spam and average 
sized not-spam message. 



On Mon, Nov 1, 2021 at 2:43 PM K Post <nntp.p...@gmail.com> wrote: 
When looking at the "Use this HTML Parser" section on the GUI, I found 
this line: 
it is recommended to set MaxBytes to 50000 (be carefull on heavy load 
systems - spam bomb regular expressions will take longer using 50000!).\ 
I'm going to change my settings and see how bad the rebuild time is.  I've 
got enough processing power and RAM now, but the disks aren't SSD.  Just a 
4 disk Raid 1+0 traditional HDD setup.  We'll see... 

Since HTMl email accounts for a big percentage of all mail,  might it be a 
good idea to update/expand the guidance in the MaxBytes section of the 
GUI?    



On Fri, Oct 29, 2021 at 8:40 PM K Post <nntp.p...@gmail.com> wrote: 
Summary: 
Should/could any consideration be given to having ASSP scan the entire 
message at the time it is received for Bombs (only), while still using 
MaxBytes for Bayesian/HMM? 

We've been having some cleverly crafted messages slipping through all 
filters that would be easy to catch with Bombs if only the catchable 
content came before MaxBytes.  These messages are 20kb+, They have a scam 
phone number at the very end of the larger than MaxBytes messages.  I 
want/need to use bombs to catch the scam phone numbers. 

With MaxBytes set to 3000, which is useful for faster RebuildSpamDB, these 
BombDataRE matches just aren't being caught.  If I increase MaxBytes, my 
BombDataRE catches them, but then rebuildspamdb is (probably? see below) 
longer than it needs to be. 

So, is there any value in considering a MaxBytesAdditionalForBombs 
variable which would be added to MaxBytes and only used when scanning for 
bombs as messages arrive?   Would that kill performance??  Other 
downsides? 

We could still only look at MaxBytes for Bayesian/HMM since it's only 
MaxBytes used when building those databases. 

What do you think? 

And while we're talking MaxBytes: 
I've asked this before, is the guidance for 3kb for MaxBytes once there's 
a mature corpus still a valid recommendation?  With unlimited horsepower 
and ram, sure, why not, do 30kb or 100kb.  That's not my reality, so I 
want to see where to best allocate resources. If 3kb is still the 
guidance, even though the spam files I'm seeing have a median size around 
20kb, so be it.  I feel like when that guidance was written, html wasn't 
used as prolifically in spam.  The median size of notspam in my corpus is 
about 40kb.  That's determined unscientifically by sorting by size and 
scrolling to approximately half way down. 

Thanks.  Have a good weekend. 
Ken 
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test




DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!
*******************************************************

_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test




DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!
*******************************************************

_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test
_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test




DISCLAIMER:
*******************************************************
This email and any files transmitted with it may be confidential, legally 
privileged and protected in law and are intended solely for the use of the 

individual to whom it is addressed.
This email was multiple times scanned for viruses. There should be no 
known virus in this email!
*******************************************************


_______________________________________________
Assp-test mailing list
Assp-test@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/assp-test

Reply via email to