Re: Rule HK_SCAM is triggered by standard business email

2020-07-02 Thread @lbutlr
On 01 Jul 2020, at 14:20, Aner Perez  wrote:
> we have the spam threshold set very low (2.4)

This is a terrible idea and exposes a fundamental misunderstanding of how SA 
works.

If SA scores an email as 3.3 then the message is not considered spam by SA. If 
you ignore this and mark it as sam anyway, you have no one to blame but 
yourself. Reducing the threshold increases the number of non-spam messages that 
are marked as spam. It will also have very little effect on actual spam 
messages. The only exception to this is if you have a badly trained Bayes, as 
that can swing the scoring quite a lot.

Set your threshold back to 5.0 and train your Bayes with actual spam you 
receive and actual ham you receive. The best Spam to train is spam that is not 
tagged by SA as spam (ignoring the bayes portion of a score). So, a message 
marked at 5.5 with BAYES_50 is a price candidate for training as it would be 
marked 4.7 without the BAYES_50.

It would have been better, I think, had SA designed the system to score 
anything over 0 as spam and anything under 0 as ham as I suspect very few 
people would make this mistake, but it's a bit late for that now.

Just think of it this way, when you set the threshold below 5, you are saying 
to SA "please mark legitimate mail theat I want to receive as spam."



-- 
'Oh, them as makes the endings don't get them,' said Granny.
--Maskerade



Re: Rule HK_SCAM is triggered by standard business email

2020-07-01 Thread Henrik K
On Wed, Jul 01, 2020 at 01:29:51PM -0700, John Hardin wrote:
>
> Agreed, that's why I want Henrik to comment. I don't have the corpus he used
> to develop that rule.

It's really old rules, I don't have either. ;-)

__HK_SCAM_S7 seems to have regressed FP wise, just gonna drop it..



Re: Rule HK_SCAM is triggered by standard business email

2020-07-01 Thread Martin Gregorie
On Wed, 2020-07-01 at 16:20 -0400, Aner Perez wrote:
> It looks like to me like the logic in __HK_SCAM_S7 is a little
> > off...
> > 
> > /(?:(?:investment|proposed|lucrative)
> > (?:business|venture)|(?:business|venture) 
> > (?:enterprise|propos(?:al|ition)))/i
> > 
> > seems like it should be:
> > 
> > /(?:(?:investment|proposed|lucrative)
> > (?:business|venture)|(?:business|venture|enterprise) 
> > propos(?:al|ition))/i
> > 
> 
IME using a meta-rule that ANDs two rules of that type works well. 

The key is to put words or phrases that often occur in spam in each of
the sub-rules, for instance having selling jargon ("lowest prices",
"unbeatable value") in one rule and product names ("flip flops",
"vodka", "power packs") in the other. As a benefit, if the lists are
well-chosen from words and phrases from spam you've received, it will
also hit on sales spam using combinations you've not previously seen
while being surprisingly good at not giving FPs on business or personal
letters.

The only disadvantage is that the subrules get a bit unwieldy and hard
to edit once their definitions get much longer than 80 characters. That
aside, they're easy to understand and maintain.

Martin





Re: Rule HK_SCAM is triggered by standard business email

2020-07-01 Thread John Hardin

On Wed, 1 Jul 2020, Aner Perez wrote:


On 7/1/20 3:52 PM, John Hardin wrote:

On Wed, 1 Jul 2020, Aner Perez wrote:

I opened a bug (7832) about this but was told to report on the SA users 
mailing list instead.


The attached email is an example which triggers the HK_SCAM rule.  Looks 
like __HK_SCAM_S7 is the culprit here since it matches the words 
"business" and "enterprise" when they are found one after the other (even 
on different lines).


In the real world this was triggered by a business email that had the 
following in the signature:


FirstName LastName
Altice Business
Enterprise Account Executive


What was the *overall* score of that message? Was this rule enough to push 
the message over the spam threshold (5 points)? Or was the message still 
scored as ham?


In our case it was marked as spam but only because we have the spam 
threshold set very low (2.4). The message scored a 3.357 when the 
BAYES_50 was added in.


Yeah, that's why doing that blindly is a bad idea. Masscheck sets the base 
rule scores so that spams score 5 points. If you reduce the spam 
threshold, you increase FPs. You need to compensate for that if you do it.



It looks like to me like the logic in __HK_SCAM_S7 is a little off...

/(?:(?:investment|proposed|lucrative) 
(?:business|venture)|(?:business|venture) 
(?:enterprise|propos(?:al|ition)))/i


seems like it should be:

/(?:(?:investment|proposed|lucrative) 
(?:business|venture)|(?:business|venture|enterprise) propos(?:al|ition))/i




That makes more sense but the rule still seems like it would be easily 
triggered by standard business talk (e.g. business proposal).  I guess that's 
the nature of business emails... they're naturally spammy.


Agreed, that's why I want Henrik to comment. I don't have the corpus he 
used to develop that rule.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Of the twenty-two civilizations that have appeared in history,
  nineteen of them collapsed when they reached the moral state the
  United States is in now.  -- Arnold Toynbee
---
 3 days until the 244th anniversary of the Declaration of Independence

Re: Rule HK_SCAM is triggered by standard business email

2020-07-01 Thread Aner Perez

On 7/1/20 3:52 PM, John Hardin wrote:

On Wed, 1 Jul 2020, Aner Perez wrote:

I opened a bug (7832) about this but was told to report on the SA users mailing list 
instead.


The attached email is an example which triggers the HK_SCAM rule.  Looks like 
__HK_SCAM_S7 is the culprit here since it matches the words "business" and "enterprise" 
when they are found one after the other (even on different lines).


In the real world this was triggered by a business email that had the following in the 
signature:


FirstName LastName
Altice Business
Enterprise Account Executive


What was the *overall* score of that message? Was this rule enough to push the message 
over the spam threshold (5 points)? Or was the message still scored as ham?


In our case it was marked as spam but only because we have the spam threshold set very low 
(2.4).  The message scored a 3.357 when the BAYES_50 was added in.




It looks like to me like the logic in __HK_SCAM_S7 is a little off...

/(?:(?:investment|proposed|lucrative) (?:business|venture)|(?:business|venture) 
(?:enterprise|propos(?:al|ition)))/i


seems like it should be:

/(?:(?:investment|proposed|lucrative) (?:business|venture)|(?:business|venture|enterprise) 
propos(?:al|ition))/i




That makes more sense but the rule still seems like it would be easily triggered by 
standard business talk (e.g. business proposal).  I guess that's the nature of business 
emails... they're naturally spammy.



...but I'll let Henrik comment.


Potentially, making it a rawbody rule might avoid this FP without affecting its 
performance against the targeted spams...



For future reference: sending a sample email to the list as a bare attachment is 
problematic, as it may be altered en-route and thus invalidate any meaningful analysis. 
It's better to attach it as a zip/gzip, or to upload it to someplace like Pastebin and 
just post the URL to it here. (In this case, your description should probably be enough to 
figure it out without the sample so you shouldn't need to do that unless someone 
explicitly asks you to do so.)




Thanks I'll keep that in mind.

- Aner


Re: Rule HK_SCAM is triggered by standard business email

2020-07-01 Thread John Hardin

On Wed, 1 Jul 2020, Aner Perez wrote:

I opened a bug (7832) about this but was told to report on the SA users 
mailing list instead.


The attached email is an example which triggers the HK_SCAM rule.  Looks like 
__HK_SCAM_S7 is the culprit here since it matches the words "business" and 
"enterprise" when they are found one after the other (even on different 
lines).


In the real world this was triggered by a business email that had the 
following in the signature:


FirstName LastName
Altice Business
Enterprise Account Executive


What was the *overall* score of that message? Was this rule enough to push 
the message over the spam threshold (5 points)? Or was the message still 
scored as ham?


It looks like to me like the logic in __HK_SCAM_S7 is a little off...

/(?:(?:investment|proposed|lucrative) (?:business|venture)|(?:business|venture) 
(?:enterprise|propos(?:al|ition)))/i

seems like it should be:

/(?:(?:investment|proposed|lucrative) 
(?:business|venture)|(?:business|venture|enterprise) propos(?:al|ition))/i

...but I'll let Henrik comment.


Potentially, making it a rawbody rule might avoid this FP without 
affecting its performance against the targeted spams...



For future reference: sending a sample email to the list as a bare 
attachment is problematic, as it may be altered en-route and thus 
invalidate any meaningful analysis. It's better to attach it as a 
zip/gzip, or to upload it to someplace like Pastebin and just post the URL 
to it here. (In this case, your description should probably be enough to 
figure it out without the sample so you shouldn't need to do that unless 
someone explicitly asks you to do so.)




--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The philosophy of gun control: Teenagers are roaring through
  town at 90MPH, where the speed limit is 25. Your solution is to
  lower the speed limit to 20.   -- Sam Cohen
---
 3 days until the 244th anniversary of the Declaration of Independence


Rule HK_SCAM is triggered by standard business email

2020-07-01 Thread Aner Perez

I opened a bug (7832) about this but was told to report on the SA users mailing 
list instead.

The attached email is an example which triggers the HK_SCAM rule.  Looks like __HK_SCAM_S7 
is the culprit here since it matches the words "business" and "enterprise" when they are 
found one after the other (even on different lines).


In the real world this was triggered by a business email that had the following in the 
signature:


FirstName LastName
Altice Business
Enterprise Account Executive

- Aner
--- Begin Message ---

Let's list some

Business
Enterprise

Sounds simple
--- End Message ---