Multiple embedded images rule

2013-08-09 Thread Pavel Bazika
Hello,

attached is an spam email that contains a few embedded images via img tag in 
HTML part. 


SpamAssassin 3.3.1 on isnotspam.org reports these matches: 


0.0 URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked. See 
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-blockfor more 
information. [URIs: list-manage2.com]
1.0 DK_SIGNED DK_SIGNED 
0.4 HTML_IMAGE_RATIO_02 BODY: HTML has a low ratio of text to image area 
0.1 HTML_MESSAGE BODY: HTML included in message 
0.0 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5665] 
0.0 MIME_QP_LONG_LINE RAW: Quoted-printable line longer than 76 chars 
0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid 
-0.1 DKIM_VALID Message has at least one valid DKIM or DK signature 
0.0 LOTS_OF_MONEY Huge... sums of money 


X-Spam-Status: Yes, hits=1.5 required=-20.0 tests=BAYES_50,DKIM_SIGNED,
DKIM_VALID,DK_SIGNED,HTML_IMAGE_RATIO_02,HTML_MESSAGE,LOTS_OF_MONEY, 
MIME_QP_LONG_LINE,URIBL_BLOCKED autolearn=no version=3.3.1 
X-Spam-Score: 1.5 


I was wondering if there is some rule that will match mails with many embedded 
images. There is already T_REMOTE_IMAGE in 72_active.cf, but with no score 
assigned and it also doesn't take into account the number of images in the 
messages.

Regards

Pavel Bazika





RE: Multiple embedded images rule

2013-08-09 Thread emailitis.com
We too get a lot of these emails which are largely an image only.  
T_REMOTE_IMAGE comes up a lot - but it is not listed in 
http://spamassassin.apache.org/tests_3_3_x.html and the previous email suggests 
there is no score attached to it.

I found somewhere before (cannot find it again) that if we put a refined score 
in brackets into local.cf, it ADDS that score to the Spamassassin default - is 
that correct?  So like this:
score T_REMOTE_IMAGE (3.5)

or do I just have to give it a new score like:
score T_REMOTE_IMAGE 3.5

Can someone remind me how to turn on the verbose so that for a short 
monitoring time we can see the scores being given to all our rules in maillog?

Many thanks, as ever, in advance.

Kind Regards,

Christoph Kuhle
-Original Message-
From: Pavel Bazika [mailto:pavel.baz...@icewarp.com] 
Sent: 05 August 2013 13:03
To: users@spamassassin.apache.org
Subject: Multiple embedded images rule

Hello,

attached is an spam email that contains a few embedded images via img tag in 
HTML part. SpamAssassin 3.3.1 on isnotspam.org reports these matches: 0.0 
URIBL_BLOCKED ADMINISTRATOR NOTICE: The query to URIBL was blocked. See 
http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-blockfor more 
information. [URIs: list-manage2.com] 1.0 DK_SIGNED DK_SIGNED 0.4 
HTML_IMAGE_RATIO_02 BODY: HTML has a low ratio of text to image area 0.1 
HTML_MESSAGE BODY: HTML included in message 0.0 BAYES_50 BODY: Bayes spam 
probability is 40 to 60% [score: 0.5665] 0.0 MIME_QP_LONG_LINE RAW: 
Quoted-printable line longer than 76 chars 0.1 DKIM_SIGNED Message has a DKIM 
or DK signature, not necessarily valid -0.1 DKIM_VALID Message has at least one 
valid DKIM or DK signature 0.0 LOTS_OF_MONEY Huge... sums of money 
X-Spam-Status: Yes, hits=1.5 required=-20.0 tests=BAYES_50,DKIM_SIGNED, 
DKIM_VALID,DK_SIGNED,HTML_IMAGE_RATIO_02,HTML_MESSAGE,LOTS_OF_MONEY, 
MIME_QP_LONG_LINE,URIBL_BLOCKED autolearn=no version=3.3.1 X-Spam-Score: 1.5 I 
was wondering if there is some rule that will match mails with many embedded 
images. There is already T_REMOTE_IMAGE in 72_active.cf, but with no score 
assigned and it also doesn't take into account the number of images in the 
messages. What about a ruleset matching multiple images in the HTML mail part?
Regards


Pavel Bazika





RE: Multiple embedded images rule

2013-08-09 Thread Benny Pedersen

emailitis.com skrev den 2013-08-09 11:07:


score T_REMOTE_IMAGE (3.5)


soft score adjust, score will be adjusted from corpus on apache.org


score T_REMOTE_IMAGE 3.5


hard score forcement, score will not be adjusted but keep as you want

note on T_ is that is a testing score rule not mean to be publiced :(


Re: SPF failure very low score (DKIM whitelisting and ADSP rules)

2013-08-09 Thread Mark Martinec
On Friday 09 August 2013 00:26:09 Quanah Gibson-Mount wrote:
 Ok, so I imagine I want to do something like:
 
  header DKIM_ADSP_DISCARD eval:check_dkim_adsp('D')
 
 but only for facebook.com... I don't see exactly how I tie those two
 together?


==
To add POSITIVE spam score points to mail with a From from specific
domains but with no valid DKIM signature, see 60_adsp_override_dkim.cf .
Protected domains there include ebay, paypal, bankofamerica,
amazon, linkedin, facebookmail, ...

To add domains protected from forgery (the following are already
in the default 60_adsp_override_dkim.cf set of rules):
  adsp_override birthdayalarm.com all
  adsp_override astrology.com all
  adsp_override linkedin.com  all
  adsp_override *.linkedin.comall
  adsp_override facebookmail.com  all
  adsp_override *.greenpeace.org  all
  ...
These are default scores for forgery (i.e. for ADSP failures):
  score DKIM_ADSP_ALL0 1.1 0 0.8
  score DKIM_ADSP_DISCARD0 1.8 0 1.8
  score DKIM_ADSP_NXDOMAIN   0 0.8 0 0.9

and equivalent scores but permissive on failed mail that went through
some mailing list:
  score NML_ADSP_CUSTOM_LOW  0 0.7 0 0.7
  score NML_ADSP_CUSTOM_MED  0 1.2 0 0.9
  score NML_ADSP_CUSTOM_HIGH 0 2.6 0 2.5

If there is a need to assign a non-default score for mail from specific
domains with no valid DKIM signature, instead of adsp_override one can
add a specific rule for such domains:

  header DKIM_ADSP_ALL_YG1 eval:check_dkim_adsp('*', gmail.com, yahoo.com)
  score  DKIM_ADSP_ALL_YG1 0.1

  header DKIM_ADSP_ALL_YG2 eval:check_dkim_adsp('*', .gmail.com, .yahoo.com)
  score  DKIM_ADSP_ALL_YG2 0.1


==
To add NEGATIVE score points assigned to mail from specific domains
with valid DKIM signatures, see 60_whitelist_dkim.cf .
Benefiting domains there include ebay, paypal, cisco, hotels.com,
lufthansa, skype, several scientific newsletters, ...

Add further domains like:
  whitelist_from_dkim  *@uu.se
  whitelist_from_dkim  *@uni-bremen.de
  whitelist_from_dkim  *@tugraz.at
  whitelist_from_dkim  *@tu-graz.ac.at
  whitelist_from_dkim  *@univie.ac.at
  whitelist_from_dkim  *@univ-tours.fr
  whitelist_from_dkim  *@cern.ch
  whitelist_from_dkim  *@amazon.com
  whitelist_from_dkim  *@springer.delivery.net
  whitelist_from_dkim  *@cisco.com
  whitelist_from_dkim  *@info.hp.com
  whitelist_from_dkim  *@alert.bankofamerica.com
  whitelist_from_dkim  *@cnn.com
  whitelist_from_dkim  *@*.cnn.com
  whitelist_from_dkim  serv...@youtube.com
  whitelist_from_dkim  *@*paypal.com
  def_whitelist_from_dkim   *@yousendit.com
  def_whitelist_from_dkim   *@meetup.com
  def_whitelist_from_dkim   dailyhorosc...@astrology.com
  def_whitelist_from_dkim   *@twitter.com
  def_whitelist_from_dkim   *@*.twitter.com
  def_whitelist_from_dkim   *@*.twitter.com  twitter.com
  def_whitelist_from_dkim   *@email.creativepro.com
  def_whitelist_from_dkim   *@publicservice-mailer.co.uk

and adjust scores if necessary:
  score USER_IN_DEF_DKIM_WL -1.5
  score USER_IN_DKIM_WHITELIST -12

If there is a need to assign a non-default score for valid DKIM-signed
mail from specific domains, instead of whitelist_from_dkim one can add
a specific rule for such domains:

  full   DKIM_VALID_WEGAME eval:check_dkim_valid(email.wegame.com)
  score  DKIM_VALID_WEGAME -8


Mark





Re: uridnsbl does not work with idn domains

2013-08-09 Thread Mark Martinec
On Friday 09 August 2013 01:13:38 Benny Pedersen wrote:
 seen idn spamming urls here that is not tested in uridnsbl, have
 spamassassin 3.4.0 not idn support yet ?
 
 is it just missing tld defines for idn domains ?
 
 should it be filled a bug ?

There is currently (3.4.0) no specific IDN support yet,
mainly because not much of these have been observed in the wild.

If the domain found in a mail body is encoded in punycode,
I see no reason for not being subject to uridnsbl rules, and
if it really isn't it's probably a bug.

If the domain found in a mail body is in Unicode (not encoded
into punycode), such conversion is not yet implemented in
SpamAssassin. Eventually it probably should as these become
widespread, so this would be a feature request.

It would be most welcome to see concrete samples from the wild.
There may already be some IDN-related problem report,
but please do open a new one and attach your samples,
at lease it can get a conversation going.

  Mark


Re: uridnsbl does not work with idn domains

2013-08-09 Thread Benny Pedersen

Mark Martinec skrev den 2013-08-09 13:49:


There is currently (3.4.0) no specific IDN support yet,
mainly because not much of these have been observed in the wild.


okay, created 
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6966



If the domain found in a mail body is encoded in punycode,
I see no reason for not being subject to uridnsbl rules, and
if it really isn't it's probably a bug.


i have not seen unicode example yet


If the domain found in a mail body is in Unicode (not encoded
into punycode), such conversion is not yet implemented in
SpamAssassin. Eventually it probably should as these become
widespread, so this would be a feature request.


yes i olso have no idn perl modules installed yet on gentoo, so its 
missing still to be implemented



It would be most welcome to see concrete samples from the wild.
There may already be some IDN-related problem report,
but please do open a new one and attach your samples,
at lease it can get a conversation going.


i have added sample url in the bug


Re: FSL_HELO_BARE_IP_2 rule?

2013-08-09 Thread Thomas Harold

On 8/8/2013 5:32 AM, Steve Freegard wrote:


Sure - I wrote both rules.

It's to identify hosts that HELO with a 'raw' IP e.g.

HELO 1.2.3.4

Which is not syntactically correct as per the RFC.  IP addresses used in
the HELO should be in a IP literal format:

HELO [1.2.3.4]

FSL_HELO_BARE_IP_1 looks at only the last external IP address, whereas
FSL_HELO_BARE_IP_2 looks at all external received hops.

These were supposed just to be sandbox rules, but they've been
autopromoted by the masschecker and I hadn't noticed until now.

FSL_HELO_BARE_IP_2 should probably be meta'd to only hit if
FSL_HELO_IP_1 doesn't hit to prevent a double hit if the last external
is a raw IP.

I'll create an FSL_HELO_BARE_IP_3 rule as a meta and see what the
results are tomorrow, and then I'll remove FSL_HELO_BARE_IP_2 provided
the results are satisfactory.



We have a client who is hitting these (yes we're working with them to 
try and fix it).  I haven't seen the _1 rule hit, but it is hitting the 
following rules:


X-Spam-Status: Yes, score=6.904 tagged_above=-999 required=4.5
tests=[BAYES_50=0.8, FSL_HELO_BARE_IP_2=2.699,
RCVD_IN_BRBL_LASTEXT=1.449, RCVD_NUMERIC_HELO=1.164, RDNS_NONE=0.793,
SPF_PASS=-0.001] autolearn=no

Hop #1 in their mailing output is emitting a HELO with a bare IP address 
of the style 1.2.3.4.  Hop #2 has a valid HELO, but they don't have a 
reverse DNS record.




Re: SPF failure very low score

2013-08-09 Thread Thomas Harold

On 8/8/2013 4:49 PM, John Hardin wrote:

On Thu, 8 Aug 2013, Quanah Gibson-Mount wrote:

SPF is _by itself_ not useful as a spam sign.

If you're seeing a lot of facebook spam that fails SPF because it's
being forged, then a rule that checks SPF_FAIL *IF* the mail claims to
be from Facebook, and adds a point or two, would be more reasonable.



In our setup, we get good results from outright blocking any SPF fails 
using policyd-spf (python version) during the SMTP transaction and we've 
only had to whitelist a handful of badly configured servers.  We block 
about 4% of all inbound messages by blocking on SPF FAIL.


So I'd argue that SPF FAIL is a pretty good indicator that the message 
is very likely to be spam.  But in our setup, those messages never get 
that far.


SPF PASS, however, is not a good indicator either way.




Re: DHL From Russia

2013-08-09 Thread Thomas Harold

On 8/8/2013 6:12 PM, Benny Pedersen wrote:


show sample on pastebin



We see a few of these each week, not sure if they are from Russia:

http://pastebin.com/iBmELtSh
http://pastebin.com/qpxhkJbB

Sometimes they score high enough to flag as spam, other times they are 
just below the threshold.


I've debated writing a local rule to flag them as spam if the from 
address does not match what DHL uses, except I have no good samples from 
DHL.




Re: DHL From Russia

2013-08-09 Thread Benny Pedersen

Thomas Harold skrev den 2013-08-09 15:16:


We see a few of these each week, not sure if they are from Russia:

http://pastebin.com/iBmELtSh



Content analysis details:   (8.9 points, 5.0 required)

 pts rule name  description
 -- 
--

 1.6 RCVD_IN_BRBL_LASTEXT   RBL: No description available.
[31.24.139.73 listed in 
bb.barracudacentral.org]

 0.1 RELAY_IT   Relayed through IT
 3.3 URIBL_BLACKContains an URL listed in the URIBL 
blacklist

[URIs: slppoa.org]
 0.5 SPF_NONE   SPF: sender does not publish an SPF Record
 0.0 T_HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level 
mail

domains are different
 0.1 STARS_ON_FORTY_FOORURI: contains 4 chars url at end
 0.1 STARS_ON_FORTY_SIX URI: contains 6 chars url at end
 0.0 HTML_MESSAGE   BODY: HTML included in message
 0.5 HTML_TITLE_MISSING Meta: !__HTML_TITLE_BEGIN  
!__HTML_TITLE_END 

HTML_MESSAGE
 1.3 RDNS_NONE  Delivered to internal network by a host 
with no rDNS

 0.1 HTML_DOCTYPE_MISSING   Meta: !__DOCTYPE_ALL  HTML_MESSAGE
 1.3 SAGREY Adds score to spam from first-time senders


http://pastebin.com/qpxhkJbB



Content analysis details:   (8.9 points, 5.0 required)

 pts rule name  description
 -- 
--

 1.6 RCVD_IN_BRBL_LASTEXT   RBL: No description available.
[62.109.30.143 listed in 
bb.barracudacentral.org]

 1.5 RELAY_RU   Relayed through RU
-0.0 SPF_PASS   SPF: sender matches SPF record
 2.4 DATE_IN_FUTURE_03_06   Date: is 3 to 6 hours after Received: date
 0.0 T_HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level 
mail

domains are different
 0.1 STARS_ON_FORTY_FOORURI: contains 4 chars url at end
 0.0 HTML_MESSAGE   BODY: HTML included in message
 0.5 HTML_TITLE_MISSING Meta: !__HTML_TITLE_BEGIN  
!__HTML_TITLE_END 

HTML_MESSAGE
 1.3 RDNS_NONE  Delivered to internal network by a host 
with no rDNS

 0.1 HTML_DOCTYPE_MISSING   Meta: !__DOCTYPE_ALL  HTML_MESSAGE
 1.3 SAGREY Adds score to spam from first-time senders



Sometimes they score high enough to flag as spam, other times they
are just below the threshold.


last one was over



I've debated writing a local rule to flag them as spam if the from
address does not match what DHL uses, except I have no good samples
from DHL.


could be a start, but none example showed forged senders here


Re: DHL From Russia

2013-08-09 Thread Neil Schwartzman

On Aug 9, 2013, at 6:16 AM, Thomas Harold thomas-li...@nybeta.com wrote:

 We see a few of these each week, not sure if they are from Russia:
 
 http://pastebin.com/iBmELtSh


Not really that difficult to block.

31.24.139.73

Senderscore of '3'(out of 100)
https://senderscore.org/lookup.php?lookup=31.24.139.73ipLookup=Go

Email Reputation Poor
http://www.senderbase.org/lookup?search_string=31.24.139.73

Re: DHL From Russia

2013-08-09 Thread Matus UHLAR - fantomas

Thomas Harold skrev den 2013-08-09 15:16:

We see a few of these each week, not sure if they are from Russia:
http://pastebin.com/iBmELtSh


On 09.08.13 16:05, Benny Pedersen wrote:

1.6 RCVD_IN_BRBL_LASTEXT   RBL: No description available.
   [31.24.139.73 listed in bb.barracudacentral.org]
0.1 RELAY_IT   Relayed through IT
3.3 URIBL_BLACKContains an URL listed in the URIBL blacklist
   [URIs: slppoa.org]
0.5 SPF_NONE   SPF: sender does not publish an SPF Record
0.0 T_HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail
   domains are different
0.1 STARS_ON_FORTY_FOORURI: contains 4 chars url at end
0.1 STARS_ON_FORTY_SIX URI: contains 6 chars url at end
0.0 HTML_MESSAGE   BODY: HTML included in message
0.5 HTML_TITLE_MISSING Meta: !__HTML_TITLE_BEGIN  !__HTML_TITLE_END 
   HTML_MESSAGE
1.3 RDNS_NONE  Delivered to internal network by a host with no rDNS
0.1 HTML_DOCTYPE_MISSING   Meta: !__DOCTYPE_ALL  HTML_MESSAGE
1.3 SAGREY Adds score to spam from first-time senders



http://pastebin.com/qpxhkJbB



1.6 RCVD_IN_BRBL_LASTEXT   RBL: No description available.
   [62.109.30.143 listed in bb.barracudacentral.org]
1.5 RELAY_RU   Relayed through RU
-0.0 SPF_PASS   SPF: sender matches SPF record
2.4 DATE_IN_FUTURE_03_06   Date: is 3 to 6 hours after Received: date
0.0 T_HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail
   domains are different
0.1 STARS_ON_FORTY_FOORURI: contains 4 chars url at end
0.0 HTML_MESSAGE   BODY: HTML included in message
0.5 HTML_TITLE_MISSING Meta: !__HTML_TITLE_BEGIN  !__HTML_TITLE_END 
   HTML_MESSAGE
1.3 RDNS_NONE  Delivered to internal network by a host with no rDNS
0.1 HTML_DOCTYPE_MISSING   Meta: !__DOCTYPE_ALL  HTML_MESSAGE
1.3 SAGREY Adds score to spam from first-time senders


unfortunately RELAY_IT, RELAY_RU STARS_ON_FORTY_FOOR, STARS_ON_FORTY_SIX and
SAGREY are not stock rules.  the RCVD_IN_BRBL_LASTEXT and URIBL_BLACK may
not apply for early recipients. 

you also seem have modified scoresd for URIBL_BLACK, at least what I have 
locally:


50_scores.cf:score URIBL_BLACK 0 1.775 0 1.725 # n=0 n=2

... and I have quite actual scores:
-rw-r--r-- 1 debian-spamd debian-spamd 44575 Aug  9 02:23 50_scores.cf

just noticing...
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Chernobyl was an Windows 95 beta test site.


Re: DHL From Russia

2013-08-09 Thread Alex
Hi,

 1.6 RCVD_IN_BRBL_LASTEXT   RBL: No description available.
[62.109.30.143 listed in
 bb.barracudacentral.org]
 1.5 RELAY_RU   Relayed through RU
 -0.0 SPF_PASS   SPF: sender matches SPF record
 2.4 DATE_IN_FUTURE_03_06   Date: is 3 to 6 hours after Received: date
 0.0 T_HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail
domains are different
 0.1 STARS_ON_FORTY_FOORURI: contains 4 chars url at end
 0.0 HTML_MESSAGE   BODY: HTML included in message
 0.5 HTML_TITLE_MISSING Meta: !__HTML_TITLE_BEGIN  !__HTML_TITLE_END
 
HTML_MESSAGE
 1.3 RDNS_NONE  Delivered to internal network by a host with no
 rDNS
 0.1 HTML_DOCTYPE_MISSING   Meta: !__DOCTYPE_ALL  HTML_MESSAGE
 1.3 SAGREY Adds score to spam from first-time senders

 unfortunately RELAY_IT, RELAY_RU STARS_ON_FORTY_FOOR, STARS_ON_FORTY_SIX and
 SAGREY are not stock rules.  the RCVD_IN_BRBL_LASTEXT and URIBL_BLACK may
 not apply for early recipients.
 you also seem have modified scoresd for URIBL_BLACK, at least what I have
 locally:

 50_scores.cf:score URIBL_BLACK 0 1.775 0 1.725 # n=0 n=2

 ... and I have quite actual scores:
 -rw-r--r-- 1 debian-spamd debian-spamd 44575 Aug  9 02:23 50_scores.cf

 just noticing...

... and no BAYES?

These looks like the types of messages where either a specific body
pattern would be necessary, or block the IP with postfix.

Regards,
Alex


Re: DHL From Russia

2013-08-09 Thread Benny Pedersen

Alex skrev den 2013-08-09 17:27:


... and no BAYES?


yep no bayes, privacy concern


These looks like the types of messages where either a specific body
pattern would be necessary, or block the IP with postfix.


well ip is not content


New spam rule for specific content

2013-08-09 Thread Amir 'CG' Caspi

Hi all,

	A number of my users have been receiving spam formatted in a 
very specific way which seems to very often miss Bayes... I don't 
know why, whether it's because of the HTML gibberish flooding Bayes 
with useless tokens (to reduce the relative strength of the spammy 
tokens), or if it's just the specific content isn't sufficiently 
spammy (or has sufficient ham to balance) to pop.
	Either way, this spam appears to be generated from a specific 
template, and I've created a rule to hit that template.  Within the 
last couple of weeks, I've had only true positives and negatives... 
no FPs, no FNs.


For your perusal, here is the rule:

# Spammy URI pattern
uri __OUTL_URI  /\/outl\b/
uri __OUTI_URI  /\/outi\b/
meta OUTL_OUTI_IS_SPAMMY(__OUTL_URI  __OUTI_URI)
describe OUTL_OUTI_IS_SPAMMY/outl + /outi link combo is highly spammy
score OUTL_OUTI_IS_SPAMMY   3

If you don't specifically trust URI rules to not have FPs, I have a 
rawbody version of this which works identically... in all cases, both 
rules pop together, so I think there's no specific need to use the 
rawbody version, but I can provide it if needed.


I recommend this rule be added to the general distribution.

(Like many other users here, I've also increased the Bayes scores for 
Bayes99, and created a Bayes999 with even higher scoring... it might 
be time to add that to the general distribution, too.)


Hope this helps...

--- Amir


Re: New spam rule for specific content

2013-08-09 Thread John Hardin

On Fri, 9 Aug 2013, Amir 'CG' Caspi wrote:

	A number of my users have been receiving spam formatted in a very 
specific way which seems to very often miss Bayes...


Can you provide a spample or two?


I recommend this rule be added to the general distribution.


They can be added but unless such spams appear in the masscheck corpora 
the rules won't be scored and distributed.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  The first time I saw a bagpipe, I thought the player was torturing
  an octopus. I was amazed they could scream so loudly.
-- cat_herder_5263 on Y! SCOX
---
 6 days until the 68th anniversary of the end of World War II


Re: New spam rule for specific content

2013-08-09 Thread RW
On Fri, 9 Aug 2013 11:19:08 -0600
Amir 'CG' Caspi wrote:

   A number of my users have been receiving spam formatted in a 
 very specific way which seems to very often miss Bayes... I don't 
 know why, whether it's because of the HTML gibberish flooding Bayes 
 with useless tokens (to reduce the relative strength of the spammy 
 tokens), or if it's just the specific content isn't sufficiently 
 spammy (or has sufficient ham to balance) to pop.

BAYES works on rendered text it doesn't see the HTML.


 (Like many other users here, I've also increased the Bayes scores for 
 Bayes99, and created a Bayes999 with even higher scoring... it might 
 be time to add that to the general distribution, too.)

Do you actually get a significant amount of ham between 0.99 and 0.999?
Personally I only get 1 in 1000 above 0.55, and nothing above 0.65.


Re: New spam rule for specific content

2013-08-09 Thread Amir 'CG' Caspi
On Fri, August 9, 2013 1:01 pm, RW wrote:
 BAYES works on rendered text it doesn't see the HTML.

Hmmm.  It doesn't see HTML comments, which would appear in rendered HTML
source even though they are invisible?  OK, in that case, I have NO idea
why the spam isn't hitting Bayes, because it looks pretty damn spammy to
me.  I wonder if it's the heavy use of images, but I don't know.

 Do you actually get a significant amount of ham between 0.99 and 0.999?
 Personally I only get 1 in 1000 above 0.55, and nothing above 0.65.

Ham, absolutely not.  So yes, I suppose I could just treat all Bayes99 as
if it were Bayes999 and score it more highly than I do.  Right now I have
Bayes99 at 4, Bayes999 at 4.5.  I could eliminate Bayes999 and make
Bayes99 score 4.5... but I do worry a little bit about FPs, even though I
guess I shoudn't, statistically speaking.

On the other hand, one could consider making Bayes999 a poison pill. 
Generally spam will only rank there if you've learned something nearly
identical to it.  At that point, perhaps it might be worth just scoring it
with 5 or higher (assuming your threshold is 5, as mine is).

--- Amir