Experimental Plugin: MetaSVM

2009-03-13 Thread decoder

Hi all,


as a result of the recent "2+2 != 4" discussion on the list, here is a 
new plugin, which tries to learn ham/spam classification only by knowing 
which rules triggered and which did not. This is, so to say, an 
automatic meta rule.


The plugin is currently experimental and can only be checked out from 
SVN at:


   https://svn.own-hero.net/sysadmin/MetaSVM/trunk


For now I recommend to not use it in production environment, as it is 
still untested (except that I tested it).
In order to use the plugin, you need to train your own model, which 
requires a certain amount of ham/spam.


I evaluated the plugin with my own ham/spam corpus (roughly 5000 spam, 
3000 ham) and the resulting model did not produce false positives with 
respect to the default scoring, but it catched approx. 30% of the mails 
that were not catched by SA itself. I'll probably release more detailed 
numbers in some whitepaper soon :)



Best regards,


Chris




smime.p7s
Description: S/MIME Cryptographic Signature


Re: SPF_NEUTRAL scoring?

2009-03-13 Thread Kai Schaetzl
Rw wrote on Thu, 12 Mar 2009 13:59:56 +:

> You get the neutral result if you don't get a match in any of the terms,
> so wont adding ~all or -all on the end, simply turn neutral into
> [soft]fail.

No. I assume you get that neutral because of ~all. And you get that ~all 
because it is the default in case it's missing. -all is *very* different 
from that.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com





Re: SPF_NEUTRAL scoring?

2009-03-13 Thread RW
On Fri, 13 Mar 2009 14:31:17 +0100
Kai Schaetzl  wrote:

> Rw wrote on Thu, 12 Mar 2009 13:59:56 +:
> 
> > You get the neutral result if you don't get a match in any of the
> > terms, so wont adding ~all or -all on the end, simply turn neutral
> > into [soft]fail.
> 
> No. I assume you get that neutral because of ~all. And you get that
> ~all because it is the default in case it's missing. -all is *very*
> different from that.

According to RFC 4408: 

   If none of the mechanisms match and there is no "redirect" modifier,
   then the check_host() returns a result of "Neutral", just as if
   "?all" were specified as the last directive. 

There are two distinct problems here. One is that the spf record was
not producing a proper fail on servers that aren't authorised to send,
the other is that his local mail is not passing the spf test and causing
FPs. 

My point was that just fixing the first problem is likely to exacerbate
the second since neutral scores less than softfail.


Re: SPF_NEUTRAL scoring?

2009-03-13 Thread Matus UHLAR - fantomas
> Rw wrote on Thu, 12 Mar 2009 13:59:56 +:
> > You get the neutral result if you don't get a match in any of the terms,
> > so wont adding ~all or -all on the end, simply turn neutral into
> > [soft]fail.

On 13.03.09 14:31, Kai Schaetzl wrote:
> No. I assume you get that neutral because of ~all. And you get that ~all 
> because it is the default in case it's missing. -all is *very* different 
> from that.

There is no ~all in his spf record.

-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Remember half the people you know are below average. 


Re: spamassassin replacetags

2009-03-13 Thread Matus UHLAR - fantomas
On 13.03.09 16:38, Schwaller Remo wrote:
> in the post i found under
> http://mail-archives.apache.org/mod_mbox/spamassassin-users/200804.mbox/%3c20080424105404.gb5...@fantomas.sk%3e
> your describe the false positives which can happen with the replacetags
> rules in spamassassin. in german we suffer from the same problem.
> may i ask what kind of workaround you finally implemented?

none yet. I of course CAN lower the score, or create meta score for SA score
and correct word (which can misfire too), but I was searching for more
general resolution, like combining with languages guessed, or comparing the
text that matched with words in slovak that can match.

I'm Cc:-ing to SA-users because I think someone can still help us with this
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Eagles may soar, but weasels don't get sucked into jet engines. 


Re: Experimental Plugin: MetaSVM

2009-03-13 Thread BChasm
This sounds like excellent work.  Please do keep us informed about release
to the public and such.

On Fri, Mar 13, 2009 at 8:33 AM, decoder  wrote:

> Hi all,
>
>
> as a result of the recent "2+2 != 4" discussion on the list, here is a new
> plugin, which tries to learn ham/spam classification only by knowing which
> rules triggered and which did not. This is, so to say, an automatic meta
> rule.
>
> The plugin is currently experimental and can only be checked out from SVN
> at:
>
>   https://svn.own-hero.net/sysadmin/MetaSVM/trunk
>
>
> For now I recommend to not use it in production environment, as it is still
> untested (except that I tested it).
> In order to use the plugin, you need to train your own model, which
> requires a certain amount of ham/spam.
>
> I evaluated the plugin with my own ham/spam corpus (roughly 5000 spam, 3000
> ham) and the resulting model did not produce false positives with respect to
> the default scoring, but it catched approx. 30% of the mails that were not
> catched by SA itself. I'll probably release more detailed numbers in some
> whitepaper soon :)
>
>
> Best regards,
>
>
> Chris
>
>
>


-- 
http://beckoningchasm.com


RE: Spamd still running as root?

2009-03-13 Thread RobertH
 
> 
> I suggested to read up on "sitewide bayes". Did you?
> 
> > ls -axl /usr/local/virtual/ash...@example.com/
> 
> This stuff is not of interest to SA at all. The bayes db and 
> the AWL is. 
> If you cannot change ownership of that directory or of the db 
> files, you have to move them elsewhere. Cut the connection 
> between spamd and your virtual users that hangs in your mind.
> 
> Again: I suggested to read up on "sitewide bayes". Did you?
> 
> Kai
> 
> --
> Kai Schätzl

kai,

yes, it is of interest.

in using sa-learn, it should be called by the proper SA processing account
i.e. UID/GID so that sa-learn can process the files and save the results in
the proper place for the system SA files, right?

it appeared that he was using the vpopmail user (and whatever GID) to run
sa-learn and get it to function.

i am guessing that he does not run SA as the vpopmail user, although i could
be wrong.

even so, UID and GID matters when running sa-learn, and if i cannot read the
files it is processing because they do not have the required perms or
UID/GID, then it will fail.

that is what i was addressing.

 - rh



Re: SPF_NEUTRAL scoring?

2009-03-13 Thread Kai Schaetzl
Matus UHLAR - fantomas wrote on Fri, 13 Mar 2009 16:17:29 +0100:

> There is no ~all in his spf record.

I was assuming that a missing "all" might trigger this NEUTRAL (I haven't 
seen a single example without it yet). That's wrong, it seems.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com





Re: spamassassin replacetags

2009-03-13 Thread John Hardin

On Fri, 13 Mar 2009, Matus UHLAR - fantomas wrote:


On 13.03.09 16:38, Schwaller Remo wrote:

in the post i found under
http://mail-archives.apache.org/mod_mbox/spamassassin-users/200804.mbox/%3c20080424105404.gb5...@fantomas.sk%3e
your describe the false positives which can happen with the replacetags
rules in spamassassin. in german we suffer from the same problem.
may i ask what kind of workaround you finally implemented?


none yet. I of course CAN lower the score, or create meta score for SA 
score and correct word (which can misfire too), but I was searching for 
more general resolution, like combining with languages guessed, or 
comparing the text that matched with words in slovak that can match.


I'm Cc:-ing to SA-users because I think someone can still help us with 
this


Right at the moment it would be a meta rule or adding some exclusions to 
the replacetags rule itself.


Long-term suggestion: a TFLAGS option to specify which languages a rule is 
valid for? Being able to put something like "TFLAGS xx languages=en" on 
replacetags rules sounds like a Good Idea to me - assuming, of course, 
that there's a fairly reliable way to determine the language a message 
uses...


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Ignorance doesn't make stuff not exist.   -- Bucky Katt
---
 Tomorrow: Albert Einstein's 130th Birthday


bayes_toks.expire problem

2009-03-13 Thread Savoy, Jim
 

Hi all,

 

   I am running SA 3.2.5 and exim 4.69 on a RedHat Enterprise Linux box
(release 4, Nahant Update 6).

 

 

   I noticed today that our /var/log/maillog was spewing out a lot of:

 

"cannot open bayes databases /var/spool/spamassassin/bayes_* R/W:
lock failed: Interrupted system call"

 

errors. I checked the /var/spool/spamassassin directory and saw that it
had over 300 large (10-to-20 meg)

bayes_toks.expire files in it. I google-searched this and found that
several others had this problem, so I

attempted to resolve it the same way they did. Only it didn't work for
me, and now I have had to shutdown bayes.

 

Please tell me where I went wrong!

 

First, I did a ls -l -t" on the directory (to sort it by modification
time), and then manually deleted all

of the bayes_toks.expirennn files, keeping only the ones that were
created in the last half hour. That

left me with the bayes_seen, bayes_journal and bayes_toks file (the
latter of which is 84 megs in size)

and 5 of the bayes_toks.expired files.

 

Next I did:   sa-learn  -D  --force-expire.

 

This ran rather quickly (wasn't expecting that, as I was led to believe
that my 84 meg bayes_toks file

would slow it down) and produced these results:

 

[clip]

[23409] dbg: bayes: tie-ing to DB file R/O
/var/spool/spamassassin/bayes_toks

[23409] dbg: bayes: tie-ing to DB file R/O
/var/spool/spamassassin/bayes_seen

[23409] dbg: bayes: found bayes db version 3

[23409] dbg: bayes: opportunistic call attempt skipped, found fresh
running expire magic token

[23409] dbg: config: score set 3 chosen.

[23409] dbg: learn: initializing learner

[23409] dbg: bayes: bayes journal sync starting

[23409] dbg: locker: safe_lock: created
/var/spool/spamassassin/bayes.mutex

[23409] dbg: locker: safe_lock: trying to get lock on
/var/spool/spamassassin/bayes with 300 timeout

[23409] dbg: locker: safe_lock: timed out after 300 seconds

bayes: cannot open bayes databases /var/spool/spamassassin/bayes_* R/W:
lock failed: Interrupted system call

[23409] dbg: bayes: bayes journal sync completed

[23409] dbg: bayes: expiry starting

[23409] dbg: locker: safe_lock: created
/var/spool/spamassassin/bayes.mutex

[23409] dbg: locker: safe_lock: trying to get lock on
/var/spool/spamassassin/bayes with 300 timeout

[23409] dbg: locker: safe_lock: timed out after 300 seconds

bayes: cannot open bayes databases /var/spool/spamassassin/bayes_* R/W:
lock failed: Interrupted system call

[23409] dbg: bayes: expiry completed

[23409] dbg: bayes: untie-ing

 

 

(it sat for 3 minutes at each of the two "300 timeout" warnings).

 

I don't think it did anything though and I am still getting bayes errors
galore in /var/maillog.

Plus the 5 bayes_toks.expire files I left behind are still there.

 

Also, I read that the 300 second timeout might be the problem. That it
is not giving bayes enough

time to complete an expiry. It was recommended that that be raised to
3000 or more. But I cannot

find that 300 value anywhere (I looked in spamd, all of the .cf files
for SpamAssassin (in /var/share

and the local.cf) and also in the exim config files that call
SpamAssassin. Where is it? Maybe if I changed

that to 3000 and re-ran sa-learn, all would be well?

 

Any advice would be greatly appreciated. We are running bayesless for
now. Thanks!

 

-  jim -

-   



Re: bayes_toks.expire problem

2009-03-13 Thread John Hardin

On Fri, 13 Mar 2009, Savoy, Jim wrote:


[23409] dbg: locker: safe_lock: trying to get lock on
/var/spool/spamassassin/bayes with 300 timeout
[23409] dbg: locker: safe_lock: timed out after 300 seconds

(it sat for 3 minutes at each of the two "300 timeout" warnings).


Disable auto-expiry in your config, _restart spamd_, and try the manual 
expire run again. The manual expire run was locked out by an 
already-running auto-expire (which no doubt failed).


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  When designing software, any time you think to yourself "a user
  would never be stupid enough to do *that*", you're wrong.
---
 Tomorrow: Albert Einstein's 130th Birthday


RE: bayes_toks.expire problem

2009-03-13 Thread Savoy, Jim
On Fri, 13 Mar 2009, Savoy, Jim wrote:

>> [23409] dbg: locker: safe_lock: trying to get lock on
>> /var/spool/spamassassin/bayes with 300 timeout
>> [23409] dbg: locker: safe_lock: timed out after 300 seconds
>>
>> (it sat for 3 minutes at each of the two "300 timeout" warnings).

>John Hardin wrote:

>Disable auto-expiry in your config, _restart spamd_, and try the manual

>expire run again. The manual expire run was locked out by an 
>already-running auto-expire (which no doubt failed).

Brilliant John! It ran in about 10 minutes, produced much interesting
output and reduced the bayes_toks file from 84 meg to 19 meg. Thanks!
All is well now.

When you said "disable auto_expiry" I wasn't sure what you meant. I
googled
that and then tried adding "bayes_auto_expire 0" to my local.cf file
(even
though I could not find any command like that anywhere in the other .cf
files
or in /usr/share's .cf files). So I hope that is what you meant. I will
also assume that by adding that command, the "bayes_expiry_max_db_size
30"
that I added previously will no longer work, right?

To avoid this big mess happening again in the future, I will leave that
command
in the local.cf file forever, and run an "sa-learn --force-expire" by
cron each
night. 

Thanks again!

 - jim -



whitelist pattern problem

2009-03-13 Thread Linda Walsh

I get many emails addressed to internal sendmail 's.
 123...@mydomain
 1abd56.ef7...@mydomain


(seem to fit a basic pattern but don't know how to specify the
pattern (or I don't have it right):
 <(start of an email-address)>[0-9][0-9a-fa-f\@mydomain

by start of an email, addr, I mean inside or outside literal '<>'.
I try matching to '<' as a start char to look for anything starting
with a number, but that fails if they don't use the "name "
format, but just use "x...@yy".  Don't know how to root at beginning
of any email address looking thing.

I know the pattern matcher in the userprefs file is primitive though
-- like DOS level file matching, so I don't know how to write
it in userprefs...

any hints would be appreciated...
running slightly older SA 3.1.7 on perl 5.8.8

intending to update ... eventually but don't know that this would
solve any pattern help

Thanks,
-linda


spamassassin: hosting service/cpanel problems user_prefs partially ignored -updated-

2009-03-13 Thread Dennis German
Updated, Thought you all might be interested  ( see  updates)

My intention is to observe false negatives (i.e. spam seen as ham) and
increase the score of one or more of the tests in an effort to cause  
additional spam to be detected.

I am using a hosting service where spamassassin configuration 
is  updatable by the cPanel system.
I can also modify ~/.spamassassin/user_perfs directly.
When I list /etc there is no mail directory 
(however I believe I am not looking at the true /etc )
...
When I modify ~/.spamassassin/user_prefs to include:

report_contact postmas...@real-world-systems.com
report_hostname Real-World-Systems.com
required_score 4
score URIBL_JP_SURBL 5 #was 1.5 
score URIBL_SBL 5  #was 1.5 
score URIBL_SC_SURBL 5 #was 1.5 
score URIBL_WS_SURBL 5 #was 1.5 


spam messages subject are correctly modified to indicate *SPAM* and
the X-SPAM-Report is correctly inserted with the revised hostname and  
contact and
scores for URIBL_* are increased to 5  
and
includes the message preview and  ((note 4.0 required))

"  Content analysis details:   (4.0 points, 4.0 required)
pts rule name  description
 --  
0.9 RCVD_IN_SORBS_DUL  RBL: SORBS: sent directly from
dyna
  ...
  X-Spam-Flag: YES

The report is preceded by:
X-Spam-Status: Yes, score=4.0
X-Spam-Score: 40
X-Spam-Bar: 

There is no X-Spam-Checker-Version header which the documentation at
http://spamassassin.apache.org/full/3.1.x/doc/Mail_SpamAssassin_Conf.html
says cannot be removed.

THE PROBLEMS:

1)Messages that are not flagged as spam have
X-Spam-Status: No, score=-0.7
X-Spam-Score: -6
X-Spam-Bar: /
X-Spam-Flag: NO


Aparently these messages are added by a module in cpanel which uses
spamassassin API's to process the email.

2) adding
 add_header all _TESTS(,)_
  has no effect on ham or spam.


3) adding
add_header all  DGG DGG
add_header ham  DGG DGG
add_header spam DGG DGG
has no effect on either spam or ham


Attempting to add headers via cpanel produces only
add_header all
add_header ham
add_header spam

Is my syntax for 3) correct?



RE: bayes_toks.expire problem

2009-03-13 Thread John Hardin

On Fri, 13 Mar 2009, Savoy, Jim wrote:


On Fri, 13 Mar 2009, Savoy, Jim wrote:


[23409] dbg: locker: safe_lock: trying to get lock on
/var/spool/spamassassin/bayes with 300 timeout
[23409] dbg: locker: safe_lock: timed out after 300 seconds

(it sat for 3 minutes at each of the two "300 timeout" warnings).



John Hardin wrote:


Disable auto-expiry in your config, _restart spamd_, and try the manual 
expire run again. The manual expire run was locked out by an 
already-running auto-expire (which no doubt failed).


Brilliant John! It ran in about 10 minutes, produced much interesting
output and reduced the bayes_toks file from 84 meg to 19 meg. Thanks!
All is well now.


Glad to hear it.

When you said "disable auto_expiry" I wasn't sure what you meant. I 
googled that and then tried adding "bayes_auto_expire 0" to my local.cf 
file (even though I could not find any command like that anywhere in the 
other .cf files or in /usr/share's .cf files). So I hope that is what 
you meant.


Yes.

I will also assume that by adding that command, the 
"bayes_expiry_max_db_size 30" that I added previously will no longer 
work, right?


I'm not sure, but I would expect that the manual expiry usess it as well. 
Leave it in if you want that size limit.


To avoid this big mess happening again in the future, I will leave that 
command in the local.cf file forever, and run an "sa-learn 
--force-expire" by cron each night.


That's the general consensus for large databases.


Thanks again!


Pleased to be of assistance.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Vista: because the audio experience is *far* more important than
  network throughput.
---
 Tomorrow: Albert Einstein's 130th Birthday


RE: whitelist pattern problem

2009-03-13 Thread Bowie Bailey
Linda Walsh wrote:
> I get many emails addressed to internal sendmail 's.
>   123...@mydomain
>   1abd56.ef7...@mydomain
> 
> 
> (seem to fit a basic pattern but don't know how to specify the
> pattern (or I don't have it right):
>   <(start of an email-address)>[0-9][0-9a-fa-f\@mydomain
> 
> by start of an email, addr, I mean inside or outside literal '<>'.
> I try matching to '<' as a start char to look for anything starting
> with a number, but that fails if they don't use the "name "
> format, but just use "x...@yy".  Don't know how to root at beginning
> of any email address looking thing.

I think this is what you are looking for (untested):

header MY_NUMBER_EMAIL To:addr =~ /^\d[0-9a-f\@mydomain/i

Look in the "Rule Definition" section of the man page for
Mail::SpamAssassin::Conf for more info on the ':addr' option.

> I know the pattern matcher in the userprefs file is primitive though
> -- like DOS level file matching, so I don't know how to write
> it in userprefs...

user_prefs uses the exact same pattern matching as the rest of SA (Perl
regexps).  It is anything but primitive.

The caveat being that rule definitions are not allowed in user_prefs
files unless you allow it by putting this in your local.cf:

allow_user_rules 1

> any hints would be appreciated...
> running slightly older SA 3.1.7 on perl 5.8.8
> 
> intending to update ... eventually but don't know that this would
> solve any pattern help

Shouldn't make any difference for this.

-- 
Bowie


Re: Experimental Plugin: MetaSVM

2009-03-13 Thread decoder

AlexB wrote:

Chris

From the README its not quite clear: will this work in "autolearn" ?
If you mean that the plugin can automatically learn with the autolearn 
setting, answer is no.


would it be enough to create the model.* files or is it a must to feed 
it?
You create one model file once by feeding it a large corpus of ham+spam. 
Once you did that,
and evaluated it as described in the README, the model should be working 
accurately enough
for your mail gateway and I expect it to work for a long time, mainly 
because it isn't depending
that much on the type of spam (i.e. the results that the model produces 
are assumed to be more generalizable than for example your bayes db)




I cases of busy gateways, where manual training is higly unpractical, 
it would need to feed itself with headers from SA report's score >X
The problem is that feeding does not work with an SVM algorithm. You 
have to train on the _whole_ set _always_, so feeding mails is unpractical.


That's why you do this process _once_ with a lot of ham and spam. You 
can repeat this process any time but it isn't necessary to do this 
permanently.


It is to be expected that the model accuracy will decrease with time ( 
a) because your rules change and b) because spam changes ) but I think 
this is a slow process.




It has yet to be evaluated how well the model performs over time :)



Best regards,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: SPF_NEUTRAL scoring?

2009-03-13 Thread Matus UHLAR - fantomas
On 13.03.09 14:31, Kai Schaetzl wrote:
> No. I assume you get that neutral because of ~all. And you get that ~all
> because it is the default in case it's missing. -all is *very* different
> from that.


> Matus UHLAR - fantomas wrote on Fri, 13 Mar 2009 16:17:29 +0100:
> > There is no ~all in his spf record.

On 13.03.09 18:31, Kai Schaetzl wrote:
> I was assuming that a missing "all" might trigger this NEUTRAL (I haven't 
> seen a single example without it yet). That's wrong, it seems.

Well, in such case you probably meant something different than you wrote.

missing "all" is understood as ?all which means neutral.

-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I intend to live forever - so far so good. 


Re: Experimental Plugin: MetaSVM

2009-03-13 Thread John Hardin

On Fri, 13 Mar 2009, decoder wrote:


You create one model file once by feeding it a large corpus of ham+spam.


The problem is that feeding does not work with an SVM algorithm. You 
have to train on the _whole_ set _always_, so feeding mails is 
unpractical.


That's why you do this process _once_ with a lot of ham and spam. You 
can repeat this process any time but it isn't necessary to do this 
permanently.


I assume it learns from full message corpa? And all it cares about is the 
rules that hit?


Per my earlier suggestion of learning off the logs + corpa to fix FP/FN, 
could there be an option to learn off generated minimal corpa files, with 
their structure being just the rules hit per message (msgid + hits on 
one possibly very long line)? e.g.:


 
BAYES_99,FORGED_RCVD_HELO,L_SOME_STD_PROBS,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RBL_PSBL_01,RCVD_IN_BRBL,RCVD_IN_NJABL_SPAM,SARE_FROM_SPAM_MONEY2,STOX_30,URIBL_BLACK,URIBL_JP_SURBL,URIBL_WS_SURBL

Then an external tool could generate and maintain these files from the SA 
log and the maintained training corpa, omitting FP/FN from the log data.


This is just intended to include in training the high- and low-scoring 
(obviously spam/ham) messages, which may not appear in the training corpa 
if training is mostly exception-based.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  It is not the place of government to make right every tragedy and
  woe that befalls every resident of the nation.
---
 Tomorrow: Albert Einstein's 130th Birthday


Re: Experimental Plugin: MetaSVM

2009-03-13 Thread Marc Perkel
I'm going to bet that there will be static meta rules that will be 
discovered that can be just added to spamassassin. I'm interested in how 
this plays out. I'm very optimistic.


Re: Experimental Plugin: MetaSVM

2009-03-13 Thread decoder

John Hardin wrote:


I assume it learns from full message corpa? And all it cares about is 
the rules that hit?


Per my earlier suggestion of learning off the logs + corpa to fix 
FP/FN, could there be an option to learn off generated minimal corpa 
files, with their structure being just the rules hit per message 
(msgid + hits on one possibly very long line)? e.g.:


 
BAYES_99,FORGED_RCVD_HELO,L_SOME_STD_PROBS,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RBL_PSBL_01,RCVD_IN_BRBL,RCVD_IN_NJABL_SPAM,SARE_FROM_SPAM_MONEY2,STOX_30,URIBL_BLACK,URIBL_JP_SURBL,URIBL_WS_SURBL 



Yes this is certainly possible. Basically all the algorithm requires for 
the SVM is the rules that hit and the classification (ham or spam) 
(actually the rules that did not hit are fed into the SVM as well, but 
they are taken from a the global rules file underlying the model). The 
tool additionally requires the score to evaluate FP/FN properly when 
testing the model, and the message id would be helpful to find false 
positives if one wants to investigate. So you are right, all this info 
would be enough and I can easily modify the tool to use this kind of 
format. I'll try to come up with a code modification to switch the input 
format :)


Then an external tool could generate and maintain these files from the 
SA log and the maintained training corpa, omitting FP/FN from the log 
data.
Yes, that's a good idea, certainly better than learning directly from 
the mail which might be scattered around several mailboxes. However, how 
do you want to exclude FP/FNs? The log certainly doesn't provide this 
information. On the other side, having some false positives in the 
training data did not spoil my results. The algorithm did even predict 
these correctly as spam later on :)




Cheers,


Chris


smime.p7s
Description: S/MIME Cryptographic Signature


Re: whitelist pattern problem in userpref-whitelisting

2009-03-13 Thread Linda Walsh

Does the below apply to the
~/.spamassassin/userprefs
   whitelisting (command, keyword or feature)...
  
Sorry...it was the whitelisting in the userpref file that I

was talking about the "primitive pattern matching"

At one point it was limited to DOS-like file-matching patterns,
not the full perlregexp set (which they below example you gave
me would be an excellent example!) ...

I don't see 'header' as a usable line in "userprefs".


thanks,
-linda


Bowie Bailey wrote:

Linda Walsh wrote:
> I get many emails addressed to internal sendmail 's.
>   123...@mydomain,  1abd56.ef7...@mydomain
> (seem to fit a basic pattern but don't know how to specify the
> pattern (or I don't have it right)):
>   <(start of an email-address)>[0-9][0-9a-fa-f\@mydomain
>
> by start of an email, addr, I mean inside or outside literal '<>'.
> I try matching to '<' as a start char to look for anything starting
> with a number, but that fails if they don't use the "name "
> format, but just use "x...@yy".  Don't know how to root at beginning
> of any email address looking thing.

I think this is what you are looking for (untested):

header MY_NUMBER_EMAIL To:addr =~ /^\d[0-9a-f\@mydomain/i

Look in the "Rule Definition" section of the man page for
Mail::SpamAssassin::Conf for more info on the ':addr' option.

> I know the pattern matcher in the userprefs file is primitive though
> -- like DOS level file matching, so I don't know how to write
> it in userprefs...

user_prefs uses the exact same pattern matching as the rest of SA (Perl
regexps).  It is anything but primitive.

The caveat being that rule definitions are not allowed in user_prefs
files unless you allow it by putting this in your local.cf:

allow_user_rules 1

> any hints would be appreciated...
> running slightly older SA 3.1.7 on perl 5.8.8
>
> intending to update ... eventually but don't know that this would
> solve any pattern help

Shouldn't make any difference for this.



Re: Experimental Plugin: MetaSVM

2009-03-13 Thread John Hardin

On Fri, 13 Mar 2009, decoder wrote:


John Hardin wrote:


 
 BAYES_99,FORGED_RCVD_HELO,L_SOME_STD_PROBS,RAZOR2_CF_RANGE_51_100,RAZOR2_CF_RANGE_E4_51_100,RAZOR2_CF_RANGE_E8_51_100,RAZOR2_CHECK,RBL_PSBL_01,RCVD_IN_BRBL,RCVD_IN_NJABL_SPAM,SARE_FROM_SPAM_MONEY2,STOX_30,URIBL_BLACK,URIBL_JP_SURBL,URIBL_WS_SURBL 


Yes this is certainly possible. Basically all the algorithm requires for the 
SVM is the rules that hit and the classification (ham or spam) (actually the 
rules that did not hit are fed into the SVM as well, but they are taken from 
a the global rules file underlying the model). The tool additionally requires 
the score to evaluate FP/FN properly when testing the model,


It needs the score, and not just Y/N Spam/Ham (i.e. from which corpa file 
it came)?



 Then an external tool could generate and maintain these files from the SA
 log and the maintained training corpa, omitting FP/FN from the log data.


Yes, that's a good idea, certainly better than learning directly from the 
mail which might be scattered around several mailboxes. However, how do you 
want to exclude FP/FNs? The log certainly doesn't provide this information.


I was thinking you'd generate a ham file and a spam file from the log, 
possibly dynamically appending rows as messages are processed. Naturally 
this would contain FPs and FNs.


You'd have a routine to extract the ham file from your full ham 
corpus/corpa, and likewise for spam. The assumption is any FP or FN would 
be placed into these corpa for normal bayes training.


The tool would then combine them, omitting from the log-generated files 
any msgid that appears in the training corpa files. You'd end up with one 
clean spam file and one clean ham file.


I do note this would be a simpler and faster operation in a relational 
database, but I don't want to throw _that_ curve into the mix quite yet. 
Perl hashes might be sufficient.


On the other side, having some false positives in the training data did 
not spoil my results. The algorithm did even predict these correctly as 
spam later on :)


Er, don't you mean it predicted them as ham (FP = ham scored as spam)? It 
would be great if it was smart enough to recognize a near-boundary false 
result as what it *should* have been...


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  One difference between a liberal and a pickpocket is that if you
  demand your money back from a pickpocket he will not question your
  motives.  -- William Rusher
---
 Tomorrow: Albert Einstein's 130th Birthday


spamassassin: attempt to process a single message fails at PerMsgStatus.pm line 164.

2009-03-13 Thread Dennis German

Attempting to see how spamassassin would score a message
I tried
 spamassassin < lottery.msg

[32179] warn: config: could not find site rules directory
check: no loaded plugin implements 'check_main': cannot scan! at
	/usr/lib/perl5/vendor_perl/5.8.8/Mail/SpamAssassin/PerMsgStatus.pm  
line 164.


message can be found at

http://real-world-systems.com/mail/lottery.msg



Re: spamassassin: attempt to process a single message fails at PerMsgStatus.pm line 164.

2009-03-13 Thread Bill Landry
Dennis German wrote:
> Attempting to see how spamassassin would score a message
> I tried
>  spamassassin < lottery.msg
> 
> [32179] warn: config: could not find site rules directory
> check: no loaded plugin implements 'check_main': cannot scan! at
> /usr/lib/perl5/vendor_perl/5.8.8/Mail/SpamAssassin/PerMsgStatus.pm
> line 164.
> 
> message can be found at
> 
> http://real-world-systems.com/mail/lottery.msg

Don't forget to add the "-t" flag if you want to see how SA will score a
message.  I tested the message and got:

Content analysis details:   (35.1 points, 10.0 required)

 pts rule name  description
 --
--
 0.5 SB_NSP_VOLUME_SPIKESenderBase: Sender IP hosted at NSP has a volume
spike
 3.5 RCVD_IN_NERDS_BR   RBL: Received from Brazil
[189.20.215.138 listed in zz.countries.nerd.dk]
 1.0 RCVD_IN_JMF_BL RBL: Sender listed in JMF-BLACK
  [189.20.215.138 listed in
hostkarma.junkemailfilter.com]
 2.0 RCVD_IN_UCEPROTECT_1   RBL: Sender listed in UCEPROTECT_1
[189.20.215.138 listed in
dnsbl-1.uceprotect.net]
 1.5 RCVD_IN_BARRACUDA  RBL: Sender listed in Barracuda Relay Black List
[189.20.215.138 listed in
b.barracudacentral.org]
 0.5 RCVD_IN_LASHBACK   RBL: Sender listed in LashBack Unsubscribe
Blacklist
[189.20.215.138 listed in ubl.unsubscore.com]
 3.5 HK_LOTTO   BODY: Lottery or games mentioned
 1.0 RELAY_BR   Relayed through Brazil
 0.6 SPF_SOFTFAIL   SPF: sender does not match SPF record (softfail)
-0.5 BOTNET_SERVERWORDS Hostname contains server-like substrings

[botnet_serverwords,ip=189.20.215.138,rdns=mail.usinavale.com.br]
 1.3 MISSING_HEADERSMissing To: header
 0.0 BAYES_50   BODY: Bayesian spam probability is 40 to 60%
[score: 0.5101]
 2.2 DCC_CHECK  Listed in DCC
(http://rhyolite.com/anti-spam/dcc/)
 0.5 RAZOR2_CHECK   Listed in Razor2 (http://razor.sf.net/)
 2.0 KARMA_CONNECT_NEGATIVE KARMA_CONNECT_NEGATIVE
 1.0 KAM_LOTTO2 Highly Likely to be a e-Lotto Scam Email
 1.0 DIGEST_MULTIPLEMessage hits more than one network digest check
 7.0 L_YOU_WON  L_YOU_WON
 1.0 HK_MUCHMONEY   Message refers to hundreds of thousands or
millions
 4.0 HK_PRIZEWINYou won lot of money or prizes
 0.5 KAM_LOTTO1 Likely to be a e-Lotto Scam Email
 1.0 SAGREY Adds 1.0 to spam from first-time senders

Bill



Re: spamassassin: attempt to process a single message fails at PerMsgStatus.pm line 164.

2009-03-13 Thread Matt Kettler
Dennis German wrote:
> Attempting to see how spamassassin would score a message
> I tried
>  spamassassin < lottery.msg
>
> [32179] warn: config: could not find site rules directory
> check: no loaded plugin implements 'check_main': cannot scan! at
> /usr/lib/perl5/vendor_perl/5.8.8/Mail/SpamAssassin/PerMsgStatus.pm
> line 164.
>
> message can be found at
>
> http://real-world-systems.com/mail/lottery.msg
Does your spamassassin work properly when you run spamassassin --lint?

Just from the look of the error you got, it sounds like your install is
borked and /etc/mail/spamassassin is missing.




Re: Experimental Plugin: MetaSVM

2009-03-13 Thread decoder

John Hardin wrote:


It needs the score, and not just Y/N Spam/Ham (i.e. from which corpa 
file it came)?
The SVM does not need the score. However, the evaluation tool needs the 
score because it uses it to calculate FP/FN rate.


I was thinking you'd generate a ham file and a spam file from the log, 
possibly dynamically appending rows as messages are processed. 
Naturally this would contain FPs and FNs.
If you want it to be dynamical, then the plugin could do the appending. 
However, the model cannot be extended, that means to incorporate new 
lines, the whole model must be recalculated. So this can't be done per 
message but only maybe on a daily basis.




You'd have a routine to extract the ham file from your full ham 
corpus/corpa, and likewise for spam. The assumption is any FP or FN 
would be placed into these corpa for normal bayes training.


The tool would then combine them, omitting from the log-generated 
files any msgid that appears in the training corpa files. You'd end up 
with one clean spam file and one clean ham file.
That implies that people are indeed using bayes training, but it might 
be a suitable idea. However, I don't think anyway that FPs and FNs spoil 
the SVM result. SVMs are quite robust to outliers (which FPs and FNs 
essentially are) and if their number is low compared to the total amount 
of mail, the algorithm will have no problem to predict them properly 
anyway :)





Er, don't you mean it predicted them as ham (FP = ham scored as spam)? 
It would be great if it was smart enough to recognize a near-boundary 
false result as what it *should* have been...


I mean that I had some unrecognized spam left in my inbox, and the 
algorithm did identify it as spam :) The SVM generally tries to find a 
hyperplane, however, if the wrongly labeled points (FPs and FNs) are of 
small count, the SVM will most likely produce a result where the FPs and 
FNs do not match the label they were trained with. The C-SVM uses a cost 
constraint (each label violation costs a certain value) and tries to 
minimize a given term which includes this cost. So if the dataset is 
sufficently large but has _some_ wrongly labeled points, the chances 
that the result is still what you wanted to have are high :)



-- Chris



smime.p7s
Description: S/MIME Cryptographic Signature