Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Ben Johnson
Apologies for the rapid-fire here folks, but I wanted to correct something.

I had these backwards:

>> Yes, I believe that me and the system always execute SA commands as the
>> "amavis" user. When I was using the SQL setup, I had the following in
>> local.cf:
>> 
>> bayes_path /var/lib/amavis/.spamassassin/bayes
>> 
>> With the DBM setup, I had the following (I have since commented it out,
>> while attempting to debug this Bayes issue):
>> 
>> bayes_sql_override_username amavis

I meant to say that I have *always* had

bayes_path /var/lib/amavis/.spamassassin/bayes

in local.cf, and using the SQL setup, I added

bayes_sql_override_username amavis

Sorry for the confusion!

-Ben



On 4/19/2013 11:02 PM, Ben Johnson wrote:
> 
> 
> On 4/19/2013 1:54 PM, Benny Pedersen wrote:
>> Ben Johnson skrev den 2013-04-19 18:02:
>>
>>> Still stumped here...
>>
>> for amavisd-new, put spamassassin sql setup into user_prefs file for the
>> user amavisd-new runs as might be working better then have insecure sql
>> settings in /etc/mail/spamassassin :)
>>
>> i dont know if this is really that you have another user for amavisd,
>> and test spamassassin -t msg with another user that uses another sql user ?
>>
>> make sure both users is really using same sql user as intended
>>
> 
> Benny, thanks for the suggestion regarding moving the SA SQL setup into
> user_prefs. I will look into that soon.
> 
> Yes, I believe that me and the system always execute SA commands as the
> "amavis" user. When I was using the SQL setup, I had the following in
> local.cf:
> 
> bayes_path /var/lib/amavis/.spamassassin/bayes
> 
> With the DBM setup, I had the following (I have since commented it out,
> while attempting to debug this Bayes issue):
> 
> bayes_sql_override_username amavis
> 
> Is something more required to ensure that my mail system, which runs
> under the "amavis" user, is always reading from and writing to the same DB?
> 
> Best regards,
> 
> -Ben
> 
> 


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Ben Johnson


On 4/19/2013 1:54 PM, Benny Pedersen wrote:
> Ben Johnson skrev den 2013-04-19 18:02:
> 
>> Still stumped here...
> 
> for amavisd-new, put spamassassin sql setup into user_prefs file for the
> user amavisd-new runs as might be working better then have insecure sql
> settings in /etc/mail/spamassassin :)
> 
> i dont know if this is really that you have another user for amavisd,
> and test spamassassin -t msg with another user that uses another sql user ?
> 
> make sure both users is really using same sql user as intended
> 

Benny, thanks for the suggestion regarding moving the SA SQL setup into
user_prefs. I will look into that soon.

Yes, I believe that me and the system always execute SA commands as the
"amavis" user. When I was using the SQL setup, I had the following in
local.cf:

bayes_path /var/lib/amavis/.spamassassin/bayes

With the DBM setup, I had the following (I have since commented it out,
while attempting to debug this Bayes issue):

bayes_sql_override_username amavis

Is something more required to ensure that my mail system, which runs
under the "amavis" user, is always reading from and writing to the same DB?

Best regards,

-Ben




Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Ben Johnson


On 4/19/2013 12:12 PM, Axb wrote:
> On 04/19/2013 06:02 PM, Ben Johnson wrote:
> 
>> Still stumped here...
> 
> do a bayes sa-learn --backup
> 
> switch to file based in SDBM format (which is fast)
> 
> do a
> 
> sa-learn --restore
> 
> feed it a few thousand NEW spams
> 
> see what happens
> 
> 
> 
> 
> 
> 

Thanks for the suggestion, Axb. Your help and time is much appreciated.

By "feed it a few thousand NEW spams", do you mean to scrap the training
corpora that I've hand-sorted in favor of starting over? Or do you mean
to clear the database and re-run the training script against the corpora?

If your thinking is that the token data may be "stale", then I will
really be stumped. When I hand-classify 12 messages with a subject and
body about a retractable garden hose as spam, I expect the 13th message
about the same hose to score high on the Bayes tests. Is this an
unreasonable expectation?

I commented-out all of the DB-related lines in my SA configuration file
(local.cf) and restarted amavis-new.

I also cleared the existing DB tokens (with "sa-learn --clear") after
amavis had restarted, and then executed my normal training script
against my ham and spam corpora.

I'll keep an eye on incoming messages to see if those that "slip
through" and score below 4.0 demonstrate evidence of Bayes testing.

I am beginning to wonder if some kind of "corruption", for lack of a
better term, had been introduced by using utf8 to store the token data
(Benny Pedersen mentioned that Unicode is overkill, and he is probably
right). Performance aside, could using utf8_bin (instead of ascii)
introduce a problem for SA (despite no errors during "sa-learn" training
or --restore commands)?

The strange thing is that Bayes seems to work fine most of the time. But
as I've stated previously, almost all "obvious to a human" spam that
scores below 4.0 lacks evidence of Bayes testing.

Since switching back to a DBM Bayes setup, the results look pretty much
as expected (wrapped), and this is the type of thing I expect to see on
every message:

---
spamassassin -D -t < "/tmp/email.txt" 2>&1 | egrep '(bayes:|whitelist:|AWL)'
dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x37520f0),
bayes_store_module=Mail::SpamAssassin::BayesStore::DBM
dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2c52558)
dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks
dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen
dbg: bayes: found bayes db version 3
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: corpus size: nspam = 6203, nham = 2479
dbg: bayes: score = 5.55111512312578e-17
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: untie-ing
dbg: timing: total 2925 ms - init: 907 (31.0%), parse: 1.92 (0.1%),
extract_message_metadata: 108 (3.7%), poll_dns_idle: 1040 (35.6%),
get_uri_detail_list: 1.22 (0.0%), tests_pri_-1000: 19 (0.7%),
compile_gen: 185 (6.3%), compile_eval: 19 (0.6%), tests_pri_-950: 5
(0.2%), tests_pri_-900: 5 (0.2%), tests_pri_-400: 32 (1.1%),
check_bayes: 26 (0.9%), tests_pri_0: 836 (28.6%), dkim_load_modules: 27
(0.9%), check_dkim_signature: 1.23 (0.0%), check_dkim_adsp: 24 (0.8%),
check_spf: 70 (2.4%), check_razor2: 202 (6.9%), check_pyzor: 135 (4.6%),
tests_pri_500: 988 (33.8%)
---

I'll wait and see if I receive messages without Bayes results and report
back.

Even if using DBM "works", I don't see this as a long-term solution --
only as a troubleshooting step. I would really like to keep my Bayes
data in a MySQL or PostgreSQL database.

Thanks again for the help!

-Ben


Reporting matched Rules and Scores (was: Re: sa-exim Terse Rules)

2013-04-19 Thread Karsten Bräckelmann
On Thu, 2013-04-18 at 19:24 -0500, John Traweek CCNA, Sec+ wrote:
> I’m new to the list, so if there are web archives that are easily
> searchable where I can find this info please point me to it.  I am
> running sa-exim with SA 3.3.1.  I am trying for the life of me to turn
> on the Terse report options, so that in the email headers I can see
> what points are being attributed to each rule.

See M::SA::Conf docs [1] for all options outlined below. You've been
slightly vague, so it isn't clear which you're actually after.

The report_safe option, set to 0, will add an X-Spam-Report header
listing all triggered rules, their score and brief description.

This actually is the more verbose one, even though the respective
Template Tag's description calls it being the "terse" report. Probably
based on the report_safe default of 1, which wraps spam unaltered as
attachment to another MIME message. Which makes the Report header more
terse than wrapping, yet verbose on info.

If you want that header, regardless of the report_safe option, or maybe
regardless of the mail's spammyness (defaults to spam only), you can
enforce that header.

  add_header  all Report _REPORT_

The other option, and basically as terse as it can get, is the Template
Tag _TESTSSCORES_. The same as _TESTS_, the list of all rules hit,
though also including each rule's score. The latter is used by default
for the Status header.

You can easily overwrite the default and customize it using the more
verbose (yet really terse) variant with scores, by adding an add_header
option to your site config, similar to the one in 10_default_prefs.cf.


> It seems this has changed somewhat from version to version so I can’t
> seem to find anything specifically related to version and sa-exim when
> googling.  TIA.

Nope, this didn't change in a long time.

Please note though, that the above is about vanilla SA configuration. I
don't know sa-exim, and whether it requires specific options.


[1] http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: bayes - large message

2013-04-19 Thread Joe Acquisto-j4
>>> On 4/19/2013 at 8:26 PM, "Joe Acquisto-j4"  wrote:
> I thought I had corrected this issue, with someone's assistance, a while ago:
> 
> Apr 19 20:21:02.477 [23670] dbg: bayes: expiry completed
> Apr 19 20:21:02.477 [23670] info: archive-iterator: skipping large message
> Learned tokens from 0 message(s) (0 message(s) examined)

Please ignore.  As much as possible.   I was testing manually and forgot --mbox 
on the command line.

However, I can see something is amiss as it is happily accepting spam I thought 
had been previously submitted.

joe a.



bayes - large message

2013-04-19 Thread Joe Acquisto-j4
I thought I had corrected this issue, with someone's assistance, a while ago:

Apr 19 20:21:02.477 [23670] dbg: bayes: expiry completed
Apr 19 20:21:02.477 [23670] info: archive-iterator: skipping large message
Learned tokens from 0 message(s) (0 message(s) examined)




Re: local score ignored

2013-04-19 Thread Joe Acquisto-j4
>>> On 4/19/2013 at 10:41 AM, John Hardin  wrote:
> On Fri, 19 Apr 2013, Joe Acquisto-j4 wrote:
> 
>>> What output does the command "sa-learn --dump magic" produce?
>>
>> 0.000  0   1872  0  non-token data: nspam
>> 0.000  0   9184  0  non-token data: nham
> 
> Generally you want the ratio of trained messages to reflect the ratio of 
> mail you're seeing. Most people get a lot more spam than ham, so it looks 
> like you need a lot more spam trained in.
> 
> I try to maintain at least a 2:1 spam:ham ratio.
> 
> -- 
>   John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ 
>   jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org 
>   key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> ---
>Ten-millimeter explosive-tip caseless, standard light armor
>piercing rounds. Why?
> ---
>   Today: the 238th anniversary of The Shot Heard 'Round The World


Interesting.  I had not paid attention.   From my personal experience, the 
totals seem reversed.   I will have to check
how others are feeding.   I suspect a certain other party may have their 
signals crossed on what to send where.

In which case, I may have to clear bayes and re-feed.

joe a



Re: Need rule to catch lots of font changes

2013-04-19 Thread Alex
Hi,

> I'm trying to adapt this to work with multiple  tags, but I must be
doing something wrong. I've tried changing it to match just 10
> instances of , just for testing. Here's what I have:

>
>> rawbody  __LOC_BR  //
>> tflags  __LOC_BR  multiple maxhits=11
>> meta  LOC_MULT_BR > 10
>> score  LOC_MULT_BR 2.0
>> describe LOC_MULT_BR At least 10 br tags found
>>
>
> You forgot to refer the meta back to your __LOC_BR rule.  It should
> generate an error.  You did run a lint check on the rules after you added
> this, right?
>
> Try this:
>
> meta LOC_MULT_BR __LOC_BR > 10
>


Ah, that was it, thanks. Don't know how I missed that.

Alex


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Benny Pedersen

Ben Johnson skrev den 2013-04-19 18:02:


Still stumped here...


for amavisd-new, put spamassassin sql setup into user_prefs file for 
the user amavisd-new runs as might be working better then have insecure 
sql settings in /etc/mail/spamassassin :)


i dont know if this is really that you have another user for amavisd, 
and test spamassassin -t msg with another user that uses another sql 
user ?


make sure both users is really using same sql user as intended

--
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Benny Pedersen

John Hardin skrev den 2013-04-18 04:15:


ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;


unicode is overkill since bayes is just ascii

it will if unicode is used create bigger db, that will slow down more 
then ascii



Please check the SpamAssassin bugzilla to see if this situation is
already mentioned, and if not, add a bug. This seems pretty critical.


i dont know how bayes in 3.4.x is now adays, its long since i have seen 
the source for it, but i maked some changes to bayes mysql so it can be 
cleaned up with timed expire of data, this is properly lost in 
transistion with 3.4.x :(



It's possible that there's a good reason the default script still
uses myISAM. If so, the documentation for this fix should at least be
easier to find.


it was dokumented ?

--
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it


Re: local score ignored

2013-04-19 Thread Benny Pedersen

Joe Acquisto-j4 skrev den 2013-04-19 13:10:


0.000  0   1872  0  non-token data: nspam
0.000  0   9184  0  non-token data: nham


any use of whitelist_from ?

score whitelist_from 0.001

why ?, whitelist_from can be forged, and will poison bayes if not 
carefull with scores


default score is -100 :(

--
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it


Re: sa-exim Terse Rules

2013-04-19 Thread Benny Pedersen

John Traweek CCNA, Sec+ skrev den 2013-04-19 02:24:

I'm new to the list, so if there are web archives that are easily
searchable where I can find this info please point me to it. I am
running sa-exim with SA 3.3.1.


http://spamassassin.apache.org/

dont trust maillist archives, use the web :=)

could you stop posting html to public maillist btw ?

--
senders that put my email into body content will deliver it to my own 
trashcan, so if you like to get reply, dont do it


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Axb

On 04/19/2013 06:02 PM, Ben Johnson wrote:


Still stumped here...


do a bayes sa-learn --backup

switch to file based in SDBM format (which is fast)

do a

sa-learn --restore

feed it a few thousand NEW spams

see what happens







Re: local score ignored

2013-04-19 Thread Bowie Bailey

On 4/19/2013 7:10 AM, Joe Acquisto-j4 wrote:


0.000  0  3  0  non-token data: bayes db version
0.000  0   1872  0  non-token data: nspam
0.000  0   9184  0  non-token data: nham
0.000  0 140303  0  non-token data: ntokens
0.000  0 1364766063  0  non-token data: oldest atime
0.000  0 1366368683  0  non-token data: newest atime
0.000  0 1366367890  0  non-token data: last journal sync atime
0.000  0 1366146116  0  non-token data: last expiry atime
0.000  01382400  0  non-token data: last expire atime delta
0.000  0  26360  0  non-token data: last expire reduction 
count


You are learning to the same DB that's being used by SA, right?

--
Bowie


Re: Need rule to catch lots of font changes

2013-04-19 Thread Bowie Bailey

On 4/18/2013 7:32 PM, Alex wrote:


I'm trying to adapt this to work with multiple  tags, but I must 
be doing something wrong. I've tried changing it to match just 10 
instances of , just for testing. Here's what I have:


rawbody  __LOC_BR  //
tflags  __LOC_BR  multiple maxhits=11
meta  LOC_MULT_BR > 10
score  LOC_MULT_BR 2.0
describe LOC_MULT_BR At least 10 br tags found


You forgot to refer the meta back to your __LOC_BR rule.  It should 
generate an error.  You did run a lint check on the rules after you 
added this, right?


Try this:

meta LOC_MULT_BR __LOC_BR > 10

--
Bowie


Re: Need rule to catch lots of font changes

2013-04-19 Thread Alexandre Boyer
Hi,

your meta is wrong.

It should be:

meta  LOC_MULT_BR  __LOC_BR > 10

Note that it will not match "just" 10 instances of this tag. It will
match "at least" ten of them.

If you want exactly 10, you have to do something like:

meta  LOC_MULT_BR  __LOC_BR = 10

Never done that, maybe you need to do "greater than 9 smaller than 11"
instead.

Alex, from prypiat.
Yes, I recycle.


On 13-04-18 07:32 PM, Alex wrote:
> Hi all,
>
>
> just write a single detection rule for FONT face= (rawbody or
> uri_detail) and use tflag multiple.
>
> Then meta this with a counter.
>
> eg:
> rawbody  __BLAH  / tflags  __BLAH  multiple maxhits=21
> meta  MULTPL_FONTS  __BLAH > 20
> score  MULTPL_FONTS  5.0
> describe MULTPL_FONTS  At least 20 FONT tags found
>
>
> I'm trying to adapt this to work with multiple  tags, but I must
> be doing something wrong. I've tried changing it to match just 10
> instances of , just for testing. Here's what I have:
>
> rawbody  __LOC_BR  //
> tflags  __LOC_BR  multiple maxhits=11
> meta  LOC_MULT_BR > 10
> score  LOC_MULT_BR 2.0
> describe LOC_MULT_BR At least 10 br tags found
>
> Here is the body example I'm working with:
>
>  href=3D"http://www.paren=
> ts-partage.org/components/com_content/bestinfo.php?tkogwruam714qhdgbfo
> ">htt=
> p://www.parents-partage.org/components/com_content/bestinfo.php?tkogwruam71=
> 
> 4qhdgbfo > r><=
> br>=
>  >__=
> __The stresses.. They just don't care. They're like you on
> Sunday m=
> orning. -- Jerry Griffin
> 
>
> Any idea why this doesn't work as expected? I've pasted an example here:
>
> http://pastebin.com/qprT2Rze
>
> Thanks for any ideas.
> Alex
>
>
>
>  
>
>
>
>
>
> Best regards,
>
> Alex, from prypiat.
> Yes, I recycle.
>
>
> On 13-04-14 08:46 PM, Marc Perkel wrote:
> > Anyone want to write a rule to catch this? Lots of font and color
> > changes.
> >
> > 
> > treatment for the summer holidays.
> > http://jmb.tw/16xul";>Achieve all your goals and this
> video
> > will
> > help you.
> >  color="#e4f4f2">One
> >  color="#e4fcf9">day
> >  > color="#e0fffb">a  > size="+2" color="#e8fffc">younger colleague,  face="Tahoma,
> > Geneva, sans-serif" size="-3" color="#f0fffd">one  > face="Courier, monospace"
> > size="5" color="#ecfbf9">of  > size="3" color="#e0fefa">my most  > color="#e0fdf9">intimate
> >  > color="#f8fffe">friends,  > size="-3" color="#f6fdfc">who had visited  > face="Arial, Helvetica, sans-serif" size="1"
> color="#f0fefc">the
> >  color="#ecfaf8">patient-  > face="Century Gothic, Times New Roman"
> > size="1" color="#e8f6f4">Irma-  size="1"
> > color="#e4f2f0">and  > size="-2" color="#e8fdfa">her
> >  color="#e4f9f6">
> > 
> > 
> >
> >
>
>


signature.asc
Description: OpenPGP digital signature


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Ben Johnson


On 4/19/2013 11:42 AM, Alex wrote:
> Hi,
> 
>> Is this normal? If so, what is the explanation for this behavior? I have
> 
> marked dozens of nearly-identical messages with the subject
> "Garden hose
> expands up to three times its length" as SPAM (over the course of
> several weeks) as SPAM, and yet SA reports "not enough usable
> tokens found".
> 
> 
> If they are identical, I don't believe it will create new tokens,
> per se.
> 
>  
> 
> Is SA referring to the number of tokens in the message? Or the
> Bayes DB?
> 
> 
> I should also mention that while training a message, use "--progress",
> as such (assuming you're running it on an mbox or message that's in mbox
> format):
> 
> # sa-learn --progress --spam --mbox mymboxfile
> 
> It will show you how many tokens have been learned during that run. It
> might also be a good idea to add the token summary flag to your config:
> 
> add_header all Tok-Stat _TOKENSUMMARY_
> 
> If you run spamassassin on a message directly, and add the -t option, it
> will show you the number of different types of tokens found in the message:
> 
> X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36.
> 
> Regards,
> Alex
> 

Alex, thanks very much for the quick reply. I really appreciate it.

One can see from the output in my previous message (two messages back)
that the user is amavis (correct for my system) and the corpus size, as
well as the token count:

dbg: bayes: corpus size: nspam = 6155, nham = 2342
dbg: bayes: tok_get_all: token count: 176
dbg: bayes: cannot use bayes on this message; not enough usable tokens found
bayes: not scoring message, returning undef

Now that I look at this output again, the "token count: 176" stands-out.
That seems like a pretty low value. Is this the token count for the
entire Bayes DB??? Or only the tokens that apply to the particular
message being fed to SA?

The "garden hose" messages are probably not *identical*, but they are
very similar, so it seems that each variant should have tokens to offer.

The concern I expressed around bug 6624 relates to Mark's comment, which
seems to imply that while SA will not insert a token twice, it *will*
increase the token "count". Here's an excerpt from Mark's comment from
that bug report:

"The effect of the bug with SpamAssassin is that tokens are only able
to be inserted once, but their counts cannot increase, leading to
terrible bayes results if the bug is not noticed. Also the conversion
form db fails, as reported by Dave."

Is it possible that training similar messages as SPAM is not having the
intended effect due to this bug in my version of SA?

My "bayes_vars" table looks like this (sorry for the wrapping, this is
the best I could do):

id  usernamespam_count  ham_count   token_count last_expire
last_atime_deltalast_expire_reduce  oldest_token_agenewest_token_age
1   amavis  61852427120092  1366364379  8380417
14747   1357985848  1366386865

The SQL query:

SELECT count( * )
FROM `bayes_token`

returns 120092 rows, so the above value is accurate (that is, the
"token_count" value in the `bayes_vars` table matches the actual row
count in the `bayes_token` table).

Also, thanks for the other tips regarding the "token summary flag"
directive an the -t switch. I was actually using the -t switch to
produce the output that I pasted two messages back. So, it seems that
the "X-Spam-Tok-Stat" output is added only when the token count is high
enough to be useful.

Still stumped here...

-Ben


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Alex
Hi,

> Is this normal? If so, what is the explanation for this behavior? I have

> marked dozens of nearly-identical messages with the subject "Garden hose
>> expands up to three times its length" as SPAM (over the course of
>> several weeks) as SPAM, and yet SA reports "not enough usable tokens
>> found".
>>
>
> If they are identical, I don't believe it will create new tokens, per se.
>
>
>
>> Is SA referring to the number of tokens in the message? Or the Bayes DB?
>>
>
I should also mention that while training a message, use "--progress", as
such (assuming you're running it on an mbox or message that's in mbox
format):

# sa-learn --progress --spam --mbox mymboxfile

It will show you how many tokens have been learned during that run. It
might also be a good idea to add the token summary flag to your config:

add_header all Tok-Stat _TOKENSUMMARY_

If you run spamassassin on a message directly, and add the -t option, it
will show you the number of different types of tokens found in the message:

X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36.

Regards,
Alex


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Alex
Hi,

> Might anyone be in a position to offer an authoritative response to

> these questions?
>
> I continue to see messages that are very similar to dozens of messages
> that have been marked as SPAM slipping through with *no Bayes scoring*
> (this is *after* fixing the SQL syntax error issue):
>
> bayes: cannot use bayes on this message; not enough usable tokens found
> bayes: not scoring message, returning undef
>

Have you tried to find out how many tokens are in your bayes DB? As the
user specified by bayes_sql_username (actually, it probably doesn't matter,
but you should to be sure) run the following:

# sa-learn --dump magic
0.000  0  3  0  non-token data: bayes db version
0.000  0 466417  0  non-token data: nspam
0.000  0 508868  0  non-token data: nham
0.000  0   10788203  0  non-token data: ntokens
0.000  0 1320901921  0  non-token data: oldest atime
0.000  0 1366385643  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal sync
atime
0.000  0 1366348380  0  non-token data: last expiry atime
0.000  0   28651364  0  non-token data: last expire atime
delta
0.000  0  0  0  non-token data: last expire
reduction count

This should show you the number of spam (nspam) and ham (nham) tokens in
the db.

> Is this normal? If so, what is the explanation for this behavior? I have

> marked dozens of nearly-identical messages with the subject "Garden hose
> expands up to three times its length" as SPAM (over the course of
> several weeks) as SPAM, and yet SA reports "not enough usable tokens
> found".
>

If they are identical, I don't believe it will create new tokens, per se.


> Is SA referring to the number of tokens in the message? Or the Bayes DB?
>

I believe it would be talking about the database, not the message.

Regards,
Alex


Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new

2013-04-19 Thread Ben Johnson


On 4/18/2013 12:18 PM, Ben Johnson wrote:
> 
> My concern now is that I am on 3.3.1, with little control over upgrades.
> I have read all three bug reports in their entirety and Bug 6624 seems
> to be a very legitimate concern. To quote Mark in the bug description:
> 
>> The effect of the bug with SpamAssassin is that tokens are only able
>> to be inserted once, but their counts cannot increase, leading to
>> terrible bayes results if the bug is not noticed. Also the conversion
>> form db fails, as reported by Dave.
>>
>> Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to
>> provide a workaround for the MySQL server bug, and improved debug logging.
> 
> How can I discern whether or not this bug does, in fact, affect me? Are
> my Bayes results being crippled as a result of this bug?
> 
>> It's possible that there's a good reason the default script still uses
>> myISAM. If so, the documentation for this fix should at least be easier
>> to find.
>>
> 
> In any event, I'm a little concerned because while the majority of
> messages are now tagged with BAYES_* hits, I am now seeing this debug
> output on a significant percentage of messages ("cannot use bayes on
> this message; not enough usable tokens found"):
> 
> # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'
> 
> --
> Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new
> self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388),
> bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
> Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis
> Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got
> store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778)
> Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established
> Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3
> Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1
> Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham
> = 2342
> Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176
> Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this
> message; not enough usable tokens found
> Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef
> Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830
> (39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%),
> poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%),
> tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19
> (0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%),
> tests_pri_-400: 15 (0.7%), check_bayes: 10 (0.5%), tests_pri_0: 1018
> (48.3%), dkim_load_modules: 25 (1.2%), check_dkim_signature: 3 (0.2%),
> check_dkim_adsp: 16 (0.7%), check_spf: 78 (3.7%), check_razor2: 91
> (4.3%), check_pyzor: 430 (20.4%), tests_pri_500: 50 (2.4%)
> --
> 
> I have done some searching-around on the string "cannot use bayes on
> this message; not enough usable tokens found" and have not found
> anything authoritative regarding what this message might mean and
> whether or not it can be ignored or if it is symptomatic of a larger
> Bayes problem.
> 
> Thank you,
> 
> -Ben
> 

Might anyone be in a position to offer an authoritative response to
these questions?

I continue to see messages that are very similar to dozens of messages
that have been marked as SPAM slipping through with *no Bayes scoring*
(this is *after* fixing the SQL syntax error issue):

bayes: cannot use bayes on this message; not enough usable tokens found
bayes: not scoring message, returning undef

Is this normal? If so, what is the explanation for this behavior? I have
marked dozens of nearly-identical messages with the subject "Garden hose
expands up to three times its length" as SPAM (over the course of
several weeks) as SPAM, and yet SA reports "not enough usable tokens found".

Is SA referring to the number of tokens in the message? Or the Bayes DB?

Thanks,

-Ben


Re: local score ignored

2013-04-19 Thread John Hardin

On Fri, 19 Apr 2013, Joe Acquisto-j4 wrote:


What output does the command "sa-learn --dump magic" produce?


0.000  0   1872  0  non-token data: nspam
0.000  0   9184  0  non-token data: nham


Generally you want the ratio of trained messages to reflect the ratio of 
mail you're seeing. Most people get a lot more spam than ham, so it looks 
like you need a lot more spam trained in.


I try to maintain at least a 2:1 spam:ham ratio.

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Ten-millimeter explosive-tip caseless, standard light armor
  piercing rounds. Why?
---
 Today: the 238th anniversary of The Shot Heard 'Round The World


Re: local score ignored

2013-04-19 Thread John Hardin

On Fri, 19 Apr 2013, Joe Acquisto-j4 wrote:

On 18.04.13 21:45, Joe Acquisto-j4 wrote:

All I can do is feed it.


that is what you should do. You need to train on both spam and ham, since
the BAYES filter must know how they differ...


That has always given me pause, as I get very little ham.  Got one this AM. 
which I will feed
but that's the first in at least a month.

I gather that aged info is not useful?


Ham changes character over time much less than spam. Train with whatever 
you have to start, then train with misclassified messages.


--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Ten-millimeter explosive-tip caseless, standard light armor
  piercing rounds. Why?
---
 Today: the 238th anniversary of The Shot Heard 'Round The World


Re: local score ignored

2013-04-19 Thread Matus UHLAR - fantomas

Niamh Holding  04/19/13 7:11 AM >>>

You only get one ham email a month?


On 19.04.13 09:22, Joe Acquisto-j4 wrote:

That's all *I* seem to get.   Other users may differ, but I have them
instructions on how to forward stuff for training.



This is a rather small system compared to what many of you deal with.


Do you use shared bayes database? Note that this may not be ideal for many
users, since different users can have different opinions pon what is spam
and what is ham.

Also, are you sure SA is using the same BAYES database you are feeding?
It's quite possible that database you have trained is not used (and this
would explain your problem).
The question is how is SA called and how do people train the database...

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Save the whales. Collect the whole set.


Re: local score ignored

2013-04-19 Thread Joe Acquisto-j4
That's all *I* seem to get.   Other users may differ, but I have them 
instructions on how
to forward stuff for training.

This is a rather small system compared to what many of you deal with.

joe a.

>>> Niamh Holding  04/19/13 7:11 AM >>>

Hello Joe,

Friday, April 19, 2013, 12:02:32 PM, you wrote:

JAj> That has always given me pause, as I get very little ham.  Got one this 
AM. which I will feed
JAj> but that's the first in at least a month.

You only get one ham email a month?

-- 
Best regards,
 Niamhmailto:ni...@fullbore.co.uk



Re: local score ignored

2013-04-19 Thread Matus UHLAR - fantomas

On 4/19/2013 at 6:29 AM, Matus UHLAR - fantomas  wrote:

that is what you should do. You need to train on both spam and ham, since
the BAYES filter must know how they differ...


On 19.04.13 07:02, Joe Acquisto-j4 wrote:

That has always given me pause, as I get very little ham.  Got one this AM. 
which I will feed
but that's the first in at least a month.

I gather that aged info is not useful?


I think that could be useful, mostly when it's aged HAM, but even aged spam
is better than no spam...
Training missed spam is more important but even training catched spam helps
In your case (just a few ham) I'd train _all_ ham and all spam that does not
hit BAYES_99

I looked at my spam history - only ~10% of my spam does not hit BAYES_99
and last spam hitting BAYES_50 was about a year and 500 spams ago.

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Due to unexpected conditions Windows 2000 will be released
in first quarter of year 1901


Re: local score ignored

2013-04-19 Thread Joe Acquisto-j4
>>> On 4/19/2013 at 6:35 AM, Matus UHLAR - fantomas  wrote:
> On 4/19/2013 at 12:06 AM, John Hardin  wrote:
>>> BAYES_50 is the bayes classifier's way of saying "insufficient data" or "I
>>> don't know".
>>>
>>> Do you really want to assign 3 points for "I don't know"?
> 
> On 19.04.13 06:09, Joe Acquisto-j4 wrote:
>>In this case, from the samples I've seen.   Absolutely, yes.
> 
> as I said, the problem is that your BAYES database does not have enough of
> spam/ham samples. You have to feed it, not to increase score for BAYES_50.
> With your logic you can give high score to any other rule that hits, e.g.
> HTML_MESSAGE.
>

Well, I *could* do a lot of things.  And have.  (See these scars?)

> 
>>For me, this last few days, I have seen lots of missed spam that has
>> virtually nothing else to trigger on.
> 
> do you have network checks enabled? Plugins allowed? packages installed?
> blacklist, uribl?
> razor, pyzor, DCC, they all need plugins and installed clients.

I have to check.   I set this up a while ago cron'd up feeding BAYES and such
and sat back.

> Do you have your trusted_networks and internal_networks properly set?
 
Probably.

>>Been so irritated by this I considered giving it a 5.0.   But, even for me,
>> that's over the top.
> 
> If you receive many spam with BAYES_50, there's something wrong with your
> BAYES database, even disabling could behave better (but training would do
> much better.

I have suspected such, but . . .

> What output does the command "sa-learn --dump magic" produce?
> 
> -- 


0.000  0  3  0  non-token data: bayes db version
0.000  0   1872  0  non-token data: nspam
0.000  0   9184  0  non-token data: nham
0.000  0 140303  0  non-token data: ntokens
0.000  0 1364766063  0  non-token data: oldest atime
0.000  0 1366368683  0  non-token data: newest atime
0.000  0 1366367890  0  non-token data: last journal sync atime
0.000  0 1366146116  0  non-token data: last expiry atime
0.000  01382400  0  non-token data: last expire atime delta
0.000  0  26360  0  non-token data: last expire reduction 
count

joe a.




Re: local score ignored

2013-04-19 Thread Niamh Holding

Hello Joe,

Friday, April 19, 2013, 12:02:32 PM, you wrote:

JAj> That has always given me pause, as I get very little ham.  Got one this 
AM. which I will feed
JAj> but that's the first in at least a month.

You only get one ham email a month?

-- 
Best regards,
 Niamhmailto:ni...@fullbore.co.uk

pgptqWiV1kyZ8.pgp
Description: PGP signature


Re: local score ignored

2013-04-19 Thread Joe Acquisto-j4
>>> On 4/19/2013 at 6:29 AM, Matus UHLAR - fantomas  wrote:
> On 4/18/2013 at 7:21 AM, Matus UHLAR - fantomas  wrote:
>>> Train your bayes database, if you get many spams with this small score.
> 
> On 18.04.13 21:45, Joe Acquisto-j4 wrote:
>>All I can do is feed it.
> 
> that is what you should do. You need to train on both spam and ham, since
> the BAYES filter must know how they differ...
> 
>

That has always given me pause, as I get very little ham.  Got one this AM. 
which I will feed
but that's the first in at least a month. 

I gather that aged info is not useful?

joe a.





Re: local score ignored

2013-04-19 Thread Matus UHLAR - fantomas

On 4/19/2013 at 12:06 AM, John Hardin  wrote:

BAYES_50 is the bayes classifier's way of saying "insufficient data" or "I
don't know".

Do you really want to assign 3 points for "I don't know"?


On 19.04.13 06:09, Joe Acquisto-j4 wrote:

In this case, from the samples I've seen.   Absolutely, yes.


as I said, the problem is that your BAYES database does not have enough of
spam/ham samples. You have to feed it, not to increase score for BAYES_50.
With your logic you can give high score to any other rule that hits, e.g.
HTML_MESSAGE.


For me, this last few days, I have seen lots of missed spam that has
virtually nothing else to trigger on.


do you have network checks enabled? Plugins allowed? packages installed?
blacklist, uribl?
razor, pyzor, DCC, they all need plugins and installed clients.
Do you have your trusted_networks and internal_networks properly set?


Been so irritated by this I considered giving it a 5.0.   But, even for me,
that's over the top.


If you receive many spam with BAYES_50, there's something wrong with your
BAYES database, even disabling could behave better (but training would do
much better.

What output does the command "sa-learn --dump magic" produce?

--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
(R)etry, (A)bort, (C)ancer


Re: local score ignored

2013-04-19 Thread Matus UHLAR - fantomas

On 4/18/2013 at 7:21 AM, Matus UHLAR - fantomas  wrote:

Train your bayes database, if you get many spams with this small score.


On 18.04.13 21:45, Joe Acquisto-j4 wrote:

All I can do is feed it.


that is what you should do. You need to train on both spam and ham, since
the BAYES filter must know how they differ...


DO NOT play with BAYES_50 score.


?  What can it hurt?


you can get many false positives.
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Christian Science Programming: "Let God Debug It!".


Re: local score ignored

2013-04-19 Thread Joe Acquisto-j4
>>> On 4/19/2013 at 12:06 AM, John Hardin  wrote:
> On Thu, 18 Apr 2013, Joe Acquisto-j4 wrote:
> 
> On 4/18/2013 at 7:21 AM, Matus UHLAR - fantomas  wrote:
>>> On 18.04.13 06:45, Joe Acquisto-j4 wrote:
 I was concerned about this:

 [score: 0.4968]
>>>
>>> This meant that BAYES has computer 49.56% probability that the mail is spam
>>> and the rest (50.44%) that it is HAM.
>>
>> ok
>>
>>> DO NOT play with BAYES_50 score.
>>
>> ?  What can it hurt?
> 
> BAYES_50 is the bayes classifier's way of saying "insufficient data" or "I 
> don't know".
> 
> Do you really want to assign 3 points for "I don't know"?
> 
> -- 
>   John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ 
>   jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org 
>   key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> ---
>Ten-millimeter explosive-tip caseless, standard light armor
>piercing rounds. Why?
> ---
>   Tomorrow: the 238th anniversary of The Shot Heard 'Round The World

In this case, from the samples I've seen.   Absolutely, yes. 

For me, this last few days, I have seen lots of missed spam that has virtually 
nothing else to trigger on.  

Been so irritated by this I considered giving it a 5.0.   But, even for me, 
that's over the top.

joe a