Training spam as ham and forwarding

2009-08-26 Thread MySQL Student
Hi SA users,

I have a few messages found in the quarantine that I need to train as
ham because they were marked as spam incorrectly. To do this, I added
the following to the top of the file so it becomes a normal email:

 From DUMMY-LINE Thu Jan  1 00:00:00 1970

Is this correct? (without the leading spaces)

I can now accurately access and index it using pine, whereas before it
didn't acknowledge it as a normal email. I'd also now like to forward
it to the intended recipient as an attachment, but the recipient isn't
able to read it as a normal email, but instead as plain text. How can
I accomplish this?

Are there mail tools, like procmail or formail, I believe, that were
designed to automate this?

Does anyone request ham from their users to be trained by bayes, or
is autolearning typically the only way (or only real effective way) to do this?

Also, on another note, how can I have all email destined for a
particular user sent to them, including spam? This is what all_spam_to
is for, correct?

Thanks,
Alex


Re: corpus ham/spam balance

2009-08-26 Thread LuKreme

On 26-Aug-2009, at 10:53, Kris Deugau wrote:
If you're running a sitewide AWL on any kind of scale beyond a few  
tens of domains, and a couple hundred accounts, you should probably  
look at putting it in SQL - it's a *lot* easier to maintain there.


Is there a good writeup on doing this?




RE: corpus ham/spam balance

2009-08-26 Thread Karsten Bräckelmann
On Wed, 2009-08-26 at 13:47 -0600, Savoy, Jim wrote:
> > Karsten wrote:
> > None.  As I mentioned earlier today on this list, auto-learning does
> > neither take Bayes nor AWL into account.
> 
> Ok thanks Karsten. I guess the change to -3.0 for ham is the only cause
> of my corpus coming into balance. Good to know.

Yup, definitely.

Also, I do agree with the post by RW. By lowering the auto-learn ham
threshold you managed to get the ratio more sane. However, continuing to
do so you won't really learn any ham, but spam only.

Raising the threshold back to the default likely would be a good idea,
and occasionally lower to get the effect you just observed: Get the
ratio back to somewhat balanced.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: sa: lottery message scored hammy by bayes:salearn --dump magin

2009-08-26 Thread Karsten Bräckelmann
On Wed, 2009-08-26 at 17:28 -0400, Dennis German wrote:
> Thanks for the support.

Voluntary support.  Please do keep the thread on-list, otherwise I'll
stop voluntaring. I'm not the only one who can help you.

> I do know enough not to past spam in a posting.

And yet you did paste the payload of a scam. Hint, the SOUGHT (FRAUD)
rule-set, which I feed myself, is designed to catch this sort of bare
text.


> I've been trying to get our local support(midphase) to
> handle some of the problems they have created by moving us to
> another server. Recall my question late last night ifm you got it.
> 
> Anyway,
> I haven't been manually doing training.
> 
> Should I be doing training?
> 
> Also did you notice error "could not find site rules directory"

Yes, I did.  Quite odd. Part of the reason I pointed out your setup, and
wondering out aloud which user is scanning and which one you dumped the
sa-learn stats for. Since you have not been training manually, these
should be the same.

As I have pointed out a couple times before, it is rather crucial to
manually train on *low* scoring spam, e.g. exactly like this one. They
are not auto-trained.


> My user_prefs is always available at
> http://www.real-world-systems.com/mail/user_prefs.cgi
> 
> 
> Karsten Bräckelmann wrote:
> > On Tue, 2009-08-25 at 21:21 -0400, Dennis German wrote:
> > > sa-learn --dump magic
> > > config: could not find site rules directory
> > > 0.000  0  3  0  non-token data: bayes db version
> > > 0.000  0 262297  0  non-token data: nspam
> > > 0.000  0  24621  0  non-token data: nham
> > > 0.000  0 142776  0  non-token data: ntokens
> > 
> > Recalling some fuzzy bits about your system and setup, I wonder if that
> > is the Bayes DB of the *scanning* user -- or a different one you have
> > been manually training.

-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



RE: AWL q?

2009-08-26 Thread Gary Smith
> I don't let that junk get past envelope stage:
> 
> postmap -q "weekendhotdeals.info" mysql:/usr/local/etc/postfix/mysql-
> from_senders_rhsbl.cf
> 554 RHSBL_DOMAIN
> 

I assume you are running some type of background process that generates the 
list of senders based upon some criteria.  Can you share more.  

I also use mysql lookups for postfix (though I'm in the process of converting 
them to memcache for some of the larger ones (with a preloader) so I can hit 
memcached first (then lookup to the database after if necessary).  I'm also 
looking for better ways to deal with spam. 



Re: AWL q?

2009-08-26 Thread Len Conrad
-- Original Message --
From: Gary Smith 
Date:  Wed, 26 Aug 2009 12:29:24 -0700

>I've been finding a lot of singletons in the AWL db for domains that are all 
>spam.  Is there a way put an entire domain into AWL or set it up to give an 
>average score for that domain?
>
>Obviously I can put this directly into the config file but I'm looking for a 
>less intrusive way to do this.  What might be useful is an awl_domain table 
>that it manages the average for the domain/ip as well as just the single email.
>
>Anyway, is there a way to do this currently?
>
>Example of the database (I think I have like 500 for these guys now from this 
>week).
>
>+--++---+---+--+
>| username | email  | ip| count | totscore 
>|
>+--++---+---+--+
>| filter   | ajdiohxo...@weekendhotdeals.info   | 76.73 | 1 |6.519 
>| 
>| filter   | ajuxorpc...@weekendhotdeals.info   | 76.73 | 1 |6.519 
>| 
>| filter   | aqxkopmj...@weekendhotdeals.info   | 76.73 | 2 |   10.872 
>| 
>| filter   | atjwoxps...@weekendhotdeals.info   | 76.73 | 1 |   11.918 
>| 
>| filter   | bckxiypg...@weekendhotdeals.info   | 76.73 | 1 |6.519 
>| 
>| filter   | beqrikuo...@weekendhotdeals.info   | 76.73 | 2 |   10.872 
>| 
>| filter   | bkqrasni...@weekendhotdeals.info   | 76.73 | 2 |   13.038 
>| 
>| filter   | blyhovks...@weekendhotdeals.info   | 76.73 | 1 |6.519 
>| 
>| filter   | bsfmogqa...@weekendhotdeals.info   | 76.73 | 2 |   10.872 
>| 
>| filter   | bsgjuulc...@weekendhotdeals.info   | 76.73 | 2 |   10.872 
>| 
>+--++---+---+--+

I don't let that junk get past envelope stage:

postmap -q "weekendhotdeals.info" 
mysql:/usr/local/etc/postfix/mysql-from_senders_rhsbl.cf
554 RHSBL_DOMAIN

Len






AWL q?

2009-08-26 Thread Gary Smith
I've been finding a lot of singletons in the AWL db for domains that are all 
spam.  Is there a way put an entire domain into AWL or set it up to give an 
average score for that domain?

Obviously I can put this directly into the config file but I'm looking for a 
less intrusive way to do this.  What might be useful is an awl_domain table 
that it manages the average for the domain/ip as well as just the single email.

Anyway, is there a way to do this currently?

Example of the database (I think I have like 500 for these guys now from this 
week).

+--++---+---+--+
| username | email  | ip| count | totscore |
+--++---+---+--+
| filter   | ajdiohxo...@weekendhotdeals.info   | 76.73 | 1 |6.519 
| 
| filter   | ajuxorpc...@weekendhotdeals.info   | 76.73 | 1 |6.519 
| 
| filter   | aqxkopmj...@weekendhotdeals.info   | 76.73 | 2 |   10.872 
| 
| filter   | atjwoxps...@weekendhotdeals.info   | 76.73 | 1 |   11.918 
| 
| filter   | bckxiypg...@weekendhotdeals.info   | 76.73 | 1 |6.519 
| 
| filter   | beqrikuo...@weekendhotdeals.info   | 76.73 | 2 |   10.872 
| 
| filter   | bkqrasni...@weekendhotdeals.info   | 76.73 | 2 |   13.038 
| 
| filter   | blyhovks...@weekendhotdeals.info   | 76.73 | 1 |6.519 
| 
| filter   | bsfmogqa...@weekendhotdeals.info   | 76.73 | 2 |   10.872 
| 
| filter   | bsgjuulc...@weekendhotdeals.info   | 76.73 | 2 |   10.872 
| 
+--++---+---+--+


RE: corpus ham/spam balance

2009-08-26 Thread Karsten Bräckelmann
On Wed, 2009-08-26 at 11:04 -0600, Savoy, Jim wrote:
> I'm not sure how much effect adding AWL to my config helped with my
> corpus coming into balance [...]

None.  As I mentioned earlier today on this list, auto-learning does
neither take Bayes nor AWL into account.

  60_awl.cf:  tflags AWL  userconf noautolearn


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: corpus ham/spam balance

2009-08-26 Thread RW
On Wed, 26 Aug 2009 11:04:42 -0600
"Savoy, Jim"  wrote:

 not sure how much effect adding AWL to my config helped with my
> corpus coming into balance (perhaps it was only the change to my ham
> threshold that made the difference), 

The AWL score isn't counted for autolearning and neither is the the
Bayes score or whitelisting rules. So by taking the threshold down to
-3.0, you wont be learning much ham at all unless you have a lot of good
negative scoring custom rules.

IMO trying to find a particular ham threshold that brings the ratio
into balance is not a good idea because you can become far too selective
in what you learn, and end-up learning the least useful candidates.
Probably better to periodically push the threshold down to -100 until
the numbers balance.


RE: corpus ham/spam balance

2009-08-26 Thread Savoy, Jim
>kdg wrote:

>If you're running a sitewide AWL on any kind of scale beyond a few tens

of domains, and a couple hundred accounts, you should probably look at 
putting it in SQL - it's a *lot* easier to maintain there.

It is one domain, with 20,000 accounts. I will see about using SQL.
Thanks.

>Well, by deleting the file, you purge all of the history various
senders 
have acquired on your system - it may not do any harm, but it may also 
cause a few FPs for senders whose first message after the deletion 
happens to be spammier than usual.

I'm not sure how much effect adding AWL to my config helped with my
corpus coming into balance (perhaps it was only the change to my ham
threshold that made the difference), but my thinking was that I probably
wouldn't get many/any FPs if the corpus is trained well, thus allowing
me to just blow that file away every few months and start over.

 - jim -



Re: corpus ham/spam balance

2009-08-26 Thread Kris Deugau

Savoy, Jim wrote:
   I see that my auto-whitelist file is now quite large. In just 4 
months, it has grown from 0 to

335 megs. Is there a way to pare that file down?


If you're running a sitewide AWL on any kind of scale beyond a few tens 
of domains, and a couple hundred accounts, you should probably look at 
putting it in SQL - it's a *lot* easier to maintain there.


Beyond that, I've never seen a real solution better than the 
trim_whitelist script I adapted from the check_whitelist script way back 
with SA2.5something.  I still use it on a couple of low-volume legacy 
domain servers to keep their BDB-based AWL from running away on me.


Unfortunately I've also seen a couple of reports on this list that it 
tends to hog memory and will probably bog down your mail system if you 
run it on a large AWL file.  :/  I don't recall if there was any 
workaround posted.


Call it from cron regularly (I find daily works well).  Note you'll need 
up to double the disk space occupied by the current file as it 
completely rewrites the whole thing in order to actually reduce the disk 
usage - simply deleting entries won't shrink the file.


http://www.deepnet.cx/~kdeugau/spamtools/

Or should I maybe just 
stop spamd, delete

the file, and restart spamd? (will that do me any harm?).


Well, by deleting the file, you purge all of the history various senders 
have acquired on your system - it may not do any harm, but it may also 
cause a few FPs for senders whose first message after the deletion 
happens to be spammier than usual.


-kgd


corpus ham/spam balance

2009-08-26 Thread Savoy, Jim
Hi all,

 

   Back in the spring, someone mentioned that it is good to have your
ham/spam ratio close

to even. I have a site-wide set-up and while it seemed to be working
perfectly, I did notice

that when I did an "sa-learn -dump magic", my ham-to-spam ratio was
almost 7::1 (2,200,000

ham to only 330,000 spam). I made two small changes to my set-up. 1) I
changed the threshold

for ham from -1.0 to -3.0, and I turned on auto-whitelisting. It seems
to have done the trick.

On April 17 my ham/spam ratio was 6.68::1 and today (Aug 26) it sits at
1.95::1! (2,900,000 ham

to 1,480,000 spam).

 

   I see that my auto-whitelist file is now quite large. In just 4
months, it has grown from 0 to

335 megs. Is there a way to pare that file down? Or should I maybe just
stop spamd, delete

the file, and restart spamd? (will that do me any harm?). Thanks.

 

-  jim -

-   

 

 



Re: lottery message scored hammy by bayes

2009-08-26 Thread Benny Pedersen

On ons 26 aug 2009 15:23:01 CEST, Karsten Bräckelmann wrote


X-Spam-testscores: BAYES_00=-2.599,HTML_MESSAGE=0.001,MISSING_HEADERS=5.7,
SUBJ_ALL_CAPS=3.1,UPPERCASE_75_100=1.528


That MISSING_HEADERS score is custom, and *way* off-base IMHO.


question is more what header is missing

--
xpoint



Re: lottery message scored hammy by bayes

2009-08-26 Thread Karsten Bräckelmann
On Tue, 2009-08-25 at 20:59 -0400, Dennis German wrote:
> email with this content:

Do *NOT* paste spam samples to the list. Use a pastebin or upload them
to your own server and provide a link instead.


> X-Spam-testscores: BAYES_00=-2.599,HTML_MESSAGE=0.001,MISSING_HEADERS=5.7,
> SUBJ_ALL_CAPS=3.1,UPPERCASE_75_100=1.528

That MISSING_HEADERS score is custom, and *way* off-base IMHO.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: sa: lottery message scored hammy by bayes:salearn --dump magin

2009-08-26 Thread Karsten Bräckelmann
On Tue, 2009-08-25 at 21:21 -0400, Dennis German wrote:
> sa-learn --dump magic
> config: could not find site rules directory
> 0.000  0  3  0  non-token data: bayes db version
> 0.000  0 262297  0  non-token data: nspam
> 0.000  0  24621  0  non-token data: nham
> 0.000  0 142776  0  non-token data: ntokens

Recalling some fuzzy bits about your system and setup, I wonder if that
is the Bayes DB of the *scanning* user -- or a different one you have
been manually training.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Auto-Learn Thresholds (was: lottery message scored hammy by bayes)

2009-08-26 Thread Karsten Bräckelmann
On Tue, 2009-08-25 at 22:13 -0400, Alex wrote:
> > If you're using autolearning, what are your learning thresholds?
> 
> What do you recommend for thresholds? I'm considering using
> autolearning, but very concerned about corrupting the database. I
> think I would use something like +15 for spam.

I generally recommend the defaults, unless you *do* know you need
something else. That's why they are defaults.

That's <= 0.1 for ham and >= 12.0 for spam. Keep in mind these scores
are calculated using a non-Bayes score set, so they generally differ
from the overall score of the message. Also, this does not take various
specific rules' scores into account, like Bayes and AWL. Plus some more
esoteric constraints.

See the docs. [1]


> There are FNs on occasion in the 2.x range with low bayes numbers (or
> BAYES_50) that I wouldn't want to be tagged as ham. Should that be a
> concern?

No.  Bayes auto-learning is *not* self-feeding.

Any overall score of about 2 (with Bayes) is *very* unlikely to cross
either threshold when using the respective non-Bayes score-set.

Moreover, your concern is with Bayes probability <= 50%, and thus a
negative score for the BAYES hit. This hit is not considered for
auto-learning, though, and as a first rule-of-thumb subtract that score
again -- which yields a slightly higher score. Still no way even close
to the thresholds.


> Even mail that has been whitelisted could also contain spam, so would
> a ham threshold of like -100 work, or present the same problem?

60_whitelist.cf:  tflags USER_IN_WHITELIST  userconf nice noautolearn

Again, as per the docs [1], whitelisting will not be considered for the
decision whether to auto-learn or not.

  guenther


[1] 
http://spamassassin.apache.org/full/3.2.x/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html

-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}