The Inbox problem (was: sa-learn problems and comprehension question)

2010-11-10 Thread Karsten Bräckelmann
On Wed, 2010-11-10 at 18:04 +0100, Karsten Bräckelmann wrote:
> On Tue, 2010-11-09 at 22:57 -0800, Karl Meyer wrote:

> > > Using the Inbox rather than a dedicated ham folder therefore is NOT a
> > > good idea.
> > 
> > The problem is, that I can't persuade about 120 users to store all their ham
> > below a defined folder. They want to sort their mails into several folders
> > they created by their own.
> 
> You should be fine with some initial training of hand-sorted ham in a
> dedicated folder. Then let auto-learn kick in.

Other than the already established issues of delete and expunge, and
re-learning the same messages over and over again, there is another
problem with the "learn from Inbox" approach.

You would, effectively, implement a rather poor auto-learning mechanism.
The auto-learning available in SA does have quite some constraints, to
prevent bad training. Automatic, non-supervised training of the Inbox
does not have *any* such precaution, and instead blindly trusts the
initial classification.

You cannot tell your users to collect hand-classified ham? Do you really
believe you can tell your users, not *ever* to just delete the
occasional spam in their Inbox, rather than moving it for training? So
the user is in a hurry, he's late for the meeting, the project deadline
is close, the day's been a disaster anyway and the headache... No, don't
want that junk. Delete. Off there goes the spam into the great bin-
bucket, after it has been learned as ham.

And then there's the college next cubicle, on vacation for three weeks.
Meanwhile, all the FNs piling up in his Inbox are being trained as ham.
Despite the fact that all that would have been needed to properly
classify them are a bit of Bayes training to cross the threshold. As
spam, though, not ham -- bootstrapping as a dis-service for everyone
else in the office.


No, automatically training the Inbox is not a safe approach.

I recommend reading the sections Getting Started and Effective Training
in the sa-learn man page.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: sa-learn problems and comprehension question

2010-11-10 Thread Martin Gregorie
On Wed, 2010-11-10 at 19:01 +0100, Karsten Bräckelmann wrote:
> On Wed, 2010-11-10 at 17:48 +, Martin Gregorie wrote:
> > On Wed, 2010-11-10 at 18:16 +0100, Karsten Bräckelmann wrote:
> > > > > find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \;
> > > > > 
> > > > > which is a bit slower but avoids the command line overflow by running
> > > > > sa_learn on every matching file.
> > > 
> > > A "bit" slower. Periodically re-learning the entire Inbox of 100+ users,
> > > spawning a full Perl process for every single mail...
> > 
> > Indeed, but when I wrote that I had no idea that the directory had
> > permanent content. I was under the impression that its content would be
> > deleted or archived after it had been learned and the next run would be
> > dealing with new messages.
> 
> Oh, I didn't mean to bitch at you personally. :)
> 
> I just thought the issues should be pointed out and clearly documented,
> rather than keeping it in the archives without a warning.
> 
Good point, and one well worth making clear.

Martin




Re: sa-learn problems and comprehension question

2010-11-10 Thread Karsten Bräckelmann
On Wed, 2010-11-10 at 17:48 +, Martin Gregorie wrote:
> On Wed, 2010-11-10 at 18:16 +0100, Karsten Bräckelmann wrote:
> > > > find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \;
> > > > 
> > > > which is a bit slower but avoids the command line overflow by running
> > > > sa_learn on every matching file.
> > 
> > A "bit" slower. Periodically re-learning the entire Inbox of 100+ users,
> > spawning a full Perl process for every single mail...
> 
> Indeed, but when I wrote that I had no idea that the directory had
> permanent content. I was under the impression that its content would be
> deleted or archived after it had been learned and the next run would be
> dealing with new messages.

Oh, I didn't mean to bitch at you personally. :)

I just thought the issues should be pointed out and clearly documented,
rather than keeping it in the archives without a warning.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: sa-learn problems and comprehension question

2010-11-10 Thread Martin Gregorie
On Wed, 2010-11-10 at 18:16 +0100, Karsten Bräckelmann wrote:
> > > find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \;
> > > 
> > > which is a bit slower but avoids the command line overflow by running
> > > sa_learn on every matching file.
> 
> A "bit" slower. Periodically re-learning the entire Inbox of 100+ users,
> spawning a full Perl process for every single mail...
> 
Indeed, but when I wrote that I had no idea that the directory had
permanent content. I was under the impression that its content would be
deleted or archived after it had been learned and the next run would be
dealing with new messages.


Martin





Re: sa-learn problems and comprehension question

2010-11-10 Thread Karsten Bräckelmann
On Wed, 2010-11-10 at 08:56 +0100, Marcin Mirosław wrote:
> W dniu 2010-11-10 07:37, Karl Meyer pisze:
> > But the 15 new messages weren't learnd yet.
> > 
> > I had 10 messages in my inbox and run sa-learn on that folder. Then I got 15
> > different new messages and re-run sa-learn again. But it said that it
> > learned from 0 messages.

How many messages did sa-learn claim to have examined?

> Do you run SA from smtp server? Porbably yes, bayes autolearned
> this/those(15) emails earlier, while was called from MTA.

To verify this assumption, and test your sa-learn command, try --forget
and then --ham again.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: sa-learn problems and comprehension question

2010-11-10 Thread Karsten Bräckelmann
> > find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \;
> > 
> > which is a bit slower but avoids the command line overflow by running
> > sa_learn on every matching file.

A "bit" slower. Periodically re-learning the entire Inbox of 100+ users,
spawning a full Perl process for every single mail...

> you can use xargs then to calling sa-learn too often. xargs can push only as
> much parameters as fits.
> 
> find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -print |xargs sa-learn

Definitely! Without it, the box will be burning in hell, doomed to
constantly spawn Perl processes until the end of time.

Using the find result to populate a list in a temporary file, and use
that with sa-learn -f is even better. A single Perl process per learning
iteration.


Also, since this is not Maildir and the above path looks suspiciously to
contain directories (mail sub-folders), the find -maxdepth option will
be required to prevent learning *all* the user's mail.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: sa-learn problems and comprehension question

2010-11-10 Thread Karsten Bräckelmann
On Tue, 2010-11-09 at 22:57 -0800, Karl Meyer wrote:
> > > --showdots /var/spool/imap/user/kmeyer/[0-9]*." amavis
> > ^^^
> > This is dangerous. With lots of mail in the (Maildir?) folder, shell
> > expansion *quickly* will exceed the command line length limit.
> >
> > The trailing dot also looks bad.
> 
> This is a good argument. I' ll think about that. The folder is a cyrus Imap
> folder. I used the [0-9]*. expression, because each cyrus folder contains
> messages with numbered filename with a trailing dot and four cyrus.* files
> (cache, index,...).

Not Maildir, and passing the dir itself doesn't seem like an option
either in the Cyrus case. Man pages to the rescue.

The Description section of the sa-learn man page holds this.

 "Note that csh-style globbing in the mail folder names is supported; in
  other words, listing a folder name as "*" will scan every folder that
  matches.  See "Mail::SpamAssassin::ArchiveIterator" for more details."

So, something similar to the above is possible. However, you will need
to escape or quote the globbing, to prevent the su-spawned shell from
expanding it (as the above does), but pass the glob to sa-learn.


> > A word of caution. There is no move command with IMAP. Instead, it is
> > copy and delete. Or rather mark-for-deletion, since there is no delete
> > command either. That's expunge.
> 
> That's right. But is this a problem?
> First, I learn ham from the inbox folder, then spam from the junk folder. If
> a mail is moved to the junk folder meanwhile and no expunge was done, then
> it's not relearned as ham again and learned from the junk folder as spam
> newly.

That is a lot of unnecessary work, constantly re-learning messages.

Also, there's a race condition. So your user knows, spam will be learned
periodically. And that month worth of spam backlog in that folder is
ugly. Time to clean it up, and expunge. The Inbox though is precious.
Lots of important stuff and cute kitten attachments. Unless the Inbox is
expunged, too, "deleted" spam from the Inbox will be re-trained next
time. The copy in the spam folder is no more, to correct that "false
training by design".

(Caveat: I don't know how Cyrus handles deletion or flagging mail as
such. It is not Maildir.)


> > Using the Inbox rather than a dedicated ham folder therefore is NOT a
> > good idea.
> 
> The problem is, that I can't persuade about 120 users to store all their ham
> below a defined folder. They want to sort their mails into several folders
> they created by their own.

You should be fine with some initial training of hand-sorted ham in a
dedicated folder. Then let auto-learn kick in.

Some script magic, using 'find' to collect the last n hours worth of ham
and spam might be an option, too. Used with the sa-learn -f option. Once
again, please do read the... man page.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: sa-learn problems and comprehension question

2010-11-10 Thread Martin Gregorie
On Tue, 2010-11-09 at 22:57 -0800, Karl Meyer wrote:
> >This is dangerous. With lots of mail in the (Maildir?) folder, shell
> >expansion *quickly* will exceed the command line length limit.
> >
> > The trailing dot also looks bad.
> 
> This is a good argument. I' ll think about that. The folder is a cyrus Imap
> folder. I used the [0-9]*. expression, because each cyrus folder contains
> messages with numbered filename with a trailing dot and four cyrus.* files
> (cache, index,...).
> 
Use find:  

find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \;

which is a bit slower but avoids the command line overflow by running
sa_learn on every matching file.


Martin




Re: sa-learn problems and comprehension question

2010-11-10 Thread Matus UHLAR - fantomas
> On Tue, 2010-11-09 at 22:57 -0800, Karl Meyer wrote:
> > >This is dangerous. With lots of mail in the (Maildir?) folder, shell
> > >expansion *quickly* will exceed the command line length limit.
> > >
> > > The trailing dot also looks bad.
> > 
> > This is a good argument. I' ll think about that. The folder is a cyrus Imap
> > folder. I used the [0-9]*. expression, because each cyrus folder contains
> > messages with numbered filename with a trailing dot and four cyrus.* files
> > (cache, index,...).

On 10.11.10 11:12, Martin Gregorie wrote:
> Use find:  
> 
> find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \;
> 
> which is a bit slower but avoids the command line overflow by running
> sa_learn on every matching file.

you can use xargs then to calling sa-learn too often. xargs can push only as
much parameters as fits.

find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -print |xargs sa-learn

(well, add --spam or --ham to sa-learn of course)
-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I wonder how much deeper the ocean would be without sponges. 


Re: sa-learn problems and comprehension question

2010-11-09 Thread Marcin Mirosław
W dniu 2010-11-10 07:37, Karl Meyer pisze:
> But the 15 new messages weren't learnd yet.
> 
> I had 10 messages in my inbox and run sa-learn on that folder. Then I got 15
> different new messages and re-run sa-learn again. But it said that it
> learned from 0 messages.

Do you run SA from smtp server? Porbably yes, bayes autolearned
this/those(15) emails earlier, while was called from MTA.
Regards


Re: sa-learn problems and comprehension question

2010-11-09 Thread Karl Meyer

>> The --dbpath option is bad. Despite its name, it is not a "path", but a
>> prefix. The sa-update man page states it is in bayes_path form, which is
>> documented in the general SA Conf documentation.

> "This is the directory and filename for Bayes databases". Since your
> given path looks similar to the default, I assume you actually meant to
> keep that SA data in the .spamassassin/ dir. For that, just drop the
> trailing slash. Also, please re-read carefully that part of the docs.

You are right. But it doesn't matter. If I used my --dbpath like above, two
files bayes_seen and bayes_toks were created below this folder. If I use
--dbpath /bayes/test, then I get two files test_seen and test_toks in
that folder. I'll correct this of course, but it works like before and my
problem is still there.



>> --showdots /var/spool/imap/user/kmeyer/[0-9]*." amavis
> ^^^
>This is dangerous. With lots of mail in the (Maildir?) folder, shell
>expansion *quickly* will exceed the command line length limit.
>
> The trailing dot also looks bad.

This is a good argument. I' ll think about that. The folder is a cyrus Imap
folder. I used the [0-9]*. expression, because each cyrus folder contains
messages with numbered filename with a trailing dot and four cyrus.* files
(cache, index,...).



> A word of caution. There is no move command with IMAP. Instead, it is
> copy and delete. Or rather mark-for-deletion, since there is no delete
> command either. That's expunge.

That's right. But is this a problem?
First, I learn ham from the inbox folder, then spam from the junk folder. If
a mail is moved to the junk folder meanwhile and no expunge was done, then
it's not relearned as ham again and learned from the junk folder as spam
newly.



> Using the Inbox rather than a dedicated ham folder therefore is NOT a
> good idea.

The problem is, that I can't persuade about 120 users to store all their ham
below a defined folder. They want to sort their mails into several folders
they created by their own.


-- 
View this message in context: 
http://old.nabble.com/sa-learn-problems-and-comprehension-question-tp30172306p30178053.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Re: sa-learn problems and comprehension question

2010-11-09 Thread Karl Meyer


Marcin Mirosław wrote:
> 
>>> and got a message, that it learned from n messages. Also in the dbpath
>>> foder
>>> two files appeared. After I got 15 new mails in my inbox, I executed the
>>> same command again. But this time it didn't learned.
> 
>>Sa-learn "remember" msgid message which has been learned, it will never
>>learned twice the same email. Until you change msgid ;)
> 


But the 15 new messages weren't learnd yet.

I had 10 messages in my inbox and run sa-learn on that folder. Then I got 15
different new messages and re-run sa-learn again. But it said that it
learned from 0 messages.

-- 
View this message in context: 
http://old.nabble.com/sa-learn-problems-and-comprehension-question-tp30172306p30177999.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.



Re: sa-learn problems and comprehension question

2010-11-09 Thread Marcin Mirosław
W dniu 2010-11-09 17:24, Bowie Bailey pisze:
> If you learn a message as ham, it will not learn the same message as ham
> a second time (same with spam).  However, you can change your mind and
> learn the message as spam.  Bayes will "forget" what it learned the
> first time and re-learn the message.

Agree, i wasn't precise.


Re: sa-learn problems and comprehension question

2010-11-09 Thread Karsten Bräckelmann
On Tue, 2010-11-09 at 08:14 -0800, Karl Meyer wrote:
> # su -c "/usr/bin/sa-learn --dbpath /var/amavis/.spamassassin/bayes/ --ham

The --dbpath option is bad. Despite its name, it is not a "path", but a
prefix. The sa-update man page states it is in bayes_path form, which is
documented in the general SA Conf documentation.

"This is the directory and filename for Bayes databases". Since your
given path looks similar to the default, I assume you actually meant to
keep that SA data in the .spamassassin/ dir. For that, just drop the
trailing slash. Also, please re-read carefully that part of the docs.

  --dbpath /var/amavis/.spamassassin/bayes

> --showdots /var/spool/imap/user/kmeyer/[0-9]*." amavis
 ^^^
This is dangerous. With lots of mail in the (Maildir?) folder, shell
expansion *quickly* will exceed the command line length limit.

I believe passing the Maildir directory containing the messages should
do. Someone who actually uses sa-learn with Maildir please correct
me. :)

The trailing dot also looks bad.


> And one more question: I read, that I have to learn spam AND ham, to make
> bayes work. I can use the Inboxes of users as source for ham and the junk
> folder for spam. But: First a spam mail comes to the inbox (where it get
> learned as ham). Then the user moves it to the junk folder, where it should
> get learned as spam. Is this a possible configuration? Or does this confuses
> SA?

That is OK. As has been answered already, SA will not learn a given
message twice, the same type. It will, however, re-learn and revert the
previous learning, if you later correct the type and learn as something
different.

Please do read the sa-learn man page.

> I can't have one folder for every user where to store ham. E. g. I have
> allready 30 folders where I sort my mails in. I can't have it all in one
> folder. The inbox is the only folder where every mail gets surely in and
> could be used by sa-learn.

A word of caution. There is no move command with IMAP. Instead, it is
copy and delete. Or rather mark-for-deletion, since there is no delete
command either. That's expunge.

In practice that means, that a mail that supposedly has been "moved" to
e.g. a junk folder, *still* is in its source folder, marked for
deletion, usually not visible to the user -- until that folder has been
expunged. Training Bayes from that folder will also learn from these
invisible, "deleted" messages.

Using the Inbox rather than a dedicated ham folder therefore is NOT a
good idea.


-- 
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: sa-learn problems and comprehension question

2010-11-09 Thread Bowie Bailey
On 11/9/2010 11:16 AM, Marcin Mirosław wrote:
> W dniu 09.11.2010 17:14, Karl Meyer pisze:
>> Hi,
>>
>> I want to configure bayes learning and still having some problems and
>> questions after reading several tutorials:
>>
>>
>> I executed sa-learn for my inbox
>> # su -c "/usr/bin/sa-learn --dbpath /var/amavis/.spamassassin/bayes/ --ham
>> --showdots /var/spool/imap/user/kmeyer/[0-9]*." amavis
>>
>> and got a message, that it learned from n messages. Also in the dbpath foder
>> two files appeared. After I got 15 new mails in my inbox, I executed the
>> same command again. But this time it didn't learned.
> Sa-learn "remember" msgid message which has been learned, it will never
> learned twice the same email. Until you change msgid ;)
> Regards

If you learn a message as ham, it will not learn the same message as ham
a second time (same with spam).  However, you can change your mind and
learn the message as spam.  Bayes will "forget" what it learned the
first time and re-learn the message.

-- 
Bowie


Re: sa-learn problems and comprehension question

2010-11-09 Thread Marcin Mirosław
W dniu 09.11.2010 17:14, Karl Meyer pisze:
> 
> Hi,
> 
> I want to configure bayes learning and still having some problems and
> questions after reading several tutorials:
> 
> 
> I executed sa-learn for my inbox
> # su -c "/usr/bin/sa-learn --dbpath /var/amavis/.spamassassin/bayes/ --ham
> --showdots /var/spool/imap/user/kmeyer/[0-9]*." amavis
> 
> and got a message, that it learned from n messages. Also in the dbpath foder
> two files appeared. After I got 15 new mails in my inbox, I executed the
> same command again. But this time it didn't learned.

Sa-learn "remember" msgid message which has been learned, it will never
learned twice the same email. Until you change msgid ;)
Regards


sa-learn problems and comprehension question

2010-11-09 Thread Karl Meyer

Hi,

I want to configure bayes learning and still having some problems and
questions after reading several tutorials:


I executed sa-learn for my inbox
# su -c "/usr/bin/sa-learn --dbpath /var/amavis/.spamassassin/bayes/ --ham
--showdots /var/spool/imap/user/kmeyer/[0-9]*." amavis

and got a message, that it learned from n messages. Also in the dbpath foder
two files appeared. After I got 15 new mails in my inbox, I executed the
same command again. But this time it didn't learned. I also executed the
command on several other inboxes (which contains mails) and on none it
learned.

Also I removed the two files form the dbpath folder and reexecuted the
command from above. I expected, that it should learn at least that much like
on the first time, but it said 'learned from 0' and the files were
recreated. Why it doesn't learn anything more? Using -D didn't showed any
unusual.


And one more question: I read, that I have to learn spam AND ham, to make
bayes work. I can use the Inboxes of users as source for ham and the junk
folder for spam. But: First a spam mail comes to the inbox (where it get
learned as ham). Then the user moves it to the junk folder, where it should
get learned as spam. Is this a possible configuration? Or does this confuses
SA? I can't have one folder for every user where to store ham. E. g. I have
allready 30 folders where I sort my mails in. I can't have it all in one
folder. The inbox is the only folder where every mail gets surely in and
could be used by sa-learn.


Regards
-- 
View this message in context: 
http://old.nabble.com/sa-learn-problems-and-comprehension-question-tp30172306p30172306.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.