Re: sa-learn problems and comprehension question
On Wed, 2010-11-10 at 19:01 +0100, Karsten Bräckelmann wrote: > On Wed, 2010-11-10 at 17:48 +, Martin Gregorie wrote: > > On Wed, 2010-11-10 at 18:16 +0100, Karsten Bräckelmann wrote: > > > > > find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \; > > > > > > > > > > which is a bit slower but avoids the command line overflow by running > > > > > sa_learn on every matching file. > > > > > > A "bit" slower. Periodically re-learning the entire Inbox of 100+ users, > > > spawning a full Perl process for every single mail... > > > > Indeed, but when I wrote that I had no idea that the directory had > > permanent content. I was under the impression that its content would be > > deleted or archived after it had been learned and the next run would be > > dealing with new messages. > > Oh, I didn't mean to bitch at you personally. :) > > I just thought the issues should be pointed out and clearly documented, > rather than keeping it in the archives without a warning. > Good point, and one well worth making clear. Martin
Re: sa-learn problems and comprehension question
On Wed, 2010-11-10 at 17:48 +, Martin Gregorie wrote: > On Wed, 2010-11-10 at 18:16 +0100, Karsten Bräckelmann wrote: > > > > find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \; > > > > > > > > which is a bit slower but avoids the command line overflow by running > > > > sa_learn on every matching file. > > > > A "bit" slower. Periodically re-learning the entire Inbox of 100+ users, > > spawning a full Perl process for every single mail... > > Indeed, but when I wrote that I had no idea that the directory had > permanent content. I was under the impression that its content would be > deleted or archived after it had been learned and the next run would be > dealing with new messages. Oh, I didn't mean to bitch at you personally. :) I just thought the issues should be pointed out and clearly documented, rather than keeping it in the archives without a warning. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: sa-learn problems and comprehension question
On Wed, 2010-11-10 at 18:16 +0100, Karsten Bräckelmann wrote: > > > find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \; > > > > > > which is a bit slower but avoids the command line overflow by running > > > sa_learn on every matching file. > > A "bit" slower. Periodically re-learning the entire Inbox of 100+ users, > spawning a full Perl process for every single mail... > Indeed, but when I wrote that I had no idea that the directory had permanent content. I was under the impression that its content would be deleted or archived after it had been learned and the next run would be dealing with new messages. Martin
Re: sa-learn problems and comprehension question
On Wed, 2010-11-10 at 08:56 +0100, Marcin Mirosław wrote: > W dniu 2010-11-10 07:37, Karl Meyer pisze: > > But the 15 new messages weren't learnd yet. > > > > I had 10 messages in my inbox and run sa-learn on that folder. Then I got 15 > > different new messages and re-run sa-learn again. But it said that it > > learned from 0 messages. How many messages did sa-learn claim to have examined? > Do you run SA from smtp server? Porbably yes, bayes autolearned > this/those(15) emails earlier, while was called from MTA. To verify this assumption, and test your sa-learn command, try --forget and then --ham again. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: sa-learn problems and comprehension question
> > find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \; > > > > which is a bit slower but avoids the command line overflow by running > > sa_learn on every matching file. A "bit" slower. Periodically re-learning the entire Inbox of 100+ users, spawning a full Perl process for every single mail... > you can use xargs then to calling sa-learn too often. xargs can push only as > much parameters as fits. > > find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -print |xargs sa-learn Definitely! Without it, the box will be burning in hell, doomed to constantly spawn Perl processes until the end of time. Using the find result to populate a list in a temporary file, and use that with sa-learn -f is even better. A single Perl process per learning iteration. Also, since this is not Maildir and the above path looks suspiciously to contain directories (mail sub-folders), the find -maxdepth option will be required to prevent learning *all* the user's mail. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: sa-learn problems and comprehension question
On Tue, 2010-11-09 at 22:57 -0800, Karl Meyer wrote: > > > --showdots /var/spool/imap/user/kmeyer/[0-9]*." amavis > > ^^^ > > This is dangerous. With lots of mail in the (Maildir?) folder, shell > > expansion *quickly* will exceed the command line length limit. > > > > The trailing dot also looks bad. > > This is a good argument. I' ll think about that. The folder is a cyrus Imap > folder. I used the [0-9]*. expression, because each cyrus folder contains > messages with numbered filename with a trailing dot and four cyrus.* files > (cache, index,...). Not Maildir, and passing the dir itself doesn't seem like an option either in the Cyrus case. Man pages to the rescue. The Description section of the sa-learn man page holds this. "Note that csh-style globbing in the mail folder names is supported; in other words, listing a folder name as "*" will scan every folder that matches. See "Mail::SpamAssassin::ArchiveIterator" for more details." So, something similar to the above is possible. However, you will need to escape or quote the globbing, to prevent the su-spawned shell from expanding it (as the above does), but pass the glob to sa-learn. > > A word of caution. There is no move command with IMAP. Instead, it is > > copy and delete. Or rather mark-for-deletion, since there is no delete > > command either. That's expunge. > > That's right. But is this a problem? > First, I learn ham from the inbox folder, then spam from the junk folder. If > a mail is moved to the junk folder meanwhile and no expunge was done, then > it's not relearned as ham again and learned from the junk folder as spam > newly. That is a lot of unnecessary work, constantly re-learning messages. Also, there's a race condition. So your user knows, spam will be learned periodically. And that month worth of spam backlog in that folder is ugly. Time to clean it up, and expunge. The Inbox though is precious. Lots of important stuff and cute kitten attachments. Unless the Inbox is expunged, too, "deleted" spam from the Inbox will be re-trained next time. The copy in the spam folder is no more, to correct that "false training by design". (Caveat: I don't know how Cyrus handles deletion or flagging mail as such. It is not Maildir.) > > Using the Inbox rather than a dedicated ham folder therefore is NOT a > > good idea. > > The problem is, that I can't persuade about 120 users to store all their ham > below a defined folder. They want to sort their mails into several folders > they created by their own. You should be fine with some initial training of hand-sorted ham in a dedicated folder. Then let auto-learn kick in. Some script magic, using 'find' to collect the last n hours worth of ham and spam might be an option, too. Used with the sa-learn -f option. Once again, please do read the... man page. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: sa-learn problems and comprehension question
On Tue, 2010-11-09 at 22:57 -0800, Karl Meyer wrote: > >This is dangerous. With lots of mail in the (Maildir?) folder, shell > >expansion *quickly* will exceed the command line length limit. > > > > The trailing dot also looks bad. > > This is a good argument. I' ll think about that. The folder is a cyrus Imap > folder. I used the [0-9]*. expression, because each cyrus folder contains > messages with numbered filename with a trailing dot and four cyrus.* files > (cache, index,...). > Use find: find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \; which is a bit slower but avoids the command line overflow by running sa_learn on every matching file. Martin
Re: sa-learn problems and comprehension question
> On Tue, 2010-11-09 at 22:57 -0800, Karl Meyer wrote: > > >This is dangerous. With lots of mail in the (Maildir?) folder, shell > > >expansion *quickly* will exceed the command line length limit. > > > > > > The trailing dot also looks bad. > > > > This is a good argument. I' ll think about that. The folder is a cyrus Imap > > folder. I used the [0-9]*. expression, because each cyrus folder contains > > messages with numbered filename with a trailing dot and four cyrus.* files > > (cache, index,...). On 10.11.10 11:12, Martin Gregorie wrote: > Use find: > > find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -exec sa_learn {} \; > > which is a bit slower but avoids the command line overflow by running > sa_learn on every matching file. you can use xargs then to calling sa-learn too often. xargs can push only as much parameters as fits. find /var/spool/imap/user/kmeyer/ -name '[0-9]*.' -print |xargs sa-learn (well, add --spam or --ham to sa-learn of course) -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. I wonder how much deeper the ocean would be without sponges.
Re: sa-learn problems and comprehension question
W dniu 2010-11-10 07:37, Karl Meyer pisze: > But the 15 new messages weren't learnd yet. > > I had 10 messages in my inbox and run sa-learn on that folder. Then I got 15 > different new messages and re-run sa-learn again. But it said that it > learned from 0 messages. Do you run SA from smtp server? Porbably yes, bayes autolearned this/those(15) emails earlier, while was called from MTA. Regards
Re: sa-learn problems and comprehension question
>> The --dbpath option is bad. Despite its name, it is not a "path", but a >> prefix. The sa-update man page states it is in bayes_path form, which is >> documented in the general SA Conf documentation. > "This is the directory and filename for Bayes databases". Since your > given path looks similar to the default, I assume you actually meant to > keep that SA data in the .spamassassin/ dir. For that, just drop the > trailing slash. Also, please re-read carefully that part of the docs. You are right. But it doesn't matter. If I used my --dbpath like above, two files bayes_seen and bayes_toks were created below this folder. If I use --dbpath /bayes/test, then I get two files test_seen and test_toks in that folder. I'll correct this of course, but it works like before and my problem is still there. >> --showdots /var/spool/imap/user/kmeyer/[0-9]*." amavis > ^^^ >This is dangerous. With lots of mail in the (Maildir?) folder, shell >expansion *quickly* will exceed the command line length limit. > > The trailing dot also looks bad. This is a good argument. I' ll think about that. The folder is a cyrus Imap folder. I used the [0-9]*. expression, because each cyrus folder contains messages with numbered filename with a trailing dot and four cyrus.* files (cache, index,...). > A word of caution. There is no move command with IMAP. Instead, it is > copy and delete. Or rather mark-for-deletion, since there is no delete > command either. That's expunge. That's right. But is this a problem? First, I learn ham from the inbox folder, then spam from the junk folder. If a mail is moved to the junk folder meanwhile and no expunge was done, then it's not relearned as ham again and learned from the junk folder as spam newly. > Using the Inbox rather than a dedicated ham folder therefore is NOT a > good idea. The problem is, that I can't persuade about 120 users to store all their ham below a defined folder. They want to sort their mails into several folders they created by their own. -- View this message in context: http://old.nabble.com/sa-learn-problems-and-comprehension-question-tp30172306p30178053.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: sa-learn problems and comprehension question
Marcin Mirosław wrote: > >>> and got a message, that it learned from n messages. Also in the dbpath >>> foder >>> two files appeared. After I got 15 new mails in my inbox, I executed the >>> same command again. But this time it didn't learned. > >>Sa-learn "remember" msgid message which has been learned, it will never >>learned twice the same email. Until you change msgid ;) > But the 15 new messages weren't learnd yet. I had 10 messages in my inbox and run sa-learn on that folder. Then I got 15 different new messages and re-run sa-learn again. But it said that it learned from 0 messages. -- View this message in context: http://old.nabble.com/sa-learn-problems-and-comprehension-question-tp30172306p30177999.html Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
Re: sa-learn problems and comprehension question
W dniu 2010-11-09 17:24, Bowie Bailey pisze: > If you learn a message as ham, it will not learn the same message as ham > a second time (same with spam). However, you can change your mind and > learn the message as spam. Bayes will "forget" what it learned the > first time and re-learn the message. Agree, i wasn't precise.
Re: sa-learn problems and comprehension question
On Tue, 2010-11-09 at 08:14 -0800, Karl Meyer wrote: > # su -c "/usr/bin/sa-learn --dbpath /var/amavis/.spamassassin/bayes/ --ham The --dbpath option is bad. Despite its name, it is not a "path", but a prefix. The sa-update man page states it is in bayes_path form, which is documented in the general SA Conf documentation. "This is the directory and filename for Bayes databases". Since your given path looks similar to the default, I assume you actually meant to keep that SA data in the .spamassassin/ dir. For that, just drop the trailing slash. Also, please re-read carefully that part of the docs. --dbpath /var/amavis/.spamassassin/bayes > --showdots /var/spool/imap/user/kmeyer/[0-9]*." amavis ^^^ This is dangerous. With lots of mail in the (Maildir?) folder, shell expansion *quickly* will exceed the command line length limit. I believe passing the Maildir directory containing the messages should do. Someone who actually uses sa-learn with Maildir please correct me. :) The trailing dot also looks bad. > And one more question: I read, that I have to learn spam AND ham, to make > bayes work. I can use the Inboxes of users as source for ham and the junk > folder for spam. But: First a spam mail comes to the inbox (where it get > learned as ham). Then the user moves it to the junk folder, where it should > get learned as spam. Is this a possible configuration? Or does this confuses > SA? That is OK. As has been answered already, SA will not learn a given message twice, the same type. It will, however, re-learn and revert the previous learning, if you later correct the type and learn as something different. Please do read the sa-learn man page. > I can't have one folder for every user where to store ham. E. g. I have > allready 30 folders where I sort my mails in. I can't have it all in one > folder. The inbox is the only folder where every mail gets surely in and > could be used by sa-learn. A word of caution. There is no move command with IMAP. Instead, it is copy and delete. Or rather mark-for-deletion, since there is no delete command either. That's expunge. In practice that means, that a mail that supposedly has been "moved" to e.g. a junk folder, *still* is in its source folder, marked for deletion, usually not visible to the user -- until that folder has been expunged. Training Bayes from that folder will also learn from these invisible, "deleted" messages. Using the Inbox rather than a dedicated ham folder therefore is NOT a good idea. -- char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: sa-learn problems and comprehension question
On 11/9/2010 11:16 AM, Marcin Mirosław wrote: > W dniu 09.11.2010 17:14, Karl Meyer pisze: >> Hi, >> >> I want to configure bayes learning and still having some problems and >> questions after reading several tutorials: >> >> >> I executed sa-learn for my inbox >> # su -c "/usr/bin/sa-learn --dbpath /var/amavis/.spamassassin/bayes/ --ham >> --showdots /var/spool/imap/user/kmeyer/[0-9]*." amavis >> >> and got a message, that it learned from n messages. Also in the dbpath foder >> two files appeared. After I got 15 new mails in my inbox, I executed the >> same command again. But this time it didn't learned. > Sa-learn "remember" msgid message which has been learned, it will never > learned twice the same email. Until you change msgid ;) > Regards If you learn a message as ham, it will not learn the same message as ham a second time (same with spam). However, you can change your mind and learn the message as spam. Bayes will "forget" what it learned the first time and re-learn the message. -- Bowie
Re: sa-learn problems and comprehension question
W dniu 09.11.2010 17:14, Karl Meyer pisze: > > Hi, > > I want to configure bayes learning and still having some problems and > questions after reading several tutorials: > > > I executed sa-learn for my inbox > # su -c "/usr/bin/sa-learn --dbpath /var/amavis/.spamassassin/bayes/ --ham > --showdots /var/spool/imap/user/kmeyer/[0-9]*." amavis > > and got a message, that it learned from n messages. Also in the dbpath foder > two files appeared. After I got 15 new mails in my inbox, I executed the > same command again. But this time it didn't learned. Sa-learn "remember" msgid message which has been learned, it will never learned twice the same email. Until you change msgid ;) Regards