Re: sa-learn process overhwelming the server
On 6-May-2009, at 06:50, Karsten Bräckelmann wrote: To determine if a mail already has been learned, SA needs to have a look at the mail. On 06.05.09 17:34, LuKreme wrote: Mightn't it be helpful if it could keep a cache of filenames? Useless, they can change. I hope that SA reads only mail header, parses Message-Id, and if it's there and known, ignores the message - closes the file or searches towards other message, if --mbox was specified. This way learning whole maildir wouldn't cause to parse all the e-mail. -- Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/ Warning: I wish NOT to receive e-mail advertising to this address. Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu. M$ Win's are shit, do not use it !
Re: sa-learn process overhwelming the server
Yes, the learn client does not try to keep up with what it has done, or not done, before - that is handled by the server (the Bayes engine). I believe there is no reasonable way for the client to achieve this, anyway - it cannot reliably modify your maildir in such a way that it can be assured of finding things again, since your email software might discard whatever the learn client put there for the memory of what it processed. And a maildir is not the only kind of input it recognizes, so it could be quite unmanageable to try to implement as many ways of 'remembering' as there are ways of giving it things to process. So I would say the better solution is to adjust the way you run your learning, so that it will discard, or move elsewhere, the items it has processed. Perhaps put things you want learned into a place that no other process touches, run the learn on that, then move that stuff to another place if you want to keep it around for a while. LuKreme krem...@kreme.com 05/06/09 4:23 PM On 6-May-2009, at 07:13, RW wrote: On Wed, 6 May 2009 01:43:08 -0600 lbut...@covisp.net wrote: The trouble appears to me to be that sa-learn has no concept of whether or not it has learned a message or not. It does, they are stored in the bayes_seen file if you are using db storage. It odes int aht it doesn't relearn then, but it doesn't in terms of processing them. Lemme explain. If I have a maildir with 109 messages, 9 of which are new, running sa- learn might take X minutes. If I have a maildir with 9 messages, all of which are new, sa-learn takes much less than X minutes. If I have a maildir with 2893 messages, 9 of which are new, it will take much much more than X minutes.
sa-learn process overhwelming the server
I have one user on my system who receives a LOT of spam. This is intentional as that user is set to never discard email once it is received. I scan the spam and let it auto-expire out of the IMAP folder after 7 days. The trouble is, in those 7 days, the folder usually grows to between 1500 and 3000 messages, and my sa-learn on that folder simply grinds the machine down to a crawl. I've set the sa- learn to process this folder in the middle of the night, but it is still problematic and interferes with other late-night processes. It has even, on occasion, necessitated a reboot when i could not get the system to kill the process. I've taken to trying to scan it daily and manually delete the spam, but that's not always possible. The trouble appears to me to be that sa-learn has no concept of whether or not it has learned a message or not. Since all IMAP messages are stored with unique names, is there some easy way to create a cache of the messages it has checked and have it ignore those messages? I suppose I could do something like: find $HOME/Maildir/.SPAM/{cur,new} -type f -ctime -1 -exec /usr/local/ bin/sa-learn --spam {} \; ?? -- These budget numbers are not just estimates, these are the actual results for the fiscal year that ended February the 30th. - GWB
Re: sa-learn process overhwelming the server
On Wed, 6 May 2009 01:43:08 -0600 ɹןʇnqן lbut...@covisp.net wrote: The trouble appears to me to be that sa-learn has no concept of whether or not it has learned a message or not. It does, they are stored in the bayes_seen file if you are using db storage.
Re: sa-learn process overhwelming the server
Hi, processes. It has even, on occasion, necessitated a reboot when i could not get the system to kill the process. I've taken to trying to scan it daily and manually delete the spam, but that's not always possible. This hint might be totally wrong, but last time I saw such a behavior it was linked to the process /usr/libexec/gam_server (a file alteration monitor, used by fail2ban for example) that was (uselessly) triggered by sa-learn. I just configured gamin so it would ignore the user data partition and the heavy loads disappeared. HTH JG
Re: sa-learn process overhwelming the server
On Wed, 2009-05-06 at 01:43 -0600, ɹןʇnqן wrote: The trouble appears to me to be that sa-learn has no concept of whether or not it has learned a message or not. Since all IMAP messages are stored with unique names, is there some easy way to create a cache of the messages it has checked and have it ignore those messages? SA does know about mail it already learned. However, for various reasons it is *not* based on the file name. An obvious reason would be support for mbox format. :) And auto-learning before the mail has been passed on to the MDA. Then there's the problem that even with Maildir format, file names (think flags) are not guaranteed to remain static... To determine if a mail already has been learned, SA needs to have a look at the mail. I suppose I could do something like: find $HOME/Maildir/.SPAM/{cur,new} -type f -ctime -1 -exec /usr/local/ bin/sa-learn --spam {} \; Something like that, yes. It definitely makes sense to 'find' the messages delivered since the last sa-learn run. That line however is *way* too inefficient, spawning an sa-learn Perl process for each message. Instead, call sa-learn a single time only. Probably by snipering the last day's spam into a temp folder first, and simply pointing sa-learn at that dir. -- char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: sa-learn process overhwelming the server
On 6-May-2009, at 07:13, RW wrote: On Wed, 6 May 2009 01:43:08 -0600 ɹןʇnqן lbut...@covisp.net wrote: The trouble appears to me to be that sa-learn has no concept of whether or not it has learned a message or not. It does, they are stored in the bayes_seen file if you are using db storage. It odes int aht it doesn't relearn then, but it doesn't in terms of processing them. Lemme explain. If I have a maildir with 109 messages, 9 of which are new, running sa- learn might take X minutes. If I have a maildir with 9 messages, all of which are new, sa-learn takes much less than X minutes. If I have a maildir with 2893 messages, 9 of which are new, it will take much much more than X minutes. --
Re: sa-learn process overhwelming the server
On Wed, 2009-05-06 at 15:23 -0500, LuKreme wrote: It does, they are stored in the bayes_seen file if you are using db storage. It odes int aht it doesn't relearn then, but it doesn't in terms of processing them. Lemme explain. I already explained in this very thread why there is *no* way for SA to identify already seen messages based on a file name. In short, can you say auto-learn? guenther -- char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: sa-learn process overhwelming the server
On 6-May-2009, at 06:50, Karsten Bräckelmann wrote: On Wed, 2009-05-06 at 01:43 -0600, ɹןʇnqן wrote: The trouble appears to me to be that sa-learn has no concept of whether or not it has learned a message or not. Since all IMAP messages are stored with unique names, is there some easy way to create a cache of the messages it has checked and have it ignore those messages? SA does know about mail it already learned. However, for various reasons it is *not* based on the file name. An obvious reason would be support for mbox format. Yes, but that is a different flag --mbox. :) And auto-learning before the mail has been passed on to the MDA. Then there's the problem that even with Maildir format, file names (think flags) are not guaranteed to remain static... The last few characters will change, but I don't think the rest of the name changes. That is, the mail file named 1241641613.40384_0.mail.covisp.net:2, is always going to be named that, with maybe one or two additional characters after the ,, no matter where I move it. the first part is the epoch time followed by 5 random characters To determine if a mail already has been learned, SA needs to have a look at the mail. Mightn't it be helpful if it could keep a cache of filenames? I suppose I could do something like: find $HOME/Maildir/.SPAM/{cur,new} -type f -ctime -1 -exec /usr/ local/ bin/sa-learn --spam {} \; Something like that, yes. It definitely makes sense to 'find' the messages delivered since the last sa-learn run. That line however is *way* too inefficient, spawning an sa-learn Perl process for each message. Ah, yes. good point. find $HOME/Maildir/.SPAM/{cur,new} -type f -ctime -1 -exec mv {} /tmp/ $USER/.junk -- Space Directive 723: Terraformers are expressly forbidden from recreating Swindon.
Re: sa-learn process overhwelming the server
Awesome. So I mentioned it twice in this thread, once each before your follow-ups, and you keep on ignoring and arguing. Which part of of auto- learning and before local delivery is unclear to you? On Wed, 2009-05-06 at 17:34 -0600, LuKreme wrote: On 6-May-2009, at 06:50, Karsten Bräckelmann wrote: SA does know about mail it already learned. However, for various reasons it is *not* based on the file name. An obvious reason would be support for mbox format. Yes, but that is a different flag --mbox. :) And auto-learning before the mail has been passed on to the MDA. Then there's the problem that even with Maildir format, file names (think flags) are not guaranteed to remain static... The last few characters will change, but I don't think the rest of the name changes. That is, the mail file named 1241641613.40384_0.mail.covisp.net:2, is always going to be named that, with maybe one or two additional characters after the ,, no matter where I move it. the first part is the epoch time followed by 5 random characters The actual name frankly is an implementation detail of your LDA. To determine if a mail already has been learned, SA needs to have a look at the mail. Mightn't it be helpful if it could keep a cache of filenames? cat cur/* mbox# your point is? -- char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Re: sa-learn process overhwelming the server
On 6-May-2009, at 19:46, Karsten Bräckelmann wrote: Awesome. So I mentioned it twice in this thread, once each before your follow-ups, and you keep on ignoring and arguing. Which part of of auto- learning and before local delivery is unclear to you? This has nothing to do with either autolearning OR local delivery. These are messages that 1) have been delivered and 2) have not be autolearned. I was simply looking for a better way than moving (well, copying) message out somewhere and then processing those copies via sa- learn and then deleting those copies. It all seems a bit much, and not much better than doing it manually. And I wasn't arguing, I was trying for clarification. Sorry if you found it argumentative, that was not my intent. -- Lithium will no longer be available on credit
Re: sa-learn process overhwelming the server
On Wed, 2009-05-06 at 19:55 -0600, LuKreme wrote: On 6-May-2009, at 19:46, Karsten Bräckelmann wrote: Awesome. So I mentioned it twice in this thread, once each before your follow-ups, and you keep on ignoring and arguing. Which part of of auto- learning and before local delivery is unclear to you? This has nothing to do with either autolearning OR local delivery. Yes, it does. Since SA already learned auto-learned messages, yet has to look at the actual message to realize that. At the very least, once. Related side note: If you're curious, there are reasons in bugzilla, why SA relies on it's own Message ID thingy. These are messages that 1) have been delivered and 2) have not be autolearned. I was simply looking for a better way than moving (well, copying) message out somewhere and then processing those copies via sa- learn and then deleting those copies. It all seems a bit much, and not much better than doing it manually. find . -ctime 1 -print0 | xargs -0 sa-learn Brief sketch. No copying and deleting, yet limited to recently delivered mail since the last sa-learn run, and limiting the costly invocation of a Perl process to a minimum. And I wasn't arguing, I was trying for clarification. Sorry if you found it argumentative, that was not my intent. OK. :) That one was aiming at the quoted parts anyway, not at your hint (granted, snipped by me) to help improve the find snipering. Time to hit the sack... -- char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1: (c=*++x); c128 (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}