Re: sa-learn process overhwelming the server

2009-05-07 Thread Matus UHLAR - fantomas
 On 6-May-2009, at 06:50, Karsten Bräckelmann wrote:
 To determine if a mail already has been learned, SA needs to have a  
 look
 at the mail.

On 06.05.09 17:34, LuKreme wrote:
 Mightn't it be helpful if it could keep a cache of filenames?

Useless, they can change. I hope that SA reads only mail header, parses
Message-Id, and if it's there and known, ignores the message
- closes the file or searches towards other message, if --mbox was
specified.  This way learning whole maildir wouldn't cause to parse all the
e-mail.

-- 
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
M$ Win's are shit, do not use it !


Re: sa-learn process overhwelming the server

2009-05-07 Thread Kevin Parris
Yes, the learn client does not try to keep up with what it has done, or not 
done, before - that is handled by the server (the Bayes engine).

I believe there is no reasonable way for the client to achieve this, anyway - 
it cannot reliably modify your maildir in such a way that it can be assured of 
finding things again, since your email software might discard whatever the 
learn client put there for the memory of what it processed.  And a maildir is 
not the only kind of input it recognizes, so it could be quite unmanageable to 
try to implement as many ways of 'remembering' as there are ways of giving it 
things to process.

So I would say the better solution is to adjust the way you run your learning, 
so that it will discard, or move elsewhere, the items it has processed.  
Perhaps put things you want learned into a place that no other process touches, 
run the learn on that, then move that stuff to another place if you want to 
keep it around for a while.

 LuKreme krem...@kreme.com 05/06/09 4:23 PM 
On 6-May-2009, at 07:13, RW wrote:
 On Wed, 6 May 2009 01:43:08 -0600
 lbut...@covisp.net wrote:

 The trouble appears to me to be that sa-learn has no concept of
 whether or not it has learned a message or not.

 It does, they are stored in the bayes_seen file if you are using
 db storage.

It odes int aht it doesn't relearn then, but it doesn't in terms of  
processing them. Lemme explain.

If I have a maildir with 109 messages, 9 of which are new, running sa- 
learn might take X minutes.  If I have a maildir with 9 messages, all  
of which are new, sa-learn takes much less than X minutes.  If I have  
a maildir with 2893 messages, 9 of which are new, it will take much  
much more than X minutes.




sa-learn process overhwelming the server

2009-05-06 Thread ɹןʇnqן
I have one user on my system who receives a LOT of spam. This is  
intentional as that user is set to never discard email once it is  
received.  I scan the spam and let it auto-expire out of the IMAP  
folder after 7 days.  The trouble is, in those 7 days, the folder  
usually grows to between 1500 and 3000 messages, and my sa-learn on  
that folder simply grinds the machine down to a crawl. I've set the sa- 
learn to process this folder in the middle of the night, but it is  
still problematic and interferes with other late-night processes. It  
has even, on occasion, necessitated a reboot when i could not get the  
system to kill the process. I've taken to trying to scan it daily and  
manually delete the spam, but that's not always possible.


The trouble appears to me to be that sa-learn has no concept of  
whether or not it has learned a message or not.  Since all IMAP  
messages are stored with unique names, is there some easy way to  
create a cache of the messages it has checked and have it ignore those  
messages?


I suppose I could do something like:

find $HOME/Maildir/.SPAM/{cur,new} -type f -ctime -1 -exec /usr/local/ 
bin/sa-learn --spam {} \;


??

--
These budget numbers are not just estimates, these are the actual
results for the fiscal year that ended February the 30th.
- GWB



Re: sa-learn process overhwelming the server

2009-05-06 Thread RW
On Wed, 6 May 2009 01:43:08 -0600
ɹןʇnqן lbut...@covisp.net wrote:


 The trouble appears to me to be that sa-learn has no concept of  
 whether or not it has learned a message or not.  

It does, they are stored in the bayes_seen file if you are using
db storage.


Re: sa-learn process overhwelming the server

2009-05-06 Thread John GALLET

Hi,

processes. It has even, on occasion, necessitated a reboot when i could not 
get the system to kill the process. I've taken to trying to scan it daily and 
manually delete the spam, but that's not always possible.


This hint might be totally wrong, but last time I saw such a behavior it 
was linked to the process /usr/libexec/gam_server (a file alteration 
monitor, used by fail2ban for example) that was (uselessly) triggered by 
sa-learn. I just configured gamin so it would ignore the user data 
partition and the heavy loads disappeared.


HTH
JG


Re: sa-learn process overhwelming the server

2009-05-06 Thread Karsten Bräckelmann
On Wed, 2009-05-06 at 01:43 -0600, ɹןʇnqן wrote:
 The trouble appears to me to be that sa-learn has no concept of  
 whether or not it has learned a message or not.  Since all IMAP  
 messages are stored with unique names, is there some easy way to  
 create a cache of the messages it has checked and have it ignore those  
 messages?

SA does know about mail it already learned. However, for various reasons
it is *not* based on the file name. An obvious reason would be support
for mbox format. :)  And auto-learning before the mail has been passed
on to the MDA. Then there's the problem that even with Maildir format,
file names (think flags) are not guaranteed to remain static...

To determine if a mail already has been learned, SA needs to have a look
at the mail.


 I suppose I could do something like:
 
 find $HOME/Maildir/.SPAM/{cur,new} -type f -ctime -1 -exec /usr/local/ 
 bin/sa-learn --spam {} \;

Something like that, yes. It definitely makes sense to 'find' the
messages delivered since the last sa-learn run. That line however is
*way* too inefficient, spawning an sa-learn Perl process for each
message.

Instead, call sa-learn a single time only. Probably by snipering the
last day's spam into a temp folder first, and simply pointing sa-learn
at that dir.


-- 
char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: sa-learn process overhwelming the server

2009-05-06 Thread LuKreme

On 6-May-2009, at 07:13, RW wrote:

On Wed, 6 May 2009 01:43:08 -0600
ɹןʇnqן lbut...@covisp.net wrote:


The trouble appears to me to be that sa-learn has no concept of
whether or not it has learned a message or not.


It does, they are stored in the bayes_seen file if you are using
db storage.


It odes int aht it doesn't relearn then, but it doesn't in terms of  
processing them. Lemme explain.


If I have a maildir with 109 messages, 9 of which are new, running sa- 
learn might take X minutes.  If I have a maildir with 9 messages, all  
of which are new, sa-learn takes much less than X minutes.  If I have  
a maildir with 2893 messages, 9 of which are new, it will take much  
much more than X minutes.



--



Re: sa-learn process overhwelming the server

2009-05-06 Thread Karsten Bräckelmann
On Wed, 2009-05-06 at 15:23 -0500, LuKreme wrote:

  It does, they are stored in the bayes_seen file if you are using
  db storage.
 
 It odes int aht it doesn't relearn then, but it doesn't in terms of  
 processing them. Lemme explain.

I already explained in this very thread why there is *no* way for SA to
identify already seen messages based on a file name. In short, can you
say auto-learn?

  guenther


-- 
char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: sa-learn process overhwelming the server

2009-05-06 Thread LuKreme

On 6-May-2009, at 06:50, Karsten Bräckelmann wrote:

On Wed, 2009-05-06 at 01:43 -0600, ɹןʇnqן wrote:

The trouble appears to me to be that sa-learn has no concept of
whether or not it has learned a message or not.  Since all IMAP
messages are stored with unique names, is there some easy way to
create a cache of the messages it has checked and have it ignore  
those

messages?


SA does know about mail it already learned. However, for various  
reasons

it is *not* based on the file name. An obvious reason would be support
for mbox format.


Yes, but that is a different flag --mbox.


:)  And auto-learning before the mail has been passed
on to the MDA. Then there's the problem that even with Maildir format,
file names (think flags) are not guaranteed to remain static...


The last few characters will change, but I don't think the rest of the  
name changes. That is, the mail file named  
1241641613.40384_0.mail.covisp.net:2, is always going to be named  
that, with maybe one or two additional characters after the ,, no  
matter where I move it.  the first part is the epoch time followed by  
5 random characters


To determine if a mail already has been learned, SA needs to have a  
look

at the mail.


Mightn't it be helpful if it could keep a cache of filenames?


I suppose I could do something like:

find $HOME/Maildir/.SPAM/{cur,new} -type f -ctime -1 -exec /usr/ 
local/

bin/sa-learn --spam {} \;


Something like that, yes. It definitely makes sense to 'find' the
messages delivered since the last sa-learn run. That line however is
*way* too inefficient, spawning an sa-learn Perl process for each
message.


Ah, yes.  good point.

find $HOME/Maildir/.SPAM/{cur,new} -type f -ctime -1 -exec mv {} /tmp/ 
$USER/.junk



--
Space Directive 723: Terraformers are expressly forbidden
from recreating Swindon.



Re: sa-learn process overhwelming the server

2009-05-06 Thread Karsten Bräckelmann
Awesome. So I mentioned it twice in this thread, once each before your
follow-ups, and you keep on ignoring and arguing. Which part of of auto-
learning and before local delivery is unclear to you?


On Wed, 2009-05-06 at 17:34 -0600, LuKreme wrote:
 On 6-May-2009, at 06:50, Karsten Bräckelmann wrote:

  SA does know about mail it already learned. However, for various reasons
  it is *not* based on the file name. An obvious reason would be support
  for mbox format.
 
 Yes, but that is a different flag --mbox.
 
  :)  And auto-learning before the mail has been passed
  on to the MDA. Then there's the problem that even with Maildir format,
  file names (think flags) are not guaranteed to remain static...
 
 The last few characters will change, but I don't think the rest of the  
 name changes. That is, the mail file named  
 1241641613.40384_0.mail.covisp.net:2, is always going to be named  
 that, with maybe one or two additional characters after the ,, no  
 matter where I move it.  the first part is the epoch time followed by  
 5 random characters

The actual name frankly is an implementation detail of your LDA.


  To determine if a mail already has been learned, SA needs to have a look
  at the mail.
 
 Mightn't it be helpful if it could keep a cache of filenames?

cat cur/*  mbox# your point is?


-- 
char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}



Re: sa-learn process overhwelming the server

2009-05-06 Thread LuKreme

On 6-May-2009, at 19:46, Karsten Bräckelmann wrote:

Awesome. So I mentioned it twice in this thread, once each before your
follow-ups, and you keep on ignoring and arguing. Which part of of  
auto-

learning and before local delivery is unclear to you?



This has nothing to do with either autolearning OR local delivery.  
These are messages that 1) have been delivered and 2) have not be  
autolearned. I was simply looking for a better way than moving (well,  
copying) message out somewhere and then processing those copies via sa- 
learn and then deleting those copies. It all seems a bit much, and not  
much better than doing it manually.


And I wasn't arguing, I was trying for clarification. Sorry if you  
found it argumentative, that was not my intent.


--
Lithium will no longer be available on credit



Re: sa-learn process overhwelming the server

2009-05-06 Thread Karsten Bräckelmann
On Wed, 2009-05-06 at 19:55 -0600, LuKreme wrote:
 On 6-May-2009, at 19:46, Karsten Bräckelmann wrote:
  Awesome. So I mentioned it twice in this thread, once each before your
  follow-ups, and you keep on ignoring and arguing. Which part of of auto-
  learning and before local delivery is unclear to you?
 
 This has nothing to do with either autolearning OR local delivery.

Yes, it does. Since SA already learned auto-learned messages, yet has to
look at the actual message to realize that. At the very least, once.

Related side note: If you're curious, there are reasons in bugzilla, why
SA relies on it's own Message ID thingy.

 These are messages that 1) have been delivered and 2) have not be  
 autolearned. I was simply looking for a better way than moving (well,  
 copying) message out somewhere and then processing those copies via sa- 
 learn and then deleting those copies. It all seems a bit much, and not  
 much better than doing it manually.

find . -ctime 1 -print0 | xargs -0 sa-learn

Brief sketch. No copying and deleting, yet limited to recently delivered
mail since the last sa-learn run, and limiting the costly invocation of
a Perl process to a minimum.

 And I wasn't arguing, I was trying for clarification. Sorry if you  
 found it argumentative, that was not my intent.

OK. :)  That one was aiming at the quoted parts anyway, not at your hint
(granted, snipped by me) to help improve the find snipering.


Time to hit the sack...

-- 
char *t=\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4;
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;il;i++){ i%8? c=1:
(c=*++x); c128  (s+=h); if (!(h=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}