Re: Google Summer of Code 2007 ...

2007-02-21 Thread Per Jessen
C. Bensend wrote:

 Perhaps this is trivial, or not desired by anyone else but myself,
 but I'd _love_ to be able to strip SpamAssassin tags via spamc and
 spamd, instead of having to fire up the full-blown spamassassin
 for each message.  :)

formail ?


/Per Jessen, Zürich



Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason

DAve writes:
Justin Mason wrote:
 Theo Van Dinter writes:
 I'm assuming that there will be a Google Summer of Code 2007 going on, and
 that the ASF will be involved again.  So it's a good time to start thinking
 about things we'd like to put up as possible projects.

 We still have a number of items from last year that we could use again.
 Anything else that we'd like people to code up?
 
 Also, any suggestions from outside the dev team?  Anyone got good ideas
 for new SpamAssassin features that would be good to pay someone to work on
 for 3 months?
 
 --j.

Maybe, several of us use MailScanner. MailScanner does not use spamc, it 
loads SA directly. One of the features of MailScanner is called MCP or 
Message Content Protection. MCP uses, or attempts to use, SA to do 
specific targeted message content checking. Many people, we included, 
would like to be able to use this but it seems there is always some 
gotcha to having SA loaded in MailScanner twice. Problems with the 
directory paths, rules in memory, etc.

The ability to run SA with two totally different configurations in the 
same application would very handy. Different rules for outbound mail vs 
inbound mail, MCP(as in this user wants zero mail with the word breast 
regardless of the rest of the message content) are just two examples.

Contacting Julian on the MailScanner list would give far better examples 
and details than I could.

cc'd Julian.

This could definitely be done -- I didn't realise there was demand for
it ;)

This should probably be opened as a bug on the bugzilla, btw.
is it already there?

--j.


Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason

Matthew Wilson writes:
- Full, tested, supportable multithreaded support

In my experience, perl threading is just not avialable in a reliable,
fast implementation -- this is not viable I'm afraid :(

- Full, tested, supportable support for an asynchronous I/O model (a la
qpsmtpd-async)

A pretty big task, unfortunately :(  It'd be nice, but could take
a lot of work.

- Pluggable to the point where all configuration and settings can be pulled
from anywhere (databases, files, in-memory cache) at runtime, so SA could
stay resident and have its configuration be changed in-process.

Is this not already possible with config_text?

--j.


Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason

Mark Martinec writes:
 Also, any suggestions from outside the dev team?  Anyone got good ideas
 for new SpamAssassin features that would be good to pay someone to work on
 for 3 months?

Here's another one, to seize the opportunity when internal changes
are being contemplated:

Split the process into two parts:

- parsing and munging of mail  rules, resulting in a set of
  findings (e.g. a list of rules being hit, perhaps somehow
  generalized). This section can be done once per message,
  regardless of the number of recipients to the message
  (assuming all users use the same rules);

- based on the above, score the findings, possibly
  applying per-recipient scoring to each rule being hit;
  This (rather inexpensive) step can be applied for each
  recipient individually, without having to re-process
  an entire message in multiple-recipient mail.

...and adjust the API to Mail::SpamAssassin accordingly, so that
MTA-based content filtering (e.g. amavisd-new) could take advantage
of it, while still allowing full per-recipient customization of
individual rules scores (including disabling some by a score of 0).

Benefits depend on a site, but our stats show 1.46 recipients
per message on the average. The above change (when calling SA
at MTA level) would bring a 46 % increase in througput for free,
while still providing individualized rules scoring. 

Mark, could you open a feature-request bug for this?
I think it'd be worthwhile, and would definitely be a good
SOC project (at least).

--j.


Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason

Mark Martinec writes:
On Saturday February 17 2007 03:01, Quinn Comendant wrote:
 How about an extensive statistics reporting tool, ..., that
 can show how well a current spamassassin installation is performing
 and where it needs improvements.

Well, not exactly by your words, but in the same spirit,
this time belonging to SA itsef:

Instrument SA with a couple of performance measuring probes,
providing some easier way to spot where bottlenecks lie.
Just something simple enough to tell, look, currently waiting
for Razor server response (or some RBL) is taking 80% of
elapsed time. Or, Bayes db is very sluggish, it is taking
5 seconds to provide a result.

A timing breakdown by subtasks is not that much work to provide,
but provides great insight into troubleshooting and performance
improvements.

Here is an example of a timing breakdown as currently provided
in the log (at log level 2) by amavisd-new, without getting into
specific details, except to say the numbers are elapsed time
for each subtask in milliseconds (and in percents, just for the
section, and then a cumulative percent of all sections so far):

TIMING [total 1840 ms] - SMTP pre-DATA-flush: 4 (0%)0, SMTP DATA: 95 (5%)5, 
check_init: 1 (0%)5, sql-enter: 69 (4%)9, mime_decode: 16 (1%)10,
get-file-type2: 26 (1%)11, parts_decode: 1 (0%)12, check_header: 3 (0%)12, 
AV-scan-1: 14 (1%)12, AV-scan-2: 20 (1%)14, spam-wb-list: 5 (0%)14,
SA call: 1517 (82%)96, update_cache: 3 (0%)97, decide_mail_destiny: 6 (0%)97,
^
write-header: 15 (1%)98, save-to-local-mailbox: 1 (0%)98,
prepare-dsn: 3 (0%)98, main_log_entry: 12 (1%)99, sql-update: 20 (1%)100,
update_snmp: 2 (0%)100, SMTP pre-response: 1 (0%)100, SMTP response: 1 (0%)
100, unlink-2-files: 1 (0%)100, rundown: 0 (0%)100

It tells at a glance that message checking and I/O for this particular
message took 1840 ms in total, that receiving a message over SMTP
for example took 5% of this, virus scaners were very quick (14 and 20 ms),
and SA call took 1517 ms, which is (82%) of all elapsed time,
all sections up to SA (cumulative) took 96% of total elapsed time.

Now, something of this relatively simple timing breakdown, but
drilled down into a SA call, telling the administrator where is it
worth spending his effort, or why all a sudden SA takes 10 seconds
instead of the usual 2.

again, another good idea, you're on a roll!  I can see that being very
handy -- and a great concise format for that info.  Could you open
*another* feature req for this one? ;)

--j.


Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason

Doc Schneider writes:
Justin Mason wrote:
 Theo Van Dinter writes:
 I'm assuming that there will be a Google Summer of Code 2007 going on, and
 that the ASF will be involved again.  So it's a good time to start thinking
 about things we'd like to put up as possible projects.

 We still have a number of items from last year that we could use again.
 Anything else that we'd like people to code up?
 
 Also, any suggestions from outside the dev team?  Anyone got good ideas
 for new SpamAssassin features that would be good to pay someone to work on
 for 3 months?
 
 --j.

Selfish Yeah an updated web interface for adding black and white list 
and per user options for MySQL/PostGres that is a part of the core 
utilities. /Selfish

Re this one -- what about Maia Mailguard?  I haven't tried it myself,
but it really sounds like they have that kind of thing really nicely
sewn up?

--j.


Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason

Raul Dias writes:
On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
 Theo Van Dinter writes:
  I'm assuming that there will be a Google Summer of Code 2007 going on, and
  that the ASF will be involved again.  So it's a good time to start thinking
  about things we'd like to put up as possible projects.
  
  We still have a number of items from last year that we could use again.
  Anything else that we'd like people to code up?
 
 Also, any suggestions from outside the dev team?  Anyone got good ideas
 for new SpamAssassin features that would be good to pay someone to work on
 for 3 months?
 
 --j.

Not a direct improvement, but...

- Add more hooks for plugins to let a broaded pluginization of SA.
  e.g. letting plugins to load before parsing rules.

Is this not already possible?  If not, it should be, agreed.

For each case where you have a problem along these lines, feel free
to open a bug at the bugzilla and we can get it fixed.  Adding a plugin
hook point is generally an easy task since it has very low overhead
in most cases.

- Better documentation of intenal structures used.  Avoid plugin authors
  to break stuff.

Again, shout if you run into problems -- we need to be prodded into
action ;)

- Classinization of some structures to facilitate plugins reuse.

Might be appropriate in some cases -- note that perl however (in common
with other dynamic langs like Ruby, python etc.) tends not to use
strictly-defined classes for many cases where C++ or Java would use a
class, instead using a more simple, loosely-defined name-value hash or
similar.  It's a slightly different approach to code design.
So it may not always be appropriate. ;)   Of course, these loosely-defined
hashes often need to be documented...

- The pluginization of SA.  From Bayes to header, body, rawbody, score 
  rules.  The entire process of doing so would open doors for more 
  external plugin usage and control.

While this might bring a slightly slower startup.  In the long run, the
bennefits can be great.

Yeah, agreed.  This one can be tricky though as there are a lot more
details involved here ;)

I think there's ongoing work to pluginize the whole concept of rule
types -- eg. the entire body rule subsystem becomes just a plugin. It's
not ready yet though, and I'm not sure how fast it's progressing...

--j.


Re: Google Summer of Code 2007 ...

2007-02-21 Thread C. Bensend

 Perhaps this is trivial, or not desired by anyone else but myself,
 but I'd _love_ to be able to strip SpamAssassin tags via spamc and
 spamd, instead of having to fire up the full-blown spamassassin
 for each message.  :)

 formail ?

That would work in most cases, yes.  Unfortunately, not in mine.
Thanks for the pointer, though.  :)

Benny


-- 
During the armageddon, only two things will survive - cockroaches
and Cher.-- What's her bra size online game



Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason

Julian Field writes:
 Justin Mason wrote:
  DAve writes:
  Justin Mason wrote:
  Theo Van Dinter writes:
  I'm assuming that there will be a Google Summer of Code 2007 going on, 
  and
  that the ASF will be involved again.  So it's a good time to start 
  thinking
  about things we'd like to put up as possible projects.
 
  We still have a number of items from last year that we could use again.
  Anything else that we'd like people to code up?
  
  Also, any suggestions from outside the dev team?  Anyone got good ideas
  for new SpamAssassin features that would be good to pay someone to work on
  for 3 months?
 
  --j.

  Maybe, several of us use MailScanner. MailScanner does not use spamc, it 
  loads SA directly. One of the features of MailScanner is called MCP or 
  Message Content Protection. MCP uses, or attempts to use, SA to do 
  specific targeted message content checking. Many people, we included, 
  would like to be able to use this but it seems there is always some 
  gotcha to having SA loaded in MailScanner twice. Problems with the 
  directory paths, rules in memory, etc.
 
  The ability to run SA with two totally different configurations in the 
  same application would very handy. Different rules for outbound mail vs 
  inbound mail, MCP(as in this user wants zero mail with the word breast 
  regardless of the rest of the message content) are just two examples.
 
  Contacting Julian on the MailScanner list would give far better examples 
  and details than I could.
 
  cc'd Julian.
 
  This could definitely be done -- I didn't realise there was demand for
  it ;)

 The basic idea is this:
 On normal incoming mail, the usual SpamAssassin setup is used. But on 
 outbound mail, for example, the company doesn't want any of the normal 
 SpamAssassin rules. But instead, they want to look for particular rude 
 keywords, company project names, and any other phrases the user wants, 
 but *not* the standard SpamAssassin rules.
 
 I don't use spamc/spamd or the spamassassin script, I call the function 
 library directly.
 
 So I need to be able to call SpamAssassin with 2 completely different 
 sets of configuration settings.
 
 What would also be nice is a way of different messages using different 
 SpamAssassin configuration settings, this would add a lot of flexibility 
 to what I can do with MailScanner's use of SpamAssassin.
 
 I effectively need to be able to create 2 or more SpamAssassin 
 configurations and have them work entirely independently.
 
 I hope that explains it well enough.

It does, thanks.  I presume having two independent Mail::SpamAssassin
objects (assuming they were really independent, where they are not
currently) would be ok?

  This should probably be opened as a bug on the bugzilla, btw.

 How much detail do you want?

A cut and paste of the above would be fine.  The idea is that (a)
it's on the bz, where it's more easily tracked and (b) if you open
it, you will be cc'd on updates and comments.

--j.


Re: Google Summer of Code 2007 ...

2007-02-21 Thread Raul Dias
On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
 Theo Van Dinter writes:
  I'm assuming that there will be a Google Summer of Code 2007 going on, and
  that the ASF will be involved again.  So it's a good time to start thinking
  about things we'd like to put up as possible projects.
  
  We still have a number of items from last year that we could use again.
  Anything else that we'd like people to code up?

Another thing that might worth adding to GSC2007.

Internal Encoding/Charset used by SA.

I havent find anything like that, but that doesnt mean SA does not do
this already.  In this case sorry :)

Mail messages can have multiple encodings like ISO-8859-*, utf-8,
utf-16, windows-*, and so on.

Also, perl (unless set use utf8) will default to the system encoding
like LC_CTYPE.

Rule writters needs a way to tell SA, which encoding their rules are.

This is not a real issue for english rule, but for other languages are,
like portugues, french, russian, chinese, japanese and so on.

The real problem is that a string in one encoding with special
characters is not the same in another encoding.

So, what is needed is:
1 - a way to tell SA the encoding/charset used in some rules
2 - SA convert the rules to an universal encoding internally 
(e.g. utf-8/16).
3 - Temporary reconvert to the message encoding/charset to proper match.

I really dont know if SA does somithing like this internally, but I
think it does not.
Doing this will require a considerable amount of work (so, gsc20007).

Without this kind of support, I see it will be easier in the future
spammers playing with charset to avoid specific rules.

-Raul Dias



Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason

actually I think this is already implemented in 3.2.0 -- see
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details.

--j.

Raul Dias writes:
 On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
  Theo Van Dinter writes:
   I'm assuming that there will be a Google Summer of Code 2007 going on, and
   that the ASF will be involved again.  So it's a good time to start 
   thinking
   about things we'd like to put up as possible projects.
   
   We still have a number of items from last year that we could use again.
   Anything else that we'd like people to code up?
 
 Another thing that might worth adding to GSC2007.
 
 Internal Encoding/Charset used by SA.
 
 I havent find anything like that, but that doesnt mean SA does not do
 this already.  In this case sorry :)
 
 Mail messages can have multiple encodings like ISO-8859-*, utf-8,
 utf-16, windows-*, and so on.
 
 Also, perl (unless set use utf8) will default to the system encoding
 like LC_CTYPE.
 
 Rule writters needs a way to tell SA, which encoding their rules are.
 
 This is not a real issue for english rule, but for other languages are,
 like portugues, french, russian, chinese, japanese and so on.
 
 The real problem is that a string in one encoding with special
 characters is not the same in another encoding.
 
 So, what is needed is:
 1 - a way to tell SA the encoding/charset used in some rules
 2 - SA convert the rules to an universal encoding internally 
 (e.g. utf-8/16).
 3 - Temporary reconvert to the message encoding/charset to proper match.
 
 I really dont know if SA does somithing like this internally, but I
 think it does not.
 Doing this will require a considerable amount of work (so, gsc20007).
 
 Without this kind of support, I see it will be easier in the future
 spammers playing with charset to avoid specific rules.
 
 -Raul Dias


Re: Google Summer of Code 2007 ...

2007-02-21 Thread Julian Field



Justin Mason wrote:

DAve writes:
  

Justin Mason wrote:


Theo Van Dinter writes:
  

I'm assuming that there will be a Google Summer of Code 2007 going on, and
that the ASF will be involved again.  So it's a good time to start thinking
about things we'd like to put up as possible projects.

We still have a number of items from last year that we could use again.
Anything else that we'd like people to code up?


Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?

--j.
  
Maybe, several of us use MailScanner. MailScanner does not use spamc, it 
loads SA directly. One of the features of MailScanner is called MCP or 
Message Content Protection. MCP uses, or attempts to use, SA to do 
specific targeted message content checking. Many people, we included, 
would like to be able to use this but it seems there is always some 
gotcha to having SA loaded in MailScanner twice. Problems with the 
directory paths, rules in memory, etc.


The ability to run SA with two totally different configurations in the 
same application would very handy. Different rules for outbound mail vs 
inbound mail, MCP(as in this user wants zero mail with the word breast 
regardless of the rest of the message content) are just two examples.


Contacting Julian on the MailScanner list would give far better examples 
and details than I could.



cc'd Julian.

This could definitely be done -- I didn't realise there was demand for
it ;)
  

The basic idea is this:
On normal incoming mail, the usual SpamAssassin setup is used. But on 
outbound mail, for example, the company doesn't want any of the normal 
SpamAssassin rules. But instead, they want to look for particular rude 
keywords, company project names, and any other phrases the user wants, 
but *not* the standard SpamAssassin rules.


I don't use spamc/spamd or the spamassassin script, I call the function 
library directly.


So I need to be able to call SpamAssassin with 2 completely different 
sets of configuration settings.


What would also be nice is a way of different messages using different 
SpamAssassin configuration settings, this would add a lot of flexibility 
to what I can do with MailScanner's use of SpamAssassin.


I effectively need to be able to create 2 or more SpamAssassin 
configurations and have them work entirely independently.


I hope that explains it well enough.

This should probably be opened as a bug on the bugzilla, btw.
  

How much detail do you want?

is it already there?
  

I've never put it there.

--j.
  


Jules

--
Julian Field MEng CITP
www.MailScanner.info
Buy the MailScanner book at www.MailScanner.info/store

Need help customising MailScanner?
Contact me!
Need help fixing or optimising your systems?
Contact me!
Need help getting you started solving new requirements from your boss?
Contact me!

PGP footprint: EE81 D763 3DB0 0BFD E1DC 7222 11F6 5947 1415 B654



PGP.sig
Description: PGP signature


Re: Google Summer of Code 2007 ...

2007-02-21 Thread DAve

Justin Mason wrote:

DAve writes:

Justin Mason wrote:

Theo Van Dinter writes:

I'm assuming that there will be a Google Summer of Code 2007 going on, and
that the ASF will be involved again.  So it's a good time to start thinking
about things we'd like to put up as possible projects.

We still have a number of items from last year that we could use again.
Anything else that we'd like people to code up?

Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?

--j.
Maybe, several of us use MailScanner. MailScanner does not use spamc, it 
loads SA directly. One of the features of MailScanner is called MCP or 
Message Content Protection. MCP uses, or attempts to use, SA to do 
specific targeted message content checking. Many people, we included, 
would like to be able to use this but it seems there is always some 
gotcha to having SA loaded in MailScanner twice. Problems with the 
directory paths, rules in memory, etc.


The ability to run SA with two totally different configurations in the 
same application would very handy. Different rules for outbound mail vs 
inbound mail, MCP(as in this user wants zero mail with the word breast 
regardless of the rest of the message content) are just two examples.


Contacting Julian on the MailScanner list would give far better examples 
and details than I could.


cc'd Julian.

This could definitely be done -- I didn't realise there was demand for
it ;)

This should probably be opened as a bug on the bugzilla, btw.
is it already there?

--j.


No, I didn't really think it a bug, nor a feature request. I kinda 
viewed it as a future path. Sounds like I am late to the game this 
morning anyway and Jules will enter it.


DAve

--
Three years now I've asked Google why they don't have a
logo change for Memorial Day. Why do they choose to do logos
for other non-international holidays, but nothing for
Veterans?

Maybe they forgot who made that choice possible.


Re: Google Summer of Code 2007 ...

2007-02-21 Thread Raul Dias
On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote:
 actually I think this is already implemented in 3.2.0 -- see
 http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details.

Nice.  This patch solves the message part problem.

With this, rules can be written in Unicode too.
A final change would be to let rules be written into other charserts.

Rule files are read separated.  A easy implementation would be to add a
file_charset option.  This option will advice the charset used by the
rule file like iso-8859-15 and be converted internally to unicode too if
and only if (IMO) normalize_charset option is set to 1.

-Raul Dias

 --j.
 
 Raul Dias writes:
  On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
   Theo Van Dinter writes:
I'm assuming that there will be a Google Summer of Code 2007 going on, 
and
that the ASF will be involved again.  So it's a good time to start 
thinking
about things we'd like to put up as possible projects.

We still have a number of items from last year that we could use again.
Anything else that we'd like people to code up?
  
  Another thing that might worth adding to GSC2007.
  
  Internal Encoding/Charset used by SA.
  
  I havent find anything like that, but that doesnt mean SA does not do
  this already.  In this case sorry :)
  
  Mail messages can have multiple encodings like ISO-8859-*, utf-8,
  utf-16, windows-*, and so on.
  
  Also, perl (unless set use utf8) will default to the system encoding
  like LC_CTYPE.
  
  Rule writters needs a way to tell SA, which encoding their rules are.
  
  This is not a real issue for english rule, but for other languages are,
  like portugues, french, russian, chinese, japanese and so on.
  
  The real problem is that a string in one encoding with special
  characters is not the same in another encoding.
  
  So, what is needed is:
  1 - a way to tell SA the encoding/charset used in some rules
  2 - SA convert the rules to an universal encoding internally 
  (e.g. utf-8/16).
  3 - Temporary reconvert to the message encoding/charset to proper match.
  
  I really dont know if SA does somithing like this internally, but I
  think it does not.
  Doing this will require a considerable amount of work (so, gsc20007).
  
  Without this kind of support, I see it will be easier in the future
  spammers playing with charset to avoid specific rules.
  
  -Raul Dias



Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason

Raul Dias writes:
 On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote:
  actually I think this is already implemented in 3.2.0 -- see
  http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details.
 
 Nice.  This patch solves the message part problem.
 
 With this, rules can be written in Unicode too.
 A final change would be to let rules be written into other charserts.
 
 Rule files are read separated.  A easy implementation would be to add a
 file_charset option.  This option will advice the charset used by the
 rule file like iso-8859-15 and be converted internally to unicode too if
 and only if (IMO) normalize_charset option is set to 1.

I think I prefer the current model, where rules are UTF-8, I'm
afraid ;)

--j.

 -Raul Dias
 
  --j.
  
  Raul Dias writes:
   On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
Theo Van Dinter writes:
 I'm assuming that there will be a Google Summer of Code 2007 going 
 on, and
 that the ASF will be involved again.  So it's a good time to start 
 thinking
 about things we'd like to put up as possible projects.
 
 We still have a number of items from last year that we could use 
 again.
 Anything else that we'd like people to code up?
   
   Another thing that might worth adding to GSC2007.
   
   Internal Encoding/Charset used by SA.
   
   I havent find anything like that, but that doesnt mean SA does not do
   this already.  In this case sorry :)
   
   Mail messages can have multiple encodings like ISO-8859-*, utf-8,
   utf-16, windows-*, and so on.
   
   Also, perl (unless set use utf8) will default to the system encoding
   like LC_CTYPE.
   
   Rule writters needs a way to tell SA, which encoding their rules are.
   
   This is not a real issue for english rule, but for other languages are,
   like portugues, french, russian, chinese, japanese and so on.
   
   The real problem is that a string in one encoding with special
   characters is not the same in another encoding.
   
   So, what is needed is:
   1 - a way to tell SA the encoding/charset used in some rules
   2 - SA convert the rules to an universal encoding internally 
   (e.g. utf-8/16).
   3 - Temporary reconvert to the message encoding/charset to proper match.
   
   I really dont know if SA does somithing like this internally, but I
   think it does not.
   Doing this will require a considerable amount of work (so, gsc20007).
   
   Without this kind of support, I see it will be easier in the future
   spammers playing with charset to avoid specific rules.
   
   -Raul Dias


RE: Google Summer of Code 2007 ...

2007-02-21 Thread R Lists06
May I ask...

Whis is this thread named as such.

Does Google help fund SA efforts in one or multiple ways?

If so, may I ask how or directions to already posted docs on it?

 - rh

--
Robert - Abba Communications
   Computer  Internet Services
 (509) 624-7159 - www.abbacomm.net





Re: Google Summer of Code 2007 ...

2007-02-21 Thread Daryl C. W. O'Shea

R Lists06 wrote:

May I ask...

Whis is this thread named as such.

Does Google help fund SA efforts in one or multiple ways?

If so, may I ask how or directions to already posted docs on it?


If you, uh, Google for Google Summer of Code I'm sure you'll find all 
you want to know.


Daryl


Re: Google Summer of Code 2007 ...

2007-02-21 Thread Raul Dias
On Wed, 2007-02-21 at 17:27 +0100, Justin Mason wrote:
 Raul Dias writes:
  On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote:
   actually I think this is already implemented in 3.2.0 -- see
   http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details.
  
  Nice.  This patch solves the message part problem.
  
  With this, rules can be written in Unicode too.
  A final change would be to let rules be written into other charserts.
  
  Rule files are read separated.  A easy implementation would be to add a
  file_charset option.  This option will advice the charset used by the
  rule file like iso-8859-15 and be converted internally to unicode too if
  and only if (IMO) normalize_charset option is set to 1.
 
 I think I prefer the current model, where rules are UTF-8, I'm
 afraid ;)

Just to get this straight.

All rules are considered UTF-8 (no difference for ascii ones)?
Is this on 3.2 only or 3.1 too?

I have assumed that it would be iso-8859-1 so far.

-Raul Dias



RE: Google Summer of Code 2007 ...

2007-02-21 Thread David B Funk
On Wed, 21 Feb 2007, R Lists06 wrote:

 May I ask...

 Whis is this thread named as such.

 Does Google help fund SA efforts in one or multiple ways?

 If so, may I ask how or directions to already posted docs on it?

  - rh

 --
 Robert - Abba Communications

Yes, if you Goole for Google Summer of Code+spamassassin
you'll get a bunch of relevant hits. ;)

For example, check out:
http://wiki.apache.org/spamassassin/SummerOfCode2006

-- 
Dave Funk  University of Iowa
dbfunk (at) engineering.uiowa.eduCollege of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include std_disclaimer.h
Better is not better, 'standard' is better. B{


Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason

Raul Dias writes:
 On Wed, 2007-02-21 at 17:27 +0100, Justin Mason wrote:
  Raul Dias writes:
   On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote:
actually I think this is already implemented in 3.2.0 -- see
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details.
   
   Nice.  This patch solves the message part problem.
   
   With this, rules can be written in Unicode too.
   A final change would be to let rules be written into other charserts.
   
   Rule files are read separated.  A easy implementation would be to add a
   file_charset option.  This option will advice the charset used by the
   rule file like iso-8859-15 and be converted internally to unicode too if
   and only if (IMO) normalize_charset option is set to 1.
  
  I think I prefer the current model, where rules are UTF-8, I'm
  afraid ;)
 
 Just to get this straight.
 
 All rules are considered UTF-8 (no difference for ascii ones)?
 Is this on 3.2 only or 3.1 too?
 
 I have assumed that it would be iso-8859-1 so far.

With the normalize_charset code active, it's UTF-8. (iirc)

--j.


RE: Google Summer of Code 2007 ...

2007-02-21 Thread R Lists06
 
 Yes, if you Goole for Google Summer of Code+spamassassin
 you'll get a bunch of relevant hits. ;)
 
 For example, check out:
 http://wiki.apache.org/spamassassin/SummerOfCode2006
 

Thank you

I was hoping for meaningful and relevant info from someone of authority and
in the know from the SA group.

I know how to search and I know how to discern and even guess.

Yet, as of late, my experiences with Google and searching are poor.

Sure I can find stuff... yet finding anything helpful or relevant in the sea
of garbage that gets spewed back is another story.

 - rh

--
Robert - Abba Communications
   Computer  Internet Services
 (509) 624-7159 - www.abbacomm.net



Re: Google Summer of Code 2007 ...

2007-02-18 Thread Matthias Leisi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



Justin Mason wrote:

 Also, any suggestions from outside the dev team?  Anyone got good ideas
 for new SpamAssassin features that would be good to pay someone to work on
 for 3 months?

If I look at the tools and scripts I built around SA (and which are far
from perfect), I would like to see:

* Show which whitelist_from(_*) rule hit on messages

* Frequency reporting (à la nightly/mass checks) - currently I grep and
filter this data out of the mail log file, which is error-prone

* Virtualize all *.cf and *.pre parsing to allow full DB-based operations

* Auto-maintenance for AWL (ie pruning old entries)

* Make the internal status of spamd (counters, timings, ..) available eg
through a CLI tool, through SNMP or through a database.

* A solid, stable, full-featured web interface for configuration,
operation and monitoring (yes, DB-based again ;) ).

- -- Matthias
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFF2CmMxbHw2nyi/okRAjxsAJ96VNxALfLcRnH/QEf5UgmsVQ3mVgCg1LK6
AbCNGeHGGbOkt0nyTLlNER8=
=Gs5o
-END PGP SIGNATURE-


Re: Google Summer of Code 2007 ...

2007-02-18 Thread Justin Mason

Graham Murray writes:
 Theo Van Dinter [EMAIL PROTECTED] writes:
  Doesn't SA have at least 3 of those already?  Razor, DCC, and Pyzor.
 
 Not quite. Those show how many times *others* have seen it, not how
 many times *I* have seen it. Also, these have hysteresis so if you are
 unfortunately to be at the start of the spam run and receive multiple
 mails all with the same body then Razor, DCC and Pyzor might not
 help. Though if this were implemented then there would have to a
 whitelist for mailing lists to which multiple users have subscribed.

I know that a few big organisations use a private DCC server for this
purpose, with good results; doing it with a DCC server works well
if you have multiple scanner machines.

--j.


Re: Google Summer of Code 2007 ...

2007-02-18 Thread Tim B.

Justin Mason wrote:

Graham Murray writes:
  

Theo Van Dinter [EMAIL PROTECTED] writes:


Doesn't SA have at least 3 of those already?  Razor, DCC, and Pyzor.
  

Not quite. Those show how many times *others* have seen it, not how
many times *I* have seen it. Also, these have hysteresis so if you are
unfortunately to be at the start of the spam run and receive multiple
mails all with the same body then Razor, DCC and Pyzor might not
help. Though if this were implemented then there would have to a
whitelist for mailing lists to which multiple users have subscribed.



I know that a few big organisations use a private DCC server for this
purpose, with good results; doing it with a DCC server works well
if you have multiple scanner machines.

--j.

  
Humm... I didn't realize DC C could be  used  this way... I will 
investigate


Re: Google Summer of Code 2007 ...

2007-02-17 Thread Justin Mason

Raul Dias writes:
 On Sat, 2007-02-17 at 02:07 +0100, Mark Martinec wrote:
  On Saturday February 17 2007 01:49, Matthew Wilson wrote:
   I was/am primarily concerned with RAM usage for high-concurrency
   situations.
  
  Ok. Still, in my experience about 30 (maybe 50) SA processes can
  fully utilize today's CPU  I/O, and it's probably no big deal
  to provide about 2 GB of memory to cater for such system.
  Also, and unfortunately, multithreading in Perl is rather
  cumbersome and not significantly less expensive than fully
  individual processes.
 
 After experiencing with the sa-blacklist.cf some time ago with 45
 process brought my system to its knees with 3.5GB (out of memory).  
 
 I agree about the thread model.
 
 But sticking to a async I/O model is a valid point.  If implemented
 correctly it will save a lot of memory and even improve performance a
 little.
 
 Having separeted process saves the need to have to check for garbage
 after filtering a message, which will cause the code to have to be
 recheck.  
 
 However, for uniprocessor systems, having multiple process running is
 actually more expansive than a async I/O one.  For multiple process
 system, just keep one process for cpu or less.
 
 In the past I have played a lot with perl-loop (any loopers around?)
 which was the only way to go.  It is too low level for most people, but
 perhaps POE is the way to go today (which can use perl-loop as its
 base).

I'm dubious about the benefits for SpamAssassin...

An async model works very well for network-bound and I/O-bound servers;
however, SpamAssassin is mainly CPU-bound, since the network and I/O parts
are already mostly run async during the scan operation.

Also, the multiple spamd processes share quite a lot of RAM with each
other -- there's a bug in how linux reports shared memory which makes it
appear much worse than it is. read the FAQ for more details.

Still, if someone tries it and can demo increased efficiency...
go for it ;)

--j.


Re: Google Summer of Code 2007 ...

2007-02-17 Thread Raul Dias
On Sat, 2007-02-17 at 11:21 +, Justin Mason wrote:
 Raul Dias writes:
  On Sat, 2007-02-17 at 02:07 +0100, Mark Martinec wrote:
   On Saturday February 17 2007 01:49, Matthew Wilson wrote:
I was/am primarily concerned with RAM usage for high-concurrency
situations.
   
   Ok. Still, in my experience about 30 (maybe 50) SA processes can
   fully utilize today's CPU  I/O, and it's probably no big deal
   to provide about 2 GB of memory to cater for such system.
   Also, and unfortunately, multithreading in Perl is rather
   cumbersome and not significantly less expensive than fully
   individual processes.
  
  After experiencing with the sa-blacklist.cf some time ago with 45
  process brought my system to its knees with 3.5GB (out of memory).  
  
  I agree about the thread model.
  
  But sticking to a async I/O model is a valid point.  If implemented
  correctly it will save a lot of memory and even improve performance a
  little.
  
  Having separeted process saves the need to have to check for garbage
  after filtering a message, which will cause the code to have to be
  recheck.  
  
  However, for uniprocessor systems, having multiple process running is
  actually more expansive than a async I/O one.  For multiple process
  system, just keep one process for cpu or less.
  
  In the past I have played a lot with perl-loop (any loopers around?)
  which was the only way to go.  It is too low level for most people, but
  perhaps POE is the way to go today (which can use perl-loop as its
  base).
 
 I'm dubious about the benefits for SpamAssassin...
 
 An async model works very well for network-bound and I/O-bound servers;
 however, SpamAssassin is mainly CPU-bound, since the network and I/O parts
 are already mostly run async during the scan operation.
 
 Also, the multiple spamd processes share quite a lot of RAM with each
 other -- there's a bug in how linux reports shared memory which makes it
 appear much worse than it is. read the FAQ for more details.

yep, but ...


01:01:37 kernel: Out of Memory: Killed process 10024 (spamd).
01:01:51 kernel: Out of Memory: Killed process 10044 (spamd).
01:02:05 kernel: Out of Memory: Killed process 10612 (spamd).
01:02:19 kernel: Out of Memory: Killed process 10038 (spamd).
01:02:32 kernel: Out of Memory: Killed process 10602 (spamd).
01:02:45 kernel: Out of Memory: Killed process 10398 (spamd).
01:03:04 kernel: Out of Memory: Killed process 10020 (spamd).
01:03:29 kernel: Out of Memory: Killed process 10015 (spamd).
01:03:42 kernel: Out of Memory: Killed process 10237 (spamd).
01:04:00 kernel: Out of Memory: Killed process 11037 (spamd).
01:04:18 kernel: Out of Memory: Killed process 10478 (spamd).
01:04:34 kernel: Out of Memory: Killed process 11065 (spamd).
01:04:40 kernel: Out of Memory: Killed process 10405 (spamd).
...and it goes...

If I remember correctly spamd was using something between 2 to 5% of
memory reported by top (45 process max).

If it was really shared, it would have not collapsed.

My bet is that the model used on Linux is copy on write.  So after a
fork, when the child spamd changes a value, the kernel makes its own
copy of the memory. (please correct me if I am wrong).  To make it worse
perl script (AFAIK) is data and not code which makes harder to reuse
(espcially with evals around).

Even if sharing does happen it is not enough.

OTOH, with an I/O model, the total memory used would be:
 - the perl interpreter and libraries (this is trully shared on a fork 
model).
 - the compiled perl code and perl libraries.
 - one copy of the parsed rules and compiled regular expressions and non
   message/scanner related data.
 - one M::SA::PerMsgStatus object for each simultaneous scanned message 
   (this is a place to put a limit on).

 Still, if someone tries it and can demo increased efficiency...
 go for it ;)

This might require some internal changes to SA. Every Sync call would
have to be changed to Async (NON BLOCKING). This might include SQL
calls, DNS calls, exec ing external apps and even file I/O.

-Raul Dias


 --j.



Re: Google Summer of Code 2007 ...

2007-02-17 Thread Matthew Wilson
Raul Dias writes:
**snip
 If I remember correctly spamd was using something between 2 to 5% of
 memory reported by top (45 process max).

 If it was really shared, it would have not collapsed.

 My bet is that the model used on Linux is copy on write.  So after a
 fork, when the child spamd changes a value, the kernel makes its own
 copy of the memory. (please correct me if I am wrong).  To make it worse
 perl script (AFAIK) is data and not code which makes harder to reuse
 (espcially with evals around).

 Even if sharing does happen it is not enough.

 OTOH, with an I/O model, the total memory used would be:
  - the perl interpreter and libraries (this is trully shared on a fork
 model).
  - the compiled perl code and perl libraries.
  - one copy of the parsed rules and compiled regular expressions and non
message/scanner related data.

Yeah.  It's the lists and rules and regexes that do it for me.

  - one M::SA::PerMsgStatus object for each simultaneous scanned message
(this is a place to put a limit on).

 Still, if someone tries it and can demo increased efficiency...
 go for it ;)

 This might require some internal changes to SA. Every Sync call would
 have to be changed to Async (NON BLOCKING). This might include SQL
 calls, DNS calls, exec ing external apps and even file I/O.


An async version of Net::DNS is
http://search.cpan.org/~msergeant/ParaDNS-1.1/



Re: Google Summer of Code 2007 ...

2007-02-17 Thread Chris St. Pierre

On Fri, 16 Feb 2007, Quinn Comendant wrote:


How about an extensive statistics reporting tool, possible
web-based, that can show how well a current spamassassin
installation is performing and where it needs improvements. It could
provide trends in different classes of spam and how each is
marked. Also show info on whether expensive (as in cpu time) rules
and plugins are actually doing any good.


I don't know that this belongs in SA itself.  It'd be a nice add-on,
but SA already does logging that should be quite sufficient to write
something like this.

Not to mention, the best measure of the success of a spam filtering
plan is user satisfaction.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University
--
Never send mail to [EMAIL PROTECTED]



Re: Google Summer of Code 2007 ...

2007-02-17 Thread Tim B.

Justin Mason wrote:

Theo Van Dinter writes:
  

I'm assuming that there will be a Google Summer of Code 2007 going on, and
that the ASF will be involved again.  So it's a good time to start thinking
about things we'd like to put up as possible projects.

We still have a number of items from last year that we could use again.
Anything else that we'd like people to code up?



Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?

--j.

  

How about a How many times have I seen this message body plugin...

So each time SA see's the same or similar enough message body, it 
increases the score. 



Re: Google Summer of Code 2007 ...

2007-02-17 Thread Theo Van Dinter
On Sat, Feb 17, 2007 at 06:56:28PM -0500, Tim B. wrote:
 How about a How many times have I seen this message body plugin...
 
 So each time SA see's the same or similar enough message body, it 
 increases the score. 

Doesn't SA have at least 3 of those already?  Razor, DCC, and Pyzor.

-- 
Randomly Selected Tagline:
I love deadlines.  I like the whooshing sound they make as they fly by.
   - Douglas Adams


pgpEbUumExLWy.pgp
Description: PGP signature


Re: Google Summer of Code 2007 ...

2007-02-17 Thread Mark Martinec
On Saturday February 17 2007 03:01, Quinn Comendant wrote:
 How about an extensive statistics reporting tool, ..., that
 can show how well a current spamassassin installation is performing
 and where it needs improvements.

Well, not exactly by your words, but in the same spirit,
this time belonging to SA itsef:

Instrument SA with a couple of performance measuring probes,
providing some easier way to spot where bottlenecks lie.
Just something simple enough to tell, look, currently waiting
for Razor server response (or some RBL) is taking 80% of
elapsed time. Or, Bayes db is very sluggish, it is taking
5 seconds to provide a result.

A timing breakdown by subtasks is not that much work to provide,
but provides great insight into troubleshooting and performance
improvements.

Here is an example of a timing breakdown as currently provided
in the log (at log level 2) by amavisd-new, without getting into
specific details, except to say the numbers are elapsed time
for each subtask in milliseconds (and in percents, just for the
section, and then a cumulative percent of all sections so far):

TIMING [total 1840 ms] - SMTP pre-DATA-flush: 4 (0%)0, SMTP DATA: 95 (5%)5, 
check_init: 1 (0%)5, sql-enter: 69 (4%)9, mime_decode: 16 (1%)10,
get-file-type2: 26 (1%)11, parts_decode: 1 (0%)12, check_header: 3 (0%)12, 
AV-scan-1: 14 (1%)12, AV-scan-2: 20 (1%)14, spam-wb-list: 5 (0%)14,
SA call: 1517 (82%)96, update_cache: 3 (0%)97, decide_mail_destiny: 6 (0%)97,
^
write-header: 15 (1%)98, save-to-local-mailbox: 1 (0%)98,
prepare-dsn: 3 (0%)98, main_log_entry: 12 (1%)99, sql-update: 20 (1%)100,
update_snmp: 2 (0%)100, SMTP pre-response: 1 (0%)100, SMTP response: 1 (0%)
100, unlink-2-files: 1 (0%)100, rundown: 0 (0%)100

It tells at a glance that message checking and I/O for this particular
message took 1840 ms in total, that receiving a message over SMTP
for example took 5% of this, virus scaners were very quick (14 and 20 ms),
and SA call took 1517 ms, which is (82%) of all elapsed time,
all sections up to SA (cumulative) took 96% of total elapsed time.

Now, something of this relatively simple timing breakdown, but
drilled down into a SA call, telling the administrator where is it
worth spending his effort, or why all a sudden SA takes 10 seconds
instead of the usual 2.

  Mark


Re: Google Summer of Code 2007 ...

2007-02-17 Thread Graham Murray
Theo Van Dinter [EMAIL PROTECTED] writes:
 Doesn't SA have at least 3 of those already?  Razor, DCC, and Pyzor.

Not quite. Those show how many times *others* have seen it, not how
many times *I* have seen it. Also, these have hysteresis so if you are
unfortunately to be at the start of the spam run and receive multiple
mails all with the same body then Razor, DCC and Pyzor might not
help. Though if this were implemented then there would have to a
whitelist for mailing lists to which multiple users have subscribed.


Re: Google Summer of Code 2007 ...

2007-02-17 Thread hamann . w


 Not quite. Those show how many times *others* have seen it, not how
 many times *I* have seen it. Also, these have hysteresis so if you are
 unfortunately to be at the start of the spam run and receive multiple
 mails all with the same body then Razor, DCC and Pyzor might not
 help. Though if this were implemented then there would have to a
 whitelist for mailing lists to which multiple users have subscribed.
 

Hi,

ixhash, which also works that way, definitely started its life as an inhouse 
mail counter.
You could probably use ixhash or razor along with your own server rather than 
the public one

Wolfgang




Re: Google Summer of Code 2007 ...

2007-02-16 Thread C. Bensend

 Also, any suggestions from outside the dev team?  Anyone got good ideas
 for new SpamAssassin features that would be good to pay someone to work on
 for 3 months?

Perhaps this is trivial, or not desired by anyone else but myself,
but I'd _love_ to be able to strip SpamAssassin tags via spamc and
spamd, instead of having to fire up the full-blown spamassassin
for each message.  :)

Benny


-- 
Very funny, Scotty. Now beam down my clothes.  -- James. T. Kirk




Re: Google Summer of Code 2007 ...

2007-02-16 Thread Doc Schneider

Justin Mason wrote:

Theo Van Dinter writes:

I'm assuming that there will be a Google Summer of Code 2007 going on, and
that the ASF will be involved again.  So it's a good time to start thinking
about things we'd like to put up as possible projects.

We still have a number of items from last year that we could use again.
Anything else that we'd like people to code up?


Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?

--j.


Selfish Yeah an updated web interface for adding black and white list 
and per user options for MySQL/PostGres that is a part of the core 
utilities. /Selfish


--

 -Doc

 SA/SARE -- Ninja
   9:52am  up 21:19, 17 users,  load average: 0.11, 0.36, 0.52

 SARE HQ  http://www.rulesemporium.com/


Re: Google Summer of Code 2007 ...

2007-02-16 Thread DAve

Justin Mason wrote:

Theo Van Dinter writes:

I'm assuming that there will be a Google Summer of Code 2007 going on, and
that the ASF will be involved again.  So it's a good time to start thinking
about things we'd like to put up as possible projects.

We still have a number of items from last year that we could use again.
Anything else that we'd like people to code up?


Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?

--j.


Maybe, several of us use MailScanner. MailScanner does not use spamc, it 
loads SA directly. One of the features of MailScanner is called MCP or 
Message Content Protection. MCP uses, or attempts to use, SA to do 
specific targeted message content checking. Many people, we included, 
would like to be able to use this but it seems there is always some 
gotcha to having SA loaded in MailScanner twice. Problems with the 
directory paths, rules in memory, etc.


The ability to run SA with two totally different configurations in the 
same application would very handy. Different rules for outbound mail vs 
inbound mail, MCP(as in this user wants zero mail with the word breast 
regardless of the rest of the message content) are just two examples.


Contacting Julian on the MailScanner list would give far better examples 
and details than I could.


Just a thought,

DAve


--
Three years now I've asked Google why they don't have a
logo change for Memorial Day. Why do they choose to do logos
for other non-international holidays, but nothing for
Veterans?

Maybe they forgot who made that choice possible.


Re: Google Summer of Code 2007 ...

2007-02-16 Thread Mark Martinec
 Also, any suggestions from outside the dev team?  Anyone got good ideas
 for new SpamAssassin features that would be good to pay someone to work on
 for 3 months?

I believe this was once mentioned on a Justin's blog (but can't find
a ref now), the following sounds promising as an additional classifier
to existing bayes (especially since the author comes from the same
organization as myself :)

http://www.virusbtn.com/spambulletin/archive/2006/01/sb200601-trec

  ijsSPAM2PPM-D compression model
Andrej Bratko (Josef Stefan Institute)

  Observations:
  The most startling observation is that character-based compression models
  perform outstandingly well for spam filtering. Commonly used open-source
  filters perform well, but not nearly so well or nearly so poorly as
  reported elsewhere.

Mark


Re: Google Summer of Code 2007 ...

2007-02-16 Thread Justin Mason

Mark Martinec writes:
  Also, any suggestions from outside the dev team?  Anyone got good ideas
  for new SpamAssassin features that would be good to pay someone to work on
  for 3 months?
 
 I believe this was once mentioned on a Justin's blog (but can't find
 a ref now), the following sounds promising as an additional classifier
 to existing bayes (especially since the author comes from the same
 organization as myself :)
 
 http://www.virusbtn.com/spambulletin/archive/2006/01/sb200601-trec
 
   ijsSPAM2PPM-D compression model
 Andrej Bratko (Josef Stefan Institute)
 
   Observations:
   The most startling observation is that character-based compression models
   perform outstandingly well for spam filtering. Commonly used open-source
   filters perform well, but not nearly so well or nearly so poorly as
   reported elsewhere.

Yes, definitely!  A related algorithm is OSBF, as implemented here:
http://osbf-lua.luaforge.net/ This had the best performance for
hand-trained probabilistic classifiers in the TREC Spam Track 2006
evaluation -- that's good ;)

Also, a related project would be to complete the pluginization of our
Bayes engine and APIs, so that other probabilistic classifiers can be
plugged in in place of, or in addition to, Bayes in SpamAssassin.

--j.


Re: Google Summer of Code 2007 ...

2007-02-16 Thread Dan


On Feb 16, 2007, at 7:35, Justin Mason wrote:
We still have a number of items from last year that we could use  
again.

Anything else that we'd like people to code up?


Also, any suggestions from outside the dev team?  Anyone got good  
ideas
for new SpamAssassin features that would be good to pay someone to  
work on

for 3 months?


I don't know how to code myself but have a new method for scoring  
messages that could be included natively in SA.  It would work in  
place of weight based scoring.  Does this sound like it qualifies?


Dan


Re: Google Summer of Code 2007 ...

2007-02-16 Thread Mark Martinec
Justin Mason writes:
 Also, a related project would be to complete the pluginization of our
 Bayes engine and APIs, so that other probabilistic classifiers can be
 plugged in in place of, or in addition to, Bayes in SpamAssassin.

Right. I felt a need for something like this when I was switching
Bayes from MySQL to PostgreSQL 8.2 and my first urge was to run both
in parallel, giving the new one small scores and see how it behaves.

(but I wasn't desperate enough to implement it, just switched
and kept an eye on it for a while. Btw, the 8.2 behaves much
better (faster) than earlier versions, seems like some new
optimizations were geared precisly to suit SA queries)

  Mark


Re: Google Summer of Code 2007 ...

2007-02-16 Thread Chris St. Pierre

On Fri, 16 Feb 2007, Mark Martinec wrote:


I believe this was once mentioned on a Justin's blog (but can't find
a ref now), the following sounds promising as an additional classifier
to existing bayes (especially since the author comes from the same
organization as myself :)

http://www.virusbtn.com/spambulletin/archive/2006/01/sb200601-trec

 ijsSPAM2PPM-D compression model
   Andrej Bratko (Josef Stefan Institute)

 Observations:
 The most startling observation is that character-based compression models
 perform outstandingly well for spam filtering. Commonly used open-source
 filters perform well, but not nearly so well or nearly so poorly as
 reported elsewhere.


This looks very promising.  I found a description of the ijsSPAM2 tool
on the site:

http://www.virusbtn.com/spambulletin/archive/2006/03/sb200603-compression

Remarkable stuff.  That would be a helluva nice plugin to have.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University

Never send mail to [EMAIL PROTECTED]


Re: Google Summer of Code 2007 ...

2007-02-16 Thread Duncan Findlay
On Fri, Feb 16, 2007 at 09:31:13AM -0800, Dan wrote:

 On Feb 16, 2007, at 7:35, Justin Mason wrote:
 We still have a number of items from last year that we could use again.
 Anything else that we'd like people to code up?

 Also, any suggestions from outside the dev team?  Anyone got good ideas
 for new SpamAssassin features that would be good to pay someone to work on
 for 3 months?

 I don't know how to code myself but have a new method for scoring messages 
 that could be included natively in SA.  It would work in 
 place of weight based scoring.  Does this sound like it qualifies?

Perhaps pluginize the scoring mechanisms so we can have plugins that
implement different ham/spam decision rules?

-- 
Duncan Findlay


pgp45Qcn2sEBR.pgp
Description: PGP signature


Re: Google Summer of Code 2007 ...

2007-02-16 Thread Bart Schaefer

On 2/16/07, Justin Mason [EMAIL PROTECTED] wrote:


Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?


http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3785


Re: Google Summer of Code 2007 ...

2007-02-16 Thread John D. Hardin
On Fri, 16 Feb 2007, Justin Mason wrote:

 Also, a related project would be to complete the pluginization of
 our Bayes engine and APIs, so that other probabilistic
 classifiers can be plugged in in place of, or in addition to,
 Bayes in SpamAssassin.

+1

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Windows Genuine Advantage (WGA) means that now you use your 
  computer at the sufferance of Microsoft Corporation. They can
  kill it remotely without your consent at any time for any reason.
---
 6 days until George Washington's 275th Birthday



Re: Google Summer of Code 2007 ...

2007-02-16 Thread John Rudd

John D. Hardin wrote:

On Fri, 16 Feb 2007, Justin Mason wrote:


Also, a related project would be to complete the pluginization of
our Bayes engine and APIs, so that other probabilistic
classifiers can be plugged in in place of, or in addition to,
Bayes in SpamAssassin.


+1



If that's a notation for me too, then:

++

I'm all for all implications on this subject so far:

1) the new PPM-D compression based classifier technique

2) pluginization of all of the probability based classifiers, so that 
sites can choose between the SA bayes implementation, other bayes 
implementations, PPM-D, or other learning processes (individually or in 
combinations)




RE: Google Summer of Code 2007 ...

2007-02-16 Thread Matthew Wilson
- Full, tested, supportable multithreaded support

- Full, tested, supportable support for an asynchronous I/O model (a la
qpsmtpd-async)

- Pluggable to the point where all configuration and settings can be pulled
from anywhere (databases, files, in-memory cache) at runtime, so SA could
stay resident and have its configuration be changed in-process.



Re: Google Summer of Code 2007 ...

2007-02-16 Thread Raul Dias
On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
 Theo Van Dinter writes:
  I'm assuming that there will be a Google Summer of Code 2007 going on, and
  that the ASF will be involved again.  So it's a good time to start thinking
  about things we'd like to put up as possible projects.
  
  We still have a number of items from last year that we could use again.
  Anything else that we'd like people to code up?
 
 Also, any suggestions from outside the dev team?  Anyone got good ideas
 for new SpamAssassin features that would be good to pay someone to work on
 for 3 months?
 
 --j.

Not a direct improvement, but...

- Add more hooks for plugins to let a broaded pluginization of SA.
  e.g. letting plugins to load before parsing rules.

- Better documentation of intenal structures used.  Avoid plugin authors
  to break stuff.

- Classinization of some structures to facilitate plugins reuse.

- The pluginization of SA.  From Bayes to header, body, rawbody, score 
  rules.  The entire process of doing so would open doors for more 
  external plugin usage and control.

While this might bring a slightly slower startup.  In the long run, the
bennefits can be great.

-Raul Dias



Re: Google Summer of Code 2007 ...

2007-02-16 Thread Mark Martinec
 Also, any suggestions from outside the dev team?  Anyone got good ideas
 for new SpamAssassin features that would be good to pay someone to work on
 for 3 months?

Here's another one, to seize the opportunity when internal changes
are being contemplated:

Split the process into two parts:

- parsing and munging of mail  rules, resulting in a set of
  findings (e.g. a list of rules being hit, perhaps somehow
  generalized). This section can be done once per message,
  regardless of the number of recipients to the message
  (assuming all users use the same rules);

- based on the above, score the findings, possibly
  applying per-recipient scoring to each rule being hit;
  This (rather inexpensive) step can be applied for each
  recipient individually, without having to re-process
  an entire message in multiple-recipient mail.

...and adjust the API to Mail::SpamAssassin accordingly, so that
MTA-based content filtering (e.g. amavisd-new) could take advantage
of it, while still allowing full per-recipient customization of
individual rules scores (including disabling some by a score of 0).

Benefits depend on a site, but our stats show 1.46 recipients
per message on the average. The above change (when calling SA
at MTA level) would bring a 46 % increase in througput for free,
while still providing individualized rules scoring. 

  Mark


Re: Google Summer of Code 2007 ...

2007-02-16 Thread Mark Martinec
Matthew Wilson wrote:
 - Full, tested, supportable multithreaded support
 - Full, tested, supportable support for an asynchronous I/O model
   (a la qpsmtpd-async)

I think effort could be better spent elsewhere.

Spam checking lands itself ideally to running parallel individual
processes, with little if any interaction between them.
For an individual user a reduction in processing latency from
three to one seconds doesn't mean a thing. For an entire mail
filtering system all that matters is its througput (messages per
hour). Multithreading brings no performance benefits in this area.

  Mark


RE: Google Summer of Code 2007 ...

2007-02-16 Thread Matthew Wilson
 -Original Message-
 From: Mark Martinec [mailto:[EMAIL PROTECTED]
 Sent: Friday, February 16, 2007 6:09 PM
 To: users@spamassassin.apache.org
 Subject: Re: Google Summer of Code 2007 ...
 
 Matthew Wilson wrote:
  - Full, tested, supportable multithreaded support
  - Full, tested, supportable support for an asynchronous I/O model
(a la qpsmtpd-async)
 
 I think effort could be better spent elsewhere.
 
 Spam checking lands itself ideally to running parallel individual
 processes, with little if any interaction between them.
 For an individual user a reduction in processing latency from
 three to one seconds doesn't mean a thing. For an entire mail
 filtering system all that matters is its througput (messages per
 hour). Multithreading brings no performance benefits in this area.
 
   Mark

I was/am primarily concerned with RAM usage for high-concurrency situations.



Re: Google Summer of Code 2007 ...

2007-02-16 Thread Mark Martinec
On Saturday February 17 2007 01:49, Matthew Wilson wrote:
 I was/am primarily concerned with RAM usage for high-concurrency
 situations.

Ok. Still, in my experience about 30 (maybe 50) SA processes can
fully utilize today's CPU  I/O, and it's probably no big deal
to provide about 2 GB of memory to cater for such system.
Also, and unfortunately, multithreading in Perl is rather
cumbersome and not significantly less expensive than fully
individual processes.

  Mark


Re: Google Summer of Code 2007 ...

2007-02-16 Thread Justin Mason

Mark Martinec writes:
 On Saturday February 17 2007 01:49, Matthew Wilson wrote:
  I was/am primarily concerned with RAM usage for high-concurrency
  situations.
 
 Ok. Still, in my experience about 30 (maybe 50) SA processes can
 fully utilize today's CPU  I/O, and it's probably no big deal
 to provide about 2 GB of memory to cater for such system.
 Also, and unfortunately, multithreading in Perl is rather
 cumbersome and not significantly less expensive than fully
 individual processes.

yep -- that's pretty much what I've found, too.  The earlier, non-ithreads
version is pretty much nonfunctional :(

--j.


Re: Google Summer of Code 2007 ...

2007-02-16 Thread Quinn Comendant
On Fri, 16 Feb 2007 15:35:39 +, Justin Mason wrote:
 We still have a number of items from last year that we could use again.
 Anything else that we'd like people to code up?

How about an extensive statistics reporting tool, possible web-based, that can 
show how well a current spamassassin installation is performing and where it 
needs improvements. It could provide trends in different classes of spam and 
how each is marked. Also show info on whether expensive (as in cpu time) rules 
and plugins are actually doing any good.

And/or a fix for the qmail+simscan per-user preferences spamc -u issue where if 
an email is addressed to multiple users or an alias spamc isn't passed the 
correct user.

Quinn

-
Strangecode :: Internet Consultancy
http://www.strangecode.com/


Re: Google Summer of Code 2007 ...

2007-02-16 Thread Raul Dias
On Sat, 2007-02-17 at 02:07 +0100, Mark Martinec wrote:
 On Saturday February 17 2007 01:49, Matthew Wilson wrote:
  I was/am primarily concerned with RAM usage for high-concurrency
  situations.
 
 Ok. Still, in my experience about 30 (maybe 50) SA processes can
 fully utilize today's CPU  I/O, and it's probably no big deal
 to provide about 2 GB of memory to cater for such system.
 Also, and unfortunately, multithreading in Perl is rather
 cumbersome and not significantly less expensive than fully
 individual processes.

After experiencing with the sa-blacklist.cf some time ago with 45
process brought my system to its knees with 3.5GB (out of memory).  

I agree about the thread model.

But sticking to a async I/O model is a valid point.  If implemented
correctly it will save a lot of memory and even improve performance a
little.

Having separeted process saves the need to have to check for garbage
after filtering a message, which will cause the code to have to be
recheck.  

However, for uniprocessor systems, having multiple process running is
actually more expansive than a async I/O one.  For multiple process
system, just keep one process for cpu or less.

In the past I have played a lot with perl-loop (any loopers around?)
which was the only way to go.  It is too low level for most people, but
perhaps POE is the way to go today (which can use perl-loop as its
base).

-Raul Dias



Re: Google Summer of Code 2007 ...

2007-02-16 Thread Quinn Comendant
On Fri, 16 Feb 2007 18:01:37 -0800, Quinn Comendant wrote:
 And/or a fix for the qmail+simscan per-user preferences spamc -u 
 issue where if an email is addressed to multiple users or an alias 
 spamc isn't passed the correct user.

Sorry to reply to myself, but I want to retract that last suggestion: it's not 
really spamassassin's job to parse recipient lists and resolve aliases.

Q

-
Strangecode :: Internet Consultancy
http://www.strangecode.com/
+1 530 624 4410