subject:"Google Summer of Code 2007 ..."

Google Summer of Code 2007 - Students Wanted

2007-03-16 Thread Michael Parker

Howdy,

The time of year for Google Summer of Code has already arrived and once
again the Apache Software Foundation is taking part.

We are currently looking for students who wish to work on SpamAssassin
related projects over the summer.

You have until *March 24th* to sign up and submit an application.  Work
on the project will take place from May28th through August 20th.

You can find a list of possible projects here (just search for
spamassassin):

http://wiki.apache.org/general/SummerOfCode2007

That is by no means an exhaustive list so if you have other ideas or
know of something from here:

http://wiki.apache.org/spamassassin/WeLoveVolunteers

that you would like to work on, feel free to add it to the list and
submit an application.

Last year we were able to take on several projects, its a nice way to
earn 4500 USD over the summer.

Thanks
Michael Parker

Re: [2] Google Summer of Code 2007 ...

2007-02-25 Thread Andrej Bratko

Chris St. Pierre wrote:
> 
> 
> Mark Martinec wrote:
>> 
>> 
>> ... the following sounds promising as an additional classifier
>> to existing bayes (especially since the author comes from the same
>> organization as myself :)
>> 
>> http://www.virusbtn.com/spambulletin/archive/2006/01/sb200601-trec
>> 
>> ijsSPAM2PPM-D compression model
>>Andrej Bratko (Josef Stefan Institute)
>> 
>> Observations:
>> The most startling observation is that character-based compression models
>> perform outstandingly well for spam filtering. Commonly used open-source
>> filters perform well, but not nearly so well or nearly so poorly as
>> reported elsewhere.
>> 
>> 
> 
> This looks very promising.  I found a description of the ijsSPAM2 tool
> on the site:
> 
> http://www.virusbtn.com/spambulletin/archive/2006/03/sb200603-compression
> 
> Remarkable stuff.  That would be a helluva nice plugin to have.
> 
> 

I've recently released a C++ library that includes an implementation of the 
PPM-D algorithm, geared towards classification (or mail filtering). This is
essentially 
the same algorithm that appeared at TREC 2005 as `ijsSPAM2'. 

It's available at:
http://ai.ijs.si/andrej/psmslib.html

There's also a Python wrapper:
http://ai.ijs.si/andrej/psmpylib.html

The C++ library and Python extension module are free for personal and for
research 
use, but unfortunately, I cannot disclose the source code at this time, or
release the 
libraries under an Apache-compatible license. Anyway, you might want to try
it out 
before coding your own implementation. 

-- 
View this message in context: 
http://www.nabble.com/Google-Summer-of-Code-2007-...-tf3240085.html#a9146893
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.

RE: Google Summer of Code 2007 ...

2007-02-21 Thread R Lists06

> 
> Yes, if you Goole for "Google Summer of Code"+spamassassin
> you'll get a bunch of relevant hits. ;)
> 
> For example, check out:
> http://wiki.apache.org/spamassassin/SummerOfCode2006
> 

Thank you

I was hoping for meaningful and relevant info from someone of authority and
in the know from the SA group.

I know how to search and I know how to discern and even guess.

Yet, as of late, my experiences with Google and searching are poor.

Sure I can find stuff... yet finding anything helpful or relevant in the sea
of garbage that gets spewed back is another story.

 - rh

--
Robert - Abba Communications
   Computer & Internet Services
 (509) 624-7159 - www.abbacomm.net

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason


Raul Dias writes:
> On Wed, 2007-02-21 at 17:27 +0100, Justin Mason wrote:
> > Raul Dias writes:
> > > On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote:
> > > > actually I think this is already implemented in 3.2.0 -- see
> > > > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details.
> > > 
> > > Nice.  This patch solves the message part problem.
> > > 
> > > With this, rules can be written in Unicode too.
> > > A final change would be to let rules be written into other charserts.
> > > 
> > > Rule files are read separated.  A easy implementation would be to add a
> > > file_charset option.  This option will advice the charset used by the
> > > rule file like iso-8859-15 and be converted internally to unicode too if
> > > and only if (IMO) normalize_charset option is set to 1.
> > 
> > I think I prefer the current model, where rules are UTF-8, I'm
> > afraid ;)
> 
> Just to get this straight.
> 
> All rules are considered UTF-8 (no difference for ascii ones)?
> Is this on 3.2 only or 3.1 too?
> 
> I have assumed that it would be iso-8859-1 so far.

With the "normalize_charset" code active, it's UTF-8. (iirc)

--j.

RE: Google Summer of Code 2007 ...

2007-02-21 Thread David B Funk

On Wed, 21 Feb 2007, R Lists06 wrote:

> May I ask...
>
> Whis is this thread named as such.
>
> Does Google help fund SA efforts in one or multiple ways?
>
> If so, may I ask how or directions to already posted docs on it?
>
>  - rh
>
> --
> Robert - Abba Communications

Yes, if you Goole for "Google Summer of Code"+spamassassin
you'll get a bunch of relevant hits. ;)

For example, check out:
http://wiki.apache.org/spamassassin/SummerOfCode2006

-- 
Dave Funk  University of Iowa
College of Engineering
319/335-5751   FAX: 319/384-0549   1256 Seamans Center
Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527
#include 
Better is not better, 'standard' is better. B{

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Raul Dias

On Wed, 2007-02-21 at 17:27 +0100, Justin Mason wrote:
> Raul Dias writes:
> > On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote:
> > > actually I think this is already implemented in 3.2.0 -- see
> > > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details.
> > 
> > Nice.  This patch solves the message part problem.
> > 
> > With this, rules can be written in Unicode too.
> > A final change would be to let rules be written into other charserts.
> > 
> > Rule files are read separated.  A easy implementation would be to add a
> > file_charset option.  This option will advice the charset used by the
> > rule file like iso-8859-15 and be converted internally to unicode too if
> > and only if (IMO) normalize_charset option is set to 1.
> 
> I think I prefer the current model, where rules are UTF-8, I'm
> afraid ;)

Just to get this straight.

All rules are considered UTF-8 (no difference for ascii ones)?
Is this on 3.2 only or 3.1 too?

I have assumed that it would be iso-8859-1 so far.

-Raul Dias

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Daryl C. W. O'Shea


R Lists06 wrote:

May I ask...

Whis is this thread named as such.

Does Google help fund SA efforts in one or multiple ways?

If so, may I ask how or directions to already posted docs on it?


If you, uh, Google for "Google Summer of Code" I'm sure you'll find all 
you want to know.


Daryl

RE: Google Summer of Code 2007 ...

2007-02-21 Thread R Lists06

May I ask...

Whis is this thread named as such.

Does Google help fund SA efforts in one or multiple ways?

If so, may I ask how or directions to already posted docs on it?

 - rh

--
Robert - Abba Communications
   Computer & Internet Services
 (509) 624-7159 - www.abbacomm.net

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason


Raul Dias writes:
> On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote:
> > actually I think this is already implemented in 3.2.0 -- see
> > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details.
> 
> Nice.  This patch solves the message part problem.
> 
> With this, rules can be written in Unicode too.
> A final change would be to let rules be written into other charserts.
> 
> Rule files are read separated.  A easy implementation would be to add a
> file_charset option.  This option will advice the charset used by the
> rule file like iso-8859-15 and be converted internally to unicode too if
> and only if (IMO) normalize_charset option is set to 1.

I think I prefer the current model, where rules are UTF-8, I'm
afraid ;)

--j.

> -Raul Dias
> 
> > --j.
> > 
> > Raul Dias writes:
> > > On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
> > > > Theo Van Dinter writes:
> > > > > I'm assuming that there will be a Google Summer of Code 2007 going 
> > > > > on, and
> > > > > that the ASF will be involved again.  So it's a good time to start 
> > > > > thinking
> > > > > about things we'd like to put up as possible projects.
> > > > > 
> > > > > We still have a number of items from last year that we could use 
> > > > > again.
> > > > > Anything else that we'd like people to code up?
> > > 
> > > Another thing that might worth adding to GSC2007.
> > > 
> > > Internal Encoding/Charset used by SA.
> > > 
> > > I havent find anything like that, but that doesnt mean SA does not do
> > > this already.  In this case sorry :)
> > > 
> > > Mail messages can have multiple encodings like ISO-8859-*, utf-8,
> > > utf-16, windows-*, and so on.
> > > 
> > > Also, perl (unless set "use utf8") will default to the system encoding
> > > like LC_CTYPE.
> > > 
> > > Rule writters needs a way to tell SA, which encoding their rules are.
> > > 
> > > This is not a real issue for english rule, but for other languages are,
> > > like portugues, french, russian, chinese, japanese and so on.
> > > 
> > > The real problem is that a string in one encoding with special
> > > characters is not the same in another encoding.
> > > 
> > > So, what is needed is:
> > > 1 - a way to tell SA the encoding/charset used in some rules
> > > 2 - SA convert the rules to an universal encoding internally 
> > > (e.g. utf-8/16).
> > > 3 - Temporary reconvert to the message encoding/charset to proper match.
> > > 
> > > I really dont know if SA does somithing like this internally, but I
> > > think it does not.
> > > Doing this will require a considerable amount of work (so, gsc20007).
> > > 
> > > Without this kind of support, I see it will be easier in the future
> > > spammers playing with charset to avoid specific rules.
> > > 
> > > -Raul Dias

Re: Google Summer of Code 2007 ...

2007-02-21 Thread DAve


Justin Mason wrote:

DAve writes:

Justin Mason wrote:

Theo Van Dinter writes:

I'm assuming that there will be a Google Summer of Code 2007 going on, and
that the ASF will be involved again.  So it's a good time to start thinking
about things we'd like to put up as possible projects.

We still have a number of items from last year that we could use again.
Anything else that we'd like people to code up?

Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?

--j.
Maybe, several of us use MailScanner. MailScanner does not use spamc, it 
loads SA directly. One of the features of MailScanner is called MCP or 
Message Content Protection. MCP uses, or attempts to use, SA to do 
specific targeted message content checking. Many people, we included, 
would like to be able to use this but it seems there is always some 
gotcha to having SA loaded in MailScanner twice. Problems with the 
directory paths, rules in memory, etc.


The ability to run SA with two totally different configurations in the 
same application would very handy. Different rules for outbound mail vs 
inbound mail, MCP(as in this user wants zero mail with the word "breast" 
regardless of the rest of the message content) are just two examples.


Contacting Julian on the MailScanner list would give far better examples 
and details than I could.


cc'd Julian.

This could definitely be done -- I didn't realise there was demand for
it ;)

This should probably be opened as a bug on the bugzilla, btw.
is it already there?

--j.


No, I didn't really think it a bug, nor a feature request. I kinda 
viewed it as a future path. Sounds like I am late to the game this 
morning anyway and Jules will enter it.


DAve

--
Three years now I've asked Google why they don't have a
logo change for Memorial Day. Why do they choose to do logos
for other non-international holidays, but nothing for
Veterans?

Maybe they forgot who made that choice possible.

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Raul Dias

On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote:
> actually I think this is already implemented in 3.2.0 -- see
> http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details.

Nice.  This patch solves the message part problem.

With this, rules can be written in Unicode too.
A final change would be to let rules be written into other charserts.

Rule files are read separated.  A easy implementation would be to add a
file_charset option.  This option will advice the charset used by the
rule file like iso-8859-15 and be converted internally to unicode too if
and only if (IMO) normalize_charset option is set to 1.

-Raul Dias

> --j.
> 
> Raul Dias writes:
> > On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
> > > Theo Van Dinter writes:
> > > > I'm assuming that there will be a Google Summer of Code 2007 going on, 
> > > > and
> > > > that the ASF will be involved again.  So it's a good time to start 
> > > > thinking
> > > > about things we'd like to put up as possible projects.
> > > > 
> > > > We still have a number of items from last year that we could use again.
> > > > Anything else that we'd like people to code up?
> > 
> > Another thing that might worth adding to GSC2007.
> > 
> > Internal Encoding/Charset used by SA.
> > 
> > I havent find anything like that, but that doesnt mean SA does not do
> > this already.  In this case sorry :)
> > 
> > Mail messages can have multiple encodings like ISO-8859-*, utf-8,
> > utf-16, windows-*, and so on.
> > 
> > Also, perl (unless set "use utf8") will default to the system encoding
> > like LC_CTYPE.
> > 
> > Rule writters needs a way to tell SA, which encoding their rules are.
> > 
> > This is not a real issue for english rule, but for other languages are,
> > like portugues, french, russian, chinese, japanese and so on.
> > 
> > The real problem is that a string in one encoding with special
> > characters is not the same in another encoding.
> > 
> > So, what is needed is:
> > 1 - a way to tell SA the encoding/charset used in some rules
> > 2 - SA convert the rules to an universal encoding internally 
> > (e.g. utf-8/16).
> > 3 - Temporary reconvert to the message encoding/charset to proper match.
> > 
> > I really dont know if SA does somithing like this internally, but I
> > think it does not.
> > Doing this will require a considerable amount of work (so, gsc20007).
> > 
> > Without this kind of support, I see it will be easier in the future
> > spammers playing with charset to avoid specific rules.
> > 
> > -Raul Dias

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Julian Field




Justin Mason wrote:

DAve writes:
  

Justin Mason wrote:


Theo Van Dinter writes:
  

I'm assuming that there will be a Google Summer of Code 2007 going on, and
that the ASF will be involved again.  So it's a good time to start thinking
about things we'd like to put up as possible projects.

We still have a number of items from last year that we could use again.
Anything else that we'd like people to code up?


Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?

--j.
  
Maybe, several of us use MailScanner. MailScanner does not use spamc, it 
loads SA directly. One of the features of MailScanner is called MCP or 
Message Content Protection. MCP uses, or attempts to use, SA to do 
specific targeted message content checking. Many people, we included, 
would like to be able to use this but it seems there is always some 
gotcha to having SA loaded in MailScanner twice. Problems with the 
directory paths, rules in memory, etc.


The ability to run SA with two totally different configurations in the 
same application would very handy. Different rules for outbound mail vs 
inbound mail, MCP(as in this user wants zero mail with the word "breast" 
regardless of the rest of the message content) are just two examples.


Contacting Julian on the MailScanner list would give far better examples 
and details than I could.



cc'd Julian.

This could definitely be done -- I didn't realise there was demand for
it ;)
  

The basic idea is this:
On normal incoming mail, the usual SpamAssassin setup is used. But on 
outbound mail, for example, the company doesn't want any of the normal 
SpamAssassin rules. But instead, they want to look for particular "rude" 
keywords, company project names, and any other phrases the user wants, 
but *not* the standard SpamAssassin rules.


I don't use spamc/spamd or the spamassassin script, I call the function 
library directly.


So I need to be able to call SpamAssassin with 2 completely different 
sets of configuration settings.


What would also be nice is a way of different messages using different 
SpamAssassin configuration settings, this would add a lot of flexibility 
to what I can do with MailScanner's use of SpamAssassin.


I effectively need to be able to create 2 or more SpamAssassin 
configurations and have them work entirely independently.


I hope that explains it well enough.

This should probably be opened as a bug on the bugzilla, btw.
  

How much detail do you want?

is it already there?
  

I've never put it there.

--j.
  


Jules

--
Julian Field MEng CITP
www.MailScanner.info
Buy the MailScanner book at www.MailScanner.info/store

Need help customising MailScanner?
Contact me!
Need help fixing or optimising your systems?
Contact me!
Need help getting you started solving new requirements from your boss?
Contact me!

PGP footprint: EE81 D763 3DB0 0BFD E1DC 7222 11F6 5947 1415 B654



PGP.sig
Description: PGP signature

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason


actually I think this is already implemented in 3.2.0 -- see
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details.

--j.

Raul Dias writes:
> On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
> > Theo Van Dinter writes:
> > > I'm assuming that there will be a Google Summer of Code 2007 going on, and
> > > that the ASF will be involved again.  So it's a good time to start 
> > > thinking
> > > about things we'd like to put up as possible projects.
> > > 
> > > We still have a number of items from last year that we could use again.
> > > Anything else that we'd like people to code up?
> 
> Another thing that might worth adding to GSC2007.
> 
> Internal Encoding/Charset used by SA.
> 
> I havent find anything like that, but that doesnt mean SA does not do
> this already.  In this case sorry :)
> 
> Mail messages can have multiple encodings like ISO-8859-*, utf-8,
> utf-16, windows-*, and so on.
> 
> Also, perl (unless set "use utf8") will default to the system encoding
> like LC_CTYPE.
> 
> Rule writters needs a way to tell SA, which encoding their rules are.
> 
> This is not a real issue for english rule, but for other languages are,
> like portugues, french, russian, chinese, japanese and so on.
> 
> The real problem is that a string in one encoding with special
> characters is not the same in another encoding.
> 
> So, what is needed is:
> 1 - a way to tell SA the encoding/charset used in some rules
> 2 - SA convert the rules to an universal encoding internally 
> (e.g. utf-8/16).
> 3 - Temporary reconvert to the message encoding/charset to proper match.
> 
> I really dont know if SA does somithing like this internally, but I
> think it does not.
> Doing this will require a considerable amount of work (so, gsc20007).
> 
> Without this kind of support, I see it will be easier in the future
> spammers playing with charset to avoid specific rules.
> 
> -Raul Dias

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Raul Dias

On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
> Theo Van Dinter writes:
> > I'm assuming that there will be a Google Summer of Code 2007 going on, and
> > that the ASF will be involved again.  So it's a good time to start thinking
> > about things we'd like to put up as possible projects.
> > 
> > We still have a number of items from last year that we could use again.
> > Anything else that we'd like people to code up?

Another thing that might worth adding to GSC2007.

Internal Encoding/Charset used by SA.

I havent find anything like that, but that doesnt mean SA does not do
this already.  In this case sorry :)

Mail messages can have multiple encodings like ISO-8859-*, utf-8,
utf-16, windows-*, and so on.

Also, perl (unless set "use utf8") will default to the system encoding
like LC_CTYPE.

Rule writters needs a way to tell SA, which encoding their rules are.

This is not a real issue for english rule, but for other languages are,
like portugues, french, russian, chinese, japanese and so on.

The real problem is that a string in one encoding with special
characters is not the same in another encoding.

So, what is needed is:
1 - a way to tell SA the encoding/charset used in some rules
2 - SA convert the rules to an universal encoding internally 
(e.g. utf-8/16).
3 - Temporary reconvert to the message encoding/charset to proper match.

I really dont know if SA does somithing like this internally, but I
think it does not.
Doing this will require a considerable amount of work (so, gsc20007).

Without this kind of support, I see it will be easier in the future
spammers playing with charset to avoid specific rules.

-Raul Dias

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason


Julian Field writes:
> Justin Mason wrote:
> > DAve writes:
> >> Justin Mason wrote:
> >>> Theo Van Dinter writes:
> >>>> I'm assuming that there will be a Google Summer of Code 2007 going on, 
> >>>> and
> >>>> that the ASF will be involved again.  So it's a good time to start 
> >>>> thinking
> >>>> about things we'd like to put up as possible projects.
> >>>>
> >>>> We still have a number of items from last year that we could use again.
> >>>> Anything else that we'd like people to code up?
> >>>> 
> >>> Also, any suggestions from outside the dev team?  Anyone got good ideas
> >>> for new SpamAssassin features that would be good to pay someone to work on
> >>> for 3 months?
> >>>
> >>> --j.
> >>>   
> >> Maybe, several of us use MailScanner. MailScanner does not use spamc, it 
> >> loads SA directly. One of the features of MailScanner is called MCP or 
> >> Message Content Protection. MCP uses, or attempts to use, SA to do 
> >> specific targeted message content checking. Many people, we included, 
> >> would like to be able to use this but it seems there is always some 
> >> gotcha to having SA loaded in MailScanner twice. Problems with the 
> >> directory paths, rules in memory, etc.
> >>
> >> The ability to run SA with two totally different configurations in the 
> >> same application would very handy. Different rules for outbound mail vs 
> >> inbound mail, MCP(as in this user wants zero mail with the word "breast" 
> >> regardless of the rest of the message content) are just two examples.
> >>
> >> Contacting Julian on the MailScanner list would give far better examples 
> >> and details than I could.
> >
> > cc'd Julian.
> >
> > This could definitely be done -- I didn't realise there was demand for
> > it ;)
> >   
> The basic idea is this:
> On normal incoming mail, the usual SpamAssassin setup is used. But on 
> outbound mail, for example, the company doesn't want any of the normal 
> SpamAssassin rules. But instead, they want to look for particular "rude" 
> keywords, company project names, and any other phrases the user wants, 
> but *not* the standard SpamAssassin rules.
> 
> I don't use spamc/spamd or the spamassassin script, I call the function 
> library directly.
> 
> So I need to be able to call SpamAssassin with 2 completely different 
> sets of configuration settings.
> 
> What would also be nice is a way of different messages using different 
> SpamAssassin configuration settings, this would add a lot of flexibility 
> to what I can do with MailScanner's use of SpamAssassin.
> 
> I effectively need to be able to create 2 or more SpamAssassin 
> configurations and have them work entirely independently.
> 
> I hope that explains it well enough.

It does, thanks.  I presume having two independent Mail::SpamAssassin
objects (assuming they were really independent, where they are not
currently) would be ok?

> > This should probably be opened as a bug on the bugzilla, btw.
> >   
> How much detail do you want?

A cut and paste of the above would be fine.  The idea is that (a)
it's on the bz, where it's more easily tracked and (b) if you open
it, you will be cc'd on updates and comments.

--j.

Re: Google Summer of Code 2007 ...

2007-02-21 Thread C. Bensend


>> Perhaps this is trivial, or not desired by anyone else but myself,
>> but I'd _love_ to be able to strip SpamAssassin tags via spamc and
>> spamd, instead of having to fire up the full-blown spamassassin
>> for each message.  :)
>
> formail ?

That would work in most cases, yes.  Unfortunately, not in mine.
Thanks for the pointer, though.  :)

Benny


-- 
"During the armageddon, only two things will survive - cockroaches
and Cher."-- "What's her bra size" online game

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason


Mark Martinec writes:
>On Saturday February 17 2007 03:01, Quinn Comendant wrote:
>> How about an extensive statistics reporting tool, ..., that
>> can show how well a current spamassassin installation is performing
>> and where it needs improvements.
>
>Well, not exactly by your words, but in the same spirit,
>this time belonging to SA itsef:
>
>Instrument SA with a couple of performance measuring probes,
>providing some easier way to spot where bottlenecks lie.
>Just something simple enough to tell, look, currently waiting
>for Razor server response (or some RBL) is taking 80% of
>elapsed time. Or, Bayes db is very sluggish, it is taking
>5 seconds to provide a result.
>
>A timing breakdown by subtasks is not that much work to provide,
>but provides great insight into troubleshooting and performance
>improvements.
>
>Here is an example of a timing breakdown as currently provided
>in the log (at log level 2) by amavisd-new, without getting into
>specific details, except to say the numbers are elapsed time
>for each subtask in milliseconds (and in percents, just for the
>section, and then a cumulative percent of all sections so far):
>
>TIMING [total 1840 ms] - SMTP pre-DATA-flush: 4 (0%)0, SMTP DATA: 95 (5%)5, 
>check_init: 1 (0%)5, sql-enter: 69 (4%)9, mime_decode: 16 (1%)10,
>get-file-type2: 26 (1%)11, parts_decode: 1 (0%)12, check_header: 3 (0%)12, 
>AV-scan-1: 14 (1%)12, AV-scan-2: 20 (1%)14, spam-wb-list: 5 (0%)14,
>SA call: 1517 (82%)96, update_cache: 3 (0%)97, decide_mail_destiny: 6 (0%)97,
>^
>write-header: 15 (1%)98, save-to-local-mailbox: 1 (0%)98,
>prepare-dsn: 3 (0%)98, main_log_entry: 12 (1%)99, sql-update: 20 (1%)100,
>update_snmp: 2 (0%)100, SMTP pre-response: 1 (0%)100, SMTP response: 1 (0%)
>100, unlink-2-files: 1 (0%)100, rundown: 0 (0%)100
>
>It tells at a glance that message checking and I/O for this particular
>message took 1840 ms in total, that receiving a message over SMTP
>for example took 5% of this, virus scaners were very quick (14 and 20 ms),
>and SA call took 1517 ms, which is (82%) of all elapsed time,
>all sections up to SA (cumulative) took 96% of total elapsed time.
>
>Now, something of this relatively simple timing breakdown, but
>drilled down into a SA call, telling the administrator where is it
>worth spending his effort, or why all a sudden SA takes 10 seconds
>instead of the usual 2.

again, another good idea, you're on a roll!  I can see that being very
handy -- and a great concise format for that info.  Could you open
*another* feature req for this one? ;)

--j.

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason


Mark Martinec writes:
>> Also, any suggestions from outside the dev team?  Anyone got good ideas
>> for new SpamAssassin features that would be good to pay someone to work on
>> for 3 months?
>
>Here's another one, to seize the opportunity when internal changes
>are being contemplated:
>
>Split the process into two parts:
>
>- parsing and munging of mail & rules, resulting in a set of
>  findings (e.g. a list of rules being hit, perhaps somehow
>  generalized). This section can be done once per message,
>  regardless of the number of recipients to the message
>  (assuming all users use the same rules);
>
>- based on the above, score the findings, possibly
>  applying per-recipient scoring to each rule being hit;
>  This (rather inexpensive) step can be applied for each
>  recipient individually, without having to re-process
>  an entire message in multiple-recipient mail.
>
>...and adjust the API to Mail::SpamAssassin accordingly, so that
>MTA-based content filtering (e.g. amavisd-new) could take advantage
>of it, while still allowing full per-recipient customization of
>individual rules scores (including disabling some by a score of 0).
>
>Benefits depend on a site, but our stats show 1.46 recipients
>per message on the average. The above change (when calling SA
>at MTA level) would bring a 46 % increase in througput for free,
>while still providing individualized rules scoring. 

Mark, could you open a feature-request bug for this?
I think it'd be worthwhile, and would definitely be a good
SOC project (at least).

--j.

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason


"Matthew Wilson" writes:
>- Full, tested, supportable multithreaded support

In my experience, perl threading is just not avialable in a reliable,
fast implementation -- this is not viable I'm afraid :(

>- Full, tested, supportable support for an asynchronous I/O model (a la
>qpsmtpd-async)

A pretty big task, unfortunately :(  It'd be nice, but could take
a lot of work.

>- Pluggable to the point where all configuration and settings can be pulled
>from anywhere (databases, files, in-memory cache) at runtime, so SA could
>stay resident and have its configuration be changed in-process.

Is this not already possible with "config_text"?

--j.

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason

Raul Dias writes:
>On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
>> Theo Van Dinter writes:
>> > I'm assuming that there will be a Google Summer of Code 2007 going on, and
>> > that the ASF will be involved again.  So it's a good time to start thinking
>> > about things we'd like to put up as possible projects.
>> > 
>> > We still have a number of items from last year that we could use again.
>> > Anything else that we'd like people to code up?
>> 
>> Also, any suggestions from outside the dev team?  Anyone got good ideas
>> for new SpamAssassin features that would be good to pay someone to work on
>> for 3 months?
>> 
>> --j.
>
>Not a direct improvement, but...
>
>- Add more hooks for plugins to let a broaded pluginization of SA.
>  e.g. letting plugins to load before parsing rules.

Is this not already possible?  If not, it should be, agreed.

For each case where you have a problem along these lines, feel free
to open a bug at the bugzilla and we can get it fixed.  Adding a plugin
hook point is generally an easy task since it has very low overhead
in most cases.

>- Better documentation of intenal structures used.  Avoid plugin authors
>  to break stuff.

Again, shout if you run into problems -- we need to be prodded into
action ;)

>- "Class"inization of some structures to facilitate plugins reuse.

Might be appropriate in some cases -- note that perl however (in common
with other dynamic langs like Ruby, python etc.) tends not to use
strictly-defined classes for many cases where C++ or Java would use a
class, instead using a more simple, loosely-defined name->value hash or
similar.  It's a slightly different approach to code design.
So it may not always be appropriate. ;)   Of course, these loosely-defined
hashes often need to be documented...

>- The pluginization of SA.  From Bayes to header, body, rawbody, score 
>  rules.  The entire process of doing so would open doors for more 
>  external plugin usage and control.
>
>While this might bring a slightly slower startup.  In the long run, the
>bennefits can be great.

Yeah, agreed.  This one can be tricky though as there are a lot more
details involved here ;)

I think there's ongoing work to pluginize the whole concept of rule
types -- eg. the entire body rule subsystem becomes just a plugin. It's
not ready yet though, and I'm not sure how fast it's progressing...

--j.

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason


DAve writes:
>Justin Mason wrote:
>> Theo Van Dinter writes:
>>> I'm assuming that there will be a Google Summer of Code 2007 going on, and
>>> that the ASF will be involved again.  So it's a good time to start thinking
>>> about things we'd like to put up as possible projects.
>>>
>>> We still have a number of items from last year that we could use again.
>>> Anything else that we'd like people to code up?
>> 
>> Also, any suggestions from outside the dev team?  Anyone got good ideas
>> for new SpamAssassin features that would be good to pay someone to work on
>> for 3 months?
>> 
>> --j.
>
>Maybe, several of us use MailScanner. MailScanner does not use spamc, it 
>loads SA directly. One of the features of MailScanner is called MCP or 
>Message Content Protection. MCP uses, or attempts to use, SA to do 
>specific targeted message content checking. Many people, we included, 
>would like to be able to use this but it seems there is always some 
>gotcha to having SA loaded in MailScanner twice. Problems with the 
>directory paths, rules in memory, etc.
>
>The ability to run SA with two totally different configurations in the 
>same application would very handy. Different rules for outbound mail vs 
>inbound mail, MCP(as in this user wants zero mail with the word "breast" 
>regardless of the rest of the message content) are just two examples.
>
>Contacting Julian on the MailScanner list would give far better examples 
>and details than I could.

cc'd Julian.

This could definitely be done -- I didn't realise there was demand for
it ;)

This should probably be opened as a bug on the bugzilla, btw.
is it already there?

--j.

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Justin Mason


Doc Schneider writes:
>Justin Mason wrote:
>> Theo Van Dinter writes:
>>> I'm assuming that there will be a Google Summer of Code 2007 going on, and
>>> that the ASF will be involved again.  So it's a good time to start thinking
>>> about things we'd like to put up as possible projects.
>>>
>>> We still have a number of items from last year that we could use again.
>>> Anything else that we'd like people to code up?
>> 
>> Also, any suggestions from outside the dev team?  Anyone got good ideas
>> for new SpamAssassin features that would be good to pay someone to work on
>> for 3 months?
>> 
>> --j.
>
> Yeah an updated web interface for adding black and white list 
>and per user options for MySQL/PostGres that is a part of the core 
>utilities. 

Re this one -- what about Maia Mailguard?  I haven't tried it myself,
but it really sounds like they have that kind of thing really nicely
sewn up?

--j.

Re: Google Summer of Code 2007 ...

2007-02-21 Thread Per Jessen

C. Bensend wrote:

> Perhaps this is trivial, or not desired by anyone else but myself,
> but I'd _love_ to be able to strip SpamAssassin tags via spamc and
> spamd, instead of having to fire up the full-blown spamassassin
> for each message.  :)

formail ?


/Per Jessen, Zürich

Re: Google Summer of Code 2007 ...

2007-02-18 Thread Tim B.


Justin Mason wrote:

Graham Murray writes:
  

Theo Van Dinter <[EMAIL PROTECTED]> writes:


Doesn't SA have at least 3 of those already?  Razor, DCC, and Pyzor.
  

Not quite. Those show how many times *others* have seen it, not how
many times *I* have seen it. Also, these have hysteresis so if you are
unfortunately to be at the start of the spam run and receive multiple
mails all with the same body then Razor, DCC and Pyzor might not
help. Though if this were implemented then there would have to a
whitelist for mailing lists to which multiple users have subscribed.



I know that a few big organisations use a private DCC server for this
purpose, with good results; doing it with a DCC server works well
if you have multiple scanner machines.

--j.

  
Humm... I didn't realize DC C could be  used  this way... I will 
investigate

Re: Google Summer of Code 2007 ...

2007-02-18 Thread Justin Mason


Graham Murray writes:
> Theo Van Dinter <[EMAIL PROTECTED]> writes:
> > Doesn't SA have at least 3 of those already?  Razor, DCC, and Pyzor.
> 
> Not quite. Those show how many times *others* have seen it, not how
> many times *I* have seen it. Also, these have hysteresis so if you are
> unfortunately to be at the start of the spam run and receive multiple
> mails all with the same body then Razor, DCC and Pyzor might not
> help. Though if this were implemented then there would have to a
> whitelist for mailing lists to which multiple users have subscribed.

I know that a few big organisations use a private DCC server for this
purpose, with good results; doing it with a DCC server works well
if you have multiple scanner machines.

--j.

Re: Google Summer of Code 2007 ...

2007-02-18 Thread Matthias Leisi

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Justin Mason wrote:

> Also, any suggestions from outside the dev team?  Anyone got good ideas
> for new SpamAssassin features that would be good to pay someone to work on
> for 3 months?

If I look at the tools and scripts I built around SA (and which are far
from perfect), I would like to see:

* Show which whitelist_from(_*) rule hit on messages

* Frequency reporting (à la nightly/mass checks) - currently I grep and
filter this data out of the mail log file, which is error-prone

* "Virtualize" all *.cf and *.pre parsing to allow full DB-based operations

* Auto-maintenance for AWL (ie pruning "old" entries)

* Make the internal status of spamd (counters, timings, ..) available eg
through a CLI tool, through SNMP or through a database.

* A solid, stable, full-featured web interface for configuration,
operation and monitoring (yes, DB-based again ;) ).

- -- Matthias
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)
Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org

iD8DBQFF2CmMxbHw2nyi/okRAjxsAJ96VNxALfLcRnH/QEf5UgmsVQ3mVgCg1LK6
AbCNGeHGGbOkt0nyTLlNER8=
=Gs5o
-END PGP SIGNATURE-

Re: Google Summer of Code 2007 ...

2007-02-17 Thread hamann . w



>> Not quite. Those show how many times *others* have seen it, not how
>> many times *I* have seen it. Also, these have hysteresis so if you are
>> unfortunately to be at the start of the spam run and receive multiple
>> mails all with the same body then Razor, DCC and Pyzor might not
>> help. Though if this were implemented then there would have to a
>> whitelist for mailing lists to which multiple users have subscribed.
>> 

Hi,

ixhash, which also works that way, definitely started its life as an inhouse 
mail counter.
You could probably use ixhash or razor along with your own server rather than 
the public one

Wolfgang

Re: Google Summer of Code 2007 ...

2007-02-17 Thread Graham Murray

Theo Van Dinter <[EMAIL PROTECTED]> writes:
> Doesn't SA have at least 3 of those already?  Razor, DCC, and Pyzor.

Not quite. Those show how many times *others* have seen it, not how
many times *I* have seen it. Also, these have hysteresis so if you are
unfortunately to be at the start of the spam run and receive multiple
mails all with the same body then Razor, DCC and Pyzor might not
help. Though if this were implemented then there would have to a
whitelist for mailing lists to which multiple users have subscribed.

Re: Google Summer of Code 2007 ...

2007-02-17 Thread Mark Martinec

On Saturday February 17 2007 03:01, Quinn Comendant wrote:
> How about an extensive statistics reporting tool, ..., that
> can show how well a current spamassassin installation is performing
> and where it needs improvements.

Well, not exactly by your words, but in the same spirit,
this time belonging to SA itsef:

Instrument SA with a couple of performance measuring probes,
providing some easier way to spot where bottlenecks lie.
Just something simple enough to tell, look, currently waiting
for Razor server response (or some RBL) is taking 80% of
elapsed time. Or, Bayes db is very sluggish, it is taking
5 seconds to provide a result.

A timing breakdown by subtasks is not that much work to provide,
but provides great insight into troubleshooting and performance
improvements.

Here is an example of a timing breakdown as currently provided
in the log (at log level 2) by amavisd-new, without getting into
specific details, except to say the numbers are elapsed time
for each subtask in milliseconds (and in percents, just for the
section, and then a cumulative percent of all sections so far):

TIMING [total 1840 ms] - SMTP pre-DATA-flush: 4 (0%)0, SMTP DATA: 95 (5%)5, 
check_init: 1 (0%)5, sql-enter: 69 (4%)9, mime_decode: 16 (1%)10,
get-file-type2: 26 (1%)11, parts_decode: 1 (0%)12, check_header: 3 (0%)12, 
AV-scan-1: 14 (1%)12, AV-scan-2: 20 (1%)14, spam-wb-list: 5 (0%)14,
SA call: 1517 (82%)96, update_cache: 3 (0%)97, decide_mail_destiny: 6 (0%)97,
^
write-header: 15 (1%)98, save-to-local-mailbox: 1 (0%)98,
prepare-dsn: 3 (0%)98, main_log_entry: 12 (1%)99, sql-update: 20 (1%)100,
update_snmp: 2 (0%)100, SMTP pre-response: 1 (0%)100, SMTP response: 1 (0%)
100, unlink-2-files: 1 (0%)100, rundown: 0 (0%)100

It tells at a glance that message checking and I/O for this particular
message took 1840 ms in total, that receiving a message over SMTP
for example took 5% of this, virus scaners were very quick (14 and 20 ms),
and SA call took 1517 ms, which is (82%) of all elapsed time,
all sections up to SA (cumulative) took 96% of total elapsed time.

Now, something of this relatively simple timing breakdown, but
drilled down into a SA call, telling the administrator where is it
worth spending his effort, or why all a sudden SA takes 10 seconds
instead of the usual 2.

  Mark

Re: Google Summer of Code 2007 ...

2007-02-17 Thread Theo Van Dinter

On Sat, Feb 17, 2007 at 06:56:28PM -0500, Tim B. wrote:
> How about a "How many times have I seen this message body" plugin...
> 
> So each time SA see's the same or similar enough message body, it 
> increases the score. 

Doesn't SA have at least 3 of those already?  Razor, DCC, and Pyzor.

-- 
Randomly Selected Tagline:
"I love deadlines.  I like the whooshing sound they make as they fly by."
   - Douglas Adams


pgpEbUumExLWy.pgp
Description: PGP signature

Re: Google Summer of Code 2007 ...

2007-02-17 Thread Tim B.


Justin Mason wrote:

Theo Van Dinter writes:
  

I'm assuming that there will be a Google Summer of Code 2007 going on, and
that the ASF will be involved again.  So it's a good time to start thinking
about things we'd like to put up as possible projects.

We still have a number of items from last year that we could use again.
Anything else that we'd like people to code up?



Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?

--j.

  

How about a "How many times have I seen this message body" plugin...

So each time SA see's the same or similar enough message body, it 
increases the score.

Re: Google Summer of Code 2007 ...

2007-02-17 Thread Chris St. Pierre


On Fri, 16 Feb 2007, Quinn Comendant wrote:


How about an extensive statistics reporting tool, possible
web-based, that can show how well a current spamassassin
installation is performing and where it needs improvements. It could
provide trends in different classes of spam and how each is
marked. Also show info on whether expensive (as in cpu time) rules
and plugins are actually doing any good.


I don't know that this belongs in SA itself.  It'd be a nice add-on,
but SA already does logging that should be quite sufficient to write
something like this.

Not to mention, the best measure of the success of a spam filtering
plan is user satisfaction.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University
--
Never send mail to [EMAIL PROTECTED]

Re: Google Summer of Code 2007 ...

2007-02-17 Thread Matthew Wilson

Raul Dias writes:
**snip
> If I remember correctly spamd was using something between 2 to 5% of
> memory reported by top (45 process max).
>
> If it was really shared, it would have not collapsed.
>
> My bet is that the model used on Linux is copy on write.  So after a
> fork, when the child spamd changes a value, the kernel makes its own
> copy of the memory. (please correct me if I am wrong).  To make it worse
> perl script (AFAIK) is data and not code which makes harder to reuse
> (espcially with evals around).
>
> Even if sharing does happen it is not enough.
>
> OTOH, with an I/O model, the total memory used would be:
>  - the perl interpreter and libraries (this is trully shared on a fork
> model).
>  - the compiled perl code and perl libraries.
>  - one copy of the parsed rules and compiled regular expressions and non
>message/scanner related data.

Yeah.  It's the lists and rules and regexes that do it for me.

>  - one M::SA::PerMsgStatus object for each simultaneous scanned message
>(this is a place to put a limit on).
>
>> Still, if someone tries it and can demo increased efficiency...
>> go for it ;)
>
> This might require some internal changes to SA. Every Sync call would
> have to be changed to Async (NON BLOCKING). This might include SQL
> calls, DNS calls, exec ing external apps and even file I/O.


An async version of Net::DNS is
http://search.cpan.org/~msergeant/ParaDNS-1.1/

Re: Google Summer of Code 2007 ...

2007-02-17 Thread Raul Dias

On Sat, 2007-02-17 at 11:21 +, Justin Mason wrote:
> Raul Dias writes:
> > On Sat, 2007-02-17 at 02:07 +0100, Mark Martinec wrote:
> > > On Saturday February 17 2007 01:49, Matthew Wilson wrote:
> > > > I was/am primarily concerned with RAM usage for high-concurrency
> > > > situations.
> > > 
> > > Ok. Still, in my experience about 30 (maybe 50) SA processes can
> > > fully utilize today's CPU & I/O, and it's probably no big deal
> > > to provide about 2 GB of memory to cater for such system.
> > > Also, and unfortunately, multithreading in Perl is rather
> > > cumbersome and not significantly less expensive than fully
> > > individual processes.
> > 
> > After experiencing with the sa-blacklist.cf some time ago with 45
> > process brought my system to its knees with 3.5GB (out of memory).  
> > 
> > I agree about the thread model.
> > 
> > But sticking to a async I/O model is a valid point.  If implemented
> > correctly it will save a lot of memory and even improve performance a
> > little.
> > 
> > Having separeted process saves the need to have to check for garbage
> > after filtering a message, which will cause the code to have to be
> > recheck.  
> > 
> > However, for uniprocessor systems, having multiple process running is
> > actually more expansive than a async I/O one.  For multiple process
> > system, just keep one process for cpu or less.
> > 
> > In the past I have played a lot with perl-loop (any loopers around?)
> > which was the only way to go.  It is too low level for most people, but
> > perhaps POE is the way to go today (which can use perl-loop as its
> > base).
> 
> I'm dubious about the benefits for SpamAssassin...
> 
> An async model works very well for network-bound and I/O-bound servers;
> however, SpamAssassin is mainly CPU-bound, since the network and I/O parts
> are already mostly run async during the scan operation.
> 
> Also, the multiple spamd processes share quite a lot of RAM with each
> other -- there's a bug in how linux reports "shared" memory which makes it
> appear much worse than it is. read the FAQ for more details.

yep, but ...

01:01:37 kernel: Out of Memory: Killed process 10024 (spamd).
01:01:51 kernel: Out of Memory: Killed process 10044 (spamd).
01:02:05 kernel: Out of Memory: Killed process 10612 (spamd).
01:02:19 kernel: Out of Memory: Killed process 10038 (spamd).
01:02:32 kernel: Out of Memory: Killed process 10602 (spamd).
01:02:45 kernel: Out of Memory: Killed process 10398 (spamd).
01:03:04 kernel: Out of Memory: Killed process 10020 (spamd).
01:03:29 kernel: Out of Memory: Killed process 10015 (spamd).
01:03:42 kernel: Out of Memory: Killed process 10237 (spamd).
01:04:00 kernel: Out of Memory: Killed process 11037 (spamd).
01:04:18 kernel: Out of Memory: Killed process 10478 (spamd).
01:04:34 kernel: Out of Memory: Killed process 11065 (spamd).
01:04:40 kernel: Out of Memory: Killed process 10405 (spamd).
...and it goes...

If I remember correctly spamd was using something between 2 to 5% of
memory reported by top (45 process max).

If it was really shared, it would have not collapsed.

My bet is that the model used on Linux is copy on write.  So after a
fork, when the child spamd changes a value, the kernel makes its own
copy of the memory. (please correct me if I am wrong).  To make it worse
perl script (AFAIK) is data and not code which makes harder to reuse
(espcially with evals around).

Even if sharing does happen it is not enough.

OTOH, with an I/O model, the total memory used would be:
 - the perl interpreter and libraries (this is trully shared on a fork 
model).
 - the compiled perl code and perl libraries.
 - one copy of the parsed rules and compiled regular expressions and non
   message/scanner related data.
 - one M::SA::PerMsgStatus object for each simultaneous scanned message 
   (this is a place to put a limit on).

> Still, if someone tries it and can demo increased efficiency...
> go for it ;)

This might require some internal changes to SA. Every Sync call would
have to be changed to Async (NON BLOCKING). This might include SQL
calls, DNS calls, exec ing external apps and even file I/O.

-Raul Dias

> --j.

Re: Google Summer of Code 2007 ...

2007-02-17 Thread Justin Mason


Raul Dias writes:
> On Sat, 2007-02-17 at 02:07 +0100, Mark Martinec wrote:
> > On Saturday February 17 2007 01:49, Matthew Wilson wrote:
> > > I was/am primarily concerned with RAM usage for high-concurrency
> > > situations.
> > 
> > Ok. Still, in my experience about 30 (maybe 50) SA processes can
> > fully utilize today's CPU & I/O, and it's probably no big deal
> > to provide about 2 GB of memory to cater for such system.
> > Also, and unfortunately, multithreading in Perl is rather
> > cumbersome and not significantly less expensive than fully
> > individual processes.
> 
> After experiencing with the sa-blacklist.cf some time ago with 45
> process brought my system to its knees with 3.5GB (out of memory).  
> 
> I agree about the thread model.
> 
> But sticking to a async I/O model is a valid point.  If implemented
> correctly it will save a lot of memory and even improve performance a
> little.
> 
> Having separeted process saves the need to have to check for garbage
> after filtering a message, which will cause the code to have to be
> recheck.  
> 
> However, for uniprocessor systems, having multiple process running is
> actually more expansive than a async I/O one.  For multiple process
> system, just keep one process for cpu or less.
> 
> In the past I have played a lot with perl-loop (any loopers around?)
> which was the only way to go.  It is too low level for most people, but
> perhaps POE is the way to go today (which can use perl-loop as its
> base).

I'm dubious about the benefits for SpamAssassin...

An async model works very well for network-bound and I/O-bound servers;
however, SpamAssassin is mainly CPU-bound, since the network and I/O parts
are already mostly run async during the scan operation.

Also, the multiple spamd processes share quite a lot of RAM with each
other -- there's a bug in how linux reports "shared" memory which makes it
appear much worse than it is. read the FAQ for more details.

Still, if someone tries it and can demo increased efficiency...
go for it ;)

--j.

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Quinn Comendant

On Fri, 16 Feb 2007 18:01:37 -0800, Quinn Comendant wrote:
> And/or a fix for the qmail+simscan per-user preferences spamc -u 
> issue where if an email is addressed to multiple users or an alias 
> spamc isn't passed the correct user.

Sorry to reply to myself, but I want to retract that last suggestion: it's not 
really spamassassin's job to parse recipient lists and resolve aliases.

Q

-
Strangecode :: Internet Consultancy
http://www.strangecode.com/
+1 530 624 4410

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Raul Dias

On Sat, 2007-02-17 at 02:07 +0100, Mark Martinec wrote:
> On Saturday February 17 2007 01:49, Matthew Wilson wrote:
> > I was/am primarily concerned with RAM usage for high-concurrency
> > situations.
> 
> Ok. Still, in my experience about 30 (maybe 50) SA processes can
> fully utilize today's CPU & I/O, and it's probably no big deal
> to provide about 2 GB of memory to cater for such system.
> Also, and unfortunately, multithreading in Perl is rather
> cumbersome and not significantly less expensive than fully
> individual processes.

After experiencing with the sa-blacklist.cf some time ago with 45
process brought my system to its knees with 3.5GB (out of memory).  

I agree about the thread model.

But sticking to a async I/O model is a valid point.  If implemented
correctly it will save a lot of memory and even improve performance a
little.

Having separeted process saves the need to have to check for garbage
after filtering a message, which will cause the code to have to be
recheck.  

However, for uniprocessor systems, having multiple process running is
actually more expansive than a async I/O one.  For multiple process
system, just keep one process for cpu or less.

In the past I have played a lot with perl-loop (any loopers around?)
which was the only way to go.  It is too low level for most people, but
perhaps POE is the way to go today (which can use perl-loop as its
base).

-Raul Dias

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Quinn Comendant

On Fri, 16 Feb 2007 15:35:39 +, Justin Mason wrote:
>> We still have a number of items from last year that we could use again.
>> Anything else that we'd like people to code up?

How about an extensive statistics reporting tool, possible web-based, that can 
show how well a current spamassassin installation is performing and where it 
needs improvements. It could provide trends in different classes of spam and 
how each is marked. Also show info on whether expensive (as in cpu time) rules 
and plugins are actually doing any good.

And/or a fix for the qmail+simscan per-user preferences spamc -u issue where if 
an email is addressed to multiple users or an alias spamc isn't passed the 
correct user.

Quinn

-
Strangecode :: Internet Consultancy
http://www.strangecode.com/

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Justin Mason


Mark Martinec writes:
> On Saturday February 17 2007 01:49, Matthew Wilson wrote:
> > I was/am primarily concerned with RAM usage for high-concurrency
> > situations.
> 
> Ok. Still, in my experience about 30 (maybe 50) SA processes can
> fully utilize today's CPU & I/O, and it's probably no big deal
> to provide about 2 GB of memory to cater for such system.
> Also, and unfortunately, multithreading in Perl is rather
> cumbersome and not significantly less expensive than fully
> individual processes.

yep -- that's pretty much what I've found, too.  The earlier, non-ithreads
version is pretty much nonfunctional :(

--j.

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Mark Martinec

On Saturday February 17 2007 01:49, Matthew Wilson wrote:
> I was/am primarily concerned with RAM usage for high-concurrency
> situations.

Ok. Still, in my experience about 30 (maybe 50) SA processes can
fully utilize today's CPU & I/O, and it's probably no big deal
to provide about 2 GB of memory to cater for such system.
Also, and unfortunately, multithreading in Perl is rather
cumbersome and not significantly less expensive than fully
individual processes.

  Mark

RE: Google Summer of Code 2007 ...

2007-02-16 Thread Matthew Wilson

> -Original Message-
> From: Mark Martinec [mailto:[EMAIL PROTECTED]
> Sent: Friday, February 16, 2007 6:09 PM
> To: users@spamassassin.apache.org
> Subject: Re: Google Summer of Code 2007 ...
> 
> Matthew Wilson wrote:
> > - Full, tested, supportable multithreaded support
> > - Full, tested, supportable support for an asynchronous I/O model
> >   (a la qpsmtpd-async)
> 
> I think effort could be better spent elsewhere.
> 
> Spam checking lands itself ideally to running parallel individual
> processes, with little if any interaction between them.
> For an individual user a reduction in processing latency from
> three to one seconds doesn't mean a thing. For an entire mail
> filtering system all that matters is its througput (messages per
> hour). Multithreading brings no performance benefits in this area.
> 
>   Mark

I was/am primarily concerned with RAM usage for high-concurrency situations.

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Mark Martinec

Matthew Wilson wrote:
> - Full, tested, supportable multithreaded support
> - Full, tested, supportable support for an asynchronous I/O model
>   (a la qpsmtpd-async)

I think effort could be better spent elsewhere.

Spam checking lands itself ideally to running parallel individual
processes, with little if any interaction between them.
For an individual user a reduction in processing latency from
three to one seconds doesn't mean a thing. For an entire mail
filtering system all that matters is its througput (messages per
hour). Multithreading brings no performance benefits in this area.

  Mark

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Mark Martinec

> Also, any suggestions from outside the dev team?  Anyone got good ideas
> for new SpamAssassin features that would be good to pay someone to work on
> for 3 months?

Here's another one, to seize the opportunity when internal changes
are being contemplated:

Split the process into two parts:

- parsing and munging of mail & rules, resulting in a set of
  findings (e.g. a list of rules being hit, perhaps somehow
  generalized). This section can be done once per message,
  regardless of the number of recipients to the message
  (assuming all users use the same rules);

- based on the above, score the findings, possibly
  applying per-recipient scoring to each rule being hit;
  This (rather inexpensive) step can be applied for each
  recipient individually, without having to re-process
  an entire message in multiple-recipient mail.

...and adjust the API to Mail::SpamAssassin accordingly, so that
MTA-based content filtering (e.g. amavisd-new) could take advantage
of it, while still allowing full per-recipient customization of
individual rules scores (including disabling some by a score of 0).

Benefits depend on a site, but our stats show 1.46 recipients
per message on the average. The above change (when calling SA
at MTA level) would bring a 46 % increase in througput for free,
while still providing individualized rules scoring. 

  Mark

RE: Google Summer of Code 2007 ...

2007-02-16 Thread Matthew Wilson

- Full, tested, supportable multithreaded support

- Full, tested, supportable support for an asynchronous I/O model (a la
qpsmtpd-async)

- Pluggable to the point where all configuration and settings can be pulled
from anywhere (databases, files, in-memory cache) at runtime, so SA could
stay resident and have its configuration be changed in-process.

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Raul Dias

On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote:
> Theo Van Dinter writes:
> > I'm assuming that there will be a Google Summer of Code 2007 going on, and
> > that the ASF will be involved again.  So it's a good time to start thinking
> > about things we'd like to put up as possible projects.
> > 
> > We still have a number of items from last year that we could use again.
> > Anything else that we'd like people to code up?
> 
> Also, any suggestions from outside the dev team?  Anyone got good ideas
> for new SpamAssassin features that would be good to pay someone to work on
> for 3 months?
> 
> --j.

Not a direct improvement, but...

- Add more hooks for plugins to let a broaded pluginization of SA.
  e.g. letting plugins to load before parsing rules.

- Better documentation of intenal structures used.  Avoid plugin authors
  to break stuff.

- "Class"inization of some structures to facilitate plugins reuse.

- The pluginization of SA.  From Bayes to header, body, rawbody, score 
  rules.  The entire process of doing so would open doors for more 
  external plugin usage and control.

While this might bring a slightly slower startup.  In the long run, the
bennefits can be great.

-Raul Dias

Re: Google Summer of Code 2007 ...

2007-02-16 Thread John Rudd


John D. Hardin wrote:

On Fri, 16 Feb 2007, Justin Mason wrote:


Also, a related project would be to complete the pluginization of
our "Bayes" engine and APIs, so that other probabilistic
classifiers can be plugged in in place of, or in addition to,
Bayes in SpamAssassin.


+1



If that's a notation for "me too", then:

++

I'm all for all implications on this subject so far:

1) the new PPM-D compression based classifier technique

2) pluginization of all of the probability based classifiers, so that 
sites can choose between the SA bayes implementation, other bayes 
implementations, PPM-D, or other learning processes (individually or in 
combinations)

Re: Google Summer of Code 2007 ...

2007-02-16 Thread John D. Hardin

On Fri, 16 Feb 2007, Justin Mason wrote:

> Also, a related project would be to complete the pluginization of
> our "Bayes" engine and APIs, so that other probabilistic
> classifiers can be plugged in in place of, or in addition to,
> Bayes in SpamAssassin.

+1

--
 John Hardin KA7OHZhttp://www.impsec.org/~jhardin/
 [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED]
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
---
  Windows Genuine Advantage (WGA) means that now you use your 
  computer at the sufferance of Microsoft Corporation. They can
  kill it remotely without your consent at any time for any reason.
---
 6 days until George Washington's 275th Birthday

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Bart Schaefer


On 2/16/07, Justin Mason <[EMAIL PROTECTED]> wrote:


Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?


http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3785

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Duncan Findlay

On Fri, Feb 16, 2007 at 09:31:13AM -0800, Dan wrote:

> On Feb 16, 2007, at 7:35, Justin Mason wrote:
> >>We still have a number of items from last year that we could use again.
> >>Anything else that we'd like people to code up?

> >Also, any suggestions from outside the dev team?  Anyone got good ideas
> >for new SpamAssassin features that would be good to pay someone to work on
> >for 3 months?

> I don't know how to code myself but have a new method for scoring messages 
> that could be included natively in SA.  It would work in 
> place of weight based scoring.  Does this sound like it qualifies?

Perhaps pluginize the scoring mechanisms so we can have plugins that
implement different ham/spam decision rules?

-- 
Duncan Findlay


pgp45Qcn2sEBR.pgp
Description: PGP signature

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Chris St. Pierre


On Fri, 16 Feb 2007, Mark Martinec wrote:


I believe this was once mentioned on a Justin's blog (but can't find
a ref now), the following sounds promising as an additional classifier
to existing bayes (especially since the author comes from the same
organization as myself :)

http://www.virusbtn.com/spambulletin/archive/2006/01/sb200601-trec

 ijsSPAM2PPM-D compression model
   Andrej Bratko (Josef Stefan Institute)

 Observations:
 The most startling observation is that character-based compression models
 perform outstandingly well for spam filtering. Commonly used open-source
 filters perform well, but not nearly so well or nearly so poorly as
 reported elsewhere.


This looks very promising.  I found a description of the ijsSPAM2 tool
on the site:

http://www.virusbtn.com/spambulletin/archive/2006/03/sb200603-compression

Remarkable stuff.  That would be a helluva nice plugin to have.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University

Never send mail to [EMAIL PROTECTED]

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Mark Martinec

Justin Mason writes:
> Also, a related project would be to complete the pluginization of our
> "Bayes" engine and APIs, so that other probabilistic classifiers can be
> plugged in in place of, or in addition to, Bayes in SpamAssassin.

Right. I felt a need for something like this when I was switching
Bayes from MySQL to PostgreSQL 8.2 and my first urge was to run both
in parallel, giving the new one small scores and see how it behaves.

(but I wasn't desperate enough to implement it, just switched
and kept an eye on it for a while. Btw, the 8.2 behaves much
better (faster) than earlier versions, seems like some new
optimizations were geared precisly to suit SA queries)

  Mark

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Dan



On Feb 16, 2007, at 7:35, Justin Mason wrote:
We still have a number of items from last year that we could use  
again.

Anything else that we'd like people to code up?


Also, any suggestions from outside the dev team?  Anyone got good  
ideas
for new SpamAssassin features that would be good to pay someone to  
work on

for 3 months?


I don't know how to code myself but have a new method for scoring  
messages that could be included natively in SA.  It would work in  
place of weight based scoring.  Does this sound like it qualifies?


Dan

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Justin Mason

Mark Martinec writes:
> > Also, any suggestions from outside the dev team?  Anyone got good ideas
> > for new SpamAssassin features that would be good to pay someone to work on
> > for 3 months?
> 
> I believe this was once mentioned on a Justin's blog (but can't find
> a ref now), the following sounds promising as an additional classifier
> to existing bayes (especially since the author comes from the same
> organization as myself :)
> 
> http://www.virusbtn.com/spambulletin/archive/2006/01/sb200601-trec
> 
>   ijsSPAM2PPM-D compression model
> Andrej Bratko (Josef Stefan Institute)
> 
>   Observations:
>   The most startling observation is that character-based compression models
>   perform outstandingly well for spam filtering. Commonly used open-source
>   filters perform well, but not nearly so well or nearly so poorly as
>   reported elsewhere.

Yes, definitely!  A related algorithm is OSBF, as implemented here:
http://osbf-lua.luaforge.net/ This had the best performance for
hand-trained probabilistic classifiers in the TREC Spam Track 2006
evaluation -- that's good ;)

Also, a related project would be to complete the pluginization of our
"Bayes" engine and APIs, so that other probabilistic classifiers can be
plugged in in place of, or in addition to, Bayes in SpamAssassin.

--j.

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Mark Martinec

> Also, any suggestions from outside the dev team?  Anyone got good ideas
> for new SpamAssassin features that would be good to pay someone to work on
> for 3 months?

I believe this was once mentioned on a Justin's blog (but can't find
a ref now), the following sounds promising as an additional classifier
to existing bayes (especially since the author comes from the same
organization as myself :)

http://www.virusbtn.com/spambulletin/archive/2006/01/sb200601-trec

  ijsSPAM2PPM-D compression model
Andrej Bratko (Josef Stefan Institute)

  Observations:
  The most startling observation is that character-based compression models
  perform outstandingly well for spam filtering. Commonly used open-source
  filters perform well, but not nearly so well or nearly so poorly as
  reported elsewhere.

Mark

Re: Google Summer of Code 2007 ...

2007-02-16 Thread DAve


Justin Mason wrote:

Theo Van Dinter writes:

I'm assuming that there will be a Google Summer of Code 2007 going on, and
that the ASF will be involved again.  So it's a good time to start thinking
about things we'd like to put up as possible projects.

We still have a number of items from last year that we could use again.
Anything else that we'd like people to code up?


Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?

--j.


Maybe, several of us use MailScanner. MailScanner does not use spamc, it 
loads SA directly. One of the features of MailScanner is called MCP or 
Message Content Protection. MCP uses, or attempts to use, SA to do 
specific targeted message content checking. Many people, we included, 
would like to be able to use this but it seems there is always some 
gotcha to having SA loaded in MailScanner twice. Problems with the 
directory paths, rules in memory, etc.


The ability to run SA with two totally different configurations in the 
same application would very handy. Different rules for outbound mail vs 
inbound mail, MCP(as in this user wants zero mail with the word "breast" 
regardless of the rest of the message content) are just two examples.


Contacting Julian on the MailScanner list would give far better examples 
and details than I could.


Just a thought,

DAve


--
Three years now I've asked Google why they don't have a
logo change for Memorial Day. Why do they choose to do logos
for other non-international holidays, but nothing for
Veterans?

Maybe they forgot who made that choice possible.

Re: Google Summer of Code 2007 ...

2007-02-16 Thread Doc Schneider


Justin Mason wrote:

Theo Van Dinter writes:

I'm assuming that there will be a Google Summer of Code 2007 going on, and
that the ASF will be involved again.  So it's a good time to start thinking
about things we'd like to put up as possible projects.

We still have a number of items from last year that we could use again.
Anything else that we'd like people to code up?


Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?

--j.


 Yeah an updated web interface for adding black and white list 
and per user options for MySQL/PostGres that is a part of the core 
utilities. 


--

 -Doc

 SA/SARE -- Ninja
   9:52am  up 21:19, 17 users,  load average: 0.11, 0.36, 0.52

 SARE HQ  http://www.rulesemporium.com/

Re: Google Summer of Code 2007 ...

2007-02-16 Thread C. Bensend


> Also, any suggestions from outside the dev team?  Anyone got good ideas
> for new SpamAssassin features that would be good to pay someone to work on
> for 3 months?

Perhaps this is trivial, or not desired by anyone else but myself,
but I'd _love_ to be able to strip SpamAssassin tags via spamc and
spamd, instead of having to fire up the full-blown spamassassin
for each message.  :)

Benny


-- 
"Very funny, Scotty. Now beam down my clothes."  -- James. T. Kirk

Google Summer of Code 2007 ...

2007-02-16 Thread Justin Mason

Theo Van Dinter writes:
> I'm assuming that there will be a Google Summer of Code 2007 going on, and
> that the ASF will be involved again.  So it's a good time to start thinking
> about things we'd like to put up as possible projects.
> 
> We still have a number of items from last year that we could use again.
> Anything else that we'd like people to code up?

Also, any suggestions from outside the dev team?  Anyone got good ideas
for new SpamAssassin features that would be good to pay someone to work on
for 3 months?

--j.

58 matches

Mail list logo