Google Summer of Code 2007 - Students Wanted
Howdy, The time of year for Google Summer of Code has already arrived and once again the Apache Software Foundation is taking part. We are currently looking for students who wish to work on SpamAssassin related projects over the summer. You have until *March 24th* to sign up and submit an application. Work on the project will take place from May28th through August 20th. You can find a list of possible projects here (just search for spamassassin): http://wiki.apache.org/general/SummerOfCode2007 That is by no means an exhaustive list so if you have other ideas or know of something from here: http://wiki.apache.org/spamassassin/WeLoveVolunteers that you would like to work on, feel free to add it to the list and submit an application. Last year we were able to take on several projects, its a nice way to earn 4500 USD over the summer. Thanks Michael Parker
Re: [2] Google Summer of Code 2007 ...
Chris St. Pierre wrote: > > > Mark Martinec wrote: >> >> >> ... the following sounds promising as an additional classifier >> to existing bayes (especially since the author comes from the same >> organization as myself :) >> >> http://www.virusbtn.com/spambulletin/archive/2006/01/sb200601-trec >> >> ijsSPAM2PPM-D compression model >>Andrej Bratko (Josef Stefan Institute) >> >> Observations: >> The most startling observation is that character-based compression models >> perform outstandingly well for spam filtering. Commonly used open-source >> filters perform well, but not nearly so well or nearly so poorly as >> reported elsewhere. >> >> > > This looks very promising. I found a description of the ijsSPAM2 tool > on the site: > > http://www.virusbtn.com/spambulletin/archive/2006/03/sb200603-compression > > Remarkable stuff. That would be a helluva nice plugin to have. > > I've recently released a C++ library that includes an implementation of the PPM-D algorithm, geared towards classification (or mail filtering). This is essentially the same algorithm that appeared at TREC 2005 as `ijsSPAM2'. It's available at: http://ai.ijs.si/andrej/psmslib.html There's also a Python wrapper: http://ai.ijs.si/andrej/psmpylib.html The C++ library and Python extension module are free for personal and for research use, but unfortunately, I cannot disclose the source code at this time, or release the libraries under an Apache-compatible license. Anyway, you might want to try it out before coding your own implementation. -- View this message in context: http://www.nabble.com/Google-Summer-of-Code-2007-...-tf3240085.html#a9146893 Sent from the SpamAssassin - Users mailing list archive at Nabble.com.
RE: Google Summer of Code 2007 ...
> > Yes, if you Goole for "Google Summer of Code"+spamassassin > you'll get a bunch of relevant hits. ;) > > For example, check out: > http://wiki.apache.org/spamassassin/SummerOfCode2006 > Thank you I was hoping for meaningful and relevant info from someone of authority and in the know from the SA group. I know how to search and I know how to discern and even guess. Yet, as of late, my experiences with Google and searching are poor. Sure I can find stuff... yet finding anything helpful or relevant in the sea of garbage that gets spewed back is another story. - rh -- Robert - Abba Communications Computer & Internet Services (509) 624-7159 - www.abbacomm.net
Re: Google Summer of Code 2007 ...
Raul Dias writes: > On Wed, 2007-02-21 at 17:27 +0100, Justin Mason wrote: > > Raul Dias writes: > > > On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote: > > > > actually I think this is already implemented in 3.2.0 -- see > > > > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details. > > > > > > Nice. This patch solves the message part problem. > > > > > > With this, rules can be written in Unicode too. > > > A final change would be to let rules be written into other charserts. > > > > > > Rule files are read separated. A easy implementation would be to add a > > > file_charset option. This option will advice the charset used by the > > > rule file like iso-8859-15 and be converted internally to unicode too if > > > and only if (IMO) normalize_charset option is set to 1. > > > > I think I prefer the current model, where rules are UTF-8, I'm > > afraid ;) > > Just to get this straight. > > All rules are considered UTF-8 (no difference for ascii ones)? > Is this on 3.2 only or 3.1 too? > > I have assumed that it would be iso-8859-1 so far. With the "normalize_charset" code active, it's UTF-8. (iirc) --j.
RE: Google Summer of Code 2007 ...
On Wed, 21 Feb 2007, R Lists06 wrote: > May I ask... > > Whis is this thread named as such. > > Does Google help fund SA efforts in one or multiple ways? > > If so, may I ask how or directions to already posted docs on it? > > - rh > > -- > Robert - Abba Communications Yes, if you Goole for "Google Summer of Code"+spamassassin you'll get a bunch of relevant hits. ;) For example, check out: http://wiki.apache.org/spamassassin/SummerOfCode2006 -- Dave Funk University of Iowa College of Engineering 319/335-5751 FAX: 319/384-0549 1256 Seamans Center Sys_admin/Postmaster/cell_adminIowa City, IA 52242-1527 #include Better is not better, 'standard' is better. B{
Re: Google Summer of Code 2007 ...
On Wed, 2007-02-21 at 17:27 +0100, Justin Mason wrote: > Raul Dias writes: > > On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote: > > > actually I think this is already implemented in 3.2.0 -- see > > > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details. > > > > Nice. This patch solves the message part problem. > > > > With this, rules can be written in Unicode too. > > A final change would be to let rules be written into other charserts. > > > > Rule files are read separated. A easy implementation would be to add a > > file_charset option. This option will advice the charset used by the > > rule file like iso-8859-15 and be converted internally to unicode too if > > and only if (IMO) normalize_charset option is set to 1. > > I think I prefer the current model, where rules are UTF-8, I'm > afraid ;) Just to get this straight. All rules are considered UTF-8 (no difference for ascii ones)? Is this on 3.2 only or 3.1 too? I have assumed that it would be iso-8859-1 so far. -Raul Dias
Re: Google Summer of Code 2007 ...
R Lists06 wrote: May I ask... Whis is this thread named as such. Does Google help fund SA efforts in one or multiple ways? If so, may I ask how or directions to already posted docs on it? If you, uh, Google for "Google Summer of Code" I'm sure you'll find all you want to know. Daryl
RE: Google Summer of Code 2007 ...
May I ask... Whis is this thread named as such. Does Google help fund SA efforts in one or multiple ways? If so, may I ask how or directions to already posted docs on it? - rh -- Robert - Abba Communications Computer & Internet Services (509) 624-7159 - www.abbacomm.net
Re: Google Summer of Code 2007 ...
Raul Dias writes: > On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote: > > actually I think this is already implemented in 3.2.0 -- see > > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details. > > Nice. This patch solves the message part problem. > > With this, rules can be written in Unicode too. > A final change would be to let rules be written into other charserts. > > Rule files are read separated. A easy implementation would be to add a > file_charset option. This option will advice the charset used by the > rule file like iso-8859-15 and be converted internally to unicode too if > and only if (IMO) normalize_charset option is set to 1. I think I prefer the current model, where rules are UTF-8, I'm afraid ;) --j. > -Raul Dias > > > --j. > > > > Raul Dias writes: > > > On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote: > > > > Theo Van Dinter writes: > > > > > I'm assuming that there will be a Google Summer of Code 2007 going > > > > > on, and > > > > > that the ASF will be involved again. So it's a good time to start > > > > > thinking > > > > > about things we'd like to put up as possible projects. > > > > > > > > > > We still have a number of items from last year that we could use > > > > > again. > > > > > Anything else that we'd like people to code up? > > > > > > Another thing that might worth adding to GSC2007. > > > > > > Internal Encoding/Charset used by SA. > > > > > > I havent find anything like that, but that doesnt mean SA does not do > > > this already. In this case sorry :) > > > > > > Mail messages can have multiple encodings like ISO-8859-*, utf-8, > > > utf-16, windows-*, and so on. > > > > > > Also, perl (unless set "use utf8") will default to the system encoding > > > like LC_CTYPE. > > > > > > Rule writters needs a way to tell SA, which encoding their rules are. > > > > > > This is not a real issue for english rule, but for other languages are, > > > like portugues, french, russian, chinese, japanese and so on. > > > > > > The real problem is that a string in one encoding with special > > > characters is not the same in another encoding. > > > > > > So, what is needed is: > > > 1 - a way to tell SA the encoding/charset used in some rules > > > 2 - SA convert the rules to an universal encoding internally > > > (e.g. utf-8/16). > > > 3 - Temporary reconvert to the message encoding/charset to proper match. > > > > > > I really dont know if SA does somithing like this internally, but I > > > think it does not. > > > Doing this will require a considerable amount of work (so, gsc20007). > > > > > > Without this kind of support, I see it will be easier in the future > > > spammers playing with charset to avoid specific rules. > > > > > > -Raul Dias
Re: Google Summer of Code 2007 ...
Justin Mason wrote: DAve writes: Justin Mason wrote: Theo Van Dinter writes: I'm assuming that there will be a Google Summer of Code 2007 going on, and that the ASF will be involved again. So it's a good time to start thinking about things we'd like to put up as possible projects. We still have a number of items from last year that we could use again. Anything else that we'd like people to code up? Also, any suggestions from outside the dev team? Anyone got good ideas for new SpamAssassin features that would be good to pay someone to work on for 3 months? --j. Maybe, several of us use MailScanner. MailScanner does not use spamc, it loads SA directly. One of the features of MailScanner is called MCP or Message Content Protection. MCP uses, or attempts to use, SA to do specific targeted message content checking. Many people, we included, would like to be able to use this but it seems there is always some gotcha to having SA loaded in MailScanner twice. Problems with the directory paths, rules in memory, etc. The ability to run SA with two totally different configurations in the same application would very handy. Different rules for outbound mail vs inbound mail, MCP(as in this user wants zero mail with the word "breast" regardless of the rest of the message content) are just two examples. Contacting Julian on the MailScanner list would give far better examples and details than I could. cc'd Julian. This could definitely be done -- I didn't realise there was demand for it ;) This should probably be opened as a bug on the bugzilla, btw. is it already there? --j. No, I didn't really think it a bug, nor a feature request. I kinda viewed it as a future path. Sounds like I am late to the game this morning anyway and Jules will enter it. DAve -- Three years now I've asked Google why they don't have a logo change for Memorial Day. Why do they choose to do logos for other non-international holidays, but nothing for Veterans? Maybe they forgot who made that choice possible.
Re: Google Summer of Code 2007 ...
On Wed, 2007-02-21 at 15:29 +0100, Justin Mason wrote: > actually I think this is already implemented in 3.2.0 -- see > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details. Nice. This patch solves the message part problem. With this, rules can be written in Unicode too. A final change would be to let rules be written into other charserts. Rule files are read separated. A easy implementation would be to add a file_charset option. This option will advice the charset used by the rule file like iso-8859-15 and be converted internally to unicode too if and only if (IMO) normalize_charset option is set to 1. -Raul Dias > --j. > > Raul Dias writes: > > On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote: > > > Theo Van Dinter writes: > > > > I'm assuming that there will be a Google Summer of Code 2007 going on, > > > > and > > > > that the ASF will be involved again. So it's a good time to start > > > > thinking > > > > about things we'd like to put up as possible projects. > > > > > > > > We still have a number of items from last year that we could use again. > > > > Anything else that we'd like people to code up? > > > > Another thing that might worth adding to GSC2007. > > > > Internal Encoding/Charset used by SA. > > > > I havent find anything like that, but that doesnt mean SA does not do > > this already. In this case sorry :) > > > > Mail messages can have multiple encodings like ISO-8859-*, utf-8, > > utf-16, windows-*, and so on. > > > > Also, perl (unless set "use utf8") will default to the system encoding > > like LC_CTYPE. > > > > Rule writters needs a way to tell SA, which encoding their rules are. > > > > This is not a real issue for english rule, but for other languages are, > > like portugues, french, russian, chinese, japanese and so on. > > > > The real problem is that a string in one encoding with special > > characters is not the same in another encoding. > > > > So, what is needed is: > > 1 - a way to tell SA the encoding/charset used in some rules > > 2 - SA convert the rules to an universal encoding internally > > (e.g. utf-8/16). > > 3 - Temporary reconvert to the message encoding/charset to proper match. > > > > I really dont know if SA does somithing like this internally, but I > > think it does not. > > Doing this will require a considerable amount of work (so, gsc20007). > > > > Without this kind of support, I see it will be easier in the future > > spammers playing with charset to avoid specific rules. > > > > -Raul Dias
Re: Google Summer of Code 2007 ...
Justin Mason wrote: DAve writes: Justin Mason wrote: Theo Van Dinter writes: I'm assuming that there will be a Google Summer of Code 2007 going on, and that the ASF will be involved again. So it's a good time to start thinking about things we'd like to put up as possible projects. We still have a number of items from last year that we could use again. Anything else that we'd like people to code up? Also, any suggestions from outside the dev team? Anyone got good ideas for new SpamAssassin features that would be good to pay someone to work on for 3 months? --j. Maybe, several of us use MailScanner. MailScanner does not use spamc, it loads SA directly. One of the features of MailScanner is called MCP or Message Content Protection. MCP uses, or attempts to use, SA to do specific targeted message content checking. Many people, we included, would like to be able to use this but it seems there is always some gotcha to having SA loaded in MailScanner twice. Problems with the directory paths, rules in memory, etc. The ability to run SA with two totally different configurations in the same application would very handy. Different rules for outbound mail vs inbound mail, MCP(as in this user wants zero mail with the word "breast" regardless of the rest of the message content) are just two examples. Contacting Julian on the MailScanner list would give far better examples and details than I could. cc'd Julian. This could definitely be done -- I didn't realise there was demand for it ;) The basic idea is this: On normal incoming mail, the usual SpamAssassin setup is used. But on outbound mail, for example, the company doesn't want any of the normal SpamAssassin rules. But instead, they want to look for particular "rude" keywords, company project names, and any other phrases the user wants, but *not* the standard SpamAssassin rules. I don't use spamc/spamd or the spamassassin script, I call the function library directly. So I need to be able to call SpamAssassin with 2 completely different sets of configuration settings. What would also be nice is a way of different messages using different SpamAssassin configuration settings, this would add a lot of flexibility to what I can do with MailScanner's use of SpamAssassin. I effectively need to be able to create 2 or more SpamAssassin configurations and have them work entirely independently. I hope that explains it well enough. This should probably be opened as a bug on the bugzilla, btw. How much detail do you want? is it already there? I've never put it there. --j. Jules -- Julian Field MEng CITP www.MailScanner.info Buy the MailScanner book at www.MailScanner.info/store Need help customising MailScanner? Contact me! Need help fixing or optimising your systems? Contact me! Need help getting you started solving new requirements from your boss? Contact me! PGP footprint: EE81 D763 3DB0 0BFD E1DC 7222 11F6 5947 1415 B654 PGP.sig Description: PGP signature
Re: Google Summer of Code 2007 ...
actually I think this is already implemented in 3.2.0 -- see http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4636 for details. --j. Raul Dias writes: > On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote: > > Theo Van Dinter writes: > > > I'm assuming that there will be a Google Summer of Code 2007 going on, and > > > that the ASF will be involved again. So it's a good time to start > > > thinking > > > about things we'd like to put up as possible projects. > > > > > > We still have a number of items from last year that we could use again. > > > Anything else that we'd like people to code up? > > Another thing that might worth adding to GSC2007. > > Internal Encoding/Charset used by SA. > > I havent find anything like that, but that doesnt mean SA does not do > this already. In this case sorry :) > > Mail messages can have multiple encodings like ISO-8859-*, utf-8, > utf-16, windows-*, and so on. > > Also, perl (unless set "use utf8") will default to the system encoding > like LC_CTYPE. > > Rule writters needs a way to tell SA, which encoding their rules are. > > This is not a real issue for english rule, but for other languages are, > like portugues, french, russian, chinese, japanese and so on. > > The real problem is that a string in one encoding with special > characters is not the same in another encoding. > > So, what is needed is: > 1 - a way to tell SA the encoding/charset used in some rules > 2 - SA convert the rules to an universal encoding internally > (e.g. utf-8/16). > 3 - Temporary reconvert to the message encoding/charset to proper match. > > I really dont know if SA does somithing like this internally, but I > think it does not. > Doing this will require a considerable amount of work (so, gsc20007). > > Without this kind of support, I see it will be easier in the future > spammers playing with charset to avoid specific rules. > > -Raul Dias
Re: Google Summer of Code 2007 ...
On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote: > Theo Van Dinter writes: > > I'm assuming that there will be a Google Summer of Code 2007 going on, and > > that the ASF will be involved again. So it's a good time to start thinking > > about things we'd like to put up as possible projects. > > > > We still have a number of items from last year that we could use again. > > Anything else that we'd like people to code up? Another thing that might worth adding to GSC2007. Internal Encoding/Charset used by SA. I havent find anything like that, but that doesnt mean SA does not do this already. In this case sorry :) Mail messages can have multiple encodings like ISO-8859-*, utf-8, utf-16, windows-*, and so on. Also, perl (unless set "use utf8") will default to the system encoding like LC_CTYPE. Rule writters needs a way to tell SA, which encoding their rules are. This is not a real issue for english rule, but for other languages are, like portugues, french, russian, chinese, japanese and so on. The real problem is that a string in one encoding with special characters is not the same in another encoding. So, what is needed is: 1 - a way to tell SA the encoding/charset used in some rules 2 - SA convert the rules to an universal encoding internally (e.g. utf-8/16). 3 - Temporary reconvert to the message encoding/charset to proper match. I really dont know if SA does somithing like this internally, but I think it does not. Doing this will require a considerable amount of work (so, gsc20007). Without this kind of support, I see it will be easier in the future spammers playing with charset to avoid specific rules. -Raul Dias
Re: Google Summer of Code 2007 ...
Julian Field writes: > Justin Mason wrote: > > DAve writes: > >> Justin Mason wrote: > >>> Theo Van Dinter writes: > >>>> I'm assuming that there will be a Google Summer of Code 2007 going on, > >>>> and > >>>> that the ASF will be involved again. So it's a good time to start > >>>> thinking > >>>> about things we'd like to put up as possible projects. > >>>> > >>>> We still have a number of items from last year that we could use again. > >>>> Anything else that we'd like people to code up? > >>>> > >>> Also, any suggestions from outside the dev team? Anyone got good ideas > >>> for new SpamAssassin features that would be good to pay someone to work on > >>> for 3 months? > >>> > >>> --j. > >>> > >> Maybe, several of us use MailScanner. MailScanner does not use spamc, it > >> loads SA directly. One of the features of MailScanner is called MCP or > >> Message Content Protection. MCP uses, or attempts to use, SA to do > >> specific targeted message content checking. Many people, we included, > >> would like to be able to use this but it seems there is always some > >> gotcha to having SA loaded in MailScanner twice. Problems with the > >> directory paths, rules in memory, etc. > >> > >> The ability to run SA with two totally different configurations in the > >> same application would very handy. Different rules for outbound mail vs > >> inbound mail, MCP(as in this user wants zero mail with the word "breast" > >> regardless of the rest of the message content) are just two examples. > >> > >> Contacting Julian on the MailScanner list would give far better examples > >> and details than I could. > > > > cc'd Julian. > > > > This could definitely be done -- I didn't realise there was demand for > > it ;) > > > The basic idea is this: > On normal incoming mail, the usual SpamAssassin setup is used. But on > outbound mail, for example, the company doesn't want any of the normal > SpamAssassin rules. But instead, they want to look for particular "rude" > keywords, company project names, and any other phrases the user wants, > but *not* the standard SpamAssassin rules. > > I don't use spamc/spamd or the spamassassin script, I call the function > library directly. > > So I need to be able to call SpamAssassin with 2 completely different > sets of configuration settings. > > What would also be nice is a way of different messages using different > SpamAssassin configuration settings, this would add a lot of flexibility > to what I can do with MailScanner's use of SpamAssassin. > > I effectively need to be able to create 2 or more SpamAssassin > configurations and have them work entirely independently. > > I hope that explains it well enough. It does, thanks. I presume having two independent Mail::SpamAssassin objects (assuming they were really independent, where they are not currently) would be ok? > > This should probably be opened as a bug on the bugzilla, btw. > > > How much detail do you want? A cut and paste of the above would be fine. The idea is that (a) it's on the bz, where it's more easily tracked and (b) if you open it, you will be cc'd on updates and comments. --j.
Re: Google Summer of Code 2007 ...
>> Perhaps this is trivial, or not desired by anyone else but myself, >> but I'd _love_ to be able to strip SpamAssassin tags via spamc and >> spamd, instead of having to fire up the full-blown spamassassin >> for each message. :) > > formail ? That would work in most cases, yes. Unfortunately, not in mine. Thanks for the pointer, though. :) Benny -- "During the armageddon, only two things will survive - cockroaches and Cher."-- "What's her bra size" online game
Re: Google Summer of Code 2007 ...
Mark Martinec writes: >On Saturday February 17 2007 03:01, Quinn Comendant wrote: >> How about an extensive statistics reporting tool, ..., that >> can show how well a current spamassassin installation is performing >> and where it needs improvements. > >Well, not exactly by your words, but in the same spirit, >this time belonging to SA itsef: > >Instrument SA with a couple of performance measuring probes, >providing some easier way to spot where bottlenecks lie. >Just something simple enough to tell, look, currently waiting >for Razor server response (or some RBL) is taking 80% of >elapsed time. Or, Bayes db is very sluggish, it is taking >5 seconds to provide a result. > >A timing breakdown by subtasks is not that much work to provide, >but provides great insight into troubleshooting and performance >improvements. > >Here is an example of a timing breakdown as currently provided >in the log (at log level 2) by amavisd-new, without getting into >specific details, except to say the numbers are elapsed time >for each subtask in milliseconds (and in percents, just for the >section, and then a cumulative percent of all sections so far): > >TIMING [total 1840 ms] - SMTP pre-DATA-flush: 4 (0%)0, SMTP DATA: 95 (5%)5, >check_init: 1 (0%)5, sql-enter: 69 (4%)9, mime_decode: 16 (1%)10, >get-file-type2: 26 (1%)11, parts_decode: 1 (0%)12, check_header: 3 (0%)12, >AV-scan-1: 14 (1%)12, AV-scan-2: 20 (1%)14, spam-wb-list: 5 (0%)14, >SA call: 1517 (82%)96, update_cache: 3 (0%)97, decide_mail_destiny: 6 (0%)97, >^ >write-header: 15 (1%)98, save-to-local-mailbox: 1 (0%)98, >prepare-dsn: 3 (0%)98, main_log_entry: 12 (1%)99, sql-update: 20 (1%)100, >update_snmp: 2 (0%)100, SMTP pre-response: 1 (0%)100, SMTP response: 1 (0%) >100, unlink-2-files: 1 (0%)100, rundown: 0 (0%)100 > >It tells at a glance that message checking and I/O for this particular >message took 1840 ms in total, that receiving a message over SMTP >for example took 5% of this, virus scaners were very quick (14 and 20 ms), >and SA call took 1517 ms, which is (82%) of all elapsed time, >all sections up to SA (cumulative) took 96% of total elapsed time. > >Now, something of this relatively simple timing breakdown, but >drilled down into a SA call, telling the administrator where is it >worth spending his effort, or why all a sudden SA takes 10 seconds >instead of the usual 2. again, another good idea, you're on a roll! I can see that being very handy -- and a great concise format for that info. Could you open *another* feature req for this one? ;) --j.
Re: Google Summer of Code 2007 ...
Mark Martinec writes: >> Also, any suggestions from outside the dev team? Anyone got good ideas >> for new SpamAssassin features that would be good to pay someone to work on >> for 3 months? > >Here's another one, to seize the opportunity when internal changes >are being contemplated: > >Split the process into two parts: > >- parsing and munging of mail & rules, resulting in a set of > findings (e.g. a list of rules being hit, perhaps somehow > generalized). This section can be done once per message, > regardless of the number of recipients to the message > (assuming all users use the same rules); > >- based on the above, score the findings, possibly > applying per-recipient scoring to each rule being hit; > This (rather inexpensive) step can be applied for each > recipient individually, without having to re-process > an entire message in multiple-recipient mail. > >...and adjust the API to Mail::SpamAssassin accordingly, so that >MTA-based content filtering (e.g. amavisd-new) could take advantage >of it, while still allowing full per-recipient customization of >individual rules scores (including disabling some by a score of 0). > >Benefits depend on a site, but our stats show 1.46 recipients >per message on the average. The above change (when calling SA >at MTA level) would bring a 46 % increase in througput for free, >while still providing individualized rules scoring. Mark, could you open a feature-request bug for this? I think it'd be worthwhile, and would definitely be a good SOC project (at least). --j.
Re: Google Summer of Code 2007 ...
"Matthew Wilson" writes: >- Full, tested, supportable multithreaded support In my experience, perl threading is just not avialable in a reliable, fast implementation -- this is not viable I'm afraid :( >- Full, tested, supportable support for an asynchronous I/O model (a la >qpsmtpd-async) A pretty big task, unfortunately :( It'd be nice, but could take a lot of work. >- Pluggable to the point where all configuration and settings can be pulled >from anywhere (databases, files, in-memory cache) at runtime, so SA could >stay resident and have its configuration be changed in-process. Is this not already possible with "config_text"? --j.
Re: Google Summer of Code 2007 ...
Raul Dias writes: >On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote: >> Theo Van Dinter writes: >> > I'm assuming that there will be a Google Summer of Code 2007 going on, and >> > that the ASF will be involved again. So it's a good time to start thinking >> > about things we'd like to put up as possible projects. >> > >> > We still have a number of items from last year that we could use again. >> > Anything else that we'd like people to code up? >> >> Also, any suggestions from outside the dev team? Anyone got good ideas >> for new SpamAssassin features that would be good to pay someone to work on >> for 3 months? >> >> --j. > >Not a direct improvement, but... > >- Add more hooks for plugins to let a broaded pluginization of SA. > e.g. letting plugins to load before parsing rules. Is this not already possible? If not, it should be, agreed. For each case where you have a problem along these lines, feel free to open a bug at the bugzilla and we can get it fixed. Adding a plugin hook point is generally an easy task since it has very low overhead in most cases. >- Better documentation of intenal structures used. Avoid plugin authors > to break stuff. Again, shout if you run into problems -- we need to be prodded into action ;) >- "Class"inization of some structures to facilitate plugins reuse. Might be appropriate in some cases -- note that perl however (in common with other dynamic langs like Ruby, python etc.) tends not to use strictly-defined classes for many cases where C++ or Java would use a class, instead using a more simple, loosely-defined name->value hash or similar. It's a slightly different approach to code design. So it may not always be appropriate. ;) Of course, these loosely-defined hashes often need to be documented... >- The pluginization of SA. From Bayes to header, body, rawbody, score > rules. The entire process of doing so would open doors for more > external plugin usage and control. > >While this might bring a slightly slower startup. In the long run, the >bennefits can be great. Yeah, agreed. This one can be tricky though as there are a lot more details involved here ;) I think there's ongoing work to pluginize the whole concept of rule types -- eg. the entire body rule subsystem becomes just a plugin. It's not ready yet though, and I'm not sure how fast it's progressing... --j.
Re: Google Summer of Code 2007 ...
DAve writes: >Justin Mason wrote: >> Theo Van Dinter writes: >>> I'm assuming that there will be a Google Summer of Code 2007 going on, and >>> that the ASF will be involved again. So it's a good time to start thinking >>> about things we'd like to put up as possible projects. >>> >>> We still have a number of items from last year that we could use again. >>> Anything else that we'd like people to code up? >> >> Also, any suggestions from outside the dev team? Anyone got good ideas >> for new SpamAssassin features that would be good to pay someone to work on >> for 3 months? >> >> --j. > >Maybe, several of us use MailScanner. MailScanner does not use spamc, it >loads SA directly. One of the features of MailScanner is called MCP or >Message Content Protection. MCP uses, or attempts to use, SA to do >specific targeted message content checking. Many people, we included, >would like to be able to use this but it seems there is always some >gotcha to having SA loaded in MailScanner twice. Problems with the >directory paths, rules in memory, etc. > >The ability to run SA with two totally different configurations in the >same application would very handy. Different rules for outbound mail vs >inbound mail, MCP(as in this user wants zero mail with the word "breast" >regardless of the rest of the message content) are just two examples. > >Contacting Julian on the MailScanner list would give far better examples >and details than I could. cc'd Julian. This could definitely be done -- I didn't realise there was demand for it ;) This should probably be opened as a bug on the bugzilla, btw. is it already there? --j.
Re: Google Summer of Code 2007 ...
Doc Schneider writes: >Justin Mason wrote: >> Theo Van Dinter writes: >>> I'm assuming that there will be a Google Summer of Code 2007 going on, and >>> that the ASF will be involved again. So it's a good time to start thinking >>> about things we'd like to put up as possible projects. >>> >>> We still have a number of items from last year that we could use again. >>> Anything else that we'd like people to code up? >> >> Also, any suggestions from outside the dev team? Anyone got good ideas >> for new SpamAssassin features that would be good to pay someone to work on >> for 3 months? >> >> --j. > > Yeah an updated web interface for adding black and white list >and per user options for MySQL/PostGres that is a part of the core >utilities. Re this one -- what about Maia Mailguard? I haven't tried it myself, but it really sounds like they have that kind of thing really nicely sewn up? --j.
Re: Google Summer of Code 2007 ...
C. Bensend wrote: > Perhaps this is trivial, or not desired by anyone else but myself, > but I'd _love_ to be able to strip SpamAssassin tags via spamc and > spamd, instead of having to fire up the full-blown spamassassin > for each message. :) formail ? /Per Jessen, Zürich
Re: Google Summer of Code 2007 ...
Justin Mason wrote: Graham Murray writes: Theo Van Dinter <[EMAIL PROTECTED]> writes: Doesn't SA have at least 3 of those already? Razor, DCC, and Pyzor. Not quite. Those show how many times *others* have seen it, not how many times *I* have seen it. Also, these have hysteresis so if you are unfortunately to be at the start of the spam run and receive multiple mails all with the same body then Razor, DCC and Pyzor might not help. Though if this were implemented then there would have to a whitelist for mailing lists to which multiple users have subscribed. I know that a few big organisations use a private DCC server for this purpose, with good results; doing it with a DCC server works well if you have multiple scanner machines. --j. Humm... I didn't realize DC C could be used this way... I will investigate
Re: Google Summer of Code 2007 ...
Graham Murray writes: > Theo Van Dinter <[EMAIL PROTECTED]> writes: > > Doesn't SA have at least 3 of those already? Razor, DCC, and Pyzor. > > Not quite. Those show how many times *others* have seen it, not how > many times *I* have seen it. Also, these have hysteresis so if you are > unfortunately to be at the start of the spam run and receive multiple > mails all with the same body then Razor, DCC and Pyzor might not > help. Though if this were implemented then there would have to a > whitelist for mailing lists to which multiple users have subscribed. I know that a few big organisations use a private DCC server for this purpose, with good results; doing it with a DCC server works well if you have multiple scanner machines. --j.
Re: Google Summer of Code 2007 ...
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Justin Mason wrote: > Also, any suggestions from outside the dev team? Anyone got good ideas > for new SpamAssassin features that would be good to pay someone to work on > for 3 months? If I look at the tools and scripts I built around SA (and which are far from perfect), I would like to see: * Show which whitelist_from(_*) rule hit on messages * Frequency reporting (à la nightly/mass checks) - currently I grep and filter this data out of the mail log file, which is error-prone * "Virtualize" all *.cf and *.pre parsing to allow full DB-based operations * Auto-maintenance for AWL (ie pruning "old" entries) * Make the internal status of spamd (counters, timings, ..) available eg through a CLI tool, through SNMP or through a database. * A solid, stable, full-featured web interface for configuration, operation and monitoring (yes, DB-based again ;) ). - -- Matthias -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org iD8DBQFF2CmMxbHw2nyi/okRAjxsAJ96VNxALfLcRnH/QEf5UgmsVQ3mVgCg1LK6 AbCNGeHGGbOkt0nyTLlNER8= =Gs5o -END PGP SIGNATURE-
Re: Google Summer of Code 2007 ...
>> Not quite. Those show how many times *others* have seen it, not how >> many times *I* have seen it. Also, these have hysteresis so if you are >> unfortunately to be at the start of the spam run and receive multiple >> mails all with the same body then Razor, DCC and Pyzor might not >> help. Though if this were implemented then there would have to a >> whitelist for mailing lists to which multiple users have subscribed. >> Hi, ixhash, which also works that way, definitely started its life as an inhouse mail counter. You could probably use ixhash or razor along with your own server rather than the public one Wolfgang
Re: Google Summer of Code 2007 ...
Theo Van Dinter <[EMAIL PROTECTED]> writes: > Doesn't SA have at least 3 of those already? Razor, DCC, and Pyzor. Not quite. Those show how many times *others* have seen it, not how many times *I* have seen it. Also, these have hysteresis so if you are unfortunately to be at the start of the spam run and receive multiple mails all with the same body then Razor, DCC and Pyzor might not help. Though if this were implemented then there would have to a whitelist for mailing lists to which multiple users have subscribed.
Re: Google Summer of Code 2007 ...
On Saturday February 17 2007 03:01, Quinn Comendant wrote: > How about an extensive statistics reporting tool, ..., that > can show how well a current spamassassin installation is performing > and where it needs improvements. Well, not exactly by your words, but in the same spirit, this time belonging to SA itsef: Instrument SA with a couple of performance measuring probes, providing some easier way to spot where bottlenecks lie. Just something simple enough to tell, look, currently waiting for Razor server response (or some RBL) is taking 80% of elapsed time. Or, Bayes db is very sluggish, it is taking 5 seconds to provide a result. A timing breakdown by subtasks is not that much work to provide, but provides great insight into troubleshooting and performance improvements. Here is an example of a timing breakdown as currently provided in the log (at log level 2) by amavisd-new, without getting into specific details, except to say the numbers are elapsed time for each subtask in milliseconds (and in percents, just for the section, and then a cumulative percent of all sections so far): TIMING [total 1840 ms] - SMTP pre-DATA-flush: 4 (0%)0, SMTP DATA: 95 (5%)5, check_init: 1 (0%)5, sql-enter: 69 (4%)9, mime_decode: 16 (1%)10, get-file-type2: 26 (1%)11, parts_decode: 1 (0%)12, check_header: 3 (0%)12, AV-scan-1: 14 (1%)12, AV-scan-2: 20 (1%)14, spam-wb-list: 5 (0%)14, SA call: 1517 (82%)96, update_cache: 3 (0%)97, decide_mail_destiny: 6 (0%)97, ^ write-header: 15 (1%)98, save-to-local-mailbox: 1 (0%)98, prepare-dsn: 3 (0%)98, main_log_entry: 12 (1%)99, sql-update: 20 (1%)100, update_snmp: 2 (0%)100, SMTP pre-response: 1 (0%)100, SMTP response: 1 (0%) 100, unlink-2-files: 1 (0%)100, rundown: 0 (0%)100 It tells at a glance that message checking and I/O for this particular message took 1840 ms in total, that receiving a message over SMTP for example took 5% of this, virus scaners were very quick (14 and 20 ms), and SA call took 1517 ms, which is (82%) of all elapsed time, all sections up to SA (cumulative) took 96% of total elapsed time. Now, something of this relatively simple timing breakdown, but drilled down into a SA call, telling the administrator where is it worth spending his effort, or why all a sudden SA takes 10 seconds instead of the usual 2. Mark
Re: Google Summer of Code 2007 ...
On Sat, Feb 17, 2007 at 06:56:28PM -0500, Tim B. wrote: > How about a "How many times have I seen this message body" plugin... > > So each time SA see's the same or similar enough message body, it > increases the score. Doesn't SA have at least 3 of those already? Razor, DCC, and Pyzor. -- Randomly Selected Tagline: "I love deadlines. I like the whooshing sound they make as they fly by." - Douglas Adams pgpEbUumExLWy.pgp Description: PGP signature
Re: Google Summer of Code 2007 ...
Justin Mason wrote: Theo Van Dinter writes: I'm assuming that there will be a Google Summer of Code 2007 going on, and that the ASF will be involved again. So it's a good time to start thinking about things we'd like to put up as possible projects. We still have a number of items from last year that we could use again. Anything else that we'd like people to code up? Also, any suggestions from outside the dev team? Anyone got good ideas for new SpamAssassin features that would be good to pay someone to work on for 3 months? --j. How about a "How many times have I seen this message body" plugin... So each time SA see's the same or similar enough message body, it increases the score.
Re: Google Summer of Code 2007 ...
On Fri, 16 Feb 2007, Quinn Comendant wrote: How about an extensive statistics reporting tool, possible web-based, that can show how well a current spamassassin installation is performing and where it needs improvements. It could provide trends in different classes of spam and how each is marked. Also show info on whether expensive (as in cpu time) rules and plugins are actually doing any good. I don't know that this belongs in SA itself. It'd be a nice add-on, but SA already does logging that should be quite sufficient to write something like this. Not to mention, the best measure of the success of a spam filtering plan is user satisfaction. Chris St. Pierre Unix Systems Administrator Nebraska Wesleyan University -- Never send mail to [EMAIL PROTECTED]
Re: Google Summer of Code 2007 ...
Raul Dias writes: **snip > If I remember correctly spamd was using something between 2 to 5% of > memory reported by top (45 process max). > > If it was really shared, it would have not collapsed. > > My bet is that the model used on Linux is copy on write. So after a > fork, when the child spamd changes a value, the kernel makes its own > copy of the memory. (please correct me if I am wrong). To make it worse > perl script (AFAIK) is data and not code which makes harder to reuse > (espcially with evals around). > > Even if sharing does happen it is not enough. > > OTOH, with an I/O model, the total memory used would be: > - the perl interpreter and libraries (this is trully shared on a fork > model). > - the compiled perl code and perl libraries. > - one copy of the parsed rules and compiled regular expressions and non >message/scanner related data. Yeah. It's the lists and rules and regexes that do it for me. > - one M::SA::PerMsgStatus object for each simultaneous scanned message >(this is a place to put a limit on). > >> Still, if someone tries it and can demo increased efficiency... >> go for it ;) > > This might require some internal changes to SA. Every Sync call would > have to be changed to Async (NON BLOCKING). This might include SQL > calls, DNS calls, exec ing external apps and even file I/O. An async version of Net::DNS is http://search.cpan.org/~msergeant/ParaDNS-1.1/
Re: Google Summer of Code 2007 ...
On Sat, 2007-02-17 at 11:21 +, Justin Mason wrote: > Raul Dias writes: > > On Sat, 2007-02-17 at 02:07 +0100, Mark Martinec wrote: > > > On Saturday February 17 2007 01:49, Matthew Wilson wrote: > > > > I was/am primarily concerned with RAM usage for high-concurrency > > > > situations. > > > > > > Ok. Still, in my experience about 30 (maybe 50) SA processes can > > > fully utilize today's CPU & I/O, and it's probably no big deal > > > to provide about 2 GB of memory to cater for such system. > > > Also, and unfortunately, multithreading in Perl is rather > > > cumbersome and not significantly less expensive than fully > > > individual processes. > > > > After experiencing with the sa-blacklist.cf some time ago with 45 > > process brought my system to its knees with 3.5GB (out of memory). > > > > I agree about the thread model. > > > > But sticking to a async I/O model is a valid point. If implemented > > correctly it will save a lot of memory and even improve performance a > > little. > > > > Having separeted process saves the need to have to check for garbage > > after filtering a message, which will cause the code to have to be > > recheck. > > > > However, for uniprocessor systems, having multiple process running is > > actually more expansive than a async I/O one. For multiple process > > system, just keep one process for cpu or less. > > > > In the past I have played a lot with perl-loop (any loopers around?) > > which was the only way to go. It is too low level for most people, but > > perhaps POE is the way to go today (which can use perl-loop as its > > base). > > I'm dubious about the benefits for SpamAssassin... > > An async model works very well for network-bound and I/O-bound servers; > however, SpamAssassin is mainly CPU-bound, since the network and I/O parts > are already mostly run async during the scan operation. > > Also, the multiple spamd processes share quite a lot of RAM with each > other -- there's a bug in how linux reports "shared" memory which makes it > appear much worse than it is. read the FAQ for more details. yep, but ... 01:01:37 kernel: Out of Memory: Killed process 10024 (spamd). 01:01:51 kernel: Out of Memory: Killed process 10044 (spamd). 01:02:05 kernel: Out of Memory: Killed process 10612 (spamd). 01:02:19 kernel: Out of Memory: Killed process 10038 (spamd). 01:02:32 kernel: Out of Memory: Killed process 10602 (spamd). 01:02:45 kernel: Out of Memory: Killed process 10398 (spamd). 01:03:04 kernel: Out of Memory: Killed process 10020 (spamd). 01:03:29 kernel: Out of Memory: Killed process 10015 (spamd). 01:03:42 kernel: Out of Memory: Killed process 10237 (spamd). 01:04:00 kernel: Out of Memory: Killed process 11037 (spamd). 01:04:18 kernel: Out of Memory: Killed process 10478 (spamd). 01:04:34 kernel: Out of Memory: Killed process 11065 (spamd). 01:04:40 kernel: Out of Memory: Killed process 10405 (spamd). ...and it goes... If I remember correctly spamd was using something between 2 to 5% of memory reported by top (45 process max). If it was really shared, it would have not collapsed. My bet is that the model used on Linux is copy on write. So after a fork, when the child spamd changes a value, the kernel makes its own copy of the memory. (please correct me if I am wrong). To make it worse perl script (AFAIK) is data and not code which makes harder to reuse (espcially with evals around). Even if sharing does happen it is not enough. OTOH, with an I/O model, the total memory used would be: - the perl interpreter and libraries (this is trully shared on a fork model). - the compiled perl code and perl libraries. - one copy of the parsed rules and compiled regular expressions and non message/scanner related data. - one M::SA::PerMsgStatus object for each simultaneous scanned message (this is a place to put a limit on). > Still, if someone tries it and can demo increased efficiency... > go for it ;) This might require some internal changes to SA. Every Sync call would have to be changed to Async (NON BLOCKING). This might include SQL calls, DNS calls, exec ing external apps and even file I/O. -Raul Dias > --j.
Re: Google Summer of Code 2007 ...
Raul Dias writes: > On Sat, 2007-02-17 at 02:07 +0100, Mark Martinec wrote: > > On Saturday February 17 2007 01:49, Matthew Wilson wrote: > > > I was/am primarily concerned with RAM usage for high-concurrency > > > situations. > > > > Ok. Still, in my experience about 30 (maybe 50) SA processes can > > fully utilize today's CPU & I/O, and it's probably no big deal > > to provide about 2 GB of memory to cater for such system. > > Also, and unfortunately, multithreading in Perl is rather > > cumbersome and not significantly less expensive than fully > > individual processes. > > After experiencing with the sa-blacklist.cf some time ago with 45 > process brought my system to its knees with 3.5GB (out of memory). > > I agree about the thread model. > > But sticking to a async I/O model is a valid point. If implemented > correctly it will save a lot of memory and even improve performance a > little. > > Having separeted process saves the need to have to check for garbage > after filtering a message, which will cause the code to have to be > recheck. > > However, for uniprocessor systems, having multiple process running is > actually more expansive than a async I/O one. For multiple process > system, just keep one process for cpu or less. > > In the past I have played a lot with perl-loop (any loopers around?) > which was the only way to go. It is too low level for most people, but > perhaps POE is the way to go today (which can use perl-loop as its > base). I'm dubious about the benefits for SpamAssassin... An async model works very well for network-bound and I/O-bound servers; however, SpamAssassin is mainly CPU-bound, since the network and I/O parts are already mostly run async during the scan operation. Also, the multiple spamd processes share quite a lot of RAM with each other -- there's a bug in how linux reports "shared" memory which makes it appear much worse than it is. read the FAQ for more details. Still, if someone tries it and can demo increased efficiency... go for it ;) --j.
Re: Google Summer of Code 2007 ...
On Fri, 16 Feb 2007 18:01:37 -0800, Quinn Comendant wrote: > And/or a fix for the qmail+simscan per-user preferences spamc -u > issue where if an email is addressed to multiple users or an alias > spamc isn't passed the correct user. Sorry to reply to myself, but I want to retract that last suggestion: it's not really spamassassin's job to parse recipient lists and resolve aliases. Q - Strangecode :: Internet Consultancy http://www.strangecode.com/ +1 530 624 4410
Re: Google Summer of Code 2007 ...
On Sat, 2007-02-17 at 02:07 +0100, Mark Martinec wrote: > On Saturday February 17 2007 01:49, Matthew Wilson wrote: > > I was/am primarily concerned with RAM usage for high-concurrency > > situations. > > Ok. Still, in my experience about 30 (maybe 50) SA processes can > fully utilize today's CPU & I/O, and it's probably no big deal > to provide about 2 GB of memory to cater for such system. > Also, and unfortunately, multithreading in Perl is rather > cumbersome and not significantly less expensive than fully > individual processes. After experiencing with the sa-blacklist.cf some time ago with 45 process brought my system to its knees with 3.5GB (out of memory). I agree about the thread model. But sticking to a async I/O model is a valid point. If implemented correctly it will save a lot of memory and even improve performance a little. Having separeted process saves the need to have to check for garbage after filtering a message, which will cause the code to have to be recheck. However, for uniprocessor systems, having multiple process running is actually more expansive than a async I/O one. For multiple process system, just keep one process for cpu or less. In the past I have played a lot with perl-loop (any loopers around?) which was the only way to go. It is too low level for most people, but perhaps POE is the way to go today (which can use perl-loop as its base). -Raul Dias
Re: Google Summer of Code 2007 ...
On Fri, 16 Feb 2007 15:35:39 +, Justin Mason wrote: >> We still have a number of items from last year that we could use again. >> Anything else that we'd like people to code up? How about an extensive statistics reporting tool, possible web-based, that can show how well a current spamassassin installation is performing and where it needs improvements. It could provide trends in different classes of spam and how each is marked. Also show info on whether expensive (as in cpu time) rules and plugins are actually doing any good. And/or a fix for the qmail+simscan per-user preferences spamc -u issue where if an email is addressed to multiple users or an alias spamc isn't passed the correct user. Quinn - Strangecode :: Internet Consultancy http://www.strangecode.com/
Re: Google Summer of Code 2007 ...
Mark Martinec writes: > On Saturday February 17 2007 01:49, Matthew Wilson wrote: > > I was/am primarily concerned with RAM usage for high-concurrency > > situations. > > Ok. Still, in my experience about 30 (maybe 50) SA processes can > fully utilize today's CPU & I/O, and it's probably no big deal > to provide about 2 GB of memory to cater for such system. > Also, and unfortunately, multithreading in Perl is rather > cumbersome and not significantly less expensive than fully > individual processes. yep -- that's pretty much what I've found, too. The earlier, non-ithreads version is pretty much nonfunctional :( --j.
Re: Google Summer of Code 2007 ...
On Saturday February 17 2007 01:49, Matthew Wilson wrote: > I was/am primarily concerned with RAM usage for high-concurrency > situations. Ok. Still, in my experience about 30 (maybe 50) SA processes can fully utilize today's CPU & I/O, and it's probably no big deal to provide about 2 GB of memory to cater for such system. Also, and unfortunately, multithreading in Perl is rather cumbersome and not significantly less expensive than fully individual processes. Mark
RE: Google Summer of Code 2007 ...
> -Original Message- > From: Mark Martinec [mailto:[EMAIL PROTECTED] > Sent: Friday, February 16, 2007 6:09 PM > To: users@spamassassin.apache.org > Subject: Re: Google Summer of Code 2007 ... > > Matthew Wilson wrote: > > - Full, tested, supportable multithreaded support > > - Full, tested, supportable support for an asynchronous I/O model > > (a la qpsmtpd-async) > > I think effort could be better spent elsewhere. > > Spam checking lands itself ideally to running parallel individual > processes, with little if any interaction between them. > For an individual user a reduction in processing latency from > three to one seconds doesn't mean a thing. For an entire mail > filtering system all that matters is its througput (messages per > hour). Multithreading brings no performance benefits in this area. > > Mark I was/am primarily concerned with RAM usage for high-concurrency situations.
Re: Google Summer of Code 2007 ...
Matthew Wilson wrote: > - Full, tested, supportable multithreaded support > - Full, tested, supportable support for an asynchronous I/O model > (a la qpsmtpd-async) I think effort could be better spent elsewhere. Spam checking lands itself ideally to running parallel individual processes, with little if any interaction between them. For an individual user a reduction in processing latency from three to one seconds doesn't mean a thing. For an entire mail filtering system all that matters is its througput (messages per hour). Multithreading brings no performance benefits in this area. Mark
Re: Google Summer of Code 2007 ...
> Also, any suggestions from outside the dev team? Anyone got good ideas > for new SpamAssassin features that would be good to pay someone to work on > for 3 months? Here's another one, to seize the opportunity when internal changes are being contemplated: Split the process into two parts: - parsing and munging of mail & rules, resulting in a set of findings (e.g. a list of rules being hit, perhaps somehow generalized). This section can be done once per message, regardless of the number of recipients to the message (assuming all users use the same rules); - based on the above, score the findings, possibly applying per-recipient scoring to each rule being hit; This (rather inexpensive) step can be applied for each recipient individually, without having to re-process an entire message in multiple-recipient mail. ...and adjust the API to Mail::SpamAssassin accordingly, so that MTA-based content filtering (e.g. amavisd-new) could take advantage of it, while still allowing full per-recipient customization of individual rules scores (including disabling some by a score of 0). Benefits depend on a site, but our stats show 1.46 recipients per message on the average. The above change (when calling SA at MTA level) would bring a 46 % increase in througput for free, while still providing individualized rules scoring. Mark
RE: Google Summer of Code 2007 ...
- Full, tested, supportable multithreaded support - Full, tested, supportable support for an asynchronous I/O model (a la qpsmtpd-async) - Pluggable to the point where all configuration and settings can be pulled from anywhere (databases, files, in-memory cache) at runtime, so SA could stay resident and have its configuration be changed in-process.
Re: Google Summer of Code 2007 ...
On Fri, 2007-02-16 at 15:35 +, Justin Mason wrote: > Theo Van Dinter writes: > > I'm assuming that there will be a Google Summer of Code 2007 going on, and > > that the ASF will be involved again. So it's a good time to start thinking > > about things we'd like to put up as possible projects. > > > > We still have a number of items from last year that we could use again. > > Anything else that we'd like people to code up? > > Also, any suggestions from outside the dev team? Anyone got good ideas > for new SpamAssassin features that would be good to pay someone to work on > for 3 months? > > --j. Not a direct improvement, but... - Add more hooks for plugins to let a broaded pluginization of SA. e.g. letting plugins to load before parsing rules. - Better documentation of intenal structures used. Avoid plugin authors to break stuff. - "Class"inization of some structures to facilitate plugins reuse. - The pluginization of SA. From Bayes to header, body, rawbody, score rules. The entire process of doing so would open doors for more external plugin usage and control. While this might bring a slightly slower startup. In the long run, the bennefits can be great. -Raul Dias
Re: Google Summer of Code 2007 ...
John D. Hardin wrote: On Fri, 16 Feb 2007, Justin Mason wrote: Also, a related project would be to complete the pluginization of our "Bayes" engine and APIs, so that other probabilistic classifiers can be plugged in in place of, or in addition to, Bayes in SpamAssassin. +1 If that's a notation for "me too", then: ++ I'm all for all implications on this subject so far: 1) the new PPM-D compression based classifier technique 2) pluginization of all of the probability based classifiers, so that sites can choose between the SA bayes implementation, other bayes implementations, PPM-D, or other learning processes (individually or in combinations)
Re: Google Summer of Code 2007 ...
On Fri, 16 Feb 2007, Justin Mason wrote: > Also, a related project would be to complete the pluginization of > our "Bayes" engine and APIs, so that other probabilistic > classifiers can be plugged in in place of, or in addition to, > Bayes in SpamAssassin. +1 -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ [EMAIL PROTECTED]FALaholic #11174 pgpk -a [EMAIL PROTECTED] key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Windows Genuine Advantage (WGA) means that now you use your computer at the sufferance of Microsoft Corporation. They can kill it remotely without your consent at any time for any reason. --- 6 days until George Washington's 275th Birthday
Re: Google Summer of Code 2007 ...
On 2/16/07, Justin Mason <[EMAIL PROTECTED]> wrote: Also, any suggestions from outside the dev team? Anyone got good ideas for new SpamAssassin features that would be good to pay someone to work on for 3 months? http://issues.apache.org/SpamAssassin/show_bug.cgi?id=3785
Re: Google Summer of Code 2007 ...
On Fri, Feb 16, 2007 at 09:31:13AM -0800, Dan wrote: > On Feb 16, 2007, at 7:35, Justin Mason wrote: > >>We still have a number of items from last year that we could use again. > >>Anything else that we'd like people to code up? > >Also, any suggestions from outside the dev team? Anyone got good ideas > >for new SpamAssassin features that would be good to pay someone to work on > >for 3 months? > I don't know how to code myself but have a new method for scoring messages > that could be included natively in SA. It would work in > place of weight based scoring. Does this sound like it qualifies? Perhaps pluginize the scoring mechanisms so we can have plugins that implement different ham/spam decision rules? -- Duncan Findlay pgp45Qcn2sEBR.pgp Description: PGP signature
Re: Google Summer of Code 2007 ...
On Fri, 16 Feb 2007, Mark Martinec wrote: I believe this was once mentioned on a Justin's blog (but can't find a ref now), the following sounds promising as an additional classifier to existing bayes (especially since the author comes from the same organization as myself :) http://www.virusbtn.com/spambulletin/archive/2006/01/sb200601-trec ijsSPAM2PPM-D compression model Andrej Bratko (Josef Stefan Institute) Observations: The most startling observation is that character-based compression models perform outstandingly well for spam filtering. Commonly used open-source filters perform well, but not nearly so well or nearly so poorly as reported elsewhere. This looks very promising. I found a description of the ijsSPAM2 tool on the site: http://www.virusbtn.com/spambulletin/archive/2006/03/sb200603-compression Remarkable stuff. That would be a helluva nice plugin to have. Chris St. Pierre Unix Systems Administrator Nebraska Wesleyan University Never send mail to [EMAIL PROTECTED]
Re: Google Summer of Code 2007 ...
Justin Mason writes: > Also, a related project would be to complete the pluginization of our > "Bayes" engine and APIs, so that other probabilistic classifiers can be > plugged in in place of, or in addition to, Bayes in SpamAssassin. Right. I felt a need for something like this when I was switching Bayes from MySQL to PostgreSQL 8.2 and my first urge was to run both in parallel, giving the new one small scores and see how it behaves. (but I wasn't desperate enough to implement it, just switched and kept an eye on it for a while. Btw, the 8.2 behaves much better (faster) than earlier versions, seems like some new optimizations were geared precisly to suit SA queries) Mark
Re: Google Summer of Code 2007 ...
On Feb 16, 2007, at 7:35, Justin Mason wrote: We still have a number of items from last year that we could use again. Anything else that we'd like people to code up? Also, any suggestions from outside the dev team? Anyone got good ideas for new SpamAssassin features that would be good to pay someone to work on for 3 months? I don't know how to code myself but have a new method for scoring messages that could be included natively in SA. It would work in place of weight based scoring. Does this sound like it qualifies? Dan
Re: Google Summer of Code 2007 ...
Mark Martinec writes: > > Also, any suggestions from outside the dev team? Anyone got good ideas > > for new SpamAssassin features that would be good to pay someone to work on > > for 3 months? > > I believe this was once mentioned on a Justin's blog (but can't find > a ref now), the following sounds promising as an additional classifier > to existing bayes (especially since the author comes from the same > organization as myself :) > > http://www.virusbtn.com/spambulletin/archive/2006/01/sb200601-trec > > ijsSPAM2PPM-D compression model > Andrej Bratko (Josef Stefan Institute) > > Observations: > The most startling observation is that character-based compression models > perform outstandingly well for spam filtering. Commonly used open-source > filters perform well, but not nearly so well or nearly so poorly as > reported elsewhere. Yes, definitely! A related algorithm is OSBF, as implemented here: http://osbf-lua.luaforge.net/ This had the best performance for hand-trained probabilistic classifiers in the TREC Spam Track 2006 evaluation -- that's good ;) Also, a related project would be to complete the pluginization of our "Bayes" engine and APIs, so that other probabilistic classifiers can be plugged in in place of, or in addition to, Bayes in SpamAssassin. --j.
Re: Google Summer of Code 2007 ...
> Also, any suggestions from outside the dev team? Anyone got good ideas > for new SpamAssassin features that would be good to pay someone to work on > for 3 months? I believe this was once mentioned on a Justin's blog (but can't find a ref now), the following sounds promising as an additional classifier to existing bayes (especially since the author comes from the same organization as myself :) http://www.virusbtn.com/spambulletin/archive/2006/01/sb200601-trec ijsSPAM2PPM-D compression model Andrej Bratko (Josef Stefan Institute) Observations: The most startling observation is that character-based compression models perform outstandingly well for spam filtering. Commonly used open-source filters perform well, but not nearly so well or nearly so poorly as reported elsewhere. Mark
Re: Google Summer of Code 2007 ...
Justin Mason wrote: Theo Van Dinter writes: I'm assuming that there will be a Google Summer of Code 2007 going on, and that the ASF will be involved again. So it's a good time to start thinking about things we'd like to put up as possible projects. We still have a number of items from last year that we could use again. Anything else that we'd like people to code up? Also, any suggestions from outside the dev team? Anyone got good ideas for new SpamAssassin features that would be good to pay someone to work on for 3 months? --j. Maybe, several of us use MailScanner. MailScanner does not use spamc, it loads SA directly. One of the features of MailScanner is called MCP or Message Content Protection. MCP uses, or attempts to use, SA to do specific targeted message content checking. Many people, we included, would like to be able to use this but it seems there is always some gotcha to having SA loaded in MailScanner twice. Problems with the directory paths, rules in memory, etc. The ability to run SA with two totally different configurations in the same application would very handy. Different rules for outbound mail vs inbound mail, MCP(as in this user wants zero mail with the word "breast" regardless of the rest of the message content) are just two examples. Contacting Julian on the MailScanner list would give far better examples and details than I could. Just a thought, DAve -- Three years now I've asked Google why they don't have a logo change for Memorial Day. Why do they choose to do logos for other non-international holidays, but nothing for Veterans? Maybe they forgot who made that choice possible.
Re: Google Summer of Code 2007 ...
Justin Mason wrote: Theo Van Dinter writes: I'm assuming that there will be a Google Summer of Code 2007 going on, and that the ASF will be involved again. So it's a good time to start thinking about things we'd like to put up as possible projects. We still have a number of items from last year that we could use again. Anything else that we'd like people to code up? Also, any suggestions from outside the dev team? Anyone got good ideas for new SpamAssassin features that would be good to pay someone to work on for 3 months? --j. Yeah an updated web interface for adding black and white list and per user options for MySQL/PostGres that is a part of the core utilities. -- -Doc SA/SARE -- Ninja 9:52am up 21:19, 17 users, load average: 0.11, 0.36, 0.52 SARE HQ http://www.rulesemporium.com/
Re: Google Summer of Code 2007 ...
> Also, any suggestions from outside the dev team? Anyone got good ideas > for new SpamAssassin features that would be good to pay someone to work on > for 3 months? Perhaps this is trivial, or not desired by anyone else but myself, but I'd _love_ to be able to strip SpamAssassin tags via spamc and spamd, instead of having to fire up the full-blown spamassassin for each message. :) Benny -- "Very funny, Scotty. Now beam down my clothes." -- James. T. Kirk
Google Summer of Code 2007 ...
Theo Van Dinter writes: > I'm assuming that there will be a Google Summer of Code 2007 going on, and > that the ASF will be involved again. So it's a good time to start thinking > about things we'd like to put up as possible projects. > > We still have a number of items from last year that we could use again. > Anything else that we'd like people to code up? Also, any suggestions from outside the dev team? Anyone got good ideas for new SpamAssassin features that would be good to pay someone to work on for 3 months? --j.