fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread David F. Skoll
On Thu, 7 Mar 2013 08:57:28 +0100
Giampaolo Tomassoni giampa...@tomassoni.biz wrote:

 I don't see too many differences with running more SA
 processes with linuxes (in which a fork() is actually a vfork()).

I don't believe that's true.  Do you have evidence to back up that claim?
fork() and vfork() have very different semantics and vfork() would not
work at all for spamd.

Regards,

David.


Re: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread Matus UHLAR - fantomas

On Thu, 7 Mar 2013 08:57:28 +0100
Giampaolo Tomassoni giampa...@tomassoni.biz wrote:

I don't see too many differences with running more SA
processes with linuxes (in which a fork() is actually a vfork()).


On 07.03.13 07:01, David F. Skoll wrote:

I don't believe that's true.  Do you have evidence to back up that claim?
fork() and vfork() have very different semantics and vfork() would not
work at all for spamd.


the implementation of fork() in linux makes it nearly the same as vfork().
--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
WinError #9: Out of error messages.


Re: Several rules not hitting on 3.4 that do hit on 3.3.2

2013-03-07 Thread Axb

On 03/07/2013 03:32 AM, Mark Martinec wrote:

Yes, I am using DecodeShortURLs
I have it on both the 3.3.2 and 3.4 systems

Both show:
0.0 HAS_SHORT_URL  Message contains one or more shortened URLs



So I guess the question is which one is running DecodeShortURLs  correctly
3.4 or 3.3.2


Missing the {hosts} part, which is now required in 3.4.0:

--- DecodeShortURLs.pm~ 2011-07-25 17:56:57.0 +0200
+++ DecodeShortURLs.pm  2013-03-07 03:27:24.0 +0100
@@ -474,5 +474,6 @@
foreach (@{$info-{cleaned}}) {
-my $dom = Mail::SpamAssassin::Util::uri_to_domain($_);
+my($dom,$host) = Mail::SpamAssassin::Util::uri_to_domain($_);

  if ($dom  !$info-{domains}-{$dom}) {
+  $info-{hosts}-{$host} = $dom;
$info-{domains}-{$dom} = 1;



Mark,

What version of the plugin are you patching?





Re: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread David F. Skoll
On Thu, 7 Mar 2013 13:47:55 +0100
Matus UHLAR - fantomas uh...@fantomas.sk wrote:

 the implementation of fork() in linux makes it nearly the same as
 vfork().

That is completely wrong.  Just because modern forks use copy-on-write
doesn't make them anything at all like vfork; the semantics are utterly
different.

Regards,

David.



Re: Several rules not hitting on 3.4 that do hit on 3.3.2

2013-03-07 Thread Mark Martinec
  Missing the {hosts} part, which is now required in 3.4.0:
  --- DecodeShortURLs.pm~ 2011-07-25 17:56:57.0 +0200
  +++ DecodeShortURLs.pm  2013-03-07 03:27:24.0 +0100
 
 What version of the plugin are you patching?

The last I could find in one of my old directories,
it claims $VERSION=0.6, was downloaded in 2011-07.
I couldn't find a current on-line version anywhere.

  Mark


Re: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread Matus UHLAR - fantomas

On Thu, 7 Mar 2013 13:47:55 +0100
Matus UHLAR - fantomas uh...@fantomas.sk wrote:


the implementation of fork() in linux makes it nearly the same as
vfork().


On 07.03.13 07:53, David F. Skoll wrote:

That is completely wrong.  Just because modern forks use copy-on-write
doesn't make them anything at all like vfork; the semantics are utterly
different.


I'm not talking about the semantics but about the implementation.  Simply
said, vfork() was developed to avoid process memory copying used at fork(). 
on linux, fork() does NOT copy process memory.


--
Matus UHLAR - fantomas, uh...@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
WinError #98652: Operation completed successfully.


RE: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread Giampaolo Tomassoni
 On Thu, 7 Mar 2013 13:47:55 +0100
 Matus UHLAR - fantomas uh...@fantomas.sk wrote:
 
  the implementation of fork() in linux makes it nearly the same as
  vfork().
 
 That is completely wrong.  Just because modern forks use copy-on-write
 doesn't make them anything at all like vfork; the semantics are utterly
 different.

Uhu! You need to put things in their own context in order to get the
semantic.

Should I had to say: a fork under linux attains performances as close to a
vfork?

I'm replying to a list, not writing a CS book, come on...

Giampaolo


 Regards,
 
 David.



Re: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread David F. Skoll
On Thu, 7 Mar 2013 14:18:12 +0100
Matus UHLAR - fantomas uh...@fantomas.sk wrote:

 I'm not talking about the semantics but about the implementation.
 Simply said, vfork() was developed to avoid process memory copying
 used at fork(). on linux, fork() does NOT copy process memory.

vfork() also suspends execution of the parent until the child calls
execve or _exit.  If the child happens to write into its memory, the parent
sees the changes... very different from fork().

Now, as for the great benefits of copy-on-write: It is actually almost
useless with Perl programs.  Here's the reason: Perl uses
reference-counting to know when to free memory.  So even if you access
memory read-only by creating a new reference to the underlying object,
that effectively becomes a write operation and Linux needs to copy the
page.

I think if you measure what happens to Perl processes that fork a number
of children to handle requests, you'll see that there's very little memory
sharing after a short while.

Regards,

David.



Re: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread Henrik K
On Thu, Mar 07, 2013 at 09:48:19AM -0500, David F. Skoll wrote:

 I think if you measure what happens to Perl processes that fork a number
 of children to handle requests, you'll see that there's very little memory
 sharing after a short while.

Please let's stop the techno-theorizing and provide actual results.
We already had this exact same discussion atleast once *sigh*.

Start something like:

spamd -4 -p 1234 --min-children=50 --min-spare=50 --max-conn-per-child=1000 
--round-robin -L

50 non-recycled childs, fed 1000 requests (~20 each).

Memory measured with free (without buffers/cache etc):

begin 2588084
end 1296756

About 25MB non-shared memory used per child, which is pretty normal
since SA uses lots of internal per-message data.  On 32-bit systems the
usage could be half of that.

So in the case of SA, it's not anywhere near very little memory shared
after a short while.



Re: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread David F. Skoll
On Thu, 7 Mar 2013 17:47:22 +0200
Henrik K h...@hege.li wrote:

 Memory measured with free (without buffers/cache etc):

 begin 2588084
 end 1296756

 About 25MB non-shared memory used per child,

Are you sure your measurements are correct?  I use MIMEDefang which also
has a preforked-children architecture and I see only about 4MB shared
per child with the vast majority of per-child memory non-shared.  This
is based on what top reports.

 So in the case of SA, it's not anywhere near very little memory
 shared after a short while.

My measurements completely disagree with yours, so one of us (or both?) is
wrong.

Regards,

David.



RE: Several rules not hitting on 3.4 that do hit on 3.3.2

2013-03-07 Thread Scott Ostrander
 -Original Message-
 Sent: Thursday, March 07, 2013 5:13 AM
 To: users@spamassassin.apache.org
 Subject: Re: Several rules not hitting on 3.4 that do hit on 3.3.2
 
   Missing the {hosts} part, which is now required in 3.4.0:
   --- DecodeShortURLs.pm~   2011-07-25 17:56:57.0 +0200
   +++ DecodeShortURLs.pm2013-03-07 03:27:24.0 +0100
 
  What version of the plugin are you patching?
 
 The last I could find in one of my old directories, it claims $VERSION=0.6, 
 was

I can confirm that making the same changes to version 5 of DecodeShortURLs.pm  
works with SA 3.4
I have not been able to find version 6

Scott Ostrander


Re: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread Henrik K
On Thu, Mar 07, 2013 at 11:37:33AM -0500, David F. Skoll wrote:
 On Thu, 7 Mar 2013 17:47:22 +0200
 Henrik K h...@hege.li wrote:
 
  Memory measured with free (without buffers/cache etc):
 
  begin 2588084
  end 1296756
 
  About 25MB non-shared memory used per child,
 
 Are you sure your measurements are correct?  I use MIMEDefang which also
 has a preforked-children architecture and I see only about 4MB shared
 per child with the vast majority of per-child memory non-shared.  This
 is based on what top reports.

You provide no data how you end up with the 4MB etc. And MD is not SA, it
might do all sorts of funky stuff.

How about actually trying the provided spamd line yourself and not keep
again theorizing how someone is measuring wrong etc?

Well actually here is the one I used to get 50 childs.. pasted wrong one.
spamd -4 -p 1234 -m 50 --min-children=50 --min-spare=40 
--max-conn-per-child=1000 --round-robin -L

Just feed a lot of random messages with spamc -p 1234.



Re: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread David F. Skoll
On Thu, 7 Mar 2013 18:56:45 +0200
Henrik K h...@hege.li wrote:

 You provide no data how you end up with the 4MB etc. And MD is not
 SA, it might do all sorts of funky stuff.

I wrote MD, so I'm pretty sure it's not doing any funky stuff.

 How about actually trying the provided spamd line yourself and not
 keep again theorizing how someone is measuring wrong etc?

I don't have any machine with spamd installed.

Regards,

David.


RE: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread Giampaolo Tomassoni
 On Thu, Mar 07, 2013 at 11:37:33AM -0500, David F. Skoll wrote:
  On Thu, 7 Mar 2013 17:47:22 +0200
  Henrik K h...@hege.li wrote:
 
   Memory measured with free (without buffers/cache etc):
 
   begin 2588084
   end 1296756
 
   About 25MB non-shared memory used per child,
 
  Are you sure your measurements are correct?  I use MIMEDefang which
 also
  has a preforked-children architecture and I see only about 4MB shared
  per child with the vast majority of per-child memory non-shared.
 This
  is based on what top reports.
 
 You provide no data how you end up with the 4MB etc. And MD is not SA,
 it
 might do all sorts of funky stuff.
 
 How about actually trying the provided spamd line yourself and not keep
 again theorizing how someone is measuring wrong etc?
 
 Well actually here is the one I used to get 50 childs.. pasted wrong
 one.
 spamd -4 -p 1234 -m 50 --min-children=50 --min-spare=40 --max-conn-per-
 child=1000 --round-robin -L
 
 Just feed a lot of random messages with spamc -p 1234.

I just got a snip into my amavisd's 5 children /proc/pid/smaps file,
summing together the count of Private_{Clean|Dirty} pages.

I got this:

p1: 74,164 kb
p2: 70,772 kb
p3: 71,548 kb
p4: 74,064 kb
p5: 70,784 kb

This accounts for a total of unique 287,168 kB (say 280 MB?). ~ 56MB in the
average.

Sounds this good?

Giampaolo



Re: fork is vfork?

2013-03-07 Thread Bernd Petrovitsch
On Don, 2013-03-07 at 12:14 -0500, David F. Skoll wrote:
 On Thu, 7 Mar 2013 18:56:45 +0200
 Henrik K h...@hege.li wrote:
 
  You provide no data how you end up with the 4MB etc. And MD is not
  SA, it might do all sorts of funky stuff.
 
 I wrote MD, so I'm pretty sure it's not doing any funky stuff.

MD forks the worker process and the worker process initializes libperl
and loads the perl script.

To share more memory on a fork, it should initialize libperl and load
the perl script before forking off the worker processes.


BTW switching to vfork() won't speed up anything significantly IMHO as
processes are (re)started quite seldom - it is not that one starts a
process for each mail .

Bernd
-- 
Bernd Petrovitsch  Email : be...@petrovitsch.priv.at
 LUGA : http://www.luga.at



Re: fork is vfork?

2013-03-07 Thread David F. Skoll
On Thu, 07 Mar 2013 19:04:22 +0100
Bernd Petrovitsch be...@petrovitsch.priv.at wrote:

 MD forks the worker process and the worker process initializes libperl
 and loads the perl script.

Nope.

 To share more memory on a fork, it should initialize libperl and load
 the perl script before forking off the worker processes.

That's what it does.

Regards,

DAvid.


Re: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread Henrik K
On Thu, Mar 07, 2013 at 07:02:00PM +0100, Giampaolo Tomassoni wrote:

 I just got a snip into my amavisd's 5 children /proc/pid/smaps file,
 summing together the count of Private_{Clean|Dirty} pages.
 
 I got this:
 
   p1: 74,164 kb
   p2: 70,772 kb
   p3: 71,548 kb
   p4: 74,064 kb
   p5: 70,784 kb
 
 This accounts for a total of unique 287,168 kB (say 280 MB?). ~ 56MB in the
 average.
 
 Sounds this good?

Memory management is tricky though. Hard to tell which values sum up to the
real thing.

Probably best meter on Linux is the actual free value highlighted below? 
Check it before starting amavisd/spamd/whatnot and check it again after
running for a while.  Also double check it after killing all the processes. 
I'm open to be proved otherwise..

$ free
 total   used   free sharedbuffers cached
Mem:   1047496 944236 103260  0   2904 284336
-/+ buffers/cache: 656996 ___390500___
Swap:   524272 28 257604



Understanding spamhaus FP

2013-03-07 Thread Alex
Hi,

I received an email that was tagged with KHOP_SPAMHAUS_DROP, which
means it was listed in the Spamhaus Don't Route Or Peer List.
However, I've checked every IP and domain in the email, and none are
listed on any spamhaus list, even as of a minute ago. What is it in
this message that is being tagged?

http://pastebin.com/qPq9ah7P

When it was initially received, it was also listed in
RCVD_IN_HOSTKARMA_BL, but checking again just five minutes after
having received it, and it's no longer listed there. The MX for the
originating domain appears to be managed by register.com, although
that appears to have been stripped out of the header by the next hop
(broadviewnet.net)?

Any ideas greatly appreciated.
Thanks,
Alex


R: Re: fork is vfork? (was Re: With similar rules, rspamd is about ten times faster than SpamAssassin.)

2013-03-07 Thread Giampaolo Tomassoni
The Private_ entries in /proc/.../smaps are reported to be the right choice 
here: they report only pages allocated while not shared with any other process. 
Ie, the ones touched after fork and the new allocated ones.

Also, smaps is a relatively new proc entry, meant exactly to cope with all the 
linux memory stats mess.

Giampaolo


Henrik K h...@hege.li ha scritto:

On Thu, Mar 07, 2013 at 07:02:00PM +0100, Giampaolo Tomassoni wrote:

 I just got a snip into my amavisd's 5 children /proc/pid/smaps file,
 summing together the count of Private_{Clean|Dirty} pages.
 
 I got this:
 
 p1: 74,164 kb
 p2: 70,772 kb
 p3: 71,548 kb
 p4: 74,064 kb
 p5: 70,784 kb
 
 This accounts for a total of unique 287,168 kB (say 280 MB?). ~ 56MB in the
 average.
 
 Sounds this good?

Memory management is tricky though. Hard to tell which values sum up to the
real thing.

Probably best meter on Linux is the actual free value highlighted below? 
Check it before starting amavisd/spamd/whatnot and check it again after
running for a while.  Also double check it after killing all the processes. 
I'm open to be proved otherwise..

$ free
 total   used   free shared    buffers cached
Mem:   1047496 944236 103260  0   2904 284336
-/+ buffers/cache: 656996 ___390500___
Swap:   524272 28 257604



ExtractText.pm not working with SA 3.4

2013-03-07 Thread Scott Ostrander
Does anybody have ExtractText working with SA 3.4?
http://whatever.truls.org/graphdefang/ExtractText.zip

I loved this third party plugin back in SA 3.2.5.
Every once in a while some attachment spam gets through.

unrtf on command line works giving expected output.
/usr/local/bin/unrtf -t ExtractText.tags -nopict  RTF.rtf

Debug output shows nothing extracted.

Mar  7 10:22:15.405 [18289] dbg: extracttext: set: magic=1
Mar  7 10:22:15.405 [18289] dbg: extracttext: external: antiword 
/usr/bin/antiword,-t,-w,0,-m,UTF-8.txt,-
Mar  7 10:22:15.406 [18289] dbg: extracttext: use: antiword name .*\.doc
Mar  7 10:22:15.406 [18289] dbg: extracttext: use: antiword name .*\.dot
Mar  7 10:22:15.406 [18289] dbg: extracttext: use: antiword type 
application/(?:vnd\.?)?ms-?word.*
Mar  7 10:22:15.406 [18289] dbg: extracttext: external: unrtf 
/usr/local/bin/unrtf,-t,ExtractText.tags,--nopict
Mar  7 10:22:15.406 [18289] dbg: extracttext: use: unrtf name .*\.doc
Mar  7 10:22:15.407 [18289] dbg: extracttext: use: unrtf name .*\.rtf
Mar  7 10:22:15.407 [18289] dbg: extracttext: use: unrtf type application/rtf
Mar  7 10:22:15.407 [18289] dbg: extracttext: use: unrtf type text/rtf
Mar  7 10:22:15.407 [18289] dbg: extracttext: external: odt2txt 
/usr/bin/odt2txt,--encoding=UTF-8,${file}
Mar  7 10:22:15.407 [18289] dbg: extracttext: use: odt2txt name .*\.odt
Mar  7 10:22:15.407 [18289] dbg: extracttext: use: odt2txt name .*\.ott
Mar  7 10:22:15.408 [18289] dbg: extracttext: use: odt2txt type 
application/.*?opendocument.*text
Mar  7 10:22:15.408 [18289] dbg: extracttext: use: odt2txt name .*\.sdw
Mar  7 10:22:15.408 [18289] dbg: extracttext: use: odt2txt name .*\.stw
Mar  7 10:22:15.408 [18289] dbg: extracttext: use: odt2txt type 
application/(?:x-)?soffice
Mar  7 10:22:15.408 [18289] dbg: extracttext: use: odt2txt type 
application/(?:x-)?starwriter
Mar  7 10:22:15.409 [18289] dbg: extracttext: external: pdftohtml 
/usr/bin/pdftohtml,-i,-xml,-stdout,-noframes,${file}
Mar  7 10:22:15.409 [18289] dbg: extracttext: external: pdftotext 
/usr/bin/pdftotext,-q,-nopgbrk,-enc,UTF-8,${file},-
Mar  7 10:22:15.409 [18289] dbg: extracttext: use: pdftotext name .*\.pdf
Mar  7 10:22:15.409 [18289] dbg: extracttext: use: pdftotext type 
application/pdf
Mar  7 10:22:18.048 [18289] dbg: extracttext: MIME database: /usr/share/mime
Mar  7 10:22:18.152 [18289] dbg: extracttext: Part: application/rtf RTF.rtf
Mar  7 10:22:18.152 [18289] dbg: extracttext: Match: name RTF.rtf =~ .*\.rtf
Mar  7 10:22:18.213 [18289] dbg: extracttext: External call: unrtf 
/usr/local/bin/unrtf,-t,ExtractText.tags,--nopict
Mar  7 10:22:18.214 [18289] info: extracttext: External extraction command: 
/usr/local/bin/unrtf,-t,ExtractText.tags,--nopict
Mar  7 10:22:18.214 [18289] info: extracttext: External extraction object: 17 
application/rtf RTF.rtf
Mar  7 10:22:18.214 [18289] info: extracttext: External extraction error: unrtf 
0 ?
Mar  7 10:22:18.259 [18289] dbg: extracttext: Not extracted
Mar  7 10:22:18.259 [18289] dbg: extracttext: X-ExtractText-Words: 0
Mar  7 10:22:18.259 [18289] dbg: extracttext: X-ExtractText-Chars: 0
Mar  7 10:22:18.389 [18289] dbg: bayes: header tokens for x-extracttext-chars = 
 0
Mar  7 10:22:18.389 [18289] dbg: bayes: header tokens for x-extracttext-words = 
 0

Thanks,
Scott Ostrander


Rspamd project

2013-03-07 Thread Vsevolod Stakhov

Hello,

I've decided to write to SA users list about rspamd project[1] status 
since I've got the second mention of rspamd in this list. However, I was 
not subscribed to it, therefore I cannot reply directly to the original 
author of the post.


The phrase mentioned in the original post: With similar rules, rspamd 
is about ten times faster than SpamAssassin, was my mistake, as it only 
describes the comparison of SA and rspamd on rather specific ruleset 
that was selected after filtering of the overall SA ruleset on our 
specific mail payload (and this set included about 100 rules). So I feel 
sorry about this phrase that is not true in a common case, as rspamd 
does not support all features of SA and has not the same ruleset.


Nevertheless, whilst I was implementing rspamd I took into consideration 
main problems with performance I had found in SA: too many regexp checks 
for each action (for example, in Received headers parsing code), too 
many repeated checks of the same text and so on. Rspamd tries to fix 
these problems by using of specified finite state machines, using of 
tries for patterns matching, having rules planner to pass more probable 
checks before less probable and so on. Moreover, rspamd can use thread 
pools for statistic and regexp check that allows to scale easily on 
multi-cores machines. As a result, on the rules that we've selected for 
porting from SA to rspamd, rspamd was several times faster than SA. 
Actually, we could not afford the check speed of SA with our amount of 
mail and with our amount of servers. And rspamd solved the problem that 
time.


Furthermore, I was focused on maximum performance while writing code for 
other rspamd modules, for example, DKIM, SPF or SURBL, trying to avoid 
usage of resource greedy libraries (like opendkim or libspf2). The 
statistic module was implemented based on Markovian Bayes algorithm with 
OSB tokenizer in crm114, that behaves more accurately in my tests than 
unigramm bayes that is used in SA by default.


In conclusion, I'd like to add some words about immature state of the 
project. Unfortunately, I've developed it focused only on a single 
client. Therefore, rspamd can not be compared with SA in terms of 
features amount, however, it can be useful for those who do not require 
every single feature of SA, but want something oriented on performance 
and statistical checks. I'm very keen to attracting more users to rspamd 
project, that's why if you have any questions or want to try rspamd, 
please feel free to contact me.


Eventually, sorry for this message that is not directly connected with 
SA project.


[1]: https://bitbucket.org/vstakhov/rspamd/

--
Vsevolod Stakhov


Re: Rspamd project

2013-03-07 Thread Kevin A. McGrail

  
  
At the end of the day, we are all hear
  to fight bastard spammers and thanks for taking the time to write
  this description. Perhaps there will be some synergy to borrow
  code and ideas one way or the other or get you working on our
  project as well. Licensing issues aside, thanks for subscribing
  and for your work combating spam.
  
  Regards,
  KAM
  
  On 3/7/2013 3:44 PM, Vsevolod Stakhov wrote:

Hello,
  
  
  I've decided to write to SA users list about rspamd project[1]
  status since I've got the second mention of rspamd in this list.
  However, I was not subscribed to it, therefore I cannot reply
  directly to the original author of the post.
  
  
  The phrase mentioned in the original post: "With similar rules,
  rspamd is about ten times faster than SpamAssassin", was my
  mistake, as it only describes the comparison of SA and rspamd on
  rather specific ruleset that was selected after filtering of the
  overall SA ruleset on our specific mail payload (and this set
  included about 100 rules). So I feel sorry about this phrase that
  is not true in a common case, as rspamd does not support all
  features of SA and has not the same ruleset.
  
  
  Nevertheless, whilst I was implementing rspamd I took into
  consideration main problems with performance I had found in SA:
  too many regexp checks for each action (for example, in Received
  headers parsing code), too many repeated checks of the same text
  and so on. Rspamd tries to fix these problems by using of
  specified finite state machines, using of tries for patterns
  matching, having rules planner to pass more probable checks before
  less probable and so on. Moreover, rspamd can use thread pools for
  statistic and regexp check that allows to scale easily on
  multi-cores machines. As a result, on the rules that we've
  selected for porting from SA to rspamd, rspamd was several times
  faster than SA. Actually, we could not afford the check speed of
  SA with our amount of mail and with our amount of servers. And
  rspamd solved the problem that time.
  
  
  Furthermore, I was focused on maximum performance while writing
  code for other rspamd modules, for example, DKIM, SPF or SURBL,
  trying to avoid usage of resource greedy libraries (like opendkim
  or libspf2). The statistic module was implemented based on
  Markovian Bayes algorithm with OSB tokenizer in crm114, that
  behaves more accurately in my tests than unigramm bayes that is
  used in SA by default.
  
  
  In conclusion, I'd like to add some words about immature state of
  the project. Unfortunately, I've developed it focused only on a
  single client. Therefore, rspamd can not be compared with SA in
  terms of features amount, however, it can be useful for those who
  do not require every single feature of SA, but want something
  oriented on performance and statistical checks. I'm very keen to
  attracting more users to rspamd project, that's why if you have
  any questions or want to try rspamd, please feel free to contact
  me.
  
  
  Eventually, sorry for this message that is not directly connected
  with SA project.
  
  
  [1]: https://bitbucket.org/vstakhov/rspamd/
  
  



-- 
  Kevin A. McGrail
  President
  
Peregrine Computer Consultants Corporation
3927 Old Lee Highway, Suite 102-C
Fairfax, VA 22030-2422
  
http://www.pccc.com/
  
703-359-9700 x50 / 800-823-8402 (Toll-Free)
703-359-8451 (fax)
kmcgr...@pccc.com
  
  
  

  



Re: Romance spam

2013-03-07 Thread Benny Pedersen

Kenneth Porter skrev den 2013-03-06 18:04:

--On Wednesday, March 06, 2013 9:27 AM -0500 Kevin A. McGrail
kmcgr...@pccc.com wrote:

I haven't seen any of this at all.  Do you have an example on 
pastebin

and I can look through my logs? Might be getting hammered by another
rule/rbl/etc.


Here's an example:

http://sewingwitch.com/ken/Stuff/spamExample.txt


only bayes hitting ?, and it autolearns ham ?


R: Rspamd project

2013-03-07 Thread Giampaolo Tomassoni
I see there would be problems in naming your project RSA. Nevertheless, is 
there any plan to have the current rspamd features in a library, in order to 
allow third-parties to develop their own message handling interface wrapping it?

Giampaolo

Vsevolod Stakhov vsevo...@highsecure.ru ha scritto:

Hello,

I've decided to write to SA users list about rspamd project[1] status 
since I've got the second mention of rspamd in this list. However, I was 
not subscribed to it, therefore I cannot reply directly to the original 
author of the post.

The phrase mentioned in the original post: With similar rules, rspamd 
is about ten times faster than SpamAssassin, was my mistake, as it only 
describes the comparison of SA and rspamd on rather specific ruleset 
that was selected after filtering of the overall SA ruleset on our 
specific mail payload (and this set included about 100 rules). So I feel 
sorry about this phrase that is not true in a common case, as rspamd 
does not support all features of SA and has not the same ruleset.

Nevertheless, whilst I was implementing rspamd I took into consideration 
main problems with performance I had found in SA: too many regexp checks 
for each action (for example, in Received headers parsing code), too 
many repeated checks of the same text and so on. Rspamd tries to fix 
these problems by using of specified finite state machines, using of 
tries for patterns matching, having rules planner to pass more probable 
checks before less probable and so on. Moreover, rspamd can use thread 
pools for statistic and regexp check that allows to scale easily on 
multi-cores machines. As a result, on the rules that we've selected for 
porting from SA to rspamd, rspamd was several times faster than SA. 
Actually, we could not afford the check speed of SA with our amount of 
mail and with our amount of servers. And rspamd solved the problem that 
time.

Furthermore, I was focused on maximum performance while writing code for 
other rspamd modules, for example, DKIM, SPF or SURBL, trying to avoid 
usage of resource greedy libraries (like opendkim or libspf2). The 
statistic module was implemented based on Markovian Bayes algorithm with 
OSB tokenizer in crm114, that behaves more accurately in my tests than 
unigramm bayes that is used in SA by default.

In conclusion, I'd like to add some words about immature state of the 
project. Unfortunately, I've developed it focused only on a single 
client. Therefore, rspamd can not be compared with SA in terms of 
features amount, however, it can be useful for those who do not require 
every single feature of SA, but want something oriented on performance 
and statistical checks. I'm very keen to attracting more users to rspamd 
project, that's why if you have any questions or want to try rspamd, 
please feel free to contact me.

Eventually, sorry for this message that is not directly connected with 
SA project.

[1]: https://bitbucket.org/vstakhov/rspamd/

-- 
Vsevolod Stakhov


Re: Understanding spamhaus FP

2013-03-07 Thread Matt Kettler
On 3/7/2013 1:51 PM, Alex wrote:
 Hi,

 I received an email that was tagged with KHOP_SPAMHAUS_DROP, which
 means it was listed in the Spamhaus Don't Route Or Peer List.
 However, I've checked every IP and domain in the email, and none are
 listed on any spamhaus list, even as of a minute ago. What is it in
 this message that is being tagged?

 http://pastebin.com/qPq9ah7P


First, I'll disclaim I'm a bit rusty here... It's been a year or two
since I've had time to contribute to SpamAssassin much. But perhaps I
can be of some help.

The SPAMHAUS_DROP list is only available from them as a text file or as
a BGP feed.. it is not a live DNS query like their other lists.

http://www.spamhaus.org/drop/drop.txt

However, I agree none of the IPs seem to be in the drop list.

It looks like the rule in question is published by khopesh.com, not the
SA core ruleset... I'm assuming you are using an update channel from
http://khopesh.com/wiki/Anti-spam.

Regardless, since the list is a text file, it looks like it is being
auto-converted to a SpamAssassin rule, but that makes it semi-static..
generally this is ok, as the DROP list doesn't change very fast.
However, it does change, and what's on your SpamAssassin box may not
reflect the current drop list. I'm not really up to speed on the khopesh
feed, so I'm not sure how often that rule gets regenerated. For that
matter, I'm also not sure how often you are fetching sa-updates from
them

I *think* if you run the message through spamassassin -D it might show
you which text matched the rule when it hits.. which should give you
some answers...