Re: [Bug 4603] RFE: Apache::SpamD module, to run spamd from httpd

Justin Mason Thu, 27 Jul 2006 10:44:30 -0700

Radoslaw Zielinski writes:
> > http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4603
> > ------- Additional Comments From [EMAIL PROTECTED]  2006-07-26 18:17 -------
> 
> I dislike the idea of using Bugzilla as a replacement for a mailing list
> (bleh, why doesn't ASF use RT); let's move here, if you don't mind...


OK, as long as you find the thread on nabble and post a pointer to the
bug; it's a *lot* easier to track down a BZ discussion 6 months down the
line, than find a mailing list thread.

by the way -- bugzilla tip -- the MozEX extension is a god-send for
dealing with bugzilla's small text entry boxes, allowing you to use
a decent external editor instead.

> [...]
> > Using IPC::Open3 is a nightmare for portability, btw -- I'm pretty sure it
> > doesn't work on win32 at least -- but maybe there are other issues there 
> > anyway?
> 
> I avoided using shell... well, this can be easily changed.

Yep, perl's own 'open "...|"' shell escapes are actually more portable.
sa-update's code is worth looking at, for an example.

> > how does it compare to current spamd, in speed terms?
> 
> 174%, crushes the hacky 0.0002s optimizations like cockroaches.

ha! I suspect these numbers are without any ruleset, though ;) Also, worth
noting that spamd does some time-consuming tasks that apache-spamd doesn't
(like log via syslog).

>   $ tail -n1 *.log
>   ==> prefork.log <==
>   parsed 2000 messages in 00:04:32 (272.930377 s),
>   7.3279 msgs/s (440 msgs/min, 26380 msgs/h)
> 
>   ==> spamd.log <==
>   parsed 2000 messages in 00:08:00 (480.140767 s),
>   4.1654 msgs/s (250 msgs/min, 14996 msgs/h)
> 
>   ==> worker.log <==
>   parsed 2000 messages in 00:04:35 (275.170448 s),
>   7.2682 msgs/s (436 msgs/min, 26166 msgs/h)
> 
> Apache-spamd / spamd run with -x -m 5, Bench-spamd.pl with -c 3 -m 2000.
> Hardware: Athlon 1.7xp, 700MB RAM.

so prefork.log and worker.log are both using apache-spamd, with
those MPMs?  That's a pretty excellent speedup.

> > Regarding logging.  What's the issue?  (I couldn't actually spot any 
> > logging in
> > that tarball.)
> 
> Apache redirects stderr to error_log, I don't know how to capture it
> (OTOH, I haven't been looking for it, but I don't think it's a good
> idea).  The ErrorLog directive doesn't support redirecting to syslog.
> 
> So, all the debug messages from SA and some startup errors detected
> at the config phase are logged.  This isn't:
> 
>   [5273] info: spamd: connection from localhost [127.0.0.1] at port 2347
>   [5273] info: spamd: checking message <[EMAIL PROTECTED]> for (unknown):500
>   [5273] info: spamd: clean message (0.0/5.0) for (unknown):500 in 0.2 
> seconds, 5978 bytes.
>   [5273] info: spamd: result: . 0 - 
> scantime=0.2,size=5978,user=(unknown),uid=500,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=2347,mid=<[EMAIL
>  PROTECTED]>,autolearn=disabled
> 
> I have not attained enlightement about the correct way to do it yet.
> 
> That would require opening a file to write at some state, passing the
> filehandle somehow (global var probably), locking...  If a syslog socket
> has been requested, I guess separate connections are needed...  Complex
> and error prone.

for what it's worth, I'd say:

  - forget about syslog; apache has its own logging model which doesn't
    involve that, so we don't have to either ;)

  - open ">>" filehandles have atomic writes for inter-process contention,
    if you use syswrite(), and the target of the fh is a file on a local
    filesystem [*]. So that's a good way to log data atomically:

        open LOG, ">>logfile";
        [...]
        my $message = "info: foo bar baz\n";
        syswrite LOG, $message;

    global var would be ok, although I can see people wanting to have
    different logfiles for different vhosts...

    ([*]: well, atomic enough, at least. see
    lib/Mail/SpamAssassin/BayesStore/DBM.pm for workarounds in the case
    when partial writes occur due to out-of-disk-space conditions --
    that's when it gets messy!)

> Adding complexity is easy, keeping it simple and obvious makes a worthy
> challenge.

yep -- feel free to ask, of course!

> > Should it be integrated into the main distro, or kept as a separate module
> > with its own Makefile.PL, do you think?  (I think I'd prefer to integrate, 
> > if
> > possible.)
> 
> If it's not integrated... will be lost, in time.

ok, agreed.

> > And finally, I think it could do with more documentation and tests ;)  a 
> > lot of
> > that would probably make more sense after the integration-into-distro 
> > question
> > is resolved (e.g. "what README does it go into"). 
> 
> I'd go for separate README.apache to keep things transparent.

Sure; like the spamc model.  But there has to be other integration into
documentation, the top-level README, INSTALL, etc. at least.


> Right now, this is written as a PerlProcessConnectionHandler (mod_perl
> handler for custom protocols).  I just figured out it *can* be done
> using the more popular HTTP handlers (PerlResponseHandler and friends)
> and I'm experimenting with it right now.
> 
> That would have two benefits I see right now (I doubt it'd change
> anything regarding performance).
> 
> First one is possibility to use mod_log_config (the CustomLog directive).
> If wee agree to compress that four log lines per connection to one, it
> would be a clean and efficient way to get the access logging done.

Sure. Note however that the "result:" line --

    [5273] info: spamd: result: . 0 - 
scantime=0.2,size=5978,user=(unknown),uid=500,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=2347,mid=<[EMAIL
 PROTECTED]>,autolearn=disabled

has a pretty well-defined format, and log parsers that know how to read
it.  it'd be good to keep that.

> Second one...  Well, here it is; try to keep an open mind. ;-)
> I'm reading http://catb.org/esr/writings/taoup/ right now; around the
> chapter about protocol design it bugged me: why isn't the spamd protocol
> based on HTTP?
> 
> Gain: forget the fancy libspamc, forget Mail::SpamAssassin::Client, get
> over with parts of spamd network-related code ("sysread not ready"
> anyone?), reduce trash code in various spamc implementations (exim,
> whatever)...  Just use a HTTP library to do a simple POST (and make sure
> the library allows you to read the Spam header after a 2xx response).
> 
> So.  If I used the mod_perl HTTP handlers, that would get us very close
> to rolling out the SPAMD/2.0 protocol [1].  After some code refactoring,
> it'd be possible to use spamd as FastCGI (or regular CGI, if someone
> wishes) with any HTTP server.  Authentication?  Just get a mod_auth*
> module.  Compression?  mod_deflate.  Whatever?  mod_whatever.
> 
>   POST /?method=PROCESS HTTP/1.1
> 
> The more I think about it, the more I like the idea.

Wow.  That's scary. ;) I'll have to think about that one.

I'm not sure I see *sufficient* benefit, in terms of the other parts of
the code, though.  The two protocols are both very, very simple; I think
there'd be more code needing to be written to support HTTP (with a new
URL-based, CGI-style parameter-passing scheme), than the existing lines of
code for supporting SPAMD!

--j.

Re: [Bug 4603] RFE: Apache::SpamD module, to run spamd from httpd

Reply via email to