Package: spamassassin
Version: 3.3.1-1
Severity: normal

Dear SpamAssassin Maintainers,

on a machine running Spamd for a mail server with approx. 1500 active
users the spamd master process seems to become unresponsive once every
few weeks.

Restarting spamassassin seems to suffice to fix this problem. "kill -9"
was not necessary. Nevertheless it causes all mail delivery to stall due
to spamc no more being able to get a result from spamd. (spamc and spamd
run on differnt machines, both being Xen DomUs.)

An additional symptom of this issue seems the presence of spamd zombie
children looking like this:

<username>    6465  0.0  0.0      0     0 ?        ZN   20:16   0:00 [spamd] 

Those zombie spamd children seem to start approx. half a day before
spamd actually stops to work completely, at least in today's case:

That zombie process above was started at 20:16. Our monitoring system
noticed it as zombie from 20:22 to 08;14 on the following day. (Checking
happens every 5 minutes.)

The amount of procmail processes (which call spamc) running in parallel
reached the warning level of 20 at 07:33, and critical level of 100 at
08:13.

The last sucessfully delivered (test) mail reached its destination at
07:25. (Happens every 5 minutes, too.)

Around 08:15 we restarted spamassassin and everything was fine again.

Looking at the spamd logs revealed that spamd seem to have worked fine
until it got killhupped at 07:27:

# fgrep spamd mail.log
[…]
Mar 13 07:27:07 … spamd[27526]: spamd: connection from … […] at port 39595
Mar 13 07:27:07 … spamd[27526]: spamd: setuid to … succeeded
Mar 13 07:27:07 … spamd[27526]: spamd: processing message <…> for …:…
Mar 13 07:27:10 … spamd[27526]: spamd: clean message (-1.2/5.0) for …:… in 3.3 
seconds, 66598 bytes.
Mar 13 07:27:10 … spamd[27526]: spamd: result: . -1 - 
BAYES_00,HTML_MESSAGE,RCVD_IN_DNSWL_MED 
scantime=3.3,size=66598,user=…,uid=…,required_score=5.0,rhost=…,raddr=…,rport=39595,mid=<…>,bayes=0.000000,autolearn=ham
Mar 13 07:27:11 … spamd[6165]: prefork: child states: KII
Mar 13 07:27:14 … spamd[6165]: spamd: server hit by SIGHUP, restarting
Mar 13 07:27:14 … spamd[6165]: spamd: child [31179] killed successfully: 
interrupted, signal 2 (0002)
Mar 13 07:27:14 … spamd[6165]: spamd: child [27526] killed successfully: 
interrupted, signal 2 (0002)
Mar 13 07:28:17 … spamd[6165]: Use of uninitialized value $selerr in 
concatenation (.) or string at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 332.
Mar 13 07:28:17 … spamd[6165]: prefork: select returned error on server 
filehandle: 
Mar 13 08:14:12 … spamd[6165]: spamd: server killed by SIGTERM, shutting down
Mar 13 08:14:12 … spamd[1972]: logger: removing stderr method
Mar 13 08:14:20 … spamd[1975]: spamd: server started on port 783/tcp (running 
version 3.3.1)
Mar 13 08:14:20 … spamd[1975]: spamd: server pid: 1975
Mar 13 08:14:20 … spamd[1975]: spamd: server successfully spawned child 
process, pid 1981
Mar 13 08:14:20 … spamd[1975]: spamd: server successfully spawned child 
process, pid 1982
Mar 13 08:14:20 … spamd[1975]: prefork: child states: BI
Mar 13 08:14:20 … spamd[1981]: spamd: connection from … […] at port 46441
Mar 13 08:14:20 … spamd[1982]: spamd: connection from … […] at port 46442
[…]

cron.daily runs at 06:25 localtime (syslog was rotated around 06:27), so
it sounds a little bit strange that /etc/cron.daily/spamassassin only
finished at 07:27, but I currently have no other explanation for the
above seen "kill -HUP" than the "invoke-rc.d spamassassin reload" from
/etc/cron.daily/spamassassin.

I suspect that the following two lines, the last ones before the SA
restart, are related to the non-responsiveness of spamd:

Mar 13 07:28:17 … spamd[6165]: Use of uninitialized value $selerr in 
concatenation (.) or string at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 332.
Mar 13 07:28:17 … spamd[6165]: prefork: select returned error on server 
filehandle: 

(And there's indeed no further output from spamd in the log behind the
latter line.)

Possibly related to http://bugs.debian.org/611040 or
http://bugs.debian.org/346237 (which both have no information about
symptoms, just more or less the error messages)

I found this error message also at the following days in the log:

# zfgrep -2h 'prefork: select returned error on server filehandle:' mail.log* | 
fgrep spamd
Dec 22 08:25:06 … spamd[22411]: prefork: sysread(7) failed after 300 secs at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 654.
Dec 22 08:25:06 … spamd[22400]: Use of uninitialized value $selerr in 
concatenation (.) or string at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 332.
Dec 22 08:25:06 … spamd[22400]: prefork: select returned error on server 
filehandle: 
Dec 22 08:25:06 … spamd[22400]: prefork: child states: II
Dec 22 08:25:06 … spamd[22400]: prefork: child states: II
Jan  9 11:41:16 … spamd[25239]: prefork: sysread(7) failed after 300 secs at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 654.
Jan  9 11:41:16 … spamd[6091]: Use of uninitialized value $selerr in 
concatenation (.) or string at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 332.
Jan  9 11:41:16 … spamd[6091]: prefork: select returned error on server 
filehandle: 
Jan  9 11:41:16 … spamd[6091]: prefork: child states: II
Jan  9 11:41:16 … spamd[6091]: prefork: child states: II
Jan 19 14:25:31 … spamd[14526]: prefork: sysread(8) failed after 300 secs at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 654.
Jan 19 14:25:31 … spamd[31699]: Use of uninitialized value $selerr in 
concatenation (.) or string at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 332.
Jan 19 14:25:31 … spamd[31699]: prefork: select returned error on server 
filehandle: 
Jan 19 14:25:31 … spamd[31699]: prefork: child states: II
Jan 19 14:25:31 … spamd[31699]: prefork: child states: II
Jan 27 06:38:09 … spamd[19644]: Use of uninitialized value $selerr in 
concatenation (.) or string at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 332.
Jan 27 06:38:09 … spamd[19644]: prefork: select returned error on server 
filehandle: 
Jan 27 06:38:09 … spamd[26848]: spamd: connection from … […] at port 49023
Jan 27 06:38:09 … spamd[787]: spamd: connection from … […] at port 49083
Feb 15 06:46:03 … spamd[10898]: prefork: sysread(8) failed after 300 secs at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 654.
Feb 15 06:46:03 … spamd[28626]: Use of uninitialized value $selerr in 
concatenation (.) or string at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 332.
Feb 15 06:46:03 … spamd[28626]: prefork: select returned error on server 
filehandle: 
Feb 15 06:46:03 … spamd[28626]: prefork: child states: II
Feb 15 06:46:03 … spamd[28626]: prefork: child states: II
Mar  2 01:37:27 … spamd[378]: prefork: sysread(10) failed after 300 secs at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 654.
Mar  2 01:37:27 … spamd[637]: Use of uninitialized value $selerr in 
concatenation (.) or string at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 332.
Mar  2 01:37:27 … spamd[637]: prefork: select returned error on server 
filehandle: 
Mar  2 01:37:27 … spamd[377]: prefork: sysread(9) failed after 300 secs at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 654.
Mar  2 01:37:28 … spamd[637]: prefork: child states: II
Mar 13 07:28:17 … spamd[6165]: Use of uninitialized value $selerr in 
concatenation (.) or string at 
/usr/share/perl5/Mail/SpamAssassin/SpamdForkScaling.pm line 332.
Mar 13 07:28:17 … spamd[6165]: prefork: select returned error on server 
filehandle: 

While it seems that some of these spamd processes continued to work, the
dates seem to relate to cases where no mail went through for more than
45 minutes (and then the monitoring alarm went off):

Sat Mar 02 00:50:09 2013
Fri Feb 15 06:00:09 2013

(We don't have a monitoring history for those kind of issues before 29th
of January as we only started them after the analysis of the spamd
non-responsiveness incident on 27th of January.)

Nevertheless these alarms predate the according error messages by 45
minutes instead of the other way round as I would have expected, so I'm
not really sure what's cause and what's effect.

So far for the information about the beginning of spamd no more properly
responding to requests.

Now about the zombie processes that seem to be related as at least in
the cases on 29th of January and today zombie spamd children were around
while spamd wasn't working properly anymore and needed manual
intervention:

These are the syslog entries of the process which became a zombie later:

# fgrep '… spamd[6465]:' mail.log
Mar 12 20:16:40 … spamd[6465]: pyzor: check failed: util: cannot fork: child 
processing timeout at /usr/sbin/spamd line 1283. at 
/usr/share/perl5/Mail/SpamAssassin/Util.pm line 1454.
Mar 12 20:16:40 … spamd[6465]: check: (run_generic) exceeded time limit, 
skipping further tests
Mar 12 20:16:40 … spamd[6465]: spamd: clean message (-3.5/5.0) for …:… in 302.5 
seconds, 66181 bytes.
Mar 12 20:16:40 … spamd[6465]: spamd: result: . -3 - 
BAYES_00,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RP_MATCHES_RCVD,SPF_HELO_PASS,TIME_LIMIT_EXCEEDED
 
scantime=302.5,size=66181,user=…,uid=…,required_score=5.0,rhost=…,raddr=…,rport=53506,mid=<…>,bayes=0.000000,autolearn=unavailable
Mar 12 20:16:41 … spamd[6465]: spamd: accidental fork: 6465 != 5813 at 
/usr/sbin/spamd line 1682.

Sounds at least similar to
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=4846

All occurrences of "spamd: accidental fork" in the past 3 months:

# zfgrep -h … mail.log* | egrep 'spamd.* fork'
Jan 26 12:41:04 … spamd[5384]: pyzor: check failed: util: cannot fork: child 
processing timeout at /usr/sbin/spamd line 1283. at 
/usr/share/perl5/Mail/SpamAssassin/Util.pm line 1454.
Jan 26 12:41:05 … spamd[5389]: pyzor: check failed: util: cannot fork: child 
processing timeout at /usr/sbin/spamd line 1283. at 
/usr/share/perl5/Mail/SpamAssassin/Util.pm line 1454.
Jan 26 12:41:05 … spamd[5384]: spamd: accidental fork: 5384 != 4883 at 
/usr/sbin/spamd line 1682.
Jan 26 12:41:06 … spamd[4883]: pyzor: check failed: util: cannot fork: child 
processing timeout at /usr/sbin/spamd line 1283. at 
/usr/share/perl5/Mail/SpamAssassin/Util.pm line 1454.
Jan 26 12:41:06 … spamd[5389]: spamd: accidental fork: 5389 != 4885 at 
/usr/sbin/spamd line 1682.
Jan 26 12:41:07 … spamd[4885]: pyzor: check failed: util: cannot fork: child 
processing timeout at /usr/sbin/spamd line 1283. at 
/usr/share/perl5/Mail/SpamAssassin/Util.pm line 1454.
Jan 26 12:41:07 … spamd[5394]: spamd: accidental fork: 5394 != 4887 at 
/usr/sbin/spamd line 1682.
Jan 26 12:41:08 … spamd[5406]: pyzor: check failed: util: cannot fork: child 
processing timeout at /usr/sbin/spamd line 1283. at 
/usr/share/perl5/Mail/SpamAssassin/Util.pm line 1454.
Jan 26 12:41:09 … spamd[5406]: spamd: accidental fork: 5406 != 4895 at 
/usr/sbin/spamd line 1682.
Jan 26 12:41:37 … spamd[5451]: pyzor: check failed: util: cannot fork: child 
processing timeout at /usr/sbin/spamd line 1283, <CONF> line 34. at 
/usr/share/perl5/Mail/SpamAssassin/Util.pm line 1454, <CONF> line 34.
Jan 26 12:41:38 … spamd[5451]: spamd: accidental fork: 5451 != 4865 at 
/usr/sbin/spamd line 1682, <CONF> line 34.
Jan 26 12:41:39 … spamd[4865]: pyzor: check failed: util: cannot fork: child 
processing timeout at /usr/sbin/spamd line 1283, <CONF> line 34. at 
/usr/share/perl5/Mail/SpamAssassin/Util.pm line 1454, <CONF> line 34.
Mar 12 20:16:40 … spamd[6465]: pyzor: check failed: util: cannot fork: child 
processing timeout at /usr/sbin/spamd line 1283. at 
/usr/share/perl5/Mail/SpamAssassin/Util.pm line 1454.
Mar 12 20:16:41 … spamd[6465]: spamd: accidental fork: 6465 != 5813 at 
/usr/sbin/spamd line 1682.
Mar 12 20:16:41 … spamd[6467]: spamd: accidental fork: 6467 != 5818 at 
/usr/sbin/spamd line 1682.
Mar 12 20:16:43 … spamd[6469]: spamd: accidental fork: 6469 != 5826 at 
/usr/sbin/spamd line 1682.

The dates are only dates on which we also saw zombies and after which we
has non-responsiveness issues a few hours later, too,. (I don't think
that's a coincidence. :)

Feel free to clone and retitle this bug report accordingly in case there
are several independent (and not yet reported) issues in here.

P.S.: I don't think this is related to http://bugs.debian.org/501964 as
we don't suspend and resume our DomUs.

-- System Information:
Debian Release: 6.0.7
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.32-5-xen-amd64 (SMP w/1 CPU core)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages spamassassin depends on:
pn  libarchive-tar-perl    <none>            (no description available)
ii  libdigest-sha1-perl    2.13-1            NIST SHA-1 message digest algorith
ii  libhtml-parser-perl    3.66-1            collection of modules that parse H
ii  libnet-dns-perl        0.66-2            Perform DNS queries from a Perl sc
ii  libnetaddr-ip-perl     4.028+dfsg-1      IP address manipulation module
ii  libsocket6-perl        0.23-1            Perl extensions for IPv6
ii  libsys-hostname-long-p 1.4-2             Figure out the long (fully-qualifi
ii  libwww-perl            5.836-1           Perl HTTP/WWW client/server librar
ii  perl                   5.10.1-17squeeze6 Larry Wall's Practical Extraction 
ii  perl-modules [libio-zl 5.10.1-17squeeze6 Core Perl modules

Versions of packages spamassassin recommends:
ii  gcc                    4:4.4.5-1         The GNU C compiler
ii  gnupg                  1.4.10-4+squeeze1 GNU privacy guard - a free PGP rep
ii  libc6-dev              2.11.3-4          Embedded GNU C Library: Developmen
ii  libio-socket-inet6-per 2.65-1.1          Object interface for AF_INET6 doma
ii  libmail-spf-perl       2.007-1           Perl implementation of Sender Poli
ii  make                   3.81-8            An utility for Directing compilati
ii  perl [libsys-syslog-pe 5.10.1-17squeeze6 Larry Wall's Practical Extraction 
ii  re2c                   0.13.5-1          tool for generating fast C-based r
ii  spamc                  3.3.1-1           Client for SpamAssassin spam filte

Versions of packages spamassassin suggests:
ii  libdbi-perl            1.612-1           Perl Database Interface (DBI)
ii  libio-socket-ssl-perl  1.33-1+squeeze1   Perl module implementing object or
ii  libmail-dkim-perl      0.38-1            cryptographically identify the sen
ii  libnet-ident-perl      1.23-1            Perl module to lookup the username
ii  perl [libcompress-zlib 5.10.1-17squeeze6 Larry Wall's Practical Extraction 
ii  pyzor                  1:0.5.0-2         spam-catcher using a collaborative
ii  razor                  1:2.85-3          spam-catcher using a collaborative

-- Configuration Files:
/etc/default/spamassassin changed:
ENABLED=1
OPTIONS="-m50 -i -A … --max-conn-per-child=200"
PIDFILE="/var/run/spamd.pid"
NICE="--nicelevel 2"
CRON=1

/etc/spamassassin/local.cf changed:
trusted_networks …
internal_networks …
report_safe 0
report_contact postmaster@…
add_header ham Report _REPORT_
pyzor_timeout 30
use_pyzor 1
pyzor_path /usr/bin/pyzor
bayes_sql_override_username     default
bayes_store_module              Mail::SpamAssassin::BayesStore::SQL
bayes_sql_dsn                   DBI:mysql:…:…
bayes_sql_username              …
bayes_sql_password              …
score RCVD_IN_MAPS_RBL 0
score RCVD_IN_MAPS_DUL 0
score RCVD_IN_MAPS_RSS 0
score RCVD_IN_MAPS_NML 0
score HELO_LOCALHOST 0
score RCVD_NUMERIC_HELO 1.3
score MIME_QP_LONG_LINE 0.7
score FROM_LOCAL_NOVOWEL 2.3
score FROM_NO_LOWER 0
score RCVD_IN_SORBS_DUL 1.0
score RCVD_IN_SORBS_DUL 0.5
score HELO_DYNAMIC_IPADDR2 0.5
score FH_HELO_EQ_D_D_D_D 0.2
score URIBL_GREY 1.0
score URIBL_BLACK 2.0
score URIBL_PH_SURBL 0.2
score URIBL_AB_SURBL 0.2
score URIBL_WS_SURBL 0.2
score URIBL_JP_SURBL 0.2
score URIBL_OB_SURBL 0.2
score URIBL_SC_SURBL 0.2
score URIBL_PH_SURBL 0.2
score RCVD_IN_DNSWL_MED -1.0
score EXTRA_MPART_TYPE 0.8
score MISSING_SUBJECT   0.45
score MIME_BASE64_TEXT  0.9
score MIME_HTML_ONLY    0.75
score RCVD_IN_HEISE_RBL 1.8
score DATE_IN_PAST_03_06 0.1
score REPLYTO_WITHOUT_TO_CC 0.2
score FUZZY_OCR_WRONG_EXTENSION 0.1
score BAYES_00 0 0 -0.20 -0.25
score BAYES_05 0 0 -0.10 -0.20
score BAYES_20 0 0 -0.05 -0.1
score BAYES_40 0 0 -0.01 -0.05
[local rules I do not want to disclose publicly…]
[whitelist I do not want to disclose publicly…]
[blacklist I do not want to disclose publicly either…]

/etc/spamassassin/v310.pre changed:
loadplugin Mail::SpamAssassin::Plugin::Pyzor
loadplugin Mail::SpamAssassin::Plugin::Razor2
loadplugin Mail::SpamAssassin::Plugin::AutoLearnThreshold
loadplugin Mail::SpamAssassin::Plugin::WhiteListSubject
loadplugin Mail::SpamAssassin::Plugin::MIMEHeader
loadplugin Mail::SpamAssassin::Plugin::ReplaceTags


-- no debconf information


--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to