http://bugzilla.spamassassin.org/show_bug.cgi?id=3872





------- Additional Comments From [EMAIL PROTECTED]  2004-10-09 14:56 -------
Subject: Re:  SA 3.0 creates randomly extreme big bayes_journal

On Sat, Oct 09, 2004 at 01:38:56PM -0400, Theo Van Dinter wrote:
> I'd like to get the file if I can.  If there's actually lines and no
> nulls, it's not a truncate issue, so that's good.  I'd like to see what
> they actually look like though.

Hrm.  I got a copy of an errored journal file:

   1096 c
3562112 m
      4 n
    266 t

so yeah, the problem is the 3.5 million m lines.  They do, by in large, all
look duplicated for some reason.  So the lines indicates ~1.8 million mails
learned, but no corresponding ham/spam count updates or tokens, which is just
wrong.  I don't think you learned 1.8 million mails anyway.

There is a ton of duplication.  Individual msgid and repeat counts,
respectively:

106     33120
16      3312
1       736
4       184
2       72
10      8
3       4
2       2

So something's up.  I haven't seen this issue in normal spamd-type usage, so
I'm tempted to blame MailScanner...

However, looking at the code, I found something that seems odd, and
could very well cause the issue.  In fact, allow me to go: OMG!

In 2.6x and 3.0, the sync_journal function (takes the journal data and
updates the databases) calls seen_put (and seen_delete) to take the
message id and store it in the database.  BUT!  seen_put and seen_delete
check to see if learn_to_journal is set, and if so, defers the update
to the journal!

OMG OMG OMG!  I can even reproduce this in normal SA mode!

If you use "sa-learn --sync" to sync the journal, the problem doesn't exist,
for some reason.

If you let auto-sync occur during SA runs, the behavior happens due to the
reason above.  I got the behavior by setting:

bayes_learn_to_journal 1
bayes_journal_max_size 1

then shoving messages through causes the problem to occur, and it actually
keeps adding the same message over and over as well since "seen" is never
updated.

OMG!  So the easy solution is to either have a special "sync" seen_put and
seen_delete, or kluge the learn_to_journal setting around the calls.  I think
the first is the right solution.  Patch forthcoming, then please test it for
me. :)

I have no idea how this hasn't been seen before.  This code has been for ages.
The "m" code was new to 2.6, so it's been over a year.  geez!





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to