[Bug 5185] Bayesian learning uses different message checksums during exiscan_acl and later sa_learn

bugzilla-daemon Sun, 26 Feb 2012 13:46:44 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5185


--- Comment #26 from Richard van der Hoff <bugzi...@rvanderhoff.org.uk> 
2012-02-26 21:46:18 UTC ---
A few thoughts on this from me:

(In reply to comment #10)
> By the way, comment 3 and comment 4 both suggest this will only affect 
> messages with no Received header. I'm pretty sure that's not the case.

This, on further inspection, is a lie. We use the earliest Received header, so
the MTA's frobbing of the time on the last Received header only matters if
there were no other Received headers. I probably looked at a message with no
Received headers when I reported this, so I missed the CRLF vs LF issue. I
still think both issues need addressing, however.

Anyway:

(In reply to comment #20)
> There needs to be some part of msgid that isn't under the control of spammers,
> otherwise it's trivial for them to prevent their spam ever being learned. They
> can generate as many spams with the same msgid as they like, and they can 
> prime
> the database with an initial dummy high-scoring spam that has no usable tokens
> in common with the rest.

Given that the earliest Received header is most certainly under the control of
the spammers, I certainly don't think we've made anything worse in this regard,
and whilst what we have now might not be perfect, I think calls to put it back
as it was are overstating matters.

Perhaps I'm being dense, but I don't really see how the spammers can use this
to their advantage. Is preventing your spams being learnt really that useful?


(In reply to comment #25)
> I feel we need to aim for a solution that works for everyone as the goal 
> before
> we add yet another configuration option.

Agreed. Flexibility is all well and good, but having millions of configuration
options makes it really hard for people to get a piece of software working as
it should.


(In reply to comment #23)
> I think if we can get a msg_id that is more unique to the message sans the
> transport path, it could IMPROVE bayes use.

Whilst that's true, I have another suggestion.  At the end of the day, we're
just trying to uniquely identify a particular message on our server, right?
Even if I get two copies of a spam, I can learn them as spam separately, I just
want to prevent re-learning each one on subsequent folder scans etc. So how
about trying to extract the local message-id from the most recent Received
header, rather than all this messing about with checksums etc?

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5185] Bayesian learning uses different message checksums during exiscan_acl and later sa_learn

Reply via email to