On Fri, 13 May 2016 12:44:40 -0500 (CDT) David B Funk wrote: > What algorithm does Bayes use to detect that it has already 'seen' a > given message? > > When I receive a bolus (say 40~60) of 'phish' messages from a > compromised Hotmail/gmail/yahoo account which are mostly the same > (body, many headers same, only recipients, Message-ID, Date, and a > few Received headers are different) if I feed all of them to Bayes, > it will learn only about 10% of them, the other 90% will be ignored > as 'already seen'. > > So how does Bayes decide that it has 'already seen' a given message > when it actually hasn't (it has already seen one that is -almost- > identical).
It's a hash of part of the body and the date header.
