I was writing a message requesting advice on bayes_ignore_header since I was sure something was wrong when I decided to have a look at spamassassin -D bayes output... and I was shocked by what I saw !
x-spam-relays-external lists all the hops of the message *including* internal servers and so x-spam-relays-internal is empty... I specifically asked to add the antivirus and other internal MTAs to the internal list... and now I find the internal server names used to calculate the bayes point... I really think this is skewing the result. In the 40 tokens it uses to calculate the score, the internal MTA is present a couple of times. I also noticed to my surprise that in the 40 tokens used to calculate the score, * the address or domain of the sender is not used * the address of the internal server is used 2 times * menaningless (to me) since too generic tokens are used several times... 10026 is the port the sending server used, 192.168 is an internal IP range..) dbg: bayes: token 'H*r:amavisd-new' => 0.00933830395446512 dbg: bayes: token 'H*r:port' => 0.0100739915629308 dbg: bayes: token 'H*r:10026' => 0.00656298715300288 dbg: bayes: token 'H*r:ESMTPSA' => 0.0291881040543893 dbg: bayes: token 'H*RU:ESMTPSA' => 0.0299783424700051 dbg: bayes: token 'Hx-spam-relays-external:ESMTPSA' => 0.0299783424700051 dbg: bayes: token 'H*r:192.168.1' => 0.0332916024497639 dbg: bayes: token 'H*R:U*noreply' => 0.0884273751672186 dbg: bayes: token 'H*r:localhost' => 0.095748955695973 * the address/domain of the receiver is present in various combinations 6 times.... why is the receiver address so important? dbg: bayes: token 'H*r:sk:<localpart>' => 0.00474205399064878 dbg: bayes: token 'HTo:U*<localpart>' => 0.00573965631120421 dbg: bayes: token '<localpart>@<domain>.it' => 0.0252948951857414 dbg: bayes: token 'U*<localpart>' => 0.0252948951857414 dbg: bayes: token 'sk:<localpart>' => 0.0252948951857414 dbg: bayes: token '<localpart><domain>' => 0.0252948951857414 * the 2 words of the subject are listed but Subject: is not tokenized according to the sources dbg: bayes: token 'INFORMAZIONI' => 0.0198930234212028 dbg: bayes: token 'importanti' => 0.0186572280369034 * the tokens with the highest score are (notice 0.97 to 0.12) dbg: bayes: token 'assicurarti' => 0.97797086079613 dbg: bayes: token 'caro' => 0.125457833816543 Can you please tell me if my bayes engine is working as it should?