On 23 Jan 2018, at 11:55 (-0500), Pedro David Marco wrote:

Shall SA accept URLs 5MB big for example?

Generally speaking, SA should not be seeing whole messages that big, much less single URLs. Beyond the slowness and the resource demands of scanning large messages, the discernment power of SA fizzles out around 500KB. You won't catch much of the spam that is very large with SA, because it isn't very similar to the spam SA is designed for or usually trained on.

But to the original question, it is unfortunately true that entities which are generally recognized as legitimate sometimes use URLs in email that exceed 1KB, while URLs longer than 2KB are quite rare in ham or spam. Some data:

For a while I had test rules that hit URLs with long parts after the hostname and found that a 600 character threshold was useless, with a tiny correlation to ham. At 800 there was a stronger but still not useful correlation to ham. Over 1000 it was a minor menace, hitting 5 times as much ham as spam with most of the spam already scored in double digits by SA and very few that had slipped past SA. I killed the rules as a failed experiment.

Today I did a very rough check of an unrepresentative corpus (hand-classified but containing only ham and SA escapees) of 95k messages (93k ham/2k spam) from the past 42 months. Longest URL is 2054 characters after the hostname and that one is in ridiculously pathological spam whose text/plain part is mostly HTML-encoded versions of UTF-16(?) entities. The next longest is 1852 characters and it's in ham. I see no way to make the length of URLs a useful spam test.

However, there is a bright side of that. While it will not catch much, it is *probably* perfectly safe to set a prudent limit on URLs (say, 5KB?) and not need to worry much about FPs.

--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Currently Seeking Steady Work: https://linkedin.com/in/billcole

Reply via email to