Re: Maxium URL acceptable length

Bill Cole Tue, 23 Jan 2018 14:27:06 -0800

On 23 Jan 2018, at 11:55 (-0500), Pedro David Marco wrote:

Shall SA accept URLs 5MB big for example?

Generally speaking, SA should not be seeing whole messages that big,much less single URLs. Beyond the slowness and the resource demands ofscanning large messages, the discernment power of SA fizzles out around500KB. You won't catch much of the spam that is very large with SA,because it isn't very similar to the spam SA is designed for or usuallytrained on.

But to the original question, it is unfortunately true that entitieswhich are generally recognized as legitimate sometimes use URLs in emailthat exceed 1KB, while URLs longer than 2KB are quite rare in ham orspam. Some data:

For a while I had test rules that hit URLs with long parts after thehostname and found that a 600 character threshold was useless, with atiny correlation to ham. At 800 there was a stronger but still notuseful correlation to ham. Over 1000 it was a minor menace, hitting 5times as much ham as spam with most of the spam already scored in doubledigits by SA and very few that had slipped past SA. I killed the rulesas a failed experiment.

Today I did a very rough check of an unrepresentative corpus(hand-classified but containing only ham and SA escapees) of 95kmessages (93k ham/2k spam) from the past 42 months. Longest URL is 2054characters after the hostname and that one is in ridiculouslypathological spam whose text/plain part is mostly HTML-encoded versionsof UTF-16(?) entities. The next longest is 1852 characters and it's inham. I see no way to make the length of URLs a useful spam test.

However, there is a bright side of that. While it will not catch much,it is *probably* perfectly safe to set a prudent limit on URLs (say,5KB?) and not need to worry much about FPs.


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Currently Seeking Steady Work: https://linkedin.com/in/billcole

Re: Maxium URL acceptable length

Reply via email to