Hi,

On Tue, Jun 10, 2014 at 3:25 PM, Karsten Bräckelmann <guent...@rudersport.de>
wrote:
>
> On Tue, 2014-06-10 at 13:53 -0400, Alex wrote:
> > I'm not very familiar with how to manage language encoding, and hoped
> > someone could help. Some time ago I wrote a rule that looks for
> > subjects that consist of a single word that's more than N characters.
> > It works, but I'm learning that it's performed before the content of
> > the subject is converted into something human-readable.
>
> This is not true. Header rules are matched against the decoded string by
> default. To prevent decoding of quoted-printable or base-64 encoded
> headers, the :raw modifier needs to be appended to the header name.

I've also realized I made the improper assumption that, even if it was
operating on the encoded string, that the decoded string may have been
greater than 20 chars.

> > Subject:
=?utf-8?B?44CK546v55CD5peF6K6v44CL5Y6f5Yib77ya5Zyo57q/5peF5ri4?=
>
> That's a base-64 encoded UTF-8 string, decoded for header rules. To see
> for yourself, just echo your test header into
>
>   spamassassin -D -L --cf="header TEST Subject =~ /.+/"
>
> and the debug output will show you what it matched.
>
>   dbg: rules: ran header rule TEST ======> got hit: "《环球旅讯》原创:在线旅游"

Great info, thanks.

It's here where I'm starting to lose you:

> > How can I write a header rule that operates on the decoded utf
> > content?
> >
> > header          __SUB_NOSPACE   Subject =~ /^.\S+$/
> > header          __SUB_VERYLONG  Subject =~ /^.{20,200}\S+$/
> > meta            LOC_SUBNOSPACE  (__SUB_VERYLONG && __SUB_NOSPACE)
>
> Again, header rules by default operate on the decoded string.
>
> I assume your actual problem is with the SUB_VERYLONG rule hitting.
> Since the above test rule shows the complete decoded Subject, we can
> tell it's 13 chars long, clearly below the "verylong" threshold of 20
> chars.

Visually counting the individual characters, including the colon, is indeed
13. However, there are spaces, which should have negated the rule (\S), no?
Also, wc shows me the string is 41 chars long.

> That is not caused by the encoding, though, but because the regex
> operates on bytes rather than characters.

Is not each character exactly one byte? Or are you referring to the fact
that it takes untold multiple bytes to produce one encoded character?

> Let's see what a 20 bytes chunk of that UTF-8 string looks like. A
> modified rule will match the first 20 bytes only:
>
>   header TEST Subject =~ /^.{20}/
>
> The result shows the string is longer than 20 bytes, and the match even
> ends right within a single UTF-8 encoded char.
>
>   got hit: "《环球旅讯》<E5><8E>"

Yes, on my xterm it just produces an unintelligible character as a question
mark.

> To make the regex matching aware of UTF-8 encoding, and match chars
> instead of (raw) bytes, we will need the normalize_charset option.
>
>   header TEST Subject =~ /^.{10}/"
>   normalize_charset 1

Why wouldn't it make sense for this to be the default? What is the utility
in trying to match on an encoded string?

I think I'm also confused by your reference above that header rules are
matched against the decoded string. What then would be the purpose of
normalize_charset here? Does normalize here mean to decode it?

> Along with yet another modification of the test rule, now matching the
> first 10 chars only.
>
>   got hit: "《环球旅讯》原创:在"
>
> The effect is clear. That 10 (chars) long match with normalize_charset
> enabled is even longer than the above 20 (byte) match.

Okay, I think I understand. So I don't want to avoid scanning encoded
headers, but it's also very unlikely to find a 20-byte string of Japanese
characters in any spam message, so I don't really know what I should do
with this.

I've also investigated a bit further, and it appears to hit quite a bit of
ham (really_long_spreadsheet.xls, for example), so maybe I need to meta it
with something, or just abandon it.

Thanks,
Alex

Reply via email to