Re: How to filter ( mime_header_checks ) special unicode "Format characters".

Wietse Venema Sun, 08 Dec 2013 14:19:50 -0800

T.B.:
> > According to RFC 2183/2184, content-disposition names containing
> > non-ASCII content must be encoded as ASCII strings.
> >
> > This means you may need to handle content-disposition names that
> > violate RFC 2183/2184, besides correctly-encoded forms for UTF-8,
> > UTF-16, and so on. I am not sure that regular expressions are the
> > tool for this job.


[create MIME message withMozilla Thunderbird]

> --------------070209080009040106000400
> Content-Type: application/octet-stream;
>   name="=?UTF-8?B?4oCuY29kLnlyYW1tdXNldml0dWPigK1uMWPigK0uZXhl?="
> Content-Transfer-Encoding: base64
> Content-Disposition: attachment;
>   filename*0*=UTF-8''%E2%80%AE%63%6F%64%2E%79%72%61%6D%6D%75%73%65%76%69%74;
>   filename*1*=%75%63%E2%80%AD%6E%31%63%E2%80%AD%2E%65%78%65

Here, you are looking at two layers of encoding. First, Unicode
0x202E and 0x202D are encoded by the rules of UTF-8, and then any
% or non-ASCII bytes in the result are encoded as %xx.

For example, Unicode 0x202E (binary: 0010 0000 | 0010 1110) becomes
UTF-8 0xE2 0x80 0xAE (binary: 1110 0010 | 1000 0000 | 1010 1110),
and that becomes %E2%80%AE as you can see above.

I don't know why Thunderbird uses %xx endcoding for ASCII characters
(%63%6F%64 is the same as "cod") but it is not wrong.

> The question you already stated is:
> (1.) Is there any possibility to recognize the special characters with 
> regular expressions?

It is possible.  However, to stop malicious mail then you would
also need to recognize other encodings than UTF-8 and %xx, including
(non)encodings that Windows software recognizes even if they violate
internet RFCs.  I can't tell you what all possible email software
recognizes.  That needs to be determined empirically.

> (3.) Additional question: Is there a known solution to filter such files 
> within attached zipped archives?

Yes, use a content filter.

        Wietse

Re: How to filter ( mime_header_checks ) special unicode "Format characters".

Reply via email to