T.B.:
> > According to RFC 2183/2184, content-disposition names containing
> > non-ASCII content must be encoded as ASCII strings.
> >
> > This means you may need to handle content-disposition names that
> > violate RFC 2183/2184, besides correctly-encoded forms for UTF-8,
> > UTF-16, and so on. I am not sure that regular expressions are the
> > tool for this job.
[create MIME message withMozilla Thunderbird]
> --------------070209080009040106000400
> Content-Type: application/octet-stream;
> name="=?UTF-8?B?4oCuY29kLnlyYW1tdXNldml0dWPigK1uMWPigK0uZXhl?="
> Content-Transfer-Encoding: base64
> Content-Disposition: attachment;
> filename*0*=UTF-8''%E2%80%AE%63%6F%64%2E%79%72%61%6D%6D%75%73%65%76%69%74;
> filename*1*=%75%63%E2%80%AD%6E%31%63%E2%80%AD%2E%65%78%65
Here, you are looking at two layers of encoding. First, Unicode
0x202E and 0x202D are encoded by the rules of UTF-8, and then any
% or non-ASCII bytes in the result are encoded as %xx.
For example, Unicode 0x202E (binary: 0010 0000 | 0010 1110) becomes
UTF-8 0xE2 0x80 0xAE (binary: 1110 0010 | 1000 0000 | 1010 1110),
and that becomes %E2%80%AE as you can see above.
I don't know why Thunderbird uses %xx endcoding for ASCII characters
(%63%6F%64 is the same as "cod") but it is not wrong.
> The question you already stated is:
> (1.) Is there any possibility to recognize the special characters with
> regular expressions?
It is possible. However, to stop malicious mail then you would
also need to recognize other encodings than UTF-8 and %xx, including
(non)encodings that Windows software recognizes even if they violate
internet RFCs. I can't tell you what all possible email software
recognizes. That needs to be determined empirically.
> (3.) Additional question: Is there a known solution to filter such files
> within attached zipped archives?
Yes, use a content filter.
Wietse