Hi, On 07/07/16 04:41, Mark Sapiro wrote: > That should be > > ^Subject:.*[list of all Chinese characters here] > > except that if your list's preferred language is English and you haven't > changed Mailman's character set for English from ASCII to UTF-8, the > text you are matching against won't contain any Chinese characters > because the decoded headers are converted to the character set of the > list's preferred language and all the Chinese characters will be > converted to '?'. > > You might try something like > > ^Subject:.*\?{4,} > > This will match any subject that contains 4 or more non-ascii characters > in a row. Unfortunately, it will also match > > Subject: WTF is happening here???? > > but you could try some number other than 4 but greater than 1
How about using 'backslashreplace' instead of 'replace' to encode to list's preferred language in Mailman/Handlers/SpamDetect.py ? Then, desirable pattern in this case seems to be ~Subject.*(\\u[0-9a-f]{4}){4} It also matches strings like 'What does the string "\\u6709\\u9650\\u516c\\u53f8" mean?', though. === modified file 'Mailman/Handlers/SpamDetect.py' --- Mailman/Handlers/SpamDetect.py 2016-01-18 23:56:58 +0000 +++ Mailman/Handlers/SpamDetect.py 2016-07-09 00:47:33 +0000 @@ -86,7 +86,7 @@ # unicode it as iso-8859-1 which may result in a garbled # mess, but we have to do something. uvalue += unicode(frag, 'iso-8859-1', 'replace') - headers += '%s: %s\n' % (h, uvalue.encode(cset, 'replace')) + headers += '%s: %s\n' % (h, uvalue.encode(cset, 'backslashreplace')) return headers -- Yasuhito FUTATSUKI <futat...@poem.co.jp> ------------------------------------------------------ Mailman-Users mailing list Mailman-Users@python.org https://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: https://mail.python.org/mailman/options/mailman-users/archive%40jab.org