On 7/6/16 10:51 AM, Greg Lindsay wrote: > > I assume the text box that is asking to input a "Spam Filter Regexp" > will attempt to match all text in the header.
The regexps match against a block of text which consists of the all of the message and sub-part headers RFC 2047 decoded and separated by newline characters and matched in multiline mode so that '^' matches the beginning of the string or immediately following a newline. > Since all headers > include the text "Subject:" and that is the area of the header that I > want to filter, this is why "^Subject:" is specified. Correct. > If I eliminate > the literal asterisk and just change this to an asterisk, i.e.: > "^Subject:*" that should take care of the space, right? Regexps are not globs. Asterisk doesn't mean 0 or more of anything. It is a repetition which means 0 or more of the preceding. "^Subject:*" will match the beginning of the string or a newline followed by 'Subject' followed by 0 or more ':'. You would want "^Subject:.*" to match Subject: followed by 0 or more of any character. See <https://docs.python.org/2/library/re.html#regular-expression-syntax>. > Sometimes the > mails come in with mixed Chinese and English characters, so if an > English character is first in the subject and my filter specifies > that it must be a space followed by a Chinese character, then the > filter would fail to catch this...I think what is needed is this: > > ^Subject:*[list of all Chinese characters here] That should be ^Subject:.*[list of all Chinese characters here] except that if your list's preferred language is English and you haven't changed Mailman's character set for English from ASCII to UTF-8, the text you are matching against won't contain any Chinese characters because the decoded headers are converted to the character set of the list's preferred language and all the Chinese characters will be converted to '?'. You might try something like ^Subject:.*\?{4,} This will match any subject that contains 4 or more non-ascii characters in a row. Unfortunately, it will also match Subject: WTF is happening here???? but you could try some number other than 4 but greater than 1 > I don't understand the use of an equals sign in the regexp. Isn't > this implied? I was referring to an RFC 2047 encoded word which you were apparently trying to match with ^Subject:\?utf-8\?B\?[56] except the literal RFC2047 encoding would not be '?utf-8?B?...'. It would be '=?utf-8?B?...'. I.e. the '=' is part of the string you would be trying to match. See <https://www.rfc-editor.org/rfc/rfc2047.txt>. However, you can't match RFC2047 encodings with header_filter_rules because the headers you are matching against have already been RFC2047 decoded. -- Mark Sapiro <m...@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan ------------------------------------------------------ Mailman-Users mailing list Mailman-Users@python.org https://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: https://mail.python.org/mailman/options/mailman-users/archive%40jab.org