On Sat, Dec 02, 2006 at 01:17:28AM -0500, Chris Hanson wrote: > The reason it doesn't match is that the "R in a circle" character is > encoded in the log file as using the ISO 8859-1 code 0xae, but this > isn't a valid first byte of a UTF-8 code. Consequently, the "." > pattern doesn't match it. In fact, I don't think there's _any_ way to > match this byte sequence in a UTF-8 locale.
I guess [eg]libc's regex functions are a bit strict about their input. However, grep also comes with its own DFA-based functions, which are more lax about encoding errors; they are normally skipped for multibyte encodings, but can be forced with GREP_USE_DFA=1. > Unfortunately I'm not sure what to do about this, because it's not > obvious how the log-file messages relate to the locale. This message They don't, at least not reliably. There's stuff in there, like ssh usernames, that comes directly from nefarious people who don't give a rat's ass about your particular selection of encoding. > One thing that works in this case is to set "LC_ALL=C" prior to > calling grep. But if the log files sometimes contain UTF-8 coding, > this will mess that up I doubt this would be a problem. Pretty much everything that is matched explicitly in any rule (hostname, IP address, process ID) is in ASCII. Any chunk of arbitrary data should be matched with something like .* or [^[:space:]]+, which will work whether it was decoded or not. Now, it's true that POSIX restricts the "C" locale to 7-bit characters, but both grep and elibc appear to deal with binary characters just fine. One unfortunate side-effect is that any error messages from grep will therefore be in English, but that's probably a lesser evil. (LC_MESSAGES cannot be left as is, since mixing different encodings is not supported.) -- Never trust an operating system you don't have sources for. ;-) -- Unknown source _______________________________________________ Logcheck-devel mailing list Logcheck-devel@lists.alioth.debian.org http://lists.alioth.debian.org/mailman/listinfo/logcheck-devel