Eli Zaretskii wrote: >> From: Jim Meyering <[email protected]> >> Cc: [email protected], [email protected] >> Date: Mon, 03 Oct 2011 18:41:25 +0200 >> >> > This version of wctob solves the problem. >> >> Good. Thanks for confirming that. >> Then I suggest that users of dfa.c like gawk arrange to use that. >> grep and any users that (by use of gnulib) can be assured of a working >> wctob do not need to change dfa.c to work around that bug. >> >> However, while current wctob configure-time tests in gnulib >> do detect some wctob problems, I don't see a test for this one. >> Hence, if you can confirm that this also causes a problem with grep, >> I'll work with you to add a configure-time test in gnulib >> so that gnulib-using projects also replace that system's wctob. > > It will take time for me to look in grep, because I'd need to build my > own binary from sources. > > For Gawk, the configure-time test is not going to solve the problem on > Windows because the Windows port of Gawk does not use the configure > script, it is built using a separately maintained Makefile. So for > Gawk, I can simply put the replacement wctob on a Windows-specific > file (which exists anyway, for other functions that need wrappers or > replacements).
FYI, this is what I'm going to push. The only piece lacking is the [...] note in NEWS where I normally document in which version the bug was introduced. Since I have been unable to reproduce it, I haven't bothered to try to deduce when it was introduced. >From 7d20c09e3e7cf3af9060f395e884fca285ce3598 Mon Sep 17 00:00:00 2001 From: Eli Zaretskii <[email protected]> Date: Sun, 2 Oct 2011 21:33:53 +0200 Subject: [PATCH] dfa: don't mishandle high-bit bytes in a regexp with signed-char This appears to arise only on systems for which "char" is signed. * src/dfa.c (FETCH_WC, FETCH): Produce an unsigned value, rather than a sign-extended one. Fixes a bug on MS-Windows with compiling patterns that include characters with the 8-th bit set. (to_uchar): Define. From coreutils. Reported by David Millis <[email protected]>. See http://thread.gmane.org/gmane.comp.gnu.grep.bugs/3893 * NEWS (Bug fixes): Mention it. --- NEWS | 5 +++++ src/dfa.c | 9 +++++++-- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/NEWS b/NEWS index 8578e82..2b06af4 100644 --- a/NEWS +++ b/NEWS @@ -2,6 +2,11 @@ GNU grep NEWS -*- outline -*- * Noteworthy changes in release ?.? (????-??-??) [?] +** Bug fixes + + grep no longer mishandles high-bit-set pattern bytes on systems + where "char" is a signed type. [bug appears to affect only MS-Windows] + grep now rejects a command like "grep -r pattern . > out", in which the output file is also one of the inputs, because it can result in an "infinite" disk-filling loop. diff --git a/src/dfa.c b/src/dfa.c index 8611435..dc87915 100644 --- a/src/dfa.c +++ b/src/dfa.c @@ -86,6 +86,11 @@ /* Sets of unsigned characters are stored as bit vectors in arrays of ints. */ typedef int charclass[CHARCLASS_INTS]; +/* Convert a possibly-signed character to an unsigned character. This is + a bit safer than casting to unsigned char, since it catches some type + errors that the cast doesn't. */ +static inline unsigned char to_uchar (char ch) { return ch; } + /* Sometimes characters can only be matched depending on the surrounding context. Such context decisions depend on what the previous character was, and the value of the current (lookahead) character. Context @@ -686,7 +691,7 @@ static unsigned char const *buf_end; /* reference to end in dfaexec(). */ { \ cur_mb_len = 1; \ --lexleft; \ - (wc) = (c) = (unsigned char) *lexptr++; \ + (wc) = (c) = to_uchar (*lexptr++); \ } \ else \ { \ @@ -715,7 +720,7 @@ static unsigned char const *buf_end; /* reference to end in dfaexec(). */ else \ return lasttok = END; \ } \ - (c) = (unsigned char) *lexptr++; \ + (c) = to_uchar (*lexptr++); \ --lexleft; \ } while(0) -- 1.7.7.rc0.362.g5a14
