On 11/9/20 9:34 AM, JIang Yuancheng wrote:
grep -E “.*{10,}{10,}{10,}{10,}{10,}” can exhaust stack space then stack
overflow comes out. (Tested on latest version 3.6)
This is a longstanding issue with the regex matcher. I installed the
attached patch to document the issue better. Fortunately, the problem is
mostly limited to contrived examples.
>From f0d97db2a2104c5fd558178713054f3f267623b2 Mon Sep 17 00:00:00 2001
From: Paul Eggert <[email protected]>
Date: Fri, 27 Aug 2021 18:20:58 -0700
Subject: [PATCH] doc: document interval expression limitations
* doc/grep.texi (Basic vs Extended, Performance):
Document limitations of interval expressions (Bug#44538).
---
doc/grep.texi | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/doc/grep.texi b/doc/grep.texi
index b92ecb7..e5b9fd8 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1526,7 +1526,7 @@ before an interval expression's closing @samp{@}}, and an unmatched
@code{\)} is invalid.
Portable scripts should avoid the following constructs, as
-POSIX says they produce undefined results:
+POSIX says they produce unspecified results:
@itemize @bullet
@item
@@ -1541,6 +1541,8 @@ Empty alternatives (as in, e.g, @samp{a|}).
Repetition operators that immediately follow empty expressions,
unescaped @samp{$}, or other repetition operators.
@item
+Interval expressions containing repetition counts greater than 255.
+@item
A backslash escaping an ordinary character (e.g., @samp{\S}),
unless it is a back-reference.
@item
@@ -1965,6 +1967,17 @@ bracket expressions like @samp{[a-z]} and @samp{[[=a=]b]}, can be
surprisingly inefficient due to difficulties in fast portable access to
concepts like multi-character collating elements.
+@cindex interval expressions
+Interval expressions may be implemented internally via repetition.
+For example, @samp{^(a|bc)@{2,4@}$} might be implemented as
+@samp{^(a|bc)(a|bc)((a|bc)(a|bc)?)?$}. A large repetition count may
+exhaust memory or greatly slow matching. Even small counts can cause
+problems if cascaded; for example, @samp{grep -E
+".*@{10,@}@{10,@}@{10,@}@{10,@}@{10,@}"} is likely to overflow a
+stack. Fortunately, regular expressions like these are typically
+artificial, and cascaded repetitions do not conform to POSIX so cannot
+be used in portable programs anyway.
+
@cindex back-references
A back-reference such as @samp{\1} can hurt performance significantly
in some cases, since back-references cannot in general be implemented
--
2.30.2