Re: GNU and BSD sed differences

2005-12-12 Thread Paul Eggert
Werner LEMBERG <[EMAIL PROTECTED]> writes:

> I suggest to add that `\?', `\+', and `\|' should not be used in sed
> expressions

Thanks for suggesting that.  The problem is a bit more general, so I
installed the following:

2005-12-12  Paul Eggert  <[EMAIL PROTECTED]>

* doc/autoconf.texi (Limitations of Usual Tools):
Mention which characters can be escaped with \ in portable regular
expressions used in grep, sed, expr.  Mention the leading ^ problem
with expr.  Clean up some confusing wording.  Mention which
grep options are portable.

--- autoconf.texi   2 Dec 2005 19:19:23 -   1.935
+++ autoconf.texi   12 Dec 2005 18:46:51 -  1.936
@@ -11891,6 +11891,10 @@ replacement @code{grep -E}.  Also, some 
 not work on long input lines.  To work around these problems, invoke
 @code{AC_PROG_EGREP} and then use @code{$EGREP}.
 
+Portable extended regular expressions should use @samp{\} only to escape
+characters in the string @samp{$()[EMAIL PROTECTED]|}.  For example, @[EMAIL 
PROTECTED]
+is not portable, even though it typically matches @[EMAIL PROTECTED]
+
 The empty alternative is not portable, use @samp{?} instead.  For
 instance with Digital Unix v5.0:
 
@@ -11945,8 +11949,15 @@ Avoid this portability problem by avoidi
 @item @command{expr} (@samp{:})
 @c 
 @prindex @command{expr}
-Don't use @samp{\?}, @samp{\+} and @samp{\|} in patterns, as they are
-not supported on Solaris.
+Portable @command{expr} regular expressions should use @samp{\} to
+escape only characters in the string @samp{$()[EMAIL PROTECTED]@}}.
+For example, alternation, @samp{\|}, is common but Posix does not
+require its support, so it should be avoided in portable scripts.
+Similarly, @samp{\+} and @samp{\?} should be avoided.
+
+Portable @command{expr} regular expressions should not begin with
[EMAIL PROTECTED]  Patterns are automatically anchored so leading @samp{^} is
+not needed anyway.
 
 The Posix standard is ambiguous as to whether
 @samp{expr 'a' : '\(b\)'} outputs @samp{0} or the empty string.
@@ -12045,6 +12056,12 @@ while @acronym{GNU} @command{find} repor
 @item @command{grep}
 @c -
 @prindex @command{grep}
+Portable scripts can rely on the @command{grep} options @option{-c},
[EMAIL PROTECTED], @option{-n}, and @option{-v}, but should avoid other
+options.  For example, don't use @option{-w}, as Posix does not require
+it and Irix 6.5.16m's @command{grep} does not support it.
+
+Some of the options required by Posix are not portable in practice.
 Don't use @samp{grep -q} to suppress output, because many @command{grep}
 implementations (e.g., Solaris) do not support @option{-q}.
 Don't use @samp{grep -s} to suppress output either, because Posix
@@ -12070,12 +12087,17 @@ grep 'foo
 bar' in.txt
 @end example
 
-Alternation, @samp{\|}, is common but Posix does not require its
+Traditional @command{grep} implementations (e.g., Solaris) do not
+support the @option{-E} or @samp{-F} options.  To work around these
+problems, invoke @code{AC_PROG_EGREP} and then use @code{$EGREP}, and
+similarly for @code{AC_PROG_FGREP} and @code{$FGREP}.
+
+Portable @command{grep} regular expressions should use @samp{\} only to
+escape characters in the string @samp{$()[EMAIL PROTECTED]@}}.  For example,
+alternation, @samp{\|}, is common but Posix does not require its
 support in basic regular expressions, so it should be avoided in
 portable scripts.  Solaris @command{grep} does not support it.
-
-Don't rely on @option{-w}, as Irix 6.5.16m's @command{grep} does not
-support it.
+Similarly, @samp{\+} and @samp{\?} should be avoided.
 
 
 @item @command{join}
@@ -12264,8 +12286,8 @@ Patterns should not include the separato
 of a character class.  In conformance with Posix, the Cray
 @command{sed} will reject @samp{s/[^/]*$//}: use @samp{s,[^/]*$,,}.
 
-Avoid empty patterns within parentheses (i.e., @samp{\(\)}).  Posix is
-silent on whether they are allowed, and Unicos 9 @command{sed} rejects
+Avoid empty patterns within parentheses (i.e., @samp{\(\)}).  Posix does
+not require support for empty patterns, and Unicos 9 @command{sed} rejects
 them.
 
 Unicos 9 @command{sed} loops endlessly on patterns like @samp{.*\n.*}.
@@ -12273,21 +12295,25 @@ Unicos 9 @command{sed} loops endlessly o
 Sed scripts should not use branch labels longer than 8 characters and
 should not contain comments.
 
-Don't include extra @samp{;}, as some @command{sed}, such as [EMAIL PROTECTED]
-1.4.2's, try to interpret the second as a command:
+Avoid redundant @samp{;}, as some @command{sed} implementations, such as
[EMAIL PROTECTED] 1.4.2's, incorrectly try to interpret the second
[EMAIL PROTECTED];} as a command:
 
 @example
 $ @kbd{echo a | sed 's/x/x/;;s/x/x/'}
 sed: 1: "s/x/x/;;s/x/x/": invalid command code ;
 @end example
 
-Input should have reasonably long lines, since some @command{sed} have
-an input buffer limited to 4000 bytes.
+Input should not have unreasonably

GNU and BSD sed differences

2005-12-12 Thread Werner LEMBERG

I suggest to add that `\?', `\+', and `\|' should not be used in sed
expressions because other sed implementations don't interpret those
entities specially.  Inspite of marked as GNU extensions in sed.info,
it is easy to miss that.


Werner