Re: awk: FS matching 0 or more characters

Stephane Chazelas Mon, 03 Feb 2020 23:09:00 -0800

2020-02-03 15:10:29 -0800, Don Cragun:
[...]
>       "The search for a matching sequence starts at the beginning
>       of a string and stops when the first sequence matching the
> *     ``begins earliest in the string’’. If the pattern permits
> *     a variable number of matching characters and thus there is
> *     more than one such sequence starting at that point, the
> *     longest such sequence is matched. For example, the BRE
>       "bb*" matches the second to fourth characters of the
>       string "abbbc", and the ERE "(wee|week)(knights|night)"
>       matches all ten characters of the string "weeknights".
> 
> *     "Consistent with the whole match being the longest of the
> *     leftmost matches, each subpattern, from left to right,
> *     shall match the longest possible string. For this purpose,
> *     a null string shall be considered to be longer than no
> *     match at all. For example, matching the BRE "\(.*\).*"
> *     against "abcdef", the subexpression "(\1)" is "abcdef",
> *     and matching the BRE "\(a*\)*" against "bc", the
> *     subexpression "(\1)" is the null string.
> 
>       "When a multi-character collating element in a bracket
>       expression (see Section 9.3.5, on page 184) is involved,
>       the longest sequence shall be measured in characters
>       consumed from the string to be matched; that is, the
>       collating element counts not as one element, but as the
>       number of characters it matches."
> 
> Noting the part of this definition that is on lines shown above
> with a leading asterisk, I believe the standard is clear and
> that Busybox awk does not conform.
[...]


Not sure how you reached that conclusion. It seems to me on the
contrary that that text alone would mean that busybox awk is the
only compliant implementation.

When it comes to sed or grep, all implementations agree with
busybox awk.

$ echo bbb | gsed 's/a*/<&>/g'
<>b<>b<>b<>
$ echo bbb | busybox sed 's/a*/<&>/g'
<>b<>b<>b<>
$ echo bbb | solaris-sed 's/a*/<&>/g'
<>b<>b<>b<>
$ echo bbb | solaris-xpg4-sed 's/a*/<&>/g'
<>b<>b<>b<>
$ echo aaa | grep 'b*'
aaa

The special behaviour of the original awk, mawk or gawk AFAICT
is a non-documented (AFAICT) deviation and seems to only apply
to FS processing (and split()).

sub(), gsub(), match, /.../ will happily match an empty string.

$ echo bbb | /usr/xpg4/bin/awk '{gsub(/a*/, "<&>"); print}'
<>b<>b<>b<>
$ echo bbb | gawk '{gsub(/a*/, "<&>"); print}'
<>b<>b<>b<>
$ echo bbb | mawk '{gsub(/a*/, "<&>"); print}'
<>b<>b<>b<>
$ echo bbb | gawk '/a*/'
bbb

To account for those implementations, POSIX should say that when
the split regexp matches an empty string, it's undefined whether
that empty string is taken as a field separator or ignored (and
in any case, matching resumes at the next character, not at the
end of the matched text otherwise it would loop indefinitely
(like ast-open's grep -o 'a*' does)).

-- 
Stephane

Re: awk: FS matching 0 or more characters

Reply via email to