interpreted between delimiters (especially when these are special characters)

Austin Group Bug Tracker via austin-group-l at The Open Group Tue, 05 Apr 2022 13:23:44 -0700


A NOTE has been added to this issue. 
====================================================================== 
https://austingroupbugs.net/view.php?id=1551 
====================================================================== 
Reported By:                calestyo
Assigned To:                
====================================================================== 
Project:                    Issue 8 drafts
Issue ID:                   1551
Category:                   Shell and Utilities
Type:                       Clarification Requested
Severity:                   Objection
Priority:                   normal
Status:                     New
Name:                       Christoph Anton Mitterer 
Organization:                
User Reference:              
Section:                    Utilities, sed 
Page Number:                3132, ff. (in the draft) 
Line Number:                see below 
Final Accepted Text:         
====================================================================== 
Date Submitted:             2022-01-14 05:39 UTC
Last Modified:              2022-04-05 00:59 UTC
====================================================================== 
Summary:                    sed: ambiguities in the how BREs/EREs are
parsed/interpreted between delimiters (especially when these are special
characters)
======================================================================
Relationships       ID      Summary
----------------------------------------------------------------------
related to          0001550 clarifications/ambiguities in the descr...
======================================================================


---------------------------------------------------------------------- 
 (0005780) calestyo (reporter) - 2022-04-05 00:59
 https://austingroupbugs.net/view.php?id=1551#c5780 
---------------------------------------------------------------------- 
(My sincere apologies for this having become so long.)


Geoff, I've looked at your proposal at
https://austingroupbugs.net/view.php?id=1550#c5761 and with respect to this
ticket I'd say the following:


I) with respect to my original point (1) here (i.e. how the string is
parsed, left-to-right in one pass vs. two passes):

People might argue, that it would kinda implicitly follow from that fact
that the delimiter character c can be in the string as just c (and thus
being a delimiter) or '\c' (being not the delimiter). And since that could
also be preceded by any further '\', which would then be part of the RE...
they could argue that it would be "clear" that the string needs to be
parsed in one pass.

However, in the draft (not Geoff's current proposal), page 3134, line
106087 merely says:
"If the character designated by c appears following a <backslash>, then it
shall be considered to be that literal character, which shall not terminate
the RE."

It doesn't use the terms "escape character/sequence"... it just says "c
following <backslash>".
Vice versa that means, that c NOT following <backslash> is a delimiter,
right?!

Nothing directly forbids, to first look for such c NOT following
<backslash>, break up the string there and parse the parts, or is there
anything that does?

Looking at e.g. 's(\\((X(' should show how this is ambiguous because the
parsing is not defined:
a) going left to right in one pass:
s(
          RE: \\      (i.e. the literal '\')
 (
 replacement: <empty>
 (
       flags: X(

b) two stage parsing, with looking for '( not preceded by <backslash>'
s(
          RE: \\(
 (
 replacement: X
 (
       flags: <empty>


With a lot of thinking around some edges, it was already vaguely implied by
the draft via:
- Page 3134, line 106085 "Both BREs and EREs shall also support the
following additions", which is probably intended to mean, that the
following bullet items (which included \n and \c) are all considered part
of the RE-language ... and thus have to be parsed with them in one step.


Geoff's current proposal makes this a little bit clearer by the sentences:
- "The BRE and ERE syntax shall additionally support escaping occurrences
of the delimiter within the RE with an unescaped <backslash> (except inside
a bracket expression)."
- "Within the RE and the replacement, the delimiter shall not terminate the
RE or replacement if it
is preceded by an unescaped <backslash> (that is not inside a bracket
expression in the RE,
where the delimiter does not terminate the RE anyway - see [xref to Regular
Expressions in
sed])"


=> Still, as I propose in
https://austingroupbugs.net/view.php?id=1556#c5778 point (c) I'd make this
more clear by directly saying, that sed's additions '\n' (for newlines) and
'\c' (for escaped delimiter) are - with respect to sed, considered part of
the RE respectively replacement language... and that the whole command
string (context address respectively s-command) is parsed in one go from
left to right.




II) with respect to my original point (2a) here (i.e. what is a escaped
delimiter \c with c being special to the RE - special or literal?):

Geoff's current proposals solves this question with the the sentences:
- "If the character designated by c is not special in a BRE or ERE
according to [xref to XBD 9.3] or [xref to XBD 9.4], respectively, the
escape sequence <backslash>c shall be treated as that literal character;
otherwise, it is unspecified whether the escape sequence <backslash>c is
treated as the literal character or the special character."

- "If the delimiter character is not special in a BRE or ERE according to
[xref to XBD 9.3] or [xref to XBD 9.4], respectively, the escape sequence
<backslash>delimiter shall be treated as that literal character in the RE;
otherwise, it is unspecified whether the escape sequence
<backslash>delimiter is treated as the literal character or the special
character. Likewise, if the delimiter character is not <ampersand> ('&'),
the escape sequence <backslash>delimiter shall be treated as that literal
character in the replacement; if it is <ampersand>, it is unspecified
whether the escape sequence <backslash>delimiter is treated as the literal
character or the special character (see below)."


Well, that "solves" (2a) (and the first half of my note 5607, which dealt
with the same problem for the replacement) by making it implementation
dependent...
So my usual question in that case:

Apart from GNU's vs. busybox' sed ... is it known whether any current (=
not older than 5 years and still maintained) sed implementations differ in
that behaviour?

BusyBox sed may simply change it's behaviour (if persuaded ;-) )... I think
they usually try to follow GNU... and so the difference might be simply
some implementation coincidence.

In note #5757, Geoff mentioned HP-UX. Current versions? Are HP people here?
Is this really documented behaviour? Would they'd be willing to change?

What "we" standardise here will stay there for many decades or forever (and
cause much headache because of portability).

Therefore I think it would be worth to try whether the behaviour of any
still relevant implementations could be unified.
And have only *one* way standardised.

I mean it's better to have this explicitly defined as implementation
dependent than nothing... but would be even better if all implementations
do just the same.




II) with respect to my original point (2b) here (i.e. should the standard
tell whether or not one cant get the other meaning (e.g. special if it's
taken literal)?):

Geoff's current proposal already describes this for one direction in the
APPLICATION USAGE:
"Applications that use a special RE character as a delimiter (for example,
'.' or '*') and need to use the delimiter as a literal character in the RE
should put it inside a bracket expression, as implementations differ
regarding whether escaping it with a <backslash> removes its special
meaning.


=> Perhaps adding something like:
"If an implementation considers such escaped delimiter as the literal
character (as opposed to the special character), it is not possible to give
it it's special meaning, except by using another delimiter."




III) with respect to my original point (2c) here (i.e. what about
characters, that get their special meaning with respect to the RE *or*
replacement only when escaped by a preceding backslash, e.g. for BREs '('
or '{'... and in implementation extensions e.g. '+' or 's'):

Geoff's current proposal doesn't mention that case, AFAIU. It uses the
wording "special character", but AFAIU, '(' is not considered a "special
character", right?

I gave some different kinds of example for this in:
- the original post of this ticket, point (2c)   <-- example for the RE
part
- note 5607 of this ticket, point (2c)           <-- example for the
replacement part
- the original post of this ticket, point (3)    <-- busybox example for
actually the same thing as in (2c)


=> I think, what Geoff's current proposal already "solves" for special
characters (by making it implementation defined) *still needs* to be solved
for such characters that become only special when escaped by a preceding
'\', too.

AND

I think the specification must make it a MUST that it implementations use
that as the literal character.
E.g. for BREs, applications MUST consider:
 s(\((X(
equivalent to:
 s/(/X/

Giving them the freedom to choose the special meaning, by making this
implementation defined, would have IMO the following problems:
- POSIX defines the escape sequence '\x' (for all characters x except some
like the specials or with sed '\n' and '\c' with c being the delimiter) to
be undefined.
- thus implementations started using this for their own purposes, e.g.
GNU's sed has '\s', '\S', '\W', '\>', '\+' and more.

If POSIX now allows e.g. '\W' in 'sW\WWxW' to be *either* special *or*
literal (and not just the latter), then people loose basically the ability
to use most possible delimiter characters in a portable way - because an
application could have extended the meaning of any '\x' (x as "defined"
above).

So even if one would follow just the POSIX rules... which effectively say:
- you may use 'W' as delimiter
- 'W' is not a special character
- 'W' is not a character like '(' that becomes special if escaped by '\'
- '\W' is undefined, except for the use as escaped delimiter

... and thereby rightfully assume, that '\W', when the delimiter is 'W',
would become the literal 'W'... they could actually get a special 'W'.

And that for every character that *any* implementation might have given an
extended meaning.
This is basically what I tried to describe in note #5611 above.

If the standard would allow the implementation to choose whether \c is the
literal c or the special c for characters c, other than a very limited set
(namely the respective POSIX defined special characters - excluding(!) any
implementation defined characters that get only special when escaped), then
I think no delimiters could be safely used anymore in a portable way (even
'/'), because when it's used in the RE it's implementation dependant,
whether it becomes special or not.
The only way around that would be, to put any delimiter c in the RE into
its own bracket expression... but I guess that would get quite ugly and
many people would probably not know that this might be needed.


btw: For the replacement part, I think Geoff's current proposal already
does this (rather implicitly though):
"Likewise, if the delimiter character is not <ampersand> ('&'), the escape
sequence <backslash>delimiter shall be treated as that literal character in
the replacement; if it is <ampersand>, it is unspecified whether the escape
sequence <backslash>delimiter is treated as the literal character or the
special character (see below)."

That gives the freedom to choose whether \c is special or literal, ONLY for
the POSIX-defined special character & ... *any* other characters (including
1-9 and any additions an implementation might have made) are required to be
literal.


Geoff's current proposal does not explicitly allow an implementation to
choose the behaviour (literal vs. special) for delimiters whose character c
would only become special if escaped - but it doesn't explicitly forbid it
either (and that's what's IMO missing).

I just saw (at the end of my review) that Geoff's proposal already
indicates this whole problematic, in the added paragraph:
"Some historical sed implementations..."

Still I think we need to more explicitly rule this out outside of the
"RATIONALE" section.



=> So, I'd propose:

* Implementations MUST consider '\c' with a delimiter 'c' ALWAYS as the
literal character 'c', unless 'c' is a special character for BREs
respectively ERE.
*If* the final accepted resolution leaves it implementation defined for for
special characters, then one could possibly amend that simply by saying
"For any 'c' that are not special (including such which would become only
special when escaped), '\c' MUST be considered as the literal 'c'."

=> e.g. with a sentence like  "If a character 'c' is used as delimiter that
is not a special character for BREs respectively EREs (as defined by
POSIX), the escape sequence '\c' must be considered to be the literal
character c, regardless of any special behaviour extending POSIX, that an
implementation would give to '\c' if another delimiter was used."


* I'm however not sure, what should be done about those characters, which
POSIX itself already mentions as characters that become special when
escaped e.g. for BREs: '(' or '{' ... and with
https://austingroupbugs.net/view.php?id=1546 also '\?' '\+' and '\|' which
*may* be special)

For them one could allow implementation dependent behaviour, because
they're defined by POSIX already, and so people know what they must exclude
to stay portable. But they cannot know for any '\w', '\s' or '\~'.

I'd probably recommend against it, and allow implementation defined
behaviour only for true special characters, which don't need to be escaped
to become special.

[I should note again, that at least BusyBox' sed would already break this,
see my original post here, point (2c).
Since it's conceptually the same as point (3), which has already been
fixed, it may be already fixed in current BusyBox versions, too.]


* I'm also unsure for the cases '\c' where c is any digit from 1-9:
- this may be in replacements (BRE and ERE)
- the RE part, too (only for BRE)  <--- see my note #5610 above for
details/examples

I think for BOTH, we need to make it a MUST, that applications treat e.g.
the escape sequence '\1' as the literal one.
For the replacement this is already the case with Geoff's current proposal
(which only allows the implementation to choose with '\&').


=> I'd also propose to add something like the following sentence also to
the standard (maybe APPLICATION USAGE?).
"If a digit from 1-9 is used as delimiter, it cannot be used as
back-reference in the replacement or a BRE's RE part."

=> I'd further suggest to add (probably also to APPLICATION USAGE) a list
for BRE and ERE, respectively, which lists all those characters that are
*not* safely usable as delimiter (because applications may choose literal
vs. special) *if* the character is also part of the RE (and the same for
the replacement, where it's only '&').
If my above proposal is accepted, and depending on what's done for '\?'
'\+' and '\|', these would be simply the list of all (truly) special
characters for BREs respectively EREs.
Such lists would help people to more easily understand what they can use
portably, without fiddling it out from the rules and their complex
meaning.




IV) with respect to my original point (2d) here (i.e. what about
's.[.].X.')

As said before, I think it's already a bit clearer with Geoff's current
proposal, that '\c' with the being the delimiter would be consider part of
the RE language.

But 's.[.].X.' is different... as the 2nd '.' is not escaped.

However, with:
- what I propose in https://austingroupbugs.net/view.php?id=1556#c5778
point (c) it would become IMO "fully" clear
and
- The following sentence added with Geoff's current proposal:
'The delimiter character that precedes and follows the RE shall not
terminate the RE when it appears within a bracket expression. For example,
the context address "/[/]/" is equivalent to "/\//".'

it becomes IMO fully clear. So that point is solved (I also like the
paragraph you add in the APPLICATION USAGE, which describes how to use
that).

=> I would however use a simpler example:
"\%[%]%" is equivalent to "/[%]/"
or alternatively:
"/[/]/" is equivalent to "\%[/]%"

Well, whether it's simply might be a matter of personal taste,.. but it
doesn't drop the bracket expression, which I think is better for showing
what's going on




V) With respect to the proposal at
https://austingroupbugs.net/view.php?id=1550#c5761

a) As already said in the other tickets, I'd put down the sentences
starting with "The BRE and ERE syntax shall additionally support escaping"
to the "Regular Expressions in sed" section again.

Sorry for giving that bad idea earlier, that it should be in "Addresses in
sed"

And perhaps overlapping parts of these sentences can be unified with the
ones added to the s-command. (Won't work for the replacement part,
though).


b) I wouldn't write:
    "is not special in a BRE or ERE"   <--- this exists in two locations
but rather
    "is not special in __the__ BRE __respectively__ ERE"
or something better.

The "or" could be interpreted e.g. the following way: we're in a BRE,
someone uses + as delimiter, while that isn't special in BRE it is in
ERE... so the "or" kicks in,... at least in my English understanding,
"respectively" would make it a tiny bit clearer that this (BRE vs. ERE)
depends on the respective case.
And yes, I've seen the ", respectively, " but I'd rather interpret that to
relate to BRE <-> [xref to XBD 9.3] and ERE <-> [xref to XBD 9.4].


c) Through out your additions, you use e.g. "with an unescaped <backslash>
(except inside a bracket expression)" or similar.
If we make it more clear (as I proposed above):
- that '\c' and '\n' are considered part of the RE language
and with
- the changes mad through some other issue, that clearly define "escape
sequence/character" for the RE language

... I think we could go back and just call that "escape 'c'" or "escape
sequnce 'c'", though I would personally prefer to retain the parentheses
with a hint like "(there can't be escape characters/sequences inside
bracket expressions)"


d) Cosmetics:
In some places the wording "escape sequence <backslash>c" is used... but in
others e.g. "escape sequence '\n'".


e) Instead of:
"The delimiter character that precedes and follows the RE shall not
terminate the RE when it appears within a bracket expression. For example,
the context address "/[/]/" is equivalent to "/\//"."

"The delimiter character that precedes and follows the RE shall not
terminate the RE when it appears within a bracket expression __but be that
literal character for the bracket expression__. For example, the context
address "/[/]/" is equivalent to "/\//"."

It's nitpicking, but AFAIU, the delimiter character (unlike the escaped
delimiter character) is strictly speaking *not* part of the RE language.
So it's in principle still not 100% clear what happens with such character.
Sure, it doesn't terminate the RE... but it could be... ignored?


f) "Within the RE and the replacement, the delimiter shall not terminate
the RE or replacement if it is preceded by an unescaped <backslash> (that
is not inside a bracket expression in the RE, where the delimiter does not
terminate the RE anyway - see [xref to Regular Expressions in sed])."

In case this would be "unified" with the corresponding parts for the
context address in "Regular Expressions in sed"... the part for the
replacement would obviously need to stay.


g) "if it is <ampersand>, it is unspecified whether the escape sequence
<backslash>delimiter is treated as the literal character or the special
character (see below).

=> one might just write '\&' here, since in that case "delimiter" is always
'&'.


h) "Applications that use a special RE character as a delimiter (for
example, '.' or '*') and need to use the delimiter as a literal character
in the RE should put it inside a bracket expression, as implementations
differ regarding whether escaping it with a <backslash> removes its special
meaning."

=> If my proposal (III) above is accepted, then I'd also repeat here
specifically e.g. "special RE character (which does not include such which
become only special when escaped) as a delimiter".

=> And perhaps something like "should put it inside a bracket expression
__with not other characters__" to make clear, that one cannot re-use one
e.g. 'sX\X[0-9]XfooX' can NOT be written as 'sX[X0-9]XfooX' but only as
'sX[X][0-9]XfooX'.

Question:
Are the following bracket expressions well-defined and portable:
- [^]
- [\]
?
At least '[^]' would fall under the above sentence ("Applications that use
a special RE character...")... '[\]' not really as '\' cannot be a
delimiter.

I tried to find this in 9.3.5 RE Bracket Expression,... and I guess '[\]'
is clearly well-defined and portable... but I cannot really follow this for
'[^]'... it seems not to be mentioned and I guess I'll report it in a
separate ticket.

=> But anyway,... the above sentence would need to exclude [^] then...

Or is there a way to safely escape this? I guess not, cause it's special
and thus implementations would be allowed to choose whether to treat '\^'
literal or special (when ^ is also the delimiter)... probably even
depending on the position of that escape sequence.


i) "Some historical sed implementations did not support escaping '(', ')',
'{', and '}' when used as a BRE"

Not sure, but this introduction with historical implementations gives kinda
the feeling that this problem would only exist because of historical
implementations and because of '(', ')', '{', and '}'.
However, AFAIU, we need to *generally* rule that out, and not just because
of historical implementations.

And with "that" I mean, implementations must not be allowed to choose
whether they give '\c' literal or special meaning, if 'c' is the delimiter,
and if 'c' alone wouldn't be special, but 'c' preceded by an escaping '\'
would be.




VI) not really related to this issue, but it would make things even more
complex if I add it in a separate ticket:

The description of the y-command contains on page 3138, line 106249:
"If the number of characters in string1 and string2 are not equal, or if
any of the characters in string1 appear more than once, the results are
undefined."

That is strictly speaking wrong, namely in the case when string1 and/or
string2 contains '\'-escaped 'n' (for newline) or a '\'-escaped delimiters,
and the number of occurrences in both strings don't even out.

=> Perhaps simply write "If the number of characters (after resolving any
escape sequences)..." or so? 

Issue History 
Date Modified    Username       Field                    Change               
====================================================================== 
2022-01-14 05:39 calestyo       New Issue                                    
2022-01-14 05:39 calestyo       Name                      => Christoph Anton
Mitterer
2022-01-14 05:39 calestyo       Section                   => Utilities, sed  
2022-01-14 05:39 calestyo       Page Number               => 3132, ff. (in the
draft)
2022-01-14 05:39 calestyo       Line Number               => see below       
2022-01-14 06:34 Don Cragun     Relationship added       related to 0001550  
2022-01-14 06:48 Don Cragun     Project                 
1003.1(2016/18)/Issue7+TC2 => Issue 8 drafts
2022-01-14 06:51 Don Cragun     Note Added: 0005602                          
2022-01-14 06:51 Don Cragun     version                   => Draft 2.1       
2022-01-14 15:52 calestyo       Note Added: 0005607                          
2022-01-14 20:36 calestyo       Note Added: 0005610                          
2022-01-14 21:40 calestyo       Note Added: 0005611                          
2022-01-14 21:48 calestyo       Note Added: 0005612                          
2022-01-14 22:07 calestyo       Note Edited: 0005612                         
2022-01-14 22:08 calestyo       Note Edited: 0005612                         
2022-01-14 22:09 calestyo       Note Edited: 0005612                         
2022-01-14 22:15 calestyo       File Added:
summary-of-literal-behaviour-gnu-vs-busybox.txt                    
2022-01-14 22:17 calestyo       Note Added: 0005613                          
2022-01-18 21:18 calestyo       Note Added: 0005627                          
2022-01-24 14:52 calestyo       Note Added: 0005634                          
2022-01-24 15:48 calestyo       Note Edited: 0005634                         
2022-02-01 20:12 calestyo       Note Added: 0005648                          
2022-03-18 14:59 geoffclare     Note Added: 0005757                          
2022-04-02 09:15 kre            Note Added: 0005774                          
2022-04-05 00:59 calestyo       Note Added: 0005780                          
======================================================================

[Issue 8 drafts 0001551]: sed: ambiguities in the how BREs/EREs are parsed/interpreted between delimiters (especially when these are special characters)

Reply via email to