Re: [Issue 8 drafts 0001550]: clarifications/ambiguities in the description of context addresses and their delimiters for sed

Geoff Clare via austin-group-l at The Open Group Tue, 05 Apr 2022 07:55:12 -0700

Replying to a whole bunch of bugnotes here, including two from 1551.
Together they are very long, so I've only quoted the minimum necessary
to give context to my replies.


> ---------------------------------------------------------------------- 
>  (0005771) calestyo (reporter) - 2022-04-02 01:53
>  https://austingroupbugs.net/view.php?id=1550#c5771 
> ---------------------------------------------------------------------- 

> So maybe, in "Addresses in sed" we should better *only* describe the \cREc
> form of these,... and link to "Regular Expressions in sed" for how
> delimiters are escaped?

I think it works better the way I have it (which you said you could live
with).  The s command and y command need to describe delimiter escaping in
the replacement as well, so I think it makes sense to keep all delimiter
escaping together for those.

> (Ic), (Id) as well as my original (2b) would be fixed, if we'd write
> something like:
> "When the delimiter character c is <slash>, a context address \/RE/ can
> also be written as /RE/." (or something similar but better).
> That would make it clear that \/RE/ is allowed and identical to /RE/ and at
> the same time define /RE/.

I'd suggest just changing the "example" part of my proposal.
I.e. instead of:

    For example, the context address "\xabc\xdefx" is equivalent to
    "/abcxdef/".

it could say:

    The construction "\cREc" does not need to be used when the delimiter
    is a <slash>; for example, the context address "\xabc\xdefx" is
    equivalent to "/abcxdef/".

> Oh, and if you should change your proposed text,... could you please always
> make a new post

I expect that too much will change to make it reasonable to edit in
place anyway.

> ---------------------------------------------------------------------- 
>  (0005775) kre (reporter) - 2022-04-02 09:37
>  https://austingroupbugs.net/view.php?id=1550#c5775 
> ---------------------------------------------------------------------- 

> Then, in a subsequent sentence, or perhaps even paragraph, say something
> like
>     Note: even if escaped, the characters <backslash> and <newline> cannot
>     be used as dellimiter characters. <backslash> does not work, [...]
>     <newline> does not work either, as if not escaped, it [...]

(see below)

> ---------------------------------------------------------------------- 
>  (0005777) calestyo (reporter) - 2022-04-02 19:47
>  https://austingroupbugs.net/view.php?id=1550#c5777 
> ---------------------------------------------------------------------- 
> That indented paragraph of yours (in Note 0005775) should (if at all) only
> go to the Rationale, IMO. At least the part which describes *why*
> <backslash> and <newline> cannot be used.

I'll put something similar in rationale.

> ---------------------------------------------------------------------- 
>  (0005774) kre (reporter) - 2022-04-02 09:15
>  https://austingroupbugs.net/view.php?id=1551#c5774 
> ---------------------------------------------------------------------- 

> About being an editorial change, I agree, but I think it would be better
> if
> it were changed to be "escape character" rather than "unescaped
> <backslash>".

Okay, I'll see what I can do.  It may make sense to use the new
definition of "escape sequence" from bug 1546.

It won't be possible in the y command, as that doesn't use an RE (so
would need its own definition of "escape character").

> ---------------------------------------------------------------------- 
>  (0005780) calestyo (reporter) - 2022-04-05 00:59
>  https://austingroupbugs.net/view.php?id=1551#c5780 
> ---------------------------------------------------------------------- 
[This one has been edited since it was sent to the mailing list, so
the quotes below are copied from Mantis instead of the email.]

> Still, as I propose in https://austingroupbugs.net/view.php?id=1556#c5778
> point (c) I'd make this more clear by directly saying, that sed's
> additions '\n' (for newlines) and '\c' (for escaped delimiter) are -
> with respect to sed, considered part of the RE respectively replacement
> language... and that the whole command string (context address
> respectively s-command) is parsed in one go from left to right.

We can't specify parsing in one go - that's an internal implementation
detail.  In fact, any sed implementation that wants to use the standard
regcomp() and regexec() functions to do the RE matching will need to do
a separate pass to produce the RE to give to regcomp().

What matters is that the delimiter can only be escaped with an
_unescaped_ backslash, and that it doesn't end the RE when it is in a
bracket expression. I believe my proposal makes both of those things
clear.

> Apart from GNU's vs. busybox' sed ... is it known whether any current
> (= not older than 5 years and still maintained) sed implementations
> differ in that behaviour?
> 
> BusyBox sed may simply change it's behaviour (if persuaded ;-) )...
> I think they usually try to follow GNU... and so the difference might
> be simply some implementation coincidence.
> 
> In note #5757, Geoff mentioned HP-UX.

In that note I also mentioned Solaris and macOS, and I stand by my
statement that "Since there is no clear winner, POSIX should (explicitly)
allow both behaviours."  Even if Busybox were to change sides, I expect
AIX is very likely to be the same as Solaris and HP-UX.  It's probably
an ancient SysV v. BSD thing.

I am quite sure that any attempt to require one side or the other to
change would not achieve consensus, so please drop this.  It really is
hardly any limitation on applications if they need to avoid using
special RE characters as delimiters in order to be portable - especially
since the portability issue only occurs if you try to escape the
delimiter.

> I think the specification must make it a MUST that it implementations
> use that as the literal character. E.g. for BREs, applications MUST
> consider:
>  s(\((X(
> equivalent to:
>  s/(/X/

My proposal in note 5761 does that, courtesy of:

    If the delimiter character is not special in a BRE or ERE
    according to [xref to XBD 9.3] or [xref to XBD 9.4], respectively,
    the escape sequence <backslash>delimiter shall be treated as that
    literal character in the RE

I included the "according to" bit to make sure there's absolutely
no doubt about which characters it applies to.  If a character is not
listed as special in the relevant section, the text applies.

> Geoff's current proposal does not explicitly allow an implementation
> to choose the behaviour (literal vs. special) for delimiters whose
> character c would only become special if escaped - but it doesn't
> explicitly forbid it either (and that's what's IMO missing).
>
> I just saw (at the end of my review) that Geoff's proposal already
> indicates this whole problematic, in the added paragraph:
> "Some historical sed implementations..."
> 
> Still I think we need to more explicitly rule this out outside of
> the "RATIONALE" section.

The proposed normative text clearly forbids it, and the rationale
points out that it forbids it.  I see no reason to do anything more
here.

> * I'm also unsure for the cases '\c' where c is any digit from 1-9:
> - this may be in replacements (BRE and ERE)
> - the RE part, too (only for BRE) <--- see my note #5610 above for
> details/examples
> 
> I think for BOTH, we need to make it a MUST, that applications treat
> e.g. the escape sequence '\1' as the literal one.
> For the replacement this is already the case with Geoff's current
> proposal (which only allows the implementation to choose with '\&').

It's already the case for the RE as well, as per the above discussion
of \( etc.

> But 's.[.].X.' is different... as the 2nd '.' is not escaped.

That's why I included this bullet item in my proposal:

    The delimiter character that precedes and follows the RE shall not
    terminate the RE when it appears within a bracket expression.

> b) I wouldn't write:
>     "is not special in a BRE or ERE" <--- this exists in two locations
> but rather
>     "is not special in __the__ BRE __respectively__ ERE"
> or something better.
> 
> The "or" could be interpreted e.g. the following way: we're in a BRE,
> someone uses + as delimiter, while that isn't special in BRE it is
> in ERE... so the "or" kicks in,... at least in my English understanding,
> "respectively" would make it a tiny bit clearer that this (BRE vs. ERE)
> depends on the respective case.
> And yes, I've seen the ", respectively, " but I'd rather interpret
> that to relate to BRE <-> [xref to XBD 9.3] and ERE <-> [xref to XBD 9.4].

I think the best way to address this would be to include mention of
whether -E is specified:

    ... in a BRE (if the <b>-E</b> option is not specified) or ERE (if
    <b>-E</b> is specified) ...

> I think we could go back and just call that "escape 'c'" or "escape
> sequence 'c'", though I would personally prefer to retain the
> parentheses with a hint like "(there can't be escape
> characters/sequences inside bracket expressions)"

See my reply above to kre's note 5774.  Rather than mention bracket
expressions in parentheses, I'd prefer to reference XBD 9.1 (as updated
by bug 1546) for the details of how/where escaping works.

> d) Cosmetics:
> In some places the wording "escape sequence <backslash>c" is used...
> but in others e.g. "escape sequence '\n'".

That's intentional, and is because the "c" in "<backslash>c" is italic,
to indicate that it stands for any character. 

> e) Instead of:
> "The delimiter character that precedes and follows the RE shall not
> terminate the RE when it appears within a bracket expression. For
> example, the context address "/[/]/" is equivalent to "/\//"."
> 
> "The delimiter character that precedes and follows the RE shall not
> terminate the RE when it appears within a bracket expression __but
> be that literal character for the bracket expression__. For example,
> the context address "/[/]/" is equivalent to "/\//"."

It might be worth altering this somehow, but "literal" is wrong
(specifically if the delimiter is '^' or '-', or things like ':' in
[[:alpha:]]).

> g) "if it is <ampersand>, it is unspecified whether the escape
> sequence <backslash>delimiter is treated as the literal character or
> the special character (see below).
>
> => one might just write '\&' here, since in that case "delimiter" is
> always '&'.

I considered doing that when I wrote the text, but decided I preferred
to have the text match the three other uses of <backslash>delimiter in
that paragraph.

> => And perhaps something like "should put it inside a bracket
> expression __with not other characters__" to make clear, that one
> cannot re-use one e.g. 'sX\X[0-9]XfooX' can NOT be written as
> 'sX[X0-9]XfooX' but only as 'sX[X][0-9]XfooX'.

Incorrect, sX[X0-9]XfooX is required to "work". 

> => But anyway,... the above sentence would need to exclude [^] then...

Red herring. See my comment in bug 1575.

> i) "Some historical sed implementations did not support escaping
> '(', ')', '{', and '}' when used as a BRE"
> 
> Not sure, but this introduction with historical implementations
> gives kinda the feeling that this problem would only exist because
> of historical implementations and because of '(', ')', '{', and '}'.
> However, AFAIU, we need to *generally* rule that out, and not just
> because of historical implementations.
> 
> And with "that" I mean, implementations must not be allowed to
> choose whether they give '\c' literal or special meaning, if 'c' is
> the delimiter, and if 'c' alone wouldn't be special, but 'c'
> preceded by an escaping '\' would be.

The proposed normative text does exactly that.  The rationale is just
there to point out that the normative text disallows some historical
behaviour.

> VI) not really related to this issue, but it would make things even
> more complex if I add it in a separate ticket:
> 
> The description of the y-command contains on page 3138, line 106249:
> "If the number of characters in string1 and string2 are not equal,
> or if any of the characters in string1 appear more than once, the
> results are undefined."
> 
> That is strictly speaking wrong, namely in the case when string1
> and/or string2 contains '\'-escaped 'n' (for newline) or a '\'-escaped
> delimiters, and the number of occurrences in both strings don't even out.
> 
> => Perhaps simply write "If the number of characters (after
> resolving any escape sequences)..." or so?

That part of the y command description is not being touched by the
changes to fix 1550 and 1551, so it would be best to raise a separate
bug for this.

-- 
Geoff Clare <g.cl...@opengroup.org>
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England

Re: [Issue 8 drafts 0001550]: clarifications/ambiguities in the description of context addresses and their delimiters for sed

Reply via email to