A NOTE has been added to this issue. ====================================================================== https://austingroupbugs.net/view.php?id=1551 ====================================================================== Reported By: calestyo Assigned To: ====================================================================== Project: Issue 8 drafts Issue ID: 1551 Category: Shell and Utilities Type: Clarification Requested Severity: Objection Priority: normal Status: New Name: Christoph Anton Mitterer Organization: User Reference: Section: Utilities, sed Page Number: 3132, ff. (in the draft) Line Number: see below Final Accepted Text: ====================================================================== Date Submitted: 2022-01-14 05:39 UTC Last Modified: 2022-04-05 00:59 UTC ====================================================================== Summary: sed: ambiguities in the how BREs/EREs are parsed/interpreted between delimiters (especially when these are special characters) ====================================================================== Relationships ID Summary ---------------------------------------------------------------------- related to 0001550 clarifications/ambiguities in the descr... ======================================================================
---------------------------------------------------------------------- (0005780) calestyo (reporter) - 2022-04-05 00:59 https://austingroupbugs.net/view.php?id=1551#c5780 ---------------------------------------------------------------------- (My sincere apologies for this having become so long.) Geoff, I've looked at your proposal at https://austingroupbugs.net/view.php?id=1550#c5761 and with respect to this ticket I'd say the following: I) with respect to my original point (1) here (i.e. how the string is parsed, left-to-right in one pass vs. two passes): People might argue, that it would kinda implicitly follow from that fact that the delimiter character c can be in the string as just c (and thus being a delimiter) or '\c' (being not the delimiter). And since that could also be preceded by any further '\', which would then be part of the RE... they could argue that it would be "clear" that the string needs to be parsed in one pass. However, in the draft (not Geoff's current proposal), page 3134, line 106087 merely says: "If the character designated by c appears following a <backslash>, then it shall be considered to be that literal character, which shall not terminate the RE." It doesn't use the terms "escape character/sequence"... it just says "c following <backslash>". Vice versa that means, that c NOT following <backslash> is a delimiter, right?! Nothing directly forbids, to first look for such c NOT following <backslash>, break up the string there and parse the parts, or is there anything that does? Looking at e.g. 's(\\((X(' should show how this is ambiguous because the parsing is not defined: a) going left to right in one pass: s( RE: \\ (i.e. the literal '\') ( replacement: <empty> ( flags: X( b) two stage parsing, with looking for '( not preceded by <backslash>' s( RE: \\( ( replacement: X ( flags: <empty> With a lot of thinking around some edges, it was already vaguely implied by the draft via: - Page 3134, line 106085 "Both BREs and EREs shall also support the following additions", which is probably intended to mean, that the following bullet items (which included \n and \c) are all considered part of the RE-language ... and thus have to be parsed with them in one step. Geoff's current proposal makes this a little bit clearer by the sentences: - "The BRE and ERE syntax shall additionally support escaping occurrences of the delimiter within the RE with an unescaped <backslash> (except inside a bracket expression)." - "Within the RE and the replacement, the delimiter shall not terminate the RE or replacement if it is preceded by an unescaped <backslash> (that is not inside a bracket expression in the RE, where the delimiter does not terminate the RE anyway - see [xref to Regular Expressions in sed])" => Still, as I propose in https://austingroupbugs.net/view.php?id=1556#c5778 point (c) I'd make this more clear by directly saying, that sed's additions '\n' (for newlines) and '\c' (for escaped delimiter) are - with respect to sed, considered part of the RE respectively replacement language... and that the whole command string (context address respectively s-command) is parsed in one go from left to right. II) with respect to my original point (2a) here (i.e. what is a escaped delimiter \c with c being special to the RE - special or literal?): Geoff's current proposals solves this question with the the sentences: - "If the character designated by c is not special in a BRE or ERE according to [xref to XBD 9.3] or [xref to XBD 9.4], respectively, the escape sequence <backslash>c shall be treated as that literal character; otherwise, it is unspecified whether the escape sequence <backslash>c is treated as the literal character or the special character." - "If the delimiter character is not special in a BRE or ERE according to [xref to XBD 9.3] or [xref to XBD 9.4], respectively, the escape sequence <backslash>delimiter shall be treated as that literal character in the RE; otherwise, it is unspecified whether the escape sequence <backslash>delimiter is treated as the literal character or the special character. Likewise, if the delimiter character is not <ampersand> ('&'), the escape sequence <backslash>delimiter shall be treated as that literal character in the replacement; if it is <ampersand>, it is unspecified whether the escape sequence <backslash>delimiter is treated as the literal character or the special character (see below)." Well, that "solves" (2a) (and the first half of my note 5607, which dealt with the same problem for the replacement) by making it implementation dependent... So my usual question in that case: Apart from GNU's vs. busybox' sed ... is it known whether any current (= not older than 5 years and still maintained) sed implementations differ in that behaviour? BusyBox sed may simply change it's behaviour (if persuaded ;-) )... I think they usually try to follow GNU... and so the difference might be simply some implementation coincidence. In note #5757, Geoff mentioned HP-UX. Current versions? Are HP people here? Is this really documented behaviour? Would they'd be willing to change? What "we" standardise here will stay there for many decades or forever (and cause much headache because of portability). Therefore I think it would be worth to try whether the behaviour of any still relevant implementations could be unified. And have only *one* way standardised. I mean it's better to have this explicitly defined as implementation dependent than nothing... but would be even better if all implementations do just the same. II) with respect to my original point (2b) here (i.e. should the standard tell whether or not one cant get the other meaning (e.g. special if it's taken literal)?): Geoff's current proposal already describes this for one direction in the APPLICATION USAGE: "Applications that use a special RE character as a delimiter (for example, '.' or '*') and need to use the delimiter as a literal character in the RE should put it inside a bracket expression, as implementations differ regarding whether escaping it with a <backslash> removes its special meaning. => Perhaps adding something like: "If an implementation considers such escaped delimiter as the literal character (as opposed to the special character), it is not possible to give it it's special meaning, except by using another delimiter." III) with respect to my original point (2c) here (i.e. what about characters, that get their special meaning with respect to the RE *or* replacement only when escaped by a preceding backslash, e.g. for BREs '(' or '{'... and in implementation extensions e.g. '+' or 's'): Geoff's current proposal doesn't mention that case, AFAIU. It uses the wording "special character", but AFAIU, '(' is not considered a "special character", right? I gave some different kinds of example for this in: - the original post of this ticket, point (2c) <-- example for the RE part - note 5607 of this ticket, point (2c) <-- example for the replacement part - the original post of this ticket, point (3) <-- busybox example for actually the same thing as in (2c) => I think, what Geoff's current proposal already "solves" for special characters (by making it implementation defined) *still needs* to be solved for such characters that become only special when escaped by a preceding '\', too. AND I think the specification must make it a MUST that it implementations use that as the literal character. E.g. for BREs, applications MUST consider: s(\((X( equivalent to: s/(/X/ Giving them the freedom to choose the special meaning, by making this implementation defined, would have IMO the following problems: - POSIX defines the escape sequence '\x' (for all characters x except some like the specials or with sed '\n' and '\c' with c being the delimiter) to be undefined. - thus implementations started using this for their own purposes, e.g. GNU's sed has '\s', '\S', '\W', '\>', '\+' and more. If POSIX now allows e.g. '\W' in 'sW\WWxW' to be *either* special *or* literal (and not just the latter), then people loose basically the ability to use most possible delimiter characters in a portable way - because an application could have extended the meaning of any '\x' (x as "defined" above). So even if one would follow just the POSIX rules... which effectively say: - you may use 'W' as delimiter - 'W' is not a special character - 'W' is not a character like '(' that becomes special if escaped by '\' - '\W' is undefined, except for the use as escaped delimiter ... and thereby rightfully assume, that '\W', when the delimiter is 'W', would become the literal 'W'... they could actually get a special 'W'. And that for every character that *any* implementation might have given an extended meaning. This is basically what I tried to describe in note #5611 above. If the standard would allow the implementation to choose whether \c is the literal c or the special c for characters c, other than a very limited set (namely the respective POSIX defined special characters - excluding(!) any implementation defined characters that get only special when escaped), then I think no delimiters could be safely used anymore in a portable way (even '/'), because when it's used in the RE it's implementation dependant, whether it becomes special or not. The only way around that would be, to put any delimiter c in the RE into its own bracket expression... but I guess that would get quite ugly and many people would probably not know that this might be needed. btw: For the replacement part, I think Geoff's current proposal already does this (rather implicitly though): "Likewise, if the delimiter character is not <ampersand> ('&'), the escape sequence <backslash>delimiter shall be treated as that literal character in the replacement; if it is <ampersand>, it is unspecified whether the escape sequence <backslash>delimiter is treated as the literal character or the special character (see below)." That gives the freedom to choose whether \c is special or literal, ONLY for the POSIX-defined special character & ... *any* other characters (including 1-9 and any additions an implementation might have made) are required to be literal. Geoff's current proposal does not explicitly allow an implementation to choose the behaviour (literal vs. special) for delimiters whose character c would only become special if escaped - but it doesn't explicitly forbid it either (and that's what's IMO missing). I just saw (at the end of my review) that Geoff's proposal already indicates this whole problematic, in the added paragraph: "Some historical sed implementations..." Still I think we need to more explicitly rule this out outside of the "RATIONALE" section. => So, I'd propose: * Implementations MUST consider '\c' with a delimiter 'c' ALWAYS as the literal character 'c', unless 'c' is a special character for BREs respectively ERE. *If* the final accepted resolution leaves it implementation defined for for special characters, then one could possibly amend that simply by saying "For any 'c' that are not special (including such which would become only special when escaped), '\c' MUST be considered as the literal 'c'." => e.g. with a sentence like "If a character 'c' is used as delimiter that is not a special character for BREs respectively EREs (as defined by POSIX), the escape sequence '\c' must be considered to be the literal character c, regardless of any special behaviour extending POSIX, that an implementation would give to '\c' if another delimiter was used." * I'm however not sure, what should be done about those characters, which POSIX itself already mentions as characters that become special when escaped e.g. for BREs: '(' or '{' ... and with https://austingroupbugs.net/view.php?id=1546 also '\?' '\+' and '\|' which *may* be special) For them one could allow implementation dependent behaviour, because they're defined by POSIX already, and so people know what they must exclude to stay portable. But they cannot know for any '\w', '\s' or '\~'. I'd probably recommend against it, and allow implementation defined behaviour only for true special characters, which don't need to be escaped to become special. [I should note again, that at least BusyBox' sed would already break this, see my original post here, point (2c). Since it's conceptually the same as point (3), which has already been fixed, it may be already fixed in current BusyBox versions, too.] * I'm also unsure for the cases '\c' where c is any digit from 1-9: - this may be in replacements (BRE and ERE) - the RE part, too (only for BRE) <--- see my note #5610 above for details/examples I think for BOTH, we need to make it a MUST, that applications treat e.g. the escape sequence '\1' as the literal one. For the replacement this is already the case with Geoff's current proposal (which only allows the implementation to choose with '\&'). => I'd also propose to add something like the following sentence also to the standard (maybe APPLICATION USAGE?). "If a digit from 1-9 is used as delimiter, it cannot be used as back-reference in the replacement or a BRE's RE part." => I'd further suggest to add (probably also to APPLICATION USAGE) a list for BRE and ERE, respectively, which lists all those characters that are *not* safely usable as delimiter (because applications may choose literal vs. special) *if* the character is also part of the RE (and the same for the replacement, where it's only '&'). If my above proposal is accepted, and depending on what's done for '\?' '\+' and '\|', these would be simply the list of all (truly) special characters for BREs respectively EREs. Such lists would help people to more easily understand what they can use portably, without fiddling it out from the rules and their complex meaning. IV) with respect to my original point (2d) here (i.e. what about 's.[.].X.') As said before, I think it's already a bit clearer with Geoff's current proposal, that '\c' with the being the delimiter would be consider part of the RE language. But 's.[.].X.' is different... as the 2nd '.' is not escaped. However, with: - what I propose in https://austingroupbugs.net/view.php?id=1556#c5778 point (c) it would become IMO "fully" clear and - The following sentence added with Geoff's current proposal: 'The delimiter character that precedes and follows the RE shall not terminate the RE when it appears within a bracket expression. For example, the context address "/[/]/" is equivalent to "/\//".' it becomes IMO fully clear. So that point is solved (I also like the paragraph you add in the APPLICATION USAGE, which describes how to use that). => I would however use a simpler example: "\%[%]%" is equivalent to "/[%]/" or alternatively: "/[/]/" is equivalent to "\%[/]%" Well, whether it's simply might be a matter of personal taste,.. but it doesn't drop the bracket expression, which I think is better for showing what's going on V) With respect to the proposal at https://austingroupbugs.net/view.php?id=1550#c5761 a) As already said in the other tickets, I'd put down the sentences starting with "The BRE and ERE syntax shall additionally support escaping" to the "Regular Expressions in sed" section again. Sorry for giving that bad idea earlier, that it should be in "Addresses in sed" And perhaps overlapping parts of these sentences can be unified with the ones added to the s-command. (Won't work for the replacement part, though). b) I wouldn't write: "is not special in a BRE or ERE" <--- this exists in two locations but rather "is not special in __the__ BRE __respectively__ ERE" or something better. The "or" could be interpreted e.g. the following way: we're in a BRE, someone uses + as delimiter, while that isn't special in BRE it is in ERE... so the "or" kicks in,... at least in my English understanding, "respectively" would make it a tiny bit clearer that this (BRE vs. ERE) depends on the respective case. And yes, I've seen the ", respectively, " but I'd rather interpret that to relate to BRE <-> [xref to XBD 9.3] and ERE <-> [xref to XBD 9.4]. c) Through out your additions, you use e.g. "with an unescaped <backslash> (except inside a bracket expression)" or similar. If we make it more clear (as I proposed above): - that '\c' and '\n' are considered part of the RE language and with - the changes mad through some other issue, that clearly define "escape sequence/character" for the RE language ... I think we could go back and just call that "escape 'c'" or "escape sequnce 'c'", though I would personally prefer to retain the parentheses with a hint like "(there can't be escape characters/sequences inside bracket expressions)" d) Cosmetics: In some places the wording "escape sequence <backslash>c" is used... but in others e.g. "escape sequence '\n'". e) Instead of: "The delimiter character that precedes and follows the RE shall not terminate the RE when it appears within a bracket expression. For example, the context address "/[/]/" is equivalent to "/\//"." "The delimiter character that precedes and follows the RE shall not terminate the RE when it appears within a bracket expression __but be that literal character for the bracket expression__. For example, the context address "/[/]/" is equivalent to "/\//"." It's nitpicking, but AFAIU, the delimiter character (unlike the escaped delimiter character) is strictly speaking *not* part of the RE language. So it's in principle still not 100% clear what happens with such character. Sure, it doesn't terminate the RE... but it could be... ignored? f) "Within the RE and the replacement, the delimiter shall not terminate the RE or replacement if it is preceded by an unescaped <backslash> (that is not inside a bracket expression in the RE, where the delimiter does not terminate the RE anyway - see [xref to Regular Expressions in sed])." In case this would be "unified" with the corresponding parts for the context address in "Regular Expressions in sed"... the part for the replacement would obviously need to stay. g) "if it is <ampersand>, it is unspecified whether the escape sequence <backslash>delimiter is treated as the literal character or the special character (see below). => one might just write '\&' here, since in that case "delimiter" is always '&'. h) "Applications that use a special RE character as a delimiter (for example, '.' or '*') and need to use the delimiter as a literal character in the RE should put it inside a bracket expression, as implementations differ regarding whether escaping it with a <backslash> removes its special meaning." => If my proposal (III) above is accepted, then I'd also repeat here specifically e.g. "special RE character (which does not include such which become only special when escaped) as a delimiter". => And perhaps something like "should put it inside a bracket expression __with not other characters__" to make clear, that one cannot re-use one e.g. 'sX\X[0-9]XfooX' can NOT be written as 'sX[X0-9]XfooX' but only as 'sX[X][0-9]XfooX'. Question: Are the following bracket expressions well-defined and portable: - [^] - [\] ? At least '[^]' would fall under the above sentence ("Applications that use a special RE character...")... '[\]' not really as '\' cannot be a delimiter. I tried to find this in 9.3.5 RE Bracket Expression,... and I guess '[\]' is clearly well-defined and portable... but I cannot really follow this for '[^]'... it seems not to be mentioned and I guess I'll report it in a separate ticket. => But anyway,... the above sentence would need to exclude [^] then... Or is there a way to safely escape this? I guess not, cause it's special and thus implementations would be allowed to choose whether to treat '\^' literal or special (when ^ is also the delimiter)... probably even depending on the position of that escape sequence. i) "Some historical sed implementations did not support escaping '(', ')', '{', and '}' when used as a BRE" Not sure, but this introduction with historical implementations gives kinda the feeling that this problem would only exist because of historical implementations and because of '(', ')', '{', and '}'. However, AFAIU, we need to *generally* rule that out, and not just because of historical implementations. And with "that" I mean, implementations must not be allowed to choose whether they give '\c' literal or special meaning, if 'c' is the delimiter, and if 'c' alone wouldn't be special, but 'c' preceded by an escaping '\' would be. VI) not really related to this issue, but it would make things even more complex if I add it in a separate ticket: The description of the y-command contains on page 3138, line 106249: "If the number of characters in string1 and string2 are not equal, or if any of the characters in string1 appear more than once, the results are undefined." That is strictly speaking wrong, namely in the case when string1 and/or string2 contains '\'-escaped 'n' (for newline) or a '\'-escaped delimiters, and the number of occurrences in both strings don't even out. => Perhaps simply write "If the number of characters (after resolving any escape sequences)..." or so? Issue History Date Modified Username Field Change ====================================================================== 2022-01-14 05:39 calestyo New Issue 2022-01-14 05:39 calestyo Name => Christoph Anton Mitterer 2022-01-14 05:39 calestyo Section => Utilities, sed 2022-01-14 05:39 calestyo Page Number => 3132, ff. (in the draft) 2022-01-14 05:39 calestyo Line Number => see below 2022-01-14 06:34 Don Cragun Relationship added related to 0001550 2022-01-14 06:48 Don Cragun Project 1003.1(2016/18)/Issue7+TC2 => Issue 8 drafts 2022-01-14 06:51 Don Cragun Note Added: 0005602 2022-01-14 06:51 Don Cragun version => Draft 2.1 2022-01-14 15:52 calestyo Note Added: 0005607 2022-01-14 20:36 calestyo Note Added: 0005610 2022-01-14 21:40 calestyo Note Added: 0005611 2022-01-14 21:48 calestyo Note Added: 0005612 2022-01-14 22:07 calestyo Note Edited: 0005612 2022-01-14 22:08 calestyo Note Edited: 0005612 2022-01-14 22:09 calestyo Note Edited: 0005612 2022-01-14 22:15 calestyo File Added: summary-of-literal-behaviour-gnu-vs-busybox.txt 2022-01-14 22:17 calestyo Note Added: 0005613 2022-01-18 21:18 calestyo Note Added: 0005627 2022-01-24 14:52 calestyo Note Added: 0005634 2022-01-24 15:48 calestyo Note Edited: 0005634 2022-02-01 20:12 calestyo Note Added: 0005648 2022-03-18 14:59 geoffclare Note Added: 0005757 2022-04-02 09:15 kre Note Added: 0005774 2022-04-05 00:59 calestyo Note Added: 0005780 ======================================================================