This is a historical overview of (a) actual csplit implementations and (b) their spec in relevant standards documents. There's more prose and tortured quotes here: https://srhtcdn.githack.com/~nabijaczleweli/voreutils/blob/man/man1/csplit.1.html#HISTORY All referenced documents are trivially available archivally on bitsavers and systems from my good friend Vetus.
Part A: * csplit appears in PWB/UNIX as a bfs(1) front-end, taking csplit file regex regex regex bfs is a read-only generalisation of ed, so naturally $ matches the new-line. * UNIX System III makes csplit a normal C program, with most present-day features. Per manual: "Regular expressions may not contain embedded new-lines." $ matches the end of the line. * UNIX System V Release 2 strips the new-line before matching, which (obviously) makes $ behave as-if it matched the new-line (embedding a new-line still works, though; <regexp.h> moment). * UNIX System V Release 4 uses <regexpr.h> instead, which disables matching embedded new-lines (to drive the point home, it also says that the regexes are like in ed). Or, in tabular form, if an input line ends in a q: | /q$/ | /q | | / PWB/UNIX | Yes | SysIII | No | Yes SysVr2 | Yes | Yes SysVr4 | Yes | No POSIX now | No | Yes Part B: * SVID2 includes the SysVr2 manual verbatim, although adding a "Regular expressions as in ED(BU_CMD) are accepted." note * XPG2 makes that "Simple regular expression syntax is accepted.", and still they can't have newlines in; no reference to a specific sexion implies a "who cares" and the relevant volume doesn't appear to be preserved * POSIX 1003.2a (D8) says The regular expression rexp shall follow the rules for basic regular expressions described in 2.8.3. and POSIX 1003.2 (D11.2), 2.8.3.5 BRE Expression Anchoring pts. 2, 3 The dollar-sign shall anchor the expression (or optionally subexpression) to the end of the string being matched; the dollar-sign can be said to match the ‘‘end-of-string’’ following the last character. and A BRE anchored by both ˆ and $ shall match only an entire string. For example, the BRE ˆabcdef$ matches strings consisting only of abcdef. * XPG4/SUSv1 "Align[s] with the ISO/IEC 9945-2: 1993 standard" and uses the same wording Or, in tabular form, if an input line ends in a q: | /q$/ | /q | Correct? | | / | SVID2 | Yes | No | Yes XPG2 | Yes (implied) | No | Sure P1003.2a | | | & XPG4 | | | & SUSv1 | No | Yes | No There are multiple extant implementations that comply with SysVr4 (because that's the only sane spec) – those are, at least, the GNU system and the illumos gate. The damage is done, however, and there are also multiple systems that conform strictly to the P1003.2a spec – NetBSD derivatives ("all modern BSD"), at least. Since regcomp() already takes REG_NEWLINE, it's my recommendation that the specification be altered only slightly – by specifying that rexp is to be compiled with the REG_NEWLINE flag (or some pithy alternative like "The regular expression rexp shall follow the rules for basic regular expressions described in XBD Basic Regular Expressions, except that the <dollar-sign>, if used as an anchor, shall also match the <newline>, as described in the REG_NEWLINE flag to the regcomp() function in the System Interfaces volume of POSIX.1-2017."). The precedent for the ref is XBD 12.2 USG. The precedent for updating csplit is that this already happened, when SUSv2 said in its FUTURE DIRECTIONS: The IEEE PASC 1003.2 Interpretations Committee has forwarded concerns about parts of this interface definition to the IEEE PASC Shell and Utilities Working Group which is identifying the corrections. A future revision of this specification will align with IEEE Std. 1003.2b when finalised. And SUSv3 includes P1003.2b's wording, which drastically altered how rexp-style operands are both parsed and executed (beforehand, /q/ {2} was to make two empty files), noting (D12): Rationale: These csplit changes are required to match historical practice and are the result of interpretation request PASC 1003.2-92 #59 submitted for IEEE Std 1003.2-1992. This is a one-line change to strictly-conforming implementations, re-aligns the standard with historical practice, and makes systems deriving from- and compatible with it conformant. Best, наб
signature.asc
Description: PGP signature