This is a historical overview of (a) actual csplit implementations and
(b) their spec in relevant standards documents. There's more prose and
tortured quotes here:
  
https://srhtcdn.githack.com/~nabijaczleweli/voreutils/blob/man/man1/csplit.1.html#HISTORY
All referenced documents are trivially available archivally on bitsavers
and systems from my good friend Vetus.


Part A:
  * csplit appears in PWB/UNIX as a bfs(1) front-end, taking
      csplit file regex regex regex
    bfs is a read-only generalisation of ed,
    so naturally $ matches the new-line.
  * UNIX System III makes csplit a normal C program,
    with most present-day features.
    Per manual: "Regular expressions may not contain embedded new-lines."
    $ matches the end of the line.
  * UNIX System V Release 2 strips the new-line before matching, which
    (obviously) makes $ behave as-if it matched the new-line
    (embedding a new-line still works, though; <regexp.h> moment).
  * UNIX System V Release 4 uses <regexpr.h> instead, which disables
    matching embedded new-lines (to drive the point home,
    it also says that the regexes are like in ed).

Or, in tabular form, if an input line ends in a q:
            | /q$/ | /q
            |      | /
  PWB/UNIX  | Yes  |
  SysIII    | No   | Yes
  SysVr2    | Yes  | Yes
  SysVr4    | Yes  | No
  POSIX now | No   | Yes


Part B:
  * SVID2 includes the SysVr2 manual verbatim, although adding a
    "Regular expressions as in ED(BU_CMD) are accepted." note
  * XPG2 makes that "Simple regular expression syntax is accepted.",
    and still they can't have newlines in; no reference to a specific
    sexion implies a "who cares" and the relevant volume doesn't appear
    to be preserved
  * POSIX 1003.2a (D8) says
      The regular expression rexp shall follow the rules for basic regular
      expressions described in 2.8.3.
    and POSIX 1003.2 (D11.2), 2.8.3.5 BRE Expression Anchoring pts. 2, 3
      The dollar-sign shall anchor the expression (or optionally
      subexpression) to the end of the string being matched;
      the dollar-sign can be said to match the ‘‘end-of-string’’
      following the last character.
    and
      A BRE anchored by both ˆ and $ shall match only an entire string.
      For example, the BRE ˆabcdef$ matches strings consisting only of
      abcdef.
  * XPG4/SUSv1 "Align[s] with the ISO/IEC 9945-2: 1993 standard" and
    uses the same wording

Or, in tabular form, if an input line ends in a q:
           | /q$/          | /q  | Correct?
           |               | /   |
  SVID2    | Yes           | No  | Yes
  XPG2     | Yes (implied) | No  | Sure
  P1003.2a |               |     |
  & XPG4   |               |     |
  & SUSv1  | No            | Yes | No


There are multiple extant implementations that comply with SysVr4
(because that's the only sane spec) – those are, at least,
the GNU system and the illumos gate.

The damage is done, however, and there are also multiple systems that
conform strictly to the P1003.2a spec –
NetBSD derivatives ("all modern BSD"), at least.

Since regcomp() already takes REG_NEWLINE, it's my recommendation
that the specification be altered only slightly – by specifying that
rexp is to be compiled with the REG_NEWLINE flag
(or some pithy alternative like "The regular expression rexp shall
follow the rules for basic regular expressions described in
XBD Basic Regular Expressions, except that the <dollar-sign>,
if used as an anchor, shall also match the <newline>, as described in
the REG_NEWLINE flag to the regcomp() function in the System Interfaces
volume of POSIX.1-2017."). The precedent for the ref is XBD 12.2 USG.

The precedent for updating csplit is that this already happened,
when SUSv2 said in its FUTURE DIRECTIONS:
  The IEEE PASC 1003.2 Interpretations Committee has forwarded concerns
  about parts of this interface definition to the IEEE PASC Shell and
  Utilities Working Group which is identifying the corrections.
  A future revision of this specification will align with
  IEEE Std. 1003.2b when finalised.

And SUSv3 includes P1003.2b's wording, which drastically altered
how rexp-style operands are both parsed and executed
(beforehand, /q/ {2} was to make two empty files), noting (D12):
  Rationale: These csplit changes are required to match historical
  practice and are the result of interpretation request
  PASC 1003.2-92 #59 submitted for IEEE Std 1003.2-1992.

This is a one-line change to strictly-conforming implementations,
re-aligns the standard with historical practice,
and makes systems deriving from- and compatible with it conformant.

Best,
наб

Attachment: signature.asc
Description: PGP signature

Reply via email to