read utility in draft 3: questions and ideas

Christoph Anton Mitterer via austin-group-l at The Open Group Fri, 08 Dec 2023 17:50:11 -0800

Hey.

I was playing around with dash, whether I'd be able to implement
draft 3's -d option for read, but I didn't understand the standard
completely and found several parts to be improvable (which I've markde
with => ).





I) My understanding was that `read` reads logical lines, where a
logical line is any string (of bytes, except NUL bytes) terminated by
some single-byte line delimiter, which is either what's given with -d
(which could again be <newline>) or <newline> if -d is not given.

line 111817, ff.:
> with the exception of either <newline> or the logical line delimiter
> specified with the −d delim option (if it is used and delim is not
> <newline>); it is unspecified which.

The whole sentence is a bit convoluted and hard to understand and gets
only clear when reading to what - AFAIU - it refers to, namely:

line 111910, ff.:
> Implementations differ in their handling of <backslash> for line
> continuation when −d delim is specified (and delim is not <newline>);
> some treat <backslash>delim (or <backslash><NUL> if delim is the null
> string) as a line continuation, whereas others still treat
> <backslash><newline> as a line continuation.


Using bash 5.2.21(1) as an example, I'd interpret the above as the
following:

$ read -d a var
123\<newline>
> xyz<newline>
a$ 
$ printf '%s' "$var" | hd
00000000  31 32 33 78 79 7a                                |123xyz|
00000006

As expected:
- the escaped <newline> cause a line continuation and is stripped
- the unescaped a ends the input and is stripped
- the unescaped <newline> does not cause a line continuation (no >
  prompt, still expected), and is stripped, because there's one field
  ("123xyz") and one variable, and line 111834 (which would append the
  field delimiters, which the unescaped <newline> is here (with the
  default IFS) doesn't apply, as there are no more fields than
  variables.

$ read -d a var
123\a456\<newline>
> 789
a$ 
$ printf '%s' "$var" | hd
00000000  31 32 33 61 34 35 36 37  38 39                   |123a456789|
0000000a

As expected:
- bash is apparently of the kind that only considers \<newline> as
  line continuation and thus the escaped a is not even a line
  separator but merely the literal a


So in short:
- There's only always one line continuation sequence either \<delim>
  or \<newline> and it's not allowed for shells to consider BOTH
  as a line continuation.
- When an implementation choses \<newline> to be the line continuation
  then \<delim> is just the literal <delim> (and not the line
  delimiter), except for the case <delim> is <newline>.


=> I don't quite get what "(and delim is not <newline>)" in the specs
   mean.
   When <delim> IS <newline>, wouldn't \<newline> still be considered
a    line continuation?

   bash seems to think so:
   $ read -d $'\n' var
   abc\<newline>
   > def
   $ printf '%s' "$var" | hd
   00000000  61 62 63 64 65 66                                 |abcdef|
   00000006

   So I think that parentheses does more evil than good, because in
   principle it could also be interpreted as:
   If -d is used and <delim> IS <newline>... the whole sentence does
   not apply at all and it's for that not even specified that it's
   either of the two behaviours.


=> Instead of "it is unspecified which", shouldn't one rather write
   that it's "implementation-dependent" which of the two behaviours it
   is?




II) line 111823, ff.:
> If standard input is a terminal device and the invoking shell is
> interactive, read shall prompt for a continuation line when it reads
> an input line ending with a <backslash> <newline>, unless the
> −r option is specified.

Now above we had all the "an implementation uses EITHER \<newline> OR
\<delim> for line continuation", here it's suddenly just \<newline>.

=> Was that section merely forgotten, or is it really on purpose that
   for interactive shells that are connected to a terminal, it's always
   \<newline>?
   If the latter, then that's IMO a bit ambiguous and should be pointed
   better.




III) line 111826, ff.:
> The terminating logical line delimiter (if any) shall be removed from
> the input

I assume the "if any" here, just indicates that read MUST read even a
final non-terminated line, e.g.:

printf 'fooabar' |
while read -d a LINE; do
        printf '%s' "$LINE" | hd
done

is expected to give:
00000000  66 6f 6f                                          |foo|
00000003
and:
00000000  62 61 72                                          |bar|
00000003
and he final (non-a-terminated line "bar" is not somehow discarded as
"invalid").

Right?




IV) line 111834, ff.:
> • The delimiter(s) that follow the field corresponding to the last
>   var
> • The remaining fields and their delimiters, with trailing IFS white
>   space ignored

=> Wouldn't it be better to use the wording "FIELD delimiter(s)".
   Sure, the LINE delimiters are gone by that point, but still makes it
   cleaner to read IMO.
   Also, the FIELD delimiters are always the same as th IFS characters,
   right? So it may be better to use harmonise wording in both
   sentences.

=> Further, the description nowhere directly explains what the
   escaping with backslash is actually used for, other than
   continuation lines.

   Yes it follows indirectly from lines 111821 and 111822, but
   wouldn't it make sense to mention that <backslash> also allows to
   preserve the literal meaning of any IFS characters and prevent an
   <backslash>-escaped IFS character from causing a field split?
   It's not just shell implementors and POSIX gurus who read this ;-)

=> Also, it should perhaps be explicitly pointed out that \ can NOT
   portably be used to escape the line delimiter, as it may be either a
   line continuation OR the literal character.




V) line 111837, ff.:
> An error in setting any variable (such as if a var has previously
> been marked readonly) shall be considered an error of read
> processing, and shall result in a return value greater than one.

=> I think it would be beneficial if it's more clearly specified what
   happens in that case or at least, that it's unspecified.

   Consider e.g.:
     readonly bar
     read foo bar baz

   Okay, read gives non-zero exit, sure,... but are foo and bar
   guaranteed to be set? Or only foo? Or are both guaranteed to be not
   set, or only baz? Or is it unspecified for all?




VII) line 111845, ff.:
> If end-of-file is detected before a terminating logical line
> delimiter is encountered, the variables specified by the var operands
> shall be set as described above and the exit status shall be 1.

together with line 111895, ff.:
> The following exit values shall be returned:
> 0 Successful completion.
> 1 End-of-file was detected.
> >1 An error occurred.

AFAU this does not break strict compatibility with previous
standardisation of read, which used:
> >0 End-of-file was detected or an error occurred.

But it now means that in the future, we can never signal any other non-
error states, because everything >1 now must be an error.

I'm just wondering whether it was considered what's better:
- strict compatibility
- allowing for signaling future non-error states by using
  1 An error occurred.
  2 End-of-file was detected.
  >2 Unspecified.

  I mean people who followed the standard, would have either checked
  for = 0, in order to see if it's success (and would have already
  needed to handle non-final EOL specially, if at all).

  And people who wanted to check for an error, would have done so by
  checking for > 0.
  So if we'd define only 1 as "generic" error status, the only thing
  we'd "loose" in terms of compatibility is that other future "success"
  statuses that would use >2, would then be considered an error in
  legacy code.
  But I'd rather assume this to be a little problem in practise.

So IMO, one should use the opportunity and consider what's more likely:
- that one wants to differentiate between separate error conditions
  (then the current draft wording would of course be better)
- or that one wants to differ between more success-like conditions
  (then the alternative from above might be worth to consider)
  or maybe even something like:
    0 success
    1 EOF without EOL
    2-99 reserved
    >100 Error
  Even that would still only cost, that possible future success-like
  statuses are wrongly considered an error.



VIII) line 111850, ff.:
> If delim consists of one single-byte character

I would have assumed, that because `read` reads from stdin and because
it sets variables (which in an earlier discussion here were pointed out
to be required to hold any bytes other than NUL - regardless of the
locale!)... `read` is also required to cope with any bytes other than
NUL.
But turned, out that the previous version of the specification requires
the input to be a text file. And the new specification only lifts that
for -d ''.

=> But still, why is delim defined to be a character?
   And wouldn't that definition strictly mean, that e.g. in an UTF-8
   locale one couldn't use 0xC0 as delimiter, because it's not a valid
   character?




IX) line 111858, ff.:
> If the −d delim option is not specified, or if it is specified and
> delim consists of one single-byte character, the standard input shall
> contain zero or more characters and shall not contain any null bytes.
>
> If the −d delim option is specified and delim is the null string, the
> standard input shall contain zero or more bytes (which need not form
> valid characters).

=> Since -d is anyway new, wouldn't it make sense to allow arbitrary
   bytes (other than NUL) as soon as -d is given, or at least - to keep
   things simpler - as soon as -d is given and delim is not <newline>?
   Regardless of the locale.
   Or does that conflict with existing implementations?




X) line 111868, ff.:
> IFS
> Determine the internal field separators used to delimit fields; see
> Section 2.5.3 (on page 2466).

=> Maybe, just as a convenience, indicate that these only delimit
   unless escaped by a <backslash>?

Also, does anyone think that further clarification is needed if any of
IFS is the line delimiter?
I personally would say, no, because it's already mentioned further up
(line 111826, ff.), that the delimiter is removed before the field
splitting (though strictly speaking, the text doesn't explicitly say
that this happens in that order).



Thanks,
Chris.



PS and OT:
A while a go I've had asked here on the list how to do portable command
substitution without stripping of trailing newlines (mostly actually
for the purpose of getting unusual pathnames).

The result back then was the sentinel trick, with using "." (and its
special properties) as sentinel.
But the problem was that stripping that off used the pattern matching
notation (which is in turn only defined for characters) within
parameter expansion, so it was necessary to set LC_ALL=C to make sure
that works with any bytes.

That in turn however made it IMO impossible to make a fully portable
and self-contained function for that, i.e. something like:
   csubst_with_nl()
   {
        command="$1"
        result="$( eval " $command"; rc="$?"; printf .; exit "$rc" )"
        rc="$?"
        LC_ALL=C
        result="${result%.}"
        return "$rc"
   }
or even something better where the variable name is a parameter
itself... because I'd need to set at least rc and LC_ALL and thus
modify the caller's execution environment (even if I backup and restore
the).
local is not portable, and actually some implementations allow to
"break out" from localised variables (IIRC bash does) via unsetting
tricks.


Not thinking carefully enough, I first thought that this would now be
portably possible via something like:

csubst_with_nl()
{
        eval " $1"  |  IFS='' LC_ALL=C read -r -d '' "$2"
}
[Not sure if the LC_ALL=C is needed (I guess not because IFS='' and
 -d ''.]

But of course it's still not, because the above:
- looses the commands exit status
- looses the variable named by $2, as the whole thing is run in a pipe
  and thus a subshell


So.... can we get process substitution in Issue 9? ;-P

read utility in draft 3: questions and ideas

Reply via email to