Hey. I was playing around with dash, whether I'd be able to implement draft 3's -d option for read, but I didn't understand the standard completely and found several parts to be improvable (which I've markde with => ).
I) My understanding was that `read` reads logical lines, where a logical line is any string (of bytes, except NUL bytes) terminated by some single-byte line delimiter, which is either what's given with -d (which could again be <newline>) or <newline> if -d is not given. line 111817, ff.: > with the exception of either <newline> or the logical line delimiter > specified with the −d delim option (if it is used and delim is not > <newline>); it is unspecified which. The whole sentence is a bit convoluted and hard to understand and gets only clear when reading to what - AFAIU - it refers to, namely: line 111910, ff.: > Implementations differ in their handling of <backslash> for line > continuation when −d delim is specified (and delim is not <newline>); > some treat <backslash>delim (or <backslash><NUL> if delim is the null > string) as a line continuation, whereas others still treat > <backslash><newline> as a line continuation. Using bash 5.2.21(1) as an example, I'd interpret the above as the following: $ read -d a var 123\<newline> > xyz<newline> a$ $ printf '%s' "$var" | hd 00000000 31 32 33 78 79 7a |123xyz| 00000006 As expected: - the escaped <newline> cause a line continuation and is stripped - the unescaped a ends the input and is stripped - the unescaped <newline> does not cause a line continuation (no > prompt, still expected), and is stripped, because there's one field ("123xyz") and one variable, and line 111834 (which would append the field delimiters, which the unescaped <newline> is here (with the default IFS) doesn't apply, as there are no more fields than variables. $ read -d a var 123\a456\<newline> > 789 a$ $ printf '%s' "$var" | hd 00000000 31 32 33 61 34 35 36 37 38 39 |123a456789| 0000000a As expected: - bash is apparently of the kind that only considers \<newline> as line continuation and thus the escaped a is not even a line separator but merely the literal a So in short: - There's only always one line continuation sequence either \<delim> or \<newline> and it's not allowed for shells to consider BOTH as a line continuation. - When an implementation choses \<newline> to be the line continuation then \<delim> is just the literal <delim> (and not the line delimiter), except for the case <delim> is <newline>. => I don't quite get what "(and delim is not <newline>)" in the specs mean. When <delim> IS <newline>, wouldn't \<newline> still be considered a line continuation? bash seems to think so: $ read -d $'\n' var abc\<newline> > def $ printf '%s' "$var" | hd 00000000 61 62 63 64 65 66 |abcdef| 00000006 So I think that parentheses does more evil than good, because in principle it could also be interpreted as: If -d is used and <delim> IS <newline>... the whole sentence does not apply at all and it's for that not even specified that it's either of the two behaviours. => Instead of "it is unspecified which", shouldn't one rather write that it's "implementation-dependent" which of the two behaviours it is? II) line 111823, ff.: > If standard input is a terminal device and the invoking shell is > interactive, read shall prompt for a continuation line when it reads > an input line ending with a <backslash> <newline>, unless the > −r option is specified. Now above we had all the "an implementation uses EITHER \<newline> OR \<delim> for line continuation", here it's suddenly just \<newline>. => Was that section merely forgotten, or is it really on purpose that for interactive shells that are connected to a terminal, it's always \<newline>? If the latter, then that's IMO a bit ambiguous and should be pointed better. III) line 111826, ff.: > The terminating logical line delimiter (if any) shall be removed from > the input I assume the "if any" here, just indicates that read MUST read even a final non-terminated line, e.g.: printf 'fooabar' | while read -d a LINE; do printf '%s' "$LINE" | hd done is expected to give: 00000000 66 6f 6f |foo| 00000003 and: 00000000 62 61 72 |bar| 00000003 and he final (non-a-terminated line "bar" is not somehow discarded as "invalid"). Right? IV) line 111834, ff.: > • The delimiter(s) that follow the field corresponding to the last > var > • The remaining fields and their delimiters, with trailing IFS white > space ignored => Wouldn't it be better to use the wording "FIELD delimiter(s)". Sure, the LINE delimiters are gone by that point, but still makes it cleaner to read IMO. Also, the FIELD delimiters are always the same as th IFS characters, right? So it may be better to use harmonise wording in both sentences. => Further, the description nowhere directly explains what the escaping with backslash is actually used for, other than continuation lines. Yes it follows indirectly from lines 111821 and 111822, but wouldn't it make sense to mention that <backslash> also allows to preserve the literal meaning of any IFS characters and prevent an <backslash>-escaped IFS character from causing a field split? It's not just shell implementors and POSIX gurus who read this ;-) => Also, it should perhaps be explicitly pointed out that \ can NOT portably be used to escape the line delimiter, as it may be either a line continuation OR the literal character. V) line 111837, ff.: > An error in setting any variable (such as if a var has previously > been marked readonly) shall be considered an error of read > processing, and shall result in a return value greater than one. => I think it would be beneficial if it's more clearly specified what happens in that case or at least, that it's unspecified. Consider e.g.: readonly bar read foo bar baz Okay, read gives non-zero exit, sure,... but are foo and bar guaranteed to be set? Or only foo? Or are both guaranteed to be not set, or only baz? Or is it unspecified for all? VII) line 111845, ff.: > If end-of-file is detected before a terminating logical line > delimiter is encountered, the variables specified by the var operands > shall be set as described above and the exit status shall be 1. together with line 111895, ff.: > The following exit values shall be returned: > 0 Successful completion. > 1 End-of-file was detected. > >1 An error occurred. AFAU this does not break strict compatibility with previous standardisation of read, which used: > >0 End-of-file was detected or an error occurred. But it now means that in the future, we can never signal any other non- error states, because everything >1 now must be an error. I'm just wondering whether it was considered what's better: - strict compatibility - allowing for signaling future non-error states by using 1 An error occurred. 2 End-of-file was detected. >2 Unspecified. I mean people who followed the standard, would have either checked for = 0, in order to see if it's success (and would have already needed to handle non-final EOL specially, if at all). And people who wanted to check for an error, would have done so by checking for > 0. So if we'd define only 1 as "generic" error status, the only thing we'd "loose" in terms of compatibility is that other future "success" statuses that would use >2, would then be considered an error in legacy code. But I'd rather assume this to be a little problem in practise. So IMO, one should use the opportunity and consider what's more likely: - that one wants to differentiate between separate error conditions (then the current draft wording would of course be better) - or that one wants to differ between more success-like conditions (then the alternative from above might be worth to consider) or maybe even something like: 0 success 1 EOF without EOL 2-99 reserved >100 Error Even that would still only cost, that possible future success-like statuses are wrongly considered an error. VIII) line 111850, ff.: > If delim consists of one single-byte character I would have assumed, that because `read` reads from stdin and because it sets variables (which in an earlier discussion here were pointed out to be required to hold any bytes other than NUL - regardless of the locale!)... `read` is also required to cope with any bytes other than NUL. But turned, out that the previous version of the specification requires the input to be a text file. And the new specification only lifts that for -d ''. => But still, why is delim defined to be a character? And wouldn't that definition strictly mean, that e.g. in an UTF-8 locale one couldn't use 0xC0 as delimiter, because it's not a valid character? IX) line 111858, ff.: > If the −d delim option is not specified, or if it is specified and > delim consists of one single-byte character, the standard input shall > contain zero or more characters and shall not contain any null bytes. > > If the −d delim option is specified and delim is the null string, the > standard input shall contain zero or more bytes (which need not form > valid characters). => Since -d is anyway new, wouldn't it make sense to allow arbitrary bytes (other than NUL) as soon as -d is given, or at least - to keep things simpler - as soon as -d is given and delim is not <newline>? Regardless of the locale. Or does that conflict with existing implementations? X) line 111868, ff.: > IFS > Determine the internal field separators used to delimit fields; see > Section 2.5.3 (on page 2466). => Maybe, just as a convenience, indicate that these only delimit unless escaped by a <backslash>? Also, does anyone think that further clarification is needed if any of IFS is the line delimiter? I personally would say, no, because it's already mentioned further up (line 111826, ff.), that the delimiter is removed before the field splitting (though strictly speaking, the text doesn't explicitly say that this happens in that order). Thanks, Chris. PS and OT: A while a go I've had asked here on the list how to do portable command substitution without stripping of trailing newlines (mostly actually for the purpose of getting unusual pathnames). The result back then was the sentinel trick, with using "." (and its special properties) as sentinel. But the problem was that stripping that off used the pattern matching notation (which is in turn only defined for characters) within parameter expansion, so it was necessary to set LC_ALL=C to make sure that works with any bytes. That in turn however made it IMO impossible to make a fully portable and self-contained function for that, i.e. something like: csubst_with_nl() { command="$1" result="$( eval " $command"; rc="$?"; printf .; exit "$rc" )" rc="$?" LC_ALL=C result="${result%.}" return "$rc" } or even something better where the variable name is a parameter itself... because I'd need to set at least rc and LC_ALL and thus modify the caller's execution environment (even if I backup and restore the). local is not portable, and actually some implementations allow to "break out" from localised variables (IIRC bash does) via unsetting tricks. Not thinking carefully enough, I first thought that this would now be portably possible via something like: csubst_with_nl() { eval " $1" | IFS='' LC_ALL=C read -r -d '' "$2" } [Not sure if the LC_ALL=C is needed (I guess not because IFS='' and -d ''.] But of course it's still not, because the above: - looses the commands exit status - looses the variable named by $2, as the whole thing is run in a pipe and thus a subshell So.... can we get process substitution in Issue 9? ;-P