Re: [1003.1(2008)/Issue 7 0000249]: Add standard support for $'...' in shell

Steffen Nurpmeso via austin-group-l at The Open Group Fri, 05 Feb 2021 12:55:27 -0800

Hello Robert.

Robert Elz wrote in
 <14199.1612519...@jinx.noi.kre.to>:
 |    Date:        Thu, 04 Feb 2021 21:59:52 +0100
 |    From:        Steffen Nurpmeso <stef...@sdaoden.eu>
 |    Message-ID:  <20210204205952.fw6wv%stef...@sdaoden.eu>
 |
 || Ok, of course, but let me disagree with the latter.  Bizarre rules
 || and Bourne/Korn shell etc ... just look at ${aXb} and quoting
 || rules within.
 |
 |Two things .. first, I agree, the quoting rules that exist now are
 |bizarre, and weird, and just a royal pain to deal with (both for
 |users and implementors) - which is one reason I'm loath to add yet
 |another difference.
 |
 |And second, I meant bizarre in a different way, it was probably
 |the wrong word (there are reasons, many of them, why I write code, and
 |not novels, nor, or at least very rarely, even academic papers),
 |what I meant was that inside the shell, we have to deal with single
 |quoted strings (which are very easy, as they're very simple, and which
 |includes both ' and \ quoting), and double quoted strings, which are
 |messy and cause problems, but which we have generally managed to conquer.
 |Adding a third, somewhat in between form, where most of the text is
 |literal, but where $ expansions (but I am assuming not ` expansions)
 |happen, when doing so adds no new functionality, just perhaps a slightly
 |simpler syntax for the user, just seems like the wrong thing to do.


If, and only with this if, it would become standardized it could
replace the other quoting mechanisms, not in the shell, but from
the user point of view.

The good thing about $'' is that nothing happens, just like in
a single-quoted string, unless you see a reverse solidus.  No
fancy rules unless you get triggered to do so.

And i have not implemented it yet, but i already document \`{} as
a future extension that will allow command evaluation, then.
Note this is Plan9 rc syntax (`{command}), which should detect
nesting easier, just like $() does.  I do not expect that to be
implemented by a POSIX shell.  It is a MUA in the end :)
That one documents

   '\$NAME'
           Non-standard extension: expand the given variable name,
           as above.  Brace enclosing the name is supported.

 |That, and while you can do whatever you like in your MUA, we have to
 |deal with the rest of sh syntax ... eg: what happens to a ' that occurs
 |inside a \$ expansion in your scheme (that is, as part of its text, \
 |not its 
 |result)?  Does that terminate the $' string, and perhaps lead to an
 |invalid $ expansion, or do things nest?   Does that include inside \
 |${var:=foo}
 |(etc) type expansions where currently (if inside quotes) quoting in the foo
 |word doesn't work (except some \ quoting) - if so, then we have a whole new
 |expansion syntax to deal with, and if not, then what do we make of a ' that
 |occurs there?  Or what of a \' there?    Do $' expressions nest?

Well .. if i recall correctly quoting inside of ${xYz} has been
clarified not too long ago -- i would expect the entire $''
context to be yielded and resumed once the ${xYz} construct has
been handled.  I *think* that is what has to happen with them
inside of "", so it should be just the same.  Except that it was
triggered by \$.. not by $.. as it would in double-quoted strings.
I think that would be the most natural take.

 |First in the simple cases, like
 | $'whatever \$( cmd $'arg' ) and more'
 |where I assume that answer would be yes, and similarly in
 | $'xxx \${var%$'\n'} yyy'
 |but also as a simple insertion
 | $'abc \$'\t' def'
 |where doing so makes no sense at all, and so the answer is probably
 |"not allowed", but that is then the one $ "expansion" which isn't
 |allowed inside $' strings, which is yet another special case.
 |
 |Also, if a command substitution were embedded using \$( ) inside a $'
 |string, what conversions (if any) are performed upon the stdout of the
 |command before being embedded in the string, are \ escapes there expected
 |to work?  (Same question for a variable expansion).
 |
 |Similarly, what does $'\${var-"two words"}' generate, and
 |$'\${var-\"two words\"}'  (assuming var is unset naturally).  Or using '
 |instead of " in both of those?

All that, to me, yield $'', resume once construct has been
handled.

 |And last (for now anyway), after "set -- A B C" what's the effect of
 |$'pfx\${@}sfx' ?

This is interesting.  I would say it is identical to ${*} here.

 |At least once we either drop \u, or properly define how it is supposed
 |to work (if anyone actually has an idea what that is), $' is entirely the
 |same as ' once the internal expansions are done (as part of lexical \
 |analysis)
 |so is trivial to add, makes it easier to encode some strings (just easier,
 |nothing that cannot already be done) and is trivial to implement.  Adding
 |\$ to that would (I think, I haven't tried to actually do it) complicate
 |everything.   Of course, since $' is properly specified, and unknown \
 |escapes produce implementation defined (or unspecified) results, there's
 |nothing to stop shells from adding \$ if they like (it would probably help
 |them if there was a fully specified spec of how it is intended to work,
 |including all the corner cases) and if it becomes popular, perhaps it
 |could appear in some later standard.  I just don't see that happening \
 |myself.
 |
 |kre
 |
 |ps: unrelated to \$ in $' but while I am here, since I mentioned it above,
 |in the NetBSD sh, \u (which accepts any number of hex digits up to 4, or
 |up to 8 for \U, not just exactly 4 or 8, but that's a frill) the interpr\

My MUA also does this, \OCTAL is flexible, as is \x, so i think \U
and \u should, too.

 |etation
 |is that the UTF-8 encoding of the code point specified is embedded in the
 |string.  No more, no less.   In particular it is *not* the shell's job to
 |validate the UTF sequences so that they make sense, or can rationally be
 |interpreted as anything at all (that's on the application).   They're just
 |bit patterns.   Similarly, since the author of the script cannot be assumed
 |to know what locale the user running it will have set, converting the \u
 |sequence to some other locale (while it is still being processed inside the
 |shell) cannot be correct either.
 |
 |If I ever work out what (beyond message encoding, and perhaps some pattern
 |matching expressions) what the shell is supposed to be doing with locales
 |(which as best I can tell, is really not a lot) and I implement that, it
 |would be by encoding everything internally as UTF-8 sequences (not \
 |wchar_t),
 |and then converting to locale specified encodings as strings are output
 |(or from them for input).   Since there doesn't seem to be to be a lot of

Yeah i guess the best you can do is having an internal Unicode
representation and convert back and forth on input and output
only.  Just look what NetBSD strvis(3) or what the name was does,
terrible, back and forth converting in between the unreliable
unusable wchar_t and the char*-based locale encoding.  Expensive,
and potentially even lossy.

 |reason any more for anyone not to use non UTF-8 encodings, that would \
 |really
 |mean doing a whole lot of nothing most of the time (hopefully, always).

My MUA just turns it into UTF-8 (via a utf32_to_utf8 function that
uses the Unicode replacement character for erroneous codepoints)
in a Unicode locale, otherwise it does not expand it at all
(leaves the construct in the text) unless iconv(3) is available,
in which case we pass the construct through it.  It is documented
like this, too.  (And we do not care about the ISO C i think it
was restrictions on codepoint ranges for \U and \u.) I think this
behaviour is mostly compatible with bash(1) even.

You have to be careful a bit with Unicode.  There are guarantees
that must be fulfilled, see for example [1].  Since the shell is
producing UTF-8 it should ensure that no invalid UTF-8 sequences
are exposed to consumers.  There are also Unicode Conformance
Requirements, for example C10 (part of "Character Encoding Forms",
excerpt):

  When a process interprets a code unit sequence which purports to
  be in a Unicode character encoding form, it shall treat
  ill-formed code unit sequences as an error conddition and shall
  not interpret such sequences as characters.

and, furthermore:

  Silently ignoring ill-formed sequences is strongly discouraged
  because joining text from before and after the ill-formed
  sequence can cause the resulting text to take a new
  meaning. This result would be especially dangerous in the
  context of textual formats that carry embedded program code,
  such as JavaScript.

  [1] https://unicode.org/faq/utf_bom.html

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Re: [1003.1(2008)/Issue 7 0000249]: Add standard support for $'...' in shell

Reply via email to