Hello.

Robert Elz wrote in
 <[email protected]>:
 |    Date:        Thu, 30 Jul 2020 15:53:53 +0200
 |    From:        Steffen Nurpmeso <[email protected]>
 |    Message-ID:  <20200730135353.qwslp%[email protected]>
 |
 || The problem being that what is in the wild does not work out for
 || many languages.
 |
 |I admit to not knowing a lot of the internationalisation issues,
 |or of unicode, but I don't understand this at all.
 |
 |The quoting mechanisms in the shell provide a means to create
 |specific bit patterns to assign to variables, pass as parameters
 |to programs, etc.   I don't see that the mechanism by which they're
 |encoded in the sh language should matter all that much, the same
 |thing could be read from a file instead ( var=$(cat file) ) in which
 |case the shell spec has no control over the bit patterns at all.
 |
 |Of course the quoting mechanisms make a difference to the ease of
 |use for the sh programmer, but that's an entirely different issue.

No, no.  Ah, i had to reread the bug report know.  But i am being
misunderstood.

 || The in-use shell quote pattern consisting of small, isolated parts
 || which depend on which kind of escaping and expanding is necessary
 || just does not work out for many languages.
 |
 |Can you give an example of something which cannot be done (assuming
 |$'' as currently intended to be specified)?   Note: not an example of
 |someone using the mechanisms to do the wrong thing - there are zillions
 |of ways to write bad code, but an example of something which cannot be
 |done correctly as specified.   Then we'll see if that really matters.
 |
 ||   ? echo Don"'"t you worry$'\x21' The sun shines on us. $'\u263A'
 ||
 || The latter is what i mean.  There are many languages on this world
 || where these \u expansions do not work out that way, but where the
 || "entire sentence must be interpreted as a unity" in order to get
 || the iconv(3) conversation to nl_langinfo(CODESET) correctly, aka
 || the way it is _desired_.
 |
 |Surely this depends upon how the shell works - if the shell is attempting
 |to convert just the \u escape into some other codeset, I can see your \
 |point,
 |but it doesn't need to work like that - it can work internally in 10646
 |code points (whether encoded in 16 or 32 bit values, or as UTF-8), and
 |only convert to the desired charset when actually used (that is, when
 |about to run "echo" at which point the entire string is available.

Yes it could.  This would solve the issue, being only that the \u
escape can be used to specify Unicode codepoints, which then will
be converted to the locale character set via iconv(3).  And that
this may yield different results dependent on the context it has
to process.  As a primitive example of a western language i know

  u$'\u{DIAERESIS}'

cannot be converted to LATIN1, even though $'u\u{DIAERESIS}' could.
In theory.  In praxis only if iconv(3) would apply a normalization
step for Unicode input, aka take care for "combining" marks.  This
then would be U+00FC (LATIN SMALL LETTER U WITH DIAERESIS).

This is a primitive example.  There are languages which have
complex rules, and where multiple Unicode codepoints, aka multiple
adjacent \u sequences, form a "grapheme".  This is because Unicode
does not provide a codepoint for each and every character of all
languages which are supported, but it uses combining marks and
other categories of codepoints, which glue together to form the
actual character.

For example i know an Australian who lives in Southeast Asia (that
happens more often than one would think), now in Malaysia but also
Vietnam and Thailand, whatever, and he said

  In Thai vowels can be infront, behind, below, above, infront and
  behind, infront and above, infront and behind and above. And
  also have tonal markers above. So can be tripple stacked.

Such things are very often represented via combined codepoints in
Unicode.  When he said that he was at odds with a ncurses based
Unix terminal application, by the way.

 |In any case, if the user has specified a specific unicode code point,
 |shouldn't that always be what is generated, regardless of whether it
 |makes sense or not?
 |
 || And for that it would be tremendous if $'' would be defined so
 || that it can be used as the sole quoting mechanism,
 |
 |No thanks.   Partly because $'' is already implemented (widely)
 |and used (perhaps slightly less yet) - so that ship has sailed.
 |
 |I believe I've seen $" ... " used that way somewhere though (don't
 |recall where) and I believe it is a mistake.

That $"" is used by bash for translation aka gettext(3) purposes
i think.

 |As soon as you have multiple different types of expansions that
 |can occur, there are problems with which one gets priority, which
 |is performed first.   So, assuming there is a $"..." which works
 |as you desire, what happens with
 |
 | $"${VAR+foo\x7Dbar}"
 |
 |Do we get foo}bar or foobar} ?   (assuming VAR was set of course).

Well, for one i do not understand your problem now.  We have seen
very, very tricky shell expansion problems being discussed in this
group in the last years, and being solved (i think).
So _if_ that \$ escape would really be added to $'', then it would
just expand the same construct that it would expand in "", i would
say.  It then integrates into the normal content of the quote, and
the final content of the quote would, in case there was any \u,
but maybe just always, be passed through iconv(3).

You can play tricks today with \x or \octal etc expansions, you
can specify an UTF-8 string with such sequences in a 7-bit ASCII
file, i do not think this changes.

I mean the stuff is tricky but is the problem of the one who
produces the sequence.

I have reread the bug 249 tracker entry.  In fact all i can do is
reiterate what i said by then, starting with #2893.
Therefore, the desired standard wording 

  <tt>\uXXXX</tt> yields the character named by ISO/IEC 10646 or,
  for \u0 to \u1f, by ISO/IEC 6429 where XXXX is one to four
  hexadecimal digits (with leading zeros supplied for missing
  digits) whose four-digit short universal character name is XXXX
  (and whose eight-digit short universal character name is
  0000XXXX).

  <tt>\UXXXXXXXX</tt> yields the character named by ISO/IEC 10646
  or, for \u0 to \u1f, by ISO/IEC 6429 where XXXXXXXX is one to
  eight hexadecimal digits (with leading zeros supplied for
  missing digits) whose eight-digit short universal character name
  is XXXXXXXX.

is no good.  You cannot say "yields the character", because these
are ISO 10646 codepoints, and multiple adjacent such codepoints
may be needed to form an actual character.  As such a conversion
to the locale charset may be possible only if at least the entire
sequence of adjacent \Uu sequences are treated as a unity.
The rest is the problem of definition.  Whether $'u\u{DIAERESIS}'
is it, or whether it must be $'we do not know charset here, but
 \u{SMALL LETTER U}\u{DIAERESIS} and this transparent too' to
work out, which is possibly what you desire.

I think it would have been better to say that $'' treats the input
as UTF-8, or LATIN1 or even US-ASCII, but that is not the way it
is defined.
Or am i mistaken?

 |Whichever way you pick, there will be arguments for doing it
 |the other way, in some other case.   This stuff simply becomes
 |a mess.   Please, don't go there.   If we wanted to add C type
 |encodings along with the others, we'd need to do it in a way that
 |is consistent with the other expansions, perhaps using something
 |like $[x7D] or $[u263A] or $[n] (but no, this is not a serious
 |suggestion).
 |
 |And I cannot fathom how this in any way overcomes your earlier
 |objection, quoted strings in sh are not units, they're simply
 |pieces of some longer word (or can be) - your Don"'"t example
 |above (and the worry$'\x21') are both examples of that.

Yes.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

Reply via email to