Re: Character sets PDD ready for review

Allison Randal Tue, 01 Apr 2008 16:53:37 -0700

Gianni Ceccarelli wrote:

(Here follows various comments and opinions on PDD28 draft, written
while reading it)


As has been pointed out, the expression «A grapheme is our concept» is
not really clear. I think «The term "grapheme" in this document
defines a concept local to Parrot» or some such.

I'm not sure that UTF-16 can be called a "fixed-width" encoding (what
with surrogate pairs and all that...)

UTF-16 isn't fixed-width, but I don't see anywhere that the PDD says itis. Maybe this comment was from an earlier version of the PDD?

«we don’t standardize on Unicode internally»: the intent is clear, but
the expression feels ambiguous to me. Do you mean "we don't fixate on
a UTF-*", "we don't use Unicode-specified semantics and tables", or
what? (I think the text is simply referring to encodings for internal
representations)

This means that we don't convert every piece of string data that entersParrot to a Unicode string. We keep the character set, encoding, andnormalization of the string as it enters Parrot. So, you have to assumethat any string you're passed could be in any format, and use thestandard string APIs to interact with them. I added to the text,hopefully clearer now.

«Parrot_Rune»: whoever came up with this short-form for "grapheme" can
collect a beer from me at the next YAPC::Europe. Brilliant!


Runes are gone, but you can still use the name unofficially. :)

«out-of-band» usually does not mean "using special values in the same
stream as normal values"... again, the intent is clear enough, but the
terminology is misleading.

«"0x00000438 0x000000030F"» is not a byte-stream, it's an int-stream.

«need to take the overload of peeking» s/overload/overhead/ ?


Removed in the edit.

Stupid serialization of Parrot_Rune arrays are not portable between
Parrot runs, right? That is, Parrot_Rune(-1) can refer to different
graphemes from one run to the next. Better bang it into the heads of
everyone from the earliest possible moment...


Yes, that's one reason the global grapheme table didn't work well.

I've always defined an "encoding" as a function from streams of
characters to strings of bytes (and back, for "decoding"). Why not
include a similar definition at the beginning of the "IMPLEMENTATION"
section?

Added a definition of encoding, but in the nominal sense common toUnicode discussions, not the verbal sense of functions to encode and decode.

«encoding_get_codepoint» may return something which is not, strictly
speaking, what Unicode calls a "codepoint". Ok, calling it "runepoint"
might be seen as a pun, but confusion is (sadly) the norm whet dealing
with text nowadays, and overloading such a badly-understood term may
not help clear the issue...


Gone in the edit.

Warnings to add to the checklist:

- arithmetical comparison of string data elements is a red flag
- string sorting is ill-defined generally, but it's well-defined
  inside a locale (that is, it's dependent on the language of the
  user, which may or may not have any relation with the language of
  the data, which in turn may or may not have any relation with the
  script of a character)
- tr/// or similar simple-minded table-based transformations are a red
  flag
- the Parrot_Rune value-space is not connected (that is, given that $a
  and $b are valid Parrot_Rune values, there may be a value $c ($a <
  $c < $b) that is not a valid Parrot_Rune), so don't use Parrot_Rune
  in for-loops
- string element count ("length") and string display width are quite
  unrelated (Han characters are wider than Latin characters almost
  always, for example)

The checklist is gone in the edit (rolled into the text), as this is aspecification document, not a usage guide. But, a general tutorial onworking with Unicode would be a good addition, down the road.

Hope this helps, and is not too jumbled (I tend to brain-dump)


Many thanks!
Allison

Re: Character sets PDD ready for review

Reply via email to