Re: Character sets PDD ready for review

Mark J. Reed Fri, 14 Mar 2008 19:01:26 -0700

As a ref point, AppleScript 2.0 (not that I know if anyone wants to
port that to Parrot) "characters" are defined as Unicode  "grapheme
clusters", e.g. the base grapheme and its diacriticals... Is that
similar to the concept of a Parrot_Rune?


On 3/14/08, Leopold Toetsch <[EMAIL PROTECTED]> wrote:
> Am Samstag, 8. März 2008 13:59 schrieb Simon Cozens:
> > Hi folks,
> >     I think I've finished doing what I can with
> > docs/pdds/draft/pdd28_character_sets.pod for the time being.
> >     Please have a look at it, and let me know if there's anything wrong,
> > anything unclear, anything missing or anything objectionable about it.
> > Character set and encoding support is an absolute nightmare to get
> > right, but I feel the stuff in this PDD gives us a good basis to work
> from.
> >     If there's no major problems with it, I'll pass it on to Allison for
> > editing.
>
> 1) The Parrot internal character type
>
> «Strings in Parrot's native string format will probably be an array of
> "Parrot_Rune"s.»
>
> or iso-8859-1 or UCS-2.
>
> Why:
>
> iso-8859-1 is an 1-byte-charset/encoding, where these 256 chars are matching
> unicode U+0000 - U+00FF codepoints. CPAN's BIO::folks and a lot more will
> like to have the speed and memory improvements of an 1-byte-encoding.
>
> UCS-2 is a fixed-width 16-bit charset, which includes the "Basic
> Multilingual
> Plane" [¹] of unicode. It is sufficient to represent some very high
> percentage of used codepoints. When Wikepedia [²] states ...
>
> <cite>
> UCS-2 (2-byte Universal Character Set) is an obsolete character encoding
> which
> is a predecessor to UTF-16.
> </cite>
>
> ..., it's already mixing the concepts of charset and encoding. Anyway for
> efficiency reasons, I'd like to see this as an alternative.
>
> 2) the concept of Parrot_Rune or
>
> <cite>
> Unicode codepoint where values >= 0x80000000 are
>        understood to be entries into the global "Parrot_grapheme_table"
> array.
> </cite>
>
> seems to be implying that we are gonna starting to:
>
> a) rewrite / improve the now used ICU library
> b) inventing a new "standard"
> c) will do a lot of future hiring work to keep in sync with unicode folks
> ;-)
>
> Basically I have some concerns "who will implement and maintain it".
>
> I wrote the one and only (AFAIK) test showing the ugliness of decomposed
> unicode [4] codepoints and I'd be glad if there would be a better solution.
>
> OTOH I don't know the impact of not having it. East European or other maybe
> involved folks should speak up now.
>
> > Simon
>
> leo's 2¢
>
> [1] http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
> [2] http://en.wikipedia.org/wiki/UTF-16
> [3] [EMAIL PROTECTED]:~/svn/parrot/leo> find t -name '*.t' | xargs grep -w 
> compose
> t/op/string_cs.t:    compose S1, S1
> t/pmc/object-mro.t:# ... now some tests which fail to compose the class
> [4] [EMAIL PROTECTED]:~/svn/parrot/leo> ./parrot t/op/string_cs_46.pasm
> ___ǰ___
> 7 8 8 7
>


-- 
Mark J. Reed <[EMAIL PROTECTED]>

Re: Character sets PDD ready for review

Reply via email to