As a ref point, AppleScript 2.0 (not that I know if anyone wants to port that to Parrot) "characters" are defined as Unicode "grapheme clusters", e.g. the base grapheme and its diacriticals... Is that similar to the concept of a Parrot_Rune?
On 3/14/08, Leopold Toetsch <[EMAIL PROTECTED]> wrote: > Am Samstag, 8. März 2008 13:59 schrieb Simon Cozens: > > Hi folks, > > I think I've finished doing what I can with > > docs/pdds/draft/pdd28_character_sets.pod for the time being. > > Please have a look at it, and let me know if there's anything wrong, > > anything unclear, anything missing or anything objectionable about it. > > Character set and encoding support is an absolute nightmare to get > > right, but I feel the stuff in this PDD gives us a good basis to work > from. > > If there's no major problems with it, I'll pass it on to Allison for > > editing. > > 1) The Parrot internal character type > > «Strings in Parrot's native string format will probably be an array of > "Parrot_Rune"s.» > > or iso-8859-1 or UCS-2. > > Why: > > iso-8859-1 is an 1-byte-charset/encoding, where these 256 chars are matching > unicode U+0000 - U+00FF codepoints. CPAN's BIO::folks and a lot more will > like to have the speed and memory improvements of an 1-byte-encoding. > > UCS-2 is a fixed-width 16-bit charset, which includes the "Basic > Multilingual > Plane" [¹] of unicode. It is sufficient to represent some very high > percentage of used codepoints. When Wikepedia [²] states ... > > <cite> > UCS-2 (2-byte Universal Character Set) is an obsolete character encoding > which > is a predecessor to UTF-16. > </cite> > > ..., it's already mixing the concepts of charset and encoding. Anyway for > efficiency reasons, I'd like to see this as an alternative. > > 2) the concept of Parrot_Rune or > > <cite> > Unicode codepoint where values >= 0x80000000 are > understood to be entries into the global "Parrot_grapheme_table" > array. > </cite> > > seems to be implying that we are gonna starting to: > > a) rewrite / improve the now used ICU library > b) inventing a new "standard" > c) will do a lot of future hiring work to keep in sync with unicode folks > ;-) > > Basically I have some concerns "who will implement and maintain it". > > I wrote the one and only (AFAIK) test showing the ugliness of decomposed > unicode [4] codepoints and I'd be glad if there would be a better solution. > > OTOH I don't know the impact of not having it. East European or other maybe > involved folks should speak up now. > > > Simon > > leo's 2¢ > > [1] http://en.wikipedia.org/wiki/Basic_Multilingual_Plane > [2] http://en.wikipedia.org/wiki/UTF-16 > [3] [EMAIL PROTECTED]:~/svn/parrot/leo> find t -name '*.t' | xargs grep -w > compose > t/op/string_cs.t: compose S1, S1 > t/pmc/object-mro.t:# ... now some tests which fail to compose the class > [4] [EMAIL PROTECTED]:~/svn/parrot/leo> ./parrot t/op/string_cs_46.pasm > ___ǰ___ > 7 8 8 7 > -- Mark J. Reed <[EMAIL PROTECTED]>