On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.s...@gmail.com> wrote:
On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
* I don't even know how to make a grapheme that is more than one
code-unit, let alone more than one code-point :) Every time I try, I
get 'invalid utf sequence'.
I feel significantly ignorant on this issue, and I'm slowly getting
enough knowledge to join the discussion, but being a dumb American who
only speaks English, I have a hard time grasping how this shit all
works.
1. See my text at
https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction
I can't read that document, it's black background with super-dark-grey
text.
2.
writeln ("A\u0308\u0330");
<A + tilde above + umlaut below> (or the opposite)
If it does not display properly, either set your terminal to UTF* or use
a more unicode-aware font (eg DejaVu series).
OK, I'll have to remember this so I can use it to test my string type ;)
The point is not playing like that with Unicode flexibility. Rather that
composite characters are just normal thingies in most languages of the
world. Actually, on this point, english is a rare exception (discarding
letters imported from foreign languages like french 'à'); to the point
of beeing, I guess, the only western language without any diacritic.
Is it common to have multiple modifiers on a single character? The
problem I see with using decomposed canonical form for strings is that we
would have to return a dchar[] for each 'element', which severely
complicates code that, for instance, only expects to handle English.
I was hoping to lazily transform a string into its composed canonical
form, allowing the (hopefully rare) exception when a composed character
does not exist. My thinking was that this at least gives a useful string
representation for 90% of usages, leaving the remaining 10% of usages to
find a more complex representation (like your Text type). If we only get
like 20% or 30% there by making dchar the element type, then we haven't
made it useful enough.
Either way, we need a string type that can be compared canonically for
things like searches or opEquals.
-Steve