> *Character Encoding*
> *
> *
> Bennie wrote:
>
> Now you say i want the 132rd letter in a string this is not meaningful and
>> incorrect in some asian languages since they use multiple unicode chars as
>> ascii for encoding as discussed ie word[5] may be the 2nd character...
>>
>
> This is mistaken. For starters, Bennie's comment confuses code points for
> characters and then confuses some encoding issues. Once that is cleaned up,
> we still need to consider normalizations and modifiers.
>
> In a UNICODE string, it is *perfectly* sensible to request either the
> 132nd *code point* or the 132nd *code unit*. Obtaining the 132nd code
> unit is an O(1) operation, but the result may not correspond to a code
> point or even to the *beginning* of a code point. Obtaining the 132nd
> code point may or may not be an O(1) operation depending on the encoding
> selected (UTF-8, UTF-16, UTF-32) and the character planes that are in use.
> In most cases, the extended character planes can be ignored. Under that
> assumption, UTF-16 encoding can be implemented in UCS-2 code units and O(1)
> indexing is achieved. Yes, that incurs memory overhead.
>
But you cant rely on the asumption .. the moment you use the code planes
the entire code becomes useless and if you have code like
if ( use this code plane)
do this
if ( use this code plane)
do this
etc
and even worse the entire code handling and indexing is diffirent so you
cant just overide methods .
than you may as well have worked on byte[] or ASCII at least the code will
be faster , tighter and handle all conditions.
>
> Having a code point does not mean you have a character or even the
> beginning of a character. There are so-called "combining sequences" in
> UNICODE that involve multiple code points. For example, there are multiple
> valid ways of representing a capital C with a cedile. See
> here<http://unicode.org/reports/tr15/#Canon_Compat_Equivalence>in the
> applicable standard.
>
> I am aware of all this ( and the 0x10 escape char) ...though my
terminology like always is loose despite my efforts. I deliberately used
the term char and word to indicate an english style get first char of the
word which i thought you were asking since getting the 132 char seemed
meaningless to me.
> What this mainly serves to reveal is that the proper handling of
> international character data is a nightmarishly complex business, and the
> entire *concept* of a fixed-length character needs to be discarded in
> order to understand how international text really works. Once you realize
> that, your whole point of view on indexing encodings changes, because
> getting an O(1) indexing operation at the encoding layer doesn't really
> help you that much.
>
Agree this is my fundamental point , Unicode code points can not describe
all chinese ( or according to David Japanese characters) and hence code
that assumes it is flawed. eg at present chinese encoding is of a chinese
character 0x78 0x1A is not teh code point 0x781A but by law it must be
encoded in GB so it ends up being 0x0078 and 0x001A. . That does not
mention the duplicates which you mention of the 2 turkish Is etc .
> In fact - and I say this having done it myself - if you're still looking
> for an O(1) way to extract a "character" from a UNICODE string, it's fair
> to say that you probably don't understand what a character is.
>
I thought you wanted it to show that the index is not O(1) .. so i gave
a solution which was close with caveats. I think i stated clearly byte
index and search was common. i would have o(1) for format for the offset
of {N :XX} but that is just a search and than cached as a byte index
since those strings get a lot of repeat.
>
> I'm only a little bit ahead of some of you. I've dug in to UNICODE far
> enough to recognize that I don't know what a character is, but not far
> enough to feel like I know how to really drive UNICODE. On the bright side,
> I seem to have a lot of company. :-)
>
IMHO USC-2 unicode is a disaster we would have been FAR better of staying
with ASCII and the encodings but organising the encodings that existed
before .. this is basically UTF8 but without the asumptions . Unfortuntely
when it was introduced it came with lots of promises to represent every
char ( now that was western bias!) and Java and Windows adopted it. CLR
inherited USC-2 but put UTF-16 ontop as that can be done pretty efficiently.
>
> CLR includes a 'char' datatype largely for legacy compatibility reasons,
> and because you need *something* to hold the result of indexing a string.
> But if you dig into the CLR internationalization APIs (or the Java
> equivalents), you'll find that most of them are framed in terms of
> operations on strings rather than operations on "characters". That's
> because there ain't no such thing as a fixed-length character.
>
Main use of char is as you say an indexer and to get a char[] to do mutable
work . Monos internat. apis works on the internal unsafe char[] but they
do string searches its just an optomization so they avoid runlength checks
etc.
My main preference of UTF8 ( or even ASCII and the old encodings ! ) is
since there is no good solution lets get one thats fast and memory
efficient to make a solid case for the language. Blowing 40-50% of heap
space on 0x00 just leaves a bitter taste in my mouth. And if you think
benchmarks are not important look at what Linux benchmarks for linux style
apps did to MicroKernels and the not so worthy contenders ( Minix , Herd
and Mach)
Ben
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev