Re: string vs. char [was Re: Java and Unicode]

addison Mon, 20 Nov 2000 09:43:52 -0800
Hi Jani,

I dunno. I oversimplified in that statement about exposing vs. hiding.

ICU "hides" the facts about the Unicode implementation in macros,
specifically a next and previous character macro and various other
fillips. If you look very closely at the function (method) prototypes you
can see that, in fact, a "character" is a 32-bit entity and a string is
made (conditionally) of 16-bit entities. But, as you suggest, ICU makes it
easy to work with (and is set up so that a sufficiently motivated coder
could change the internal encoding).

<rant>
If you ask a 100 programmers the index of the string, they'll give you the
wrong answer 99 times... because there is little or no I18n training in
the course of becoming a programmer. The members of this list are
continually ground down by the sheer inertia of ignorance (I just gave up
answering one about email... I must have written a response to that
message a bunch of times, but don't have the time or stamina this morning
to go find and rework one of them).
</rant>

In any case this has been a fun and instructive interlude. As I said in
my initial email, I tend to be a CONSUMER of Unicode APIs rather than a
creator. I haven't written a Unicode support package in quite some time
(and the last one was a UTF-8 hack in C++). It's good to be familiar with
the details, but I find that, as a programmer one typically doesn't fully
comprehend the design decisions until one faces them oneself. As it is, I
ended up changing my design and sample code over the weekend to follow the
suggestions of several on this list who've Been There.

As a side note: one of the problems I faced on this project was the need
to keep the Unicode and locale libraries extremely small (this is an
embedded OS). I would happily have borrowed ICU to actually *be* the
library... but it's too large. I've had to design a tiny (and therefore
quite limited) support library. It's been an interesting experience.

Best Regards,

Addison

===========================================================
Addison P. Phillips                    Principal Consultant
Inter-Locale LLC                http://www.inter-locale.com
Los Gatos, CA, USA          mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)              +1 408.904.4762 (fax)
===========================================================
Globalization Engineering & Consulting Services

On Mon, 20 Nov 2000, Jani Kajala wrote:

> 
> > >The question, I guess, boils down to: put it in the interface, or hide it
> > >in the internals. ICU exposes it. My spec, up to this point, hides it,
> 
> (I'm aware that the original question was about C interfaces so you might consider 
>this a bit out of topic but I just wanted to comment about the exposed encoding)
> 
> I think that exposing encoding in interfaces doesn't do any good. It violates 
>oriented design principles and it is not even intuitive.
> 
> I'd bet that if we take 100 programmers and ask them 'What is this index in context 
>of this string?' in every case we'll get an answer that its of course the nth 
>character position. Nobody who isn't well aware of character encoding will ever think 
>of code units. Thus, it is not intuitive to use indices to point at code units. 
>Especially as Unicode has been so well-marketed as '16-bit character set'.
> 
> Besides, you can always use (C++ style) iterators instead of plain indices without 
>any loss in performance or in syntactic convenience. With an 'iterator' in this I 
>refer to simple encapsulated pointer which behaves just as any C++ Standard Template 
>Library random access iterator but takes encoding into account. Example:
> 
> for ( String::Iterator i = s.begin() ; i != s.end() ; ++i )
>     // ith character in s = *i
>     // i+nth character in s = i[n]
> 
> The solution works with any encoding as long as string::iterator is defined properly.
> 
> The conclusion that using indices won't make a difference in performance also makes 
>sense if you consider the basic underlying task: If you need random access to a 
>string you need to check for characters spanning over multiple code units. So the 
>task is the same O(n) complexity, using indices won't help a bit. If the user needs 
>the access to arbitrary character he needs to iterate anyway. It is just matter how 
>you want to encapsulate the task.
> 
> 
> Regards,
> Jani Kajala
> http://www.helsinki.fi/~kajala/
> 
>
Re: string vs. char [was Re: Java and Unicode]

Reply via email to