On Mon, Aug 18, 2008 at 4:55 PM, Ricky Sharp <[EMAIL PROTECTED]> wrote:
>
> On Aug 18, 2008, at 3:40 PM, mm w wrote:
>
>> to avoid the splitting problem
>>
>> (c < 128) ? "%c" : "\\u%04x", c);
>
> I'm not sure what this solves.
>
> Per Michael's e-mail below, this is indeed a difficult problem.  UTF-8 is
> just a particular scheme to store Unicode strings.  Operating on individual
> bytes in such streams will most likely not make any sense.
>
> What I would do is pick some normalized form and operate on that data.  For
> a recent feature at my day job, we normalized all input CSV files to
> UTF-16BE.  We were able to handle all of our customer data so far.  The
> final solution still isn't 100% Unicode-savvy (e.g. it does crap-out with
> surrogate pairs), but we have unit tests to expose/document such
> limitations. And, customer data doesn't yet have such things.

Note that depending on what kind of results you want, even if all of
your data is within the BMP, this *still* won't save you.

As a really basic example, consider a simple, obvious character like
é. (That's an e with an acute accent on it if you're having unicode
trouble in your e-mail client.) That can be represented as two
separate unicode code points, a plain old ASCII e followed by a
combining accent mark. If you should happen to split the string on the
accent mark, such that the e goes into the first half and the
combining accent mark goes into the second half, you get a really
unintuitive result. What appears to the user to be a single character
gets suddenly blown in two. Worse, if you happen to insert a string in
the middle, you could end up applying that acute accent to some
*other* letter instead.

And if you think this is bad, you should see how Unicode deals with Korean.

If you're using NSString, you can find good places to split using the
-rangeOfComposedCharacterSequenceAtIndex: method. I believe that it
will also deal with surrogate pairs, not only "normal" composed
character sequences.

Ultimately if you're doing any manipulation of Unicode, some large
amount of knowledge about Unicode needs to be in the system somewhere.
If your code is running on a Mac then you can use the knowledge that
NSString has about Unicode to help out, sometimes. But alas, due to
how Unicode is designed, there's simply no way to safely manipulate
strings beyond very basic operations like concatenation unless you
either make the code know a lot about Unicode or place overly strong
constraints on the system, such as only splitting on line breaks or
carriage returns (or commas).

Yeah, the situation kind of sucks, but it's what we're stuck with.
Thankfully Foundation and CoreFoundation do a lot to hide the messy,
ugly details from us.

Mike
_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Reply via email to