On Mon, Aug 18, 2008 at 4:55 PM, Ricky Sharp <[EMAIL PROTECTED]> wrote: > > On Aug 18, 2008, at 3:40 PM, mm w wrote: > >> to avoid the splitting problem >> >> (c < 128) ? "%c" : "\\u%04x", c); > > I'm not sure what this solves. > > Per Michael's e-mail below, this is indeed a difficult problem. UTF-8 is > just a particular scheme to store Unicode strings. Operating on individual > bytes in such streams will most likely not make any sense. > > What I would do is pick some normalized form and operate on that data. For > a recent feature at my day job, we normalized all input CSV files to > UTF-16BE. We were able to handle all of our customer data so far. The > final solution still isn't 100% Unicode-savvy (e.g. it does crap-out with > surrogate pairs), but we have unit tests to expose/document such > limitations. And, customer data doesn't yet have such things.
Note that depending on what kind of results you want, even if all of your data is within the BMP, this *still* won't save you. As a really basic example, consider a simple, obvious character like é. (That's an e with an acute accent on it if you're having unicode trouble in your e-mail client.) That can be represented as two separate unicode code points, a plain old ASCII e followed by a combining accent mark. If you should happen to split the string on the accent mark, such that the e goes into the first half and the combining accent mark goes into the second half, you get a really unintuitive result. What appears to the user to be a single character gets suddenly blown in two. Worse, if you happen to insert a string in the middle, you could end up applying that acute accent to some *other* letter instead. And if you think this is bad, you should see how Unicode deals with Korean. If you're using NSString, you can find good places to split using the -rangeOfComposedCharacterSequenceAtIndex: method. I believe that it will also deal with surrogate pairs, not only "normal" composed character sequences. Ultimately if you're doing any manipulation of Unicode, some large amount of knowledge about Unicode needs to be in the system somewhere. If your code is running on a Mac then you can use the knowledge that NSString has about Unicode to help out, sometimes. But alas, due to how Unicode is designed, there's simply no way to safely manipulate strings beyond very basic operations like concatenation unless you either make the code know a lot about Unicode or place overly strong constraints on the system, such as only splitting on line breaks or carriage returns (or commas). Yeah, the situation kind of sucks, but it's what we're stuck with. Thankfully Foundation and CoreFoundation do a lot to hide the messy, ugly details from us. Mike
_______________________________________________ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to [EMAIL PROTECTED]