On May 5, 2014, at 2:06 PM, Jens Alfke <j...@mooseyard.com> wrote:

> How can I map a byte offset in a UTF-8 string back to the corresponding 
> character offset in the NSString it came from?
> 
> I’m writing an Objective-C wrapper around a C text-tokenizer API that takes a 
> UTF-8 string as input, and as part of its output returns byte ranges of words 
> that it found. So my API takes an NSString, converts it to UTF-8, passes that 
> to the C API, and then gets these byte offsets that it needs to convert into 
> character offsets in the NSString.
> 
> I’ve looked through both the NSString and CFString APIs and didn’t see 
> anything relevant to this. I know UTF-8 isn’t rocket science and I could 
> pretty easily write my own function to scan through it counting characters, 
> but I suspect I’d run into the differences between Unicode characters and the 
> UTF-16 code points that NSString actually considers “characters”. I’d much 
> rather let CF do this for me in an internally-consistent way.

I ran into this same problem once, and I don't think there's any way to do it 
other than scanning through the string. The good news is that the documentation 
for CFStringGetLength does specifically say that the length returned is in 
terms of UTF-16 code pairs, so I don't think they can change that 
implementation detail without breaking the contract.

Charles


_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to