On May 5, 2014, at 12:06 , Jens Alfke <j...@mooseyard.com> wrote:

> How can I map a byte offset in a UTF-8 string back to the corresponding 
> character offset in the NSString it came from?

I’ve been thinking about this since your original question, and it seems to me 
that this is a subtler problem than it seems:

1. You cannot *in general* map a UTF-8 byte offset to a NSString (UTF-16) 
“character" offset. The two representations may have different numbers of code 
units (1-4 for UTF-8, 1-2 for UTF-16) per code point. There’s no real answer to 
the question of what UTF-16 offset corresponds to the 3rd code unit of a 4-byte 
UTF-8 code point.

2. So, you’re restricted at least to byte offsets of UTF-8 code units that are 
the *start* of a code point. However, there’s a potential problem with this, 
because you’re not in control of the structure of the NSString. It’s possible, 
for example, that the UTF-8 byte offset points to the second (or later) code 
point of a base+combining mark sequence, but an equivalent NSString has a 
single code point consisting of one or two code units (a “composed character”). 
Even if both versions of the string have the same number of code points 
(“characters”), they may have different orders.

3. It’s *possible* that you can create a NSString that has the same code points 
in the same order as the UTF-8 string, but I don’t see any API contract that 
clearly guarantees it. The documentation for -[NSString 
initWithCharacters:length:] says that the return value is "An initialized 
NSString object containing length characters taken from characters.” That 
*might* be a sufficient guarantee, but code-point equivalence possibly isn’t 
guaranteed across some NSString manipulation methods, so you’d have to be 
careful.

4. Otherwise, I think it’s yet more difficult. The next-higher Unicode boundary 
is “grapheme clusters”. You can divide a NSString into grapheme clusters 
(either through direct iteration using ‘rangeOfComposedCharacterSequence…’, or 
through enumeration using ‘enumerateSubstrings…’), but to match the UTF-8 and 
NSString representations cluster by cluster you’d need to break the UTF-8 
string into grapheme clusters using the same algorithm as NSString, and it’s 
not documented what the precise algorithm is.

(The documentation at:

        
https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html

refers to this:

        http://unicode.org/reports/tr29/

which I find pretty overwhelming.)

5. Even if #3 works, you may still have some troubles with grapheme clusters. 
For example, if a UTF-8 byte offset is actually a code point in the middle of a 
cluster, you may have trouble getting consistent NSString behavior with 
substrings that start from that code point.

FWIW, my opinion is that if your library clients are specifying UTF-8 sequences 
at the API, and expect byte offsets into those sequences to be meaningful, you 
might well be forced to maintain the original UTF-8 sequence in the library’s 
internal data model — or, perhaps, an array of the original code points — and 
do all of your internal processing in terms of code points. Conversion to 
NSString would happen only in the journey from data model to UI text field.

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to