Re: How to convert a UTF-8 byte offset into an NSString character offset?

Quincey Morris Tue, 06 May 2014 11:17:26 -0700

On May 5, 2014, at 12:06 , Jens Alfke <j...@mooseyard.com> wrote:

> How can I map a byte offset in a UTF-8 string back to the corresponding 
> character offset in the NSString it came from?

I’ve been thinking about this since your original question, and it seems to me
that this is a subtler problem than it seems:

1. You cannot *in general* map a UTF-8 byte offset to a NSString (UTF-16)
“character" offset. The two representations may have different numbers of code
units (1-4 for UTF-8, 1-2 for UTF-16) per code point. There’s no real answer to
the question of what UTF-16 offset corresponds to the 3rd code unit of a 4-byte
UTF-8 code point.

2. So, you’re restricted at least to byte offsets of UTF-8 code units that are
the *start* of a code point. However, there’s a potential problem with this,
because you’re not in control of the structure of the NSString. It’s possible,
for example, that the UTF-8 byte offset points to the second (or later) code
point of a base+combining mark sequence, but an equivalent NSString has a
single code point consisting of one or two code units (a “composed character”).
Even if both versions of the string have the same number of code points
(“characters”), they may have different orders.

3. It’s *possible* that you can create a NSString that has the same code points
in the same order as the UTF-8 string, but I don’t see any API contract that
clearly guarantees it. The documentation for -[NSString
initWithCharacters:length:] says that the return value is "An initialized
NSString object containing length characters taken from characters.” That
*might* be a sufficient guarantee, but code-point equivalence possibly isn’t
guaranteed across some NSString manipulation methods, so you’d have to be
careful.

4. Otherwise, I think it’s yet more difficult. The next-higher Unicode boundary
is “grapheme clusters”. You can divide a NSString into grapheme clusters
(either through direct iteration using ‘rangeOfComposedCharacterSequence…’, or
through enumeration using ‘enumerateSubstrings…’), but to match the UTF-8 and
NSString representations cluster by cluster you’d need to break the UTF-8
string into grapheme clusters using the same algorithm as NSString, and it’s
not documented what the precise algorithm is.

(The documentation at:

https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html

refers to this:

http://unicode.org/reports/tr29/

which I find pretty overwhelming.)

5. Even if #3 works, you may still have some troubles with grapheme clusters.
For example, if a UTF-8 byte offset is actually a code point in the middle of a
cluster, you may have trouble getting consistent NSString behavior with
substrings that start from that code point.

FWIW, my opinion is that if your library clients are specifying UTF-8 sequences
at the API, and expect byte offsets into those sequences to be meaningful, you
might well be forced to maintain the original UTF-8 sequence in the library’s
internal data model — or, perhaps, an array of the original code points — and
do all of your internal processing in terms of code points. Conversion to
NSString would happen only in the journey from data model to UI text field.

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: How to convert a UTF-8 byte offset into an NSString character offset?

Reply via email to