Re: How to convert a UTF-8 byte offset into an NSString character offset?
On 06 May 2014, at 20:12, Quincey Morris wrote: > FWIW, my opinion is that if your library clients are specifying UTF-8 > sequences at the API, and expect byte offsets into those sequences to be > meaningful, you might well be forced to maintain the original UTF-8 sequence > in the library’s internal data model — or, perhaps, an array of the original > code points — and do all of your internal processing in terms of code points. > Conversion to NSString would happen only in the journey from data model to UI > text field. This is pretty much what I do in one of my projects. I hand-wrote code for converting between offsets, mostly based on tutorials from the net, knowing my offsets are always at the start/end of code points, and knowing that my conversion code always generates the same sequence, just encoding as UTF8/UTF16 as needed to display in the UI or convert selection offsets back to UTF8 offsets. The code hasn't shipped (or been tested much beyond myself using the app for a while), but if you're interested, contact me off-list and I'll send you a copy. Cheers, -- Uli Kusterer "The Witnesses of TeachText are everywhere..." http://www.zathras.de ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: How to convert a UTF-8 byte offset into an NSString character offset?
On May 5, 2014, at 2:06 PM, Jens Alfke wrote: > How can I map a byte offset in a UTF-8 string back to the corresponding > character offset in the NSString it came from? I don't think there's a great way. You can do the reverse, map a character (really a UTF-16 code unit) offset to a UTF-8 offset using CFStringGetBytes(). You'd pass in a range from 0 to the index you want to map and NULL for the buffer. It will fill in *usedBufLen with the length in bytes that would be required by the conversion. You could build the reverse map by doing that repeatedly for each character index, but that would be expensive. You'd also have to tolerate failure in case a given character index can't be converted (if it references half of a surrogate pair, for example). So, I suspect that your best bet will be to do the conversion to UTF-8 yourself and build the index map as you go. Regards, Ken ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: How to convert a UTF-8 byte offset into an NSString character offset?
On May 6, 2014, at 11:12 AM, Quincey Morris wrote: > I’ve been thinking about this since your original question, and it seems to > me that this is a subtler problem than it seems: No offense, but I think you’re overanalyzing it. Remember I said that the UTF-8 data was generated from an NSString via -dataUsingEncoding:, so it’s going to have the same structure as the UTF-16 in the string. > 3. It’s *possible* that you can create a NSString that has the same code > points in the same order as the UTF-8 string I’m not creating an NSString. The client already has an NSString and is asking for character range in that string. —Jens ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: How to convert a UTF-8 byte offset into an NSString character offset?
On May 5, 2014, at 12:06 , Jens Alfke wrote: > How can I map a byte offset in a UTF-8 string back to the corresponding > character offset in the NSString it came from? I’ve been thinking about this since your original question, and it seems to me that this is a subtler problem than it seems: 1. You cannot *in general* map a UTF-8 byte offset to a NSString (UTF-16) “character" offset. The two representations may have different numbers of code units (1-4 for UTF-8, 1-2 for UTF-16) per code point. There’s no real answer to the question of what UTF-16 offset corresponds to the 3rd code unit of a 4-byte UTF-8 code point. 2. So, you’re restricted at least to byte offsets of UTF-8 code units that are the *start* of a code point. However, there’s a potential problem with this, because you’re not in control of the structure of the NSString. It’s possible, for example, that the UTF-8 byte offset points to the second (or later) code point of a base+combining mark sequence, but an equivalent NSString has a single code point consisting of one or two code units (a “composed character”). Even if both versions of the string have the same number of code points (“characters”), they may have different orders. 3. It’s *possible* that you can create a NSString that has the same code points in the same order as the UTF-8 string, but I don’t see any API contract that clearly guarantees it. The documentation for -[NSString initWithCharacters:length:] says that the return value is "An initialized NSString object containing length characters taken from characters.” That *might* be a sufficient guarantee, but code-point equivalence possibly isn’t guaranteed across some NSString manipulation methods, so you’d have to be careful. 4. Otherwise, I think it’s yet more difficult. The next-higher Unicode boundary is “grapheme clusters”. You can divide a NSString into grapheme clusters (either through direct iteration using ‘rangeOfComposedCharacterSequence…’, or through enumeration using ‘enumerateSubstrings…’), but to match the UTF-8 and NSString representations cluster by cluster you’d need to break the UTF-8 string into grapheme clusters using the same algorithm as NSString, and it’s not documented what the precise algorithm is. (The documentation at: https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html refers to this: http://unicode.org/reports/tr29/ which I find pretty overwhelming.) 5. Even if #3 works, you may still have some troubles with grapheme clusters. For example, if a UTF-8 byte offset is actually a code point in the middle of a cluster, you may have trouble getting consistent NSString behavior with substrings that start from that code point. FWIW, my opinion is that if your library clients are specifying UTF-8 sequences at the API, and expect byte offsets into those sequences to be meaningful, you might well be forced to maintain the original UTF-8 sequence in the library’s internal data model — or, perhaps, an array of the original code points — and do all of your internal processing in terms of code points. Conversion to NSString would happen only in the journey from data model to UI text field. ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: How to convert a UTF-8 byte offset into an NSString character offset?
> No, it would probably be to highlight that range of the string in a text view, which does require knowing the character range. Maybe you could take each of the ranges returned and create a string from the UTF8 byte stream; search for it in the original string; the results giving you the range for that string. Another option could be to rebuild the string from the UTF8 string, noting the ranges (in the new NSString). On Tue, May 6, 2014 at 9:13 AM, Jens Alfke wrote: > > On May 5, 2014, at 10:19 PM, Stephen J. Butler > wrote: > > > What's your next step after doing the UTF8 to UTF16 range conversion? If > it's just going to be -[NSString substringWithRange:] then I'd strongly > suggest just doing -[NSString initWithBytes:length:encoding:] on the UTF8 > string. > > No, it would probably be to highlight that range of the string in a text > view, which does require knowing the character range. > > (I can’t say for sure because I’m implementing a library, not an app, so > 3rd party apps will be the ultimate users of this information. But since > this functionality is about text search, it’s very likely they’d be > displaying the hit range to the user.) > > —Jens > > ___ > > Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) > > Please do not post admin requests or moderator comments to the list. > Contact the moderators at cocoa-dev-admins(at)lists.apple.com > > Help/Unsubscribe/Update your Subscription: > https://lists.apple.com/mailman/options/cocoa-dev/unmarked%40gmail.com > > This email sent to unmar...@gmail.com > -- Mark Munz unmarked software http://www.unmarked.com/ ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: How to convert a UTF-8 byte offset into an NSString character offset?
On May 5, 2014, at 10:19 PM, Stephen J. Butler wrote: > What's your next step after doing the UTF8 to UTF16 range conversion? If it's > just going to be -[NSString substringWithRange:] then I'd strongly suggest > just doing -[NSString initWithBytes:length:encoding:] on the UTF8 string. No, it would probably be to highlight that range of the string in a text view, which does require knowing the character range. (I can’t say for sure because I’m implementing a library, not an app, so 3rd party apps will be the ultimate users of this information. But since this functionality is about text search, it’s very likely they’d be displaying the hit range to the user.) —Jens ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: How to convert a UTF-8 byte offset into an NSString character offset?
What's your next step after doing the UTF8 to UTF16 range conversion? If it's just going to be -[NSString substringWithRange:] then I'd strongly suggest just doing -[NSString initWithBytes:length:encoding:] on the UTF8 string. At least profile it and see what the penalty is. You've already paid the UTF16 to UTF8 conversion price once. It's not clear that going UTF8 to UTF16 again will be a big penalty vs. the range conversion. But profile would show. On Mon, May 5, 2014 at 2:06 PM, Jens Alfke wrote: > How can I map a byte offset in a UTF-8 string back to the corresponding > character offset in the NSString it came from? > > I’m writing an Objective-C wrapper around a C text-tokenizer API that > takes a UTF-8 string as input, and as part of its output returns byte > ranges of words that it found. So my API takes an NSString, converts it to > UTF-8, passes that to the C API, and then gets these byte offsets that it > needs to convert into character offsets in the NSString. > > I’ve looked through both the NSString and CFString APIs and didn’t see > anything relevant to this. I know UTF-8 isn’t rocket science and I could > pretty easily write my own function to scan through it counting characters, > but I suspect I’d run into the differences between Unicode characters and > the UTF-16 code points that NSString actually considers “characters”. I’d > much rather let CF do this for me in an internally-consistent way. > > —Jens > ___ > > Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) > > Please do not post admin requests or moderator comments to the list. > Contact the moderators at cocoa-dev-admins(at)lists.apple.com > > Help/Unsubscribe/Update your Subscription: > > https://lists.apple.com/mailman/options/cocoa-dev/stephen.butler%40gmail.com > > This email sent to stephen.but...@gmail.com ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
Re: How to convert a UTF-8 byte offset into an NSString character offset?
On May 5, 2014, at 2:06 PM, Jens Alfke wrote: > How can I map a byte offset in a UTF-8 string back to the corresponding > character offset in the NSString it came from? > > I’m writing an Objective-C wrapper around a C text-tokenizer API that takes a > UTF-8 string as input, and as part of its output returns byte ranges of words > that it found. So my API takes an NSString, converts it to UTF-8, passes that > to the C API, and then gets these byte offsets that it needs to convert into > character offsets in the NSString. > > I’ve looked through both the NSString and CFString APIs and didn’t see > anything relevant to this. I know UTF-8 isn’t rocket science and I could > pretty easily write my own function to scan through it counting characters, > but I suspect I’d run into the differences between Unicode characters and the > UTF-16 code points that NSString actually considers “characters”. I’d much > rather let CF do this for me in an internally-consistent way. I ran into this same problem once, and I don't think there's any way to do it other than scanning through the string. The good news is that the documentation for CFStringGetLength does specifically say that the length returned is in terms of UTF-16 code pairs, so I don't think they can change that implementation detail without breaking the contract. Charles ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com
How to convert a UTF-8 byte offset into an NSString character offset?
How can I map a byte offset in a UTF-8 string back to the corresponding character offset in the NSString it came from? I’m writing an Objective-C wrapper around a C text-tokenizer API that takes a UTF-8 string as input, and as part of its output returns byte ranges of words that it found. So my API takes an NSString, converts it to UTF-8, passes that to the C API, and then gets these byte offsets that it needs to convert into character offsets in the NSString. I’ve looked through both the NSString and CFString APIs and didn’t see anything relevant to this. I know UTF-8 isn’t rocket science and I could pretty easily write my own function to scan through it counting characters, but I suspect I’d run into the differences between Unicode characters and the UTF-16 code points that NSString actually considers “characters”. I’d much rather let CF do this for me in an internally-consistent way. —Jens ___ Cocoa-dev mailing list (Cocoa-dev@lists.apple.com) Please do not post admin requests or moderator comments to the list. Contact the moderators at cocoa-dev-admins(at)lists.apple.com Help/Unsubscribe/Update your Subscription: https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com This email sent to arch...@mail-archive.com