Re: How to convert a UTF-8 byte offset into an NSString character offset?

2014-05-07 Thread Uli Kusterer
On 06 May 2014, at 20:12, Quincey Morris  
wrote:
> FWIW, my opinion is that if your library clients are specifying UTF-8 
> sequences at the API, and expect byte offsets into those sequences to be 
> meaningful, you might well be forced to maintain the original UTF-8 sequence 
> in the library’s internal data model — or, perhaps, an array of the original 
> code points — and do all of your internal processing in terms of code points. 
> Conversion to NSString would happen only in the journey from data model to UI 
> text field.

 This is pretty much what I do in one of my projects. I hand-wrote code for 
converting between offsets, mostly based on tutorials from the net, knowing my 
offsets are always at the start/end of code points, and knowing that my 
conversion code always generates the same sequence, just encoding as UTF8/UTF16 
as needed to display in the UI or convert selection offsets back to UTF8 
offsets. The code hasn't shipped (or been tested much beyond myself using the 
app for a while), but if you're interested, contact me off-list and I'll send 
you a copy.

Cheers,
-- Uli Kusterer
"The Witnesses of TeachText are everywhere..."
http://www.zathras.de


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: How to convert a UTF-8 byte offset into an NSString character offset?

2014-05-06 Thread Ken Thomases
On May 5, 2014, at 2:06 PM, Jens Alfke wrote:

> How can I map a byte offset in a UTF-8 string back to the corresponding 
> character offset in the NSString it came from?

I don't think there's a great way.

You can do the reverse, map a character (really a UTF-16 code unit) offset to a 
UTF-8 offset using CFStringGetBytes().  You'd pass in a range from 0 to the 
index you want to map and NULL for the buffer.  It will fill in *usedBufLen 
with the length in bytes that would be required by the conversion.

You could build the reverse map by doing that repeatedly for each character 
index, but that would be expensive.  You'd also have to tolerate failure in 
case a given character index can't be converted (if it references half of a 
surrogate pair, for example).

So, I suspect that your best bet will be to do the conversion to UTF-8 yourself 
and build the index map as you go.

Regards,
Ken


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: How to convert a UTF-8 byte offset into an NSString character offset?

2014-05-06 Thread Jens Alfke

On May 6, 2014, at 11:12 AM, Quincey Morris 
 wrote:

> I’ve been thinking about this since your original question, and it seems to 
> me that this is a subtler problem than it seems:

No offense, but I think you’re overanalyzing it. Remember I said that the UTF-8 
data was generated from an NSString via -dataUsingEncoding:, so it’s going to 
have the same structure as the UTF-16 in the string.

> 3. It’s *possible* that you can create a NSString that has the same code 
> points in the same order as the UTF-8 string

I’m not creating an NSString. The client already has an NSString and is asking 
for character range in that string.

—Jens
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: How to convert a UTF-8 byte offset into an NSString character offset?

2014-05-06 Thread Quincey Morris
On May 5, 2014, at 12:06 , Jens Alfke  wrote:

> How can I map a byte offset in a UTF-8 string back to the corresponding 
> character offset in the NSString it came from?

I’ve been thinking about this since your original question, and it seems to me 
that this is a subtler problem than it seems:

1. You cannot *in general* map a UTF-8 byte offset to a NSString (UTF-16) 
“character" offset. The two representations may have different numbers of code 
units (1-4 for UTF-8, 1-2 for UTF-16) per code point. There’s no real answer to 
the question of what UTF-16 offset corresponds to the 3rd code unit of a 4-byte 
UTF-8 code point.

2. So, you’re restricted at least to byte offsets of UTF-8 code units that are 
the *start* of a code point. However, there’s a potential problem with this, 
because you’re not in control of the structure of the NSString. It’s possible, 
for example, that the UTF-8 byte offset points to the second (or later) code 
point of a base+combining mark sequence, but an equivalent NSString has a 
single code point consisting of one or two code units (a “composed character”). 
Even if both versions of the string have the same number of code points 
(“characters”), they may have different orders.

3. It’s *possible* that you can create a NSString that has the same code points 
in the same order as the UTF-8 string, but I don’t see any API contract that 
clearly guarantees it. The documentation for -[NSString 
initWithCharacters:length:] says that the return value is "An initialized 
NSString object containing length characters taken from characters.” That 
*might* be a sufficient guarantee, but code-point equivalence possibly isn’t 
guaranteed across some NSString manipulation methods, so you’d have to be 
careful.

4. Otherwise, I think it’s yet more difficult. The next-higher Unicode boundary 
is “grapheme clusters”. You can divide a NSString into grapheme clusters 
(either through direct iteration using ‘rangeOfComposedCharacterSequence…’, or 
through enumeration using ‘enumerateSubstrings…’), but to match the UTF-8 and 
NSString representations cluster by cluster you’d need to break the UTF-8 
string into grapheme clusters using the same algorithm as NSString, and it’s 
not documented what the precise algorithm is.

(The documentation at:


https://developer.apple.com/library/mac/documentation/Cocoa/Conceptual/Strings/Articles/stringsClusters.html

refers to this:

http://unicode.org/reports/tr29/

which I find pretty overwhelming.)

5. Even if #3 works, you may still have some troubles with grapheme clusters. 
For example, if a UTF-8 byte offset is actually a code point in the middle of a 
cluster, you may have trouble getting consistent NSString behavior with 
substrings that start from that code point.

FWIW, my opinion is that if your library clients are specifying UTF-8 sequences 
at the API, and expect byte offsets into those sequences to be meaningful, you 
might well be forced to maintain the original UTF-8 sequence in the library’s 
internal data model — or, perhaps, an array of the original code points — and 
do all of your internal processing in terms of code points. Conversion to 
NSString would happen only in the journey from data model to UI text field.

___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: How to convert a UTF-8 byte offset into an NSString character offset?

2014-05-06 Thread Mark Munz
> No, it would probably be to highlight that range of the string in a text
view, which does require knowing the character range.

Maybe you could take each of the ranges returned and create a string from
the UTF8 byte stream; search for it in the original string; the results
giving you the range for that string.
Another option could be to rebuild the string from the UTF8 string, noting
the ranges (in the new NSString).



On Tue, May 6, 2014 at 9:13 AM, Jens Alfke  wrote:

>
> On May 5, 2014, at 10:19 PM, Stephen J. Butler 
> wrote:
>
> > What's your next step after doing the UTF8 to UTF16 range conversion? If
> it's just going to be -[NSString substringWithRange:] then I'd strongly
> suggest just doing -[NSString initWithBytes:length:encoding:] on the UTF8
> string.
>
> No, it would probably be to highlight that range of the string in a text
> view, which does require knowing the character range.
>
> (I can’t say for sure because I’m implementing a library, not an app, so
> 3rd party apps will be the ultimate users of this information. But since
> this functionality is about text search, it’s very likely they’d be
> displaying the hit range to the user.)
>
> —Jens
>
> ___
>
> Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
> https://lists.apple.com/mailman/options/cocoa-dev/unmarked%40gmail.com
>
> This email sent to unmar...@gmail.com
>



-- 
Mark Munz
unmarked software
http://www.unmarked.com/
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: How to convert a UTF-8 byte offset into an NSString character offset?

2014-05-06 Thread Jens Alfke

On May 5, 2014, at 10:19 PM, Stephen J. Butler  wrote:

> What's your next step after doing the UTF8 to UTF16 range conversion? If it's 
> just going to be -[NSString substringWithRange:] then I'd strongly suggest 
> just doing -[NSString initWithBytes:length:encoding:] on the UTF8 string. 

No, it would probably be to highlight that range of the string in a text view, 
which does require knowing the character range. 

(I can’t say for sure because I’m implementing a library, not an app, so 3rd 
party apps will be the ultimate users of this information. But since this 
functionality is about text search, it’s very likely they’d be displaying the 
hit range to the user.)

—Jens

___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: How to convert a UTF-8 byte offset into an NSString character offset?

2014-05-05 Thread Stephen J. Butler
What's your next step after doing the UTF8 to UTF16 range conversion? If
it's just going to be -[NSString substringWithRange:] then I'd strongly
suggest just doing -[NSString initWithBytes:length:encoding:] on the UTF8
string. At least profile it and see what the penalty is. You've already
paid the UTF16 to UTF8 conversion price once. It's not clear that going
UTF8 to UTF16 again will be a big penalty vs. the range conversion.

But profile would show.


On Mon, May 5, 2014 at 2:06 PM, Jens Alfke  wrote:

> How can I map a byte offset in a UTF-8 string back to the corresponding
> character offset in the NSString it came from?
>
> I’m writing an Objective-C wrapper around a C text-tokenizer API that
> takes a UTF-8 string as input, and as part of its output returns byte
> ranges of words that it found. So my API takes an NSString, converts it to
> UTF-8, passes that to the C API, and then gets these byte offsets that it
> needs to convert into character offsets in the NSString.
>
> I’ve looked through both the NSString and CFString APIs and didn’t see
> anything relevant to this. I know UTF-8 isn’t rocket science and I could
> pretty easily write my own function to scan through it counting characters,
> but I suspect I’d run into the differences between Unicode characters and
> the UTF-16 code points that NSString actually considers “characters”. I’d
> much rather let CF do this for me in an internally-consistent way.
>
> —Jens
> ___
>
> Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
>
> https://lists.apple.com/mailman/options/cocoa-dev/stephen.butler%40gmail.com
>
> This email sent to stephen.but...@gmail.com
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: How to convert a UTF-8 byte offset into an NSString character offset?

2014-05-05 Thread Charles Srstka
On May 5, 2014, at 2:06 PM, Jens Alfke  wrote:

> How can I map a byte offset in a UTF-8 string back to the corresponding 
> character offset in the NSString it came from?
> 
> I’m writing an Objective-C wrapper around a C text-tokenizer API that takes a 
> UTF-8 string as input, and as part of its output returns byte ranges of words 
> that it found. So my API takes an NSString, converts it to UTF-8, passes that 
> to the C API, and then gets these byte offsets that it needs to convert into 
> character offsets in the NSString.
> 
> I’ve looked through both the NSString and CFString APIs and didn’t see 
> anything relevant to this. I know UTF-8 isn’t rocket science and I could 
> pretty easily write my own function to scan through it counting characters, 
> but I suspect I’d run into the differences between Unicode characters and the 
> UTF-16 code points that NSString actually considers “characters”. I’d much 
> rather let CF do this for me in an internally-consistent way.

I ran into this same problem once, and I don't think there's any way to do it 
other than scanning through the string. The good news is that the documentation 
for CFStringGetLength does specifically say that the length returned is in 
terms of UTF-16 code pairs, so I don't think they can change that 
implementation detail without breaking the contract.

Charles


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

How to convert a UTF-8 byte offset into an NSString character offset?

2014-05-05 Thread Jens Alfke
How can I map a byte offset in a UTF-8 string back to the corresponding 
character offset in the NSString it came from?

I’m writing an Objective-C wrapper around a C text-tokenizer API that takes a 
UTF-8 string as input, and as part of its output returns byte ranges of words 
that it found. So my API takes an NSString, converts it to UTF-8, passes that 
to the C API, and then gets these byte offsets that it needs to convert into 
character offsets in the NSString.

I’ve looked through both the NSString and CFString APIs and didn’t see anything 
relevant to this. I know UTF-8 isn’t rocket science and I could pretty easily 
write my own function to scan through it counting characters, but I suspect I’d 
run into the differences between Unicode characters and the UTF-16 code points 
that NSString actually considers “characters”. I’d much rather let CF do this 
for me in an internally-consistent way.

—Jens
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com