Re: How to count composed characters in NSString?

2008-09-30 Thread David Niemeijer

Hi Peter,

On Sep 30, 2008, at 7:58 AM, Peter Edberg wrote:
CFStringGetRangeOfComposedCharactersAtIndex and -[NSString  
rangeOfComposedCharacterSequenceAtIndex:] are the modern  
replacements for UCFindTextBreak with kUCTextBreakClusterMask and  
indeed they now are closer to the original intent of  
kUCTextBreakClusterMask that the current implementation of  
kUCTextBreakClusterMask is (since UCFindTextBreak was converted to  
follow Unicode/ICU default text segmentation rules).


The modern functions treat all of the following as a cluster:
- A surrogate pair (of course, since it is a single character);
- A base character followed by a sequence of combining marks  
(whether or not this is something that would be composed under NFC);

- A Hangul syllable expressed as a sequence of conjoining jamo;
- An Indic consonant cluster such as consonant + virama + consonant  
+ vowel matra. It is this latter cluster that is no longer treated  
as a single entity by  UCFindTextBreak with kUCTextBreakClusterMask.


Ok, understood. This looks good. Based on the discussion I have  
updated my bug report 6253075. I think a "convenience" method that  
returns the cluster count would be very useful as it is probably  
faster than if we manually role a counter method using repeated calls  
to rangeOfComposedCharacterSequenceAtIndex and because it will, by its  
simple availability, reduce some of the confusion that I sense on this  
list as to what the most appropriate way is to count "characters".  
There would be "length" to count the number of UTF-16 units and a  
"numberOfCharacters" to count the clusters that are closest to the  
human conception of characters.


Thanks,

david.
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-29 Thread Peter Edberg


On Sep 29, 2008, at 9:27 PM, David Niemeijer wrote:


Hi Douglas and Peter,

On Sep 29, 2008, at 6:39 PM, Douglas Davidson wrote:

On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:

I need to be able to display the number of characters to the user  
in a way that makes sense to them. If they see 3 I should report  
3. I also need it to cut-off certain input to the number of "real"  
characters and should not generate results that only make sense  
for a language like English where each 16 bits equals a single  
character.


What you are describing is the notion that Unicode sometimes refers  
to as a "user-perceived character", which in general can be  
somewhat ambiguous, since different users may have different  
perceptions, and since there are writing systems in which character  
boundaries are not at all similar to those in English.  To handle  
this sort of issue programmatically, Unicode defines what are known  
as "grapheme clusters", but there is not a single notion of  
grapheme cluster; there are several such notions, depending on  
precisely what it is you want.


These issues are covered in detail in Unicode Standard Annex #29, , which gives a number of examples and some algorithms for  
determining grapheme cluster boundaries.  Grapheme clusters are  
similar to but not quite identical to composed character  
sequences.  For some purposes composed character sequences may be  
sufficient; NSString gives prominence to the notion of composed  
character sequence, because that is the most important concept for  
arbitrary text processing, but if you are really interested in user- 
perceived characters you may wish to use something else.


Thanks for your clarification. It is indeed the "grapheme clusters"  
that I am after. I need to be able to do things such as capitalize  
the first letter of a string and in doing statistical text analysis  
determine the number of "characters" of a text string. This  
description from the URL you pointed at fits my use quite well:  
"Grapheme cluster boundaries are important for collation, regular  
expressions, UI interactions (such as mouse selection, arrow key  
movement, backspacing), segmentation for vertical text,  
identification of boundaries for first-letter styling, and counting  
“character” positions within text." Using glyphs in this case is not  
appropriate as in text analysis the text itself is not displayed,  
nor is using [aString length] because it just reports the number of  
UTF-16 code units. I realize there is no perfect approach, but I am  
just trying to do something that brings me closest to what a user  
would expect.


Peter confirmed earlier that  
CFStringGetRangeOfComposedCharactersAtIndex would be the way to go  
for me. But, if I read Douglas' comment then I am beginning to  
wonder whether this is the equivalent of UCFindTextBreak's   
kUCTextBreakCharMask and not of kUCTextBreakClusterMask. In the past  
I used to use UCFindTextBreak with kUCTextBreakClusterMask, but  
unlike NSString, UCFindTextBreak is not available on one of the  
platforms I need to support, so what would be the right way to get  
at the cluster breaks using the NSString API? (Please contact me off  
list if you need further clarification.)


Cheers,

david.



David,
CFStringGetRangeOfComposedCharactersAtIndex and -[NSString  
rangeOfComposedCharacterSequenceAtIndex:] are the modern replacements  
for UCFindTextBreak with kUCTextBreakClusterMask and indeed they now  
are closer to the original intent of kUCTextBreakClusterMask that the  
current implementation of kUCTextBreakClusterMask is (since  
UCFindTextBreak was converted to follow Unicode/ICU default text  
segmentation rules).


The modern functions treat all of the following as a cluster:
- A surrogate pair (of course, since it is a single character);
- A base character followed by a sequence of combining marks (whether  
or not this is something that would be composed under NFC);

- A Hangul syllable expressed as a sequence of conjoining jamo;
- An Indic consonant cluster such as consonant + virama + consonant +  
vowel matra. It is this latter cluster that is no longer treated as a  
single entity by  UCFindTextBreak with kUCTextBreakClusterMask.


-Peter



___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-29 Thread David Niemeijer

Hi Douglas and Peter,

On Sep 29, 2008, at 6:39 PM, Douglas Davidson wrote:

On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:

I need to be able to display the number of characters to the user  
in a way that makes sense to them. If they see 3 I should report 3.  
I also need it to cut-off certain input to the number of "real"  
characters and should not generate results that only make sense for  
a language like English where each 16 bits equals a single character.


What you are describing is the notion that Unicode sometimes refers  
to as a "user-perceived character", which in general can be somewhat  
ambiguous, since different users may have different perceptions, and  
since there are writing systems in which character boundaries are  
not at all similar to those in English.  To handle this sort of  
issue programmatically, Unicode defines what are known as "grapheme  
clusters", but there is not a single notion of grapheme cluster;  
there are several such notions, depending on precisely what it is  
you want.


These issues are covered in detail in Unicode Standard Annex #29, , which gives a number of examples and some algorithms for  
determining grapheme cluster boundaries.  Grapheme clusters are  
similar to but not quite identical to composed character sequences.   
For some purposes composed character sequences may be sufficient;  
NSString gives prominence to the notion of composed character  
sequence, because that is the most important concept for arbitrary  
text processing, but if you are really interested in user-perceived  
characters you may wish to use something else.


Thanks for your clarification. It is indeed the "grapheme clusters"  
that I am after. I need to be able to do things such as capitalize the  
first letter of a string and in doing statistical text analysis  
determine the number of "characters" of a text string. This  
description from the URL you pointed at fits my use quite well:  
"Grapheme cluster boundaries are important for collation, regular  
expressions, UI interactions (such as mouse selection, arrow key  
movement, backspacing), segmentation for vertical text, identification  
of boundaries for first-letter styling, and counting “character”  
positions within text." Using glyphs in this case is not appropriate  
as in text analysis the text itself is not displayed, nor is using  
[aString length] because it just reports the number of UTF-16 code  
units. I realize there is no perfect approach, but I am just trying to  
do something that brings me closest to what a user would expect.


Peter confirmed earlier that  
CFStringGetRangeOfComposedCharactersAtIndex would be the way to go for  
me. But, if I read Douglas' comment then I am beginning to wonder  
whether this is the equivalent of UCFindTextBreak's   
kUCTextBreakCharMask and not of kUCTextBreakClusterMask. In the past I  
used to use UCFindTextBreak with kUCTextBreakClusterMask, but unlike  
NSString, UCFindTextBreak is not available on one of the platforms I  
need to support, so what would be the right way to get at the cluster  
breaks using the NSString API? (Please contact me off list if you need  
further clarification.)


Cheers,

david.___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-29 Thread Michael Ash
On Mon, Sep 29, 2008 at 12:52 AM, Michael Gardner <[EMAIL PROTECTED]> wrote:
> But composed character sequences aren't the problem; surrogate pairs are.
> Composed character sequences can be taken care of by using either
> -precomposedStringWithCanonicalMapping or
> -precomposedStringWithCompatibilityMapping. In my opinion, -length should
> take surrogate pairs into account, which is what the docs seem to imply.

The NSString API is inherently either UCS-2 or UTF-16. As UCS-2
doesn't cover all of Unicode, it ends up being UTF-16.

The API defines NSString as an ordered collection of 16-bit unichars.
The length is necessarily the number of 16-bit unichars in the string,
nothing else would really make sense. Short of creating a new API that
works on pure Unicode code points, the only thing to do is to document
the fact that -length gives you the number of UTF-16 code units, not
the number of Unicode characters.

(As an aside, changing the API to work with Unicode code points is
something I don't think is really worthwhile. Aside from having to
support the old API which would no doubt be a great deal of hassle,
Unicode code points are pretty useless on their own anyway. You always
end up having to convert and deal with precomposed characters an all
the rest of the Unicode mess regardless. Adding surrogate pairs to all
of that really doesn't increase the burden any further.)

Mike
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-29 Thread Douglas Davidson


On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:

I need to be able to display the number of characters to the user in  
a way that makes sense to them. If they see 3 I should report 3. I  
also need it to cut-off certain input to the number of "real"  
characters and should not generate results that only make sense for  
a language like English where each 16 bits equals a single character.


What you are describing is the notion that Unicode sometimes refers to  
as a "user-perceived character", which in general can be somewhat  
ambiguous, since different users may have different perceptions, and  
since there are writing systems in which character boundaries are not  
at all similar to those in English.  To handle this sort of issue  
programmatically, Unicode defines what are known as "grapheme  
clusters", but there is not a single notion of grapheme cluster; there  
are several such notions, depending on precisely what it is you want.


These issues are covered in detail in Unicode Standard Annex #29, , which gives a number of examples and some algorithms for  
determining grapheme cluster boundaries.  Grapheme clusters are  
similar to but not quite identical to composed character sequences.   
For some purposes composed character sequences may be sufficient;  
NSString gives prominence to the notion of composed character  
sequence, because that is the most important concept for arbitrary  
text processing, but if you are really interested in user-perceived  
characters you may wish to use something else.


The most problematic scripts for this sort of determination include:  
handwriting-based scripts such as Arabic, in which (depending on the  
ligatures used in a particular font) character boundaries may not be  
readily perceptible; composed scripts such as Hangul, in which the  
script elements are in turn composed of smaller, individually  
meaningful graphic elements; and scripts involving reordering and  
combining, such as Devanagari and other Indic or Indic-influenced  
scripts.


There is still another similar but not quite identical notion, which  
is used for determining the number and position of insertion points  
during editing.  In Leopard, NSLayoutManager has API support for  
determining insertion point positions within a line of text as it is  
laid out.  Note that insertion point boundaries are not identical to  
glyph boundaries; a ligature glyph in some cases, such as an "fi"  
ligature in Latin script, may require an internal insertion point on a  
user-perceived character boundary.


Douglas Davidson

___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-28 Thread Clark S. Cox III



Sent from my iPhone

On Sep 28, 2008, at 21:52, Michael Gardner <[EMAIL PROTECTED]> wrote:


On Sep 28, 2008, at 1:17 PM, David Niemeijer wrote:


Michael,

On 28 sep 2008, at 14:41, Michael Gardner wrote:
Upon further investigation, I may be wrong. I based my assertion  
upon Apple's NSString documentation ("Returns the number of  
Unicode characters in the receiver"), and upon some quick tests I  
ran. But this reply made me look into the issue in greater depth.


I re-did my tests more throughly, and it does appear that -length  
returns the number of 16-bit words (code units), not the number of  
Unicode characters (code points), in the string. If this is true,  
I would call it a bug either in the code or in the documentation,  
which David should submit to Apple.


i think the docs are clear. In the discussion section for "length"  
it says: "The number returned includes the individual characters of  
composed character sequences, so you cannot use this method to  
determine if a string will be visible when printed or how long it  
will appear."


But composed character sequences aren't the problem; surrogate pairs  
are. Composed character sequences can be taken care of by using  
either -precomposedStringWithCanonicalMapping or - 
precomposedStringWithCompatibilityMapping.


Not true. Not all possible combinations of base characters followed by  
combining characters even have a mapping to a single precimposed  
character.


Essentially, what one wants to do is count all of the characters with  
a combining class of zero, however, even this isn't without issues.




In my opinion, -length should take surrogate pairs into account,  
which is what the docs seem to imply.


-Michael
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/clarkcox3%40gmail.com

This email sent to [EMAIL PROTECTED]

___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-28 Thread Michael Gardner

On Sep 28, 2008, at 1:17 PM, David Niemeijer wrote:


Michael,

On 28 sep 2008, at 14:41, Michael Gardner wrote:
Upon further investigation, I may be wrong. I based my assertion  
upon Apple's NSString documentation ("Returns the number of Unicode  
characters in the receiver"), and upon some quick tests I ran. But  
this reply made me look into the issue in greater depth.


I re-did my tests more throughly, and it does appear that -length  
returns the number of 16-bit words (code units), not the number of  
Unicode characters (code points), in the string. If this is true, I  
would call it a bug either in the code or in the documentation,  
which David should submit to Apple.


i think the docs are clear. In the discussion section for "length"  
it says: "The number returned includes the individual characters of  
composed character sequences, so you cannot use this method to  
determine if a string will be visible when printed or how long it  
will appear."


But composed character sequences aren't the problem; surrogate pairs  
are. Composed character sequences can be taken care of by using either  
-precomposedStringWithCanonicalMapping or - 
precomposedStringWithCompatibilityMapping. In my opinion, -length  
should take surrogate pairs into account, which is what the docs seem  
to imply.


-Michael
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-28 Thread Peter Edberg


On Sep 28, 2008, at 3:05 PM, Peter Edberg wrote:


David,
Check out CFStringGetRangeOfComposedCharactersAtIndex. It finds the  
kinds of text boundaries that I think you are interested in. You  
would just need to iterate over the string calling this for each  
iteration  to find the next boundary.


Apologies, I see now that your in your original post you already  
mentioned rangeOfComposedCharacterSequenceAtIndex. That would be  
preferred :-)

-Peter

___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-28 Thread Peter Edberg


On Sep 28, 2008, at 12:02 PM, [EMAIL PROTECTED] wrote:


--

Message: 1
Date: Sun, 28 Sep 2008 20:17:26 +0200
From: David Niemeijer <[EMAIL PROTECTED]>
Subject: Re: How to count composed characters in NSString?
To: Cocoa-Dev List 
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

Michael,

On 28 sep 2008, at 14:41, Michael Gardner wrote:

Upon further investigation, I may be wrong. I based my assertion
upon Apple's NSString documentation ("Returns the number of Unicode
characters in the receiver"), and upon some quick tests I ran. But
this reply made me look into the issue in greater depth.

I re-did my tests more throughly, and it does appear that -length
returns the number of 16-bit words (code units), not the number of
Unicode characters (code points), in the string. If this is true, I
would call it a bug either in the code or in the documentation,
which David should submit to Apple.


i think the docs are clear. In the discussion section for "length" it
says: "The number returned includes the individual characters of
composed character sequences, so you cannot use this method to
determine if a string will be visible when printed or how long it will
appear."

I did file a bug (ID 6253075) as you suggested, because I think there
should be a simple API for this.


I apologize for the apparent misinformation in my previous, hasty
reply.


Well, I mad an error too. i suggested that on 10.5 the
CFStringTokenizer could be used, but only now noticed that it only
supports larger units (words and up). Thus there is no easy API to
count the number of characters in a way that surrogate pairs or other
"long" unicode characters are treated as a single character.



David,
Check out CFStringGetRangeOfComposedCharactersAtIndex. It finds the  
kinds of text boundaries that I think you are interested in. You would  
just need to iterate over the string calling this for each iteration   
to find the next boundary.


-Peter Edberg, Apple


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-28 Thread Kirk Kerekes
Users don't see characters, they see glyphs. If you want your count to  
maximally agree with user perception, you need to be counting glyphs,  
not characters.


See NSLayoutManager, esp:

- (NSRange)glyphRangeForCharacterRange:(NSRange)charRange

 -- and friends.

If you are showing strings to the user, do so in an NSTextView, and  
then query the NSLayoutManager associated with that view.



On Sep 27, 2008, at 9:37 PM, [EMAIL PROTECTED] wrote:


Message: 15
Date: Sat, 27 Sep 2008 21:23:25 +0200
From: David Niemeijer <[EMAIL PROTECTED]>
Subject: How to count composed characters in NSString?
To: cocoa-dev@lists.apple.com
Message-ID: <[EMAIL PROTECTED]>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

Hi,

I have been trying to find this in the documentation and list archives
but without success so far. What is the best way to count the number
of characters in an NSString taking account of the fact that some
characters may take up multiple 16 bit slots. Using "-
(NSUInteger)length" is thus not the right way. Using a series of calls
to "rangeOfComposedCharacterSequenceAtIndex:" seems like a
possibility, but I am not sure this would be the most efficient way.
Is there a simple and straightforward solution? I would like to be
able to display the number of characters in a string and not report
the wrong results for foreign languages (which I would get if I simply
took the length of the string). I need a solution that does not only
work in Leopard (i.e. CFStringTokenizer is not an option) and that
does not require using the lower level UCFindTextBreak.

Thanks,

david.


___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-28 Thread Kyle Sluder
On Sun, Sep 28, 2008 at 2:17 PM, David Niemeijer <[EMAIL PROTECTED]> wrote:
> I need to be able to display the number of characters to the user in a way
> that makes sense to them. If they see 3 I should report 3. I also need it to
> cut-off certain input to the number of "real" characters and should not
> generate results that only make sense for a language like English where each
> 16 bits equals a single character.

Perhaps more information on why this is a requirement would be
helpful.  Since it's apparent that you're going to be dealing with
languages other than English, there won't be only one set of rules for
you to follow.  For example, in Dutch, IJ is one letter.  In Spanish,
you might treat ll and ch as one letter or not, depending on which
region you're using and whether you're performing collation or just
counting the number of letters.  If you can explain why counting
characters is important to your app, we might be able to help you
better.

--Kyle Sluder
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-28 Thread David Niemeijer

Michael,

On 28 sep 2008, at 14:41, Michael Gardner wrote:
Upon further investigation, I may be wrong. I based my assertion  
upon Apple's NSString documentation ("Returns the number of Unicode  
characters in the receiver"), and upon some quick tests I ran. But  
this reply made me look into the issue in greater depth.


I re-did my tests more throughly, and it does appear that -length  
returns the number of 16-bit words (code units), not the number of  
Unicode characters (code points), in the string. If this is true, I  
would call it a bug either in the code or in the documentation,  
which David should submit to Apple.


i think the docs are clear. In the discussion section for "length" it  
says: "The number returned includes the individual characters of  
composed character sequences, so you cannot use this method to  
determine if a string will be visible when printed or how long it will  
appear."


I did file a bug (ID 6253075) as you suggested, because I think there  
should be a simple API for this.


I apologize for the apparent misinformation in my previous, hasty  
reply.


Well, I mad an error too. i suggested that on 10.5 the  
CFStringTokenizer could be used, but only now noticed that it only  
supports larger units (words and up). Thus there is no easy API to  
count the number of characters in a way that surrogate pairs or other  
"long" unicode characters are treated as a single character.


In the meanwhile, David, perhaps you can find a library that can  
work with UTF-8 strings. What are you using the length values for?


I need to be able to display the number of characters to the user in a  
way that makes sense to them. If they see 3 I should report 3. I also  
need it to cut-off certain input to the number of "real" characters  
and should not generate results that only make sense for a language  
like English where each 16 bits equals a single character.


Using some kind of UTF-8 library may be possible, but that would  
require converting all the time between UTF-16 and UTF-8, which is not  
efficient for a program that has to do a lot of these kind of  
calculations.


david.
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-28 Thread Michael Ash
On Sat, Sep 27, 2008 at 3:23 PM, David Niemeijer <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have been trying to find this in the documentation and list archives but
> without success so far. What is the best way to count the number of
> characters in an NSString taking account of the fact that some characters
> may take up multiple 16 bit slots. Using "- (NSUInteger)length" is thus not
> the right way. Using a series of calls to
> "rangeOfComposedCharacterSequenceAtIndex:" seems like a possibility, but I
> am not sure this would be the most efficient way. Is there a simple and
> straightforward solution? I would like to be able to display the number of
> characters in a string and not report the wrong results for foreign
> languages (which I would get if I simply took the length of the string). I
> need a solution that does not only work in Leopard (i.e. CFStringTokenizer
> is not an option) and that does not require using the lower level
> UCFindTextBreak.

First I recommend you simply give up on the concept. You've stumbled
into a tough problem, one which is not all that useful, and it may be
better to skip it. Of course I don't know what you're using it for,
but in general counting the number of characters in a string is not a
useful thing to do.

That said, if you want to continue, I'd suggest that you first figure
out what you mean by a "character". You mention composed character
sequences, but conceptually those are separate characters which happen
to display as a single unit. Your description sounds like you want to
catch UTF-16 surrogate pairs. Do you also want to catch things like é
(accented e), which can be encoded as two separate unicode code
points? Is space a character? Is a ligature like the "fi" glyph found
in many fonts one character or many? Note that these are not
rhetorical questions, and there is more than one right answer to each
depending on what you want to do.

Mike
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Re: How to count composed characters in NSString?

2008-09-28 Thread Michael Gardner

On Sep 28, 2008, at 5:53 AM, Gerriet M. Denkmann wrote:



On Sun, 28 Sep 2008 03:27:48 -0500, Michael Gardner <[EMAIL PROTECTED] 
> wrote:


On Sep 27, 2008, at 2:23 PM, David Niemeijer wrote:


Hi,

I have been trying to find this in the documentation and list
archives but without success so far. What is the best way to count
the number of characters in an NSString taking account of the fact
that some characters may take up multiple 16 bit slots. Using "-
(NSUInteger)length" is thus not the right way.


If I am reading you right, you are saying that -length will give you
the wrong results because some characters in Unicode are represented
by multibyte sequences. This is incorrect: -length will give you the
number of Unicode characters in a string [...].


This surprises me. I always thought that "length" gives you the  
number of shorts in the Utf-16 encoding of the string, which - as I  
used to think - is not the same as the number of Unicode code points  
in this string.


But maybe you are right and I am confused.


Upon further investigation, I may be wrong. I based my assertion upon  
Apple's NSString documentation ("Returns the number of Unicode  
characters in the receiver"), and upon some quick tests I ran. But  
this reply made me look into the issue in greater depth.


I re-did my tests more throughly, and it does appear that -length  
returns the number of 16-bit words (code units), not the number of  
Unicode characters (code points), in the string. If this is true, I  
would call it a bug either in the code or in the documentation, which  
David should submit to Apple.


I apologize for the apparent misinformation in my previous, hasty reply.

In the meanwhile, David, perhaps you can find a library that can work  
with UTF-8 strings. What are you using the length values for?


-Michael
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-28 Thread Gerriet M. Denkmann


On Sun, 28 Sep 2008 03:27:48 -0500, Michael Gardner  
<[EMAIL PROTECTED]> wrote:


On Sep 27, 2008, at 2:23 PM, David Niemeijer wrote:


Hi,

I have been trying to find this in the documentation and list
archives but without success so far. What is the best way to count
the number of characters in an NSString taking account of the fact
that some characters may take up multiple 16 bit slots. Using "-
(NSUInteger)length" is thus not the right way.


If I am reading you right, you are saying that -length will give you
the wrong results because some characters in Unicode are represented
by multibyte sequences. This is incorrect: -length will give you the
number of Unicode characters in a string [...].


This surprises me. I always thought that "length" gives you the  
number of shorts in the Utf-16 encoding of the string, which - as I  
used to think - is not the same as the number of Unicode code points  
in this string.


But maybe you are right and I am confused.


Kind regards,

Gerriet.

___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


Re: How to count composed characters in NSString?

2008-09-28 Thread Michael Gardner

On Sep 27, 2008, at 2:23 PM, David Niemeijer wrote:


Hi,

I have been trying to find this in the documentation and list  
archives but without success so far. What is the best way to count  
the number of characters in an NSString taking account of the fact  
that some characters may take up multiple 16 bit slots. Using "-  
(NSUInteger)length" is thus not the right way.


If I am reading you right, you are saying that -length will give you  
the wrong results because some characters in Unicode are represented  
by multibyte sequences. This is incorrect: -length will give you the  
number of Unicode characters in a string, not the number of bytes.


However, there are characters like "combining grave accent" (U+0300)  
that will usually not be displayed as a separate character, so there  
is a potential problem if you want to know how many characters will  
actually be displayed. The solution is to put the string into one of  
the composed Normalization Forms with either - 
precomposedStringWithCanonicalMapping (NFC) or - 
precomposedStringWithCompatibilityMapping (NFKC), depending on your  
needs. Then calling -length should give you the result you are looking  
for.


For information on Unicode Normalization Forms, see http://unicode.org/reports/tr15/ 
.


-Michael
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]


How to count composed characters in NSString?

2008-09-27 Thread David Niemeijer

Hi,

I have been trying to find this in the documentation and list archives  
but without success so far. What is the best way to count the number  
of characters in an NSString taking account of the fact that some  
characters may take up multiple 16 bit slots. Using "-  
(NSUInteger)length" is thus not the right way. Using a series of calls  
to "rangeOfComposedCharacterSequenceAtIndex:" seems like a  
possibility, but I am not sure this would be the most efficient way.  
Is there a simple and straightforward solution? I would like to be  
able to display the number of characters in a string and not report  
the wrong results for foreign languages (which I would get if I simply  
took the length of the string). I need a solution that does not only  
work in Leopard (i.e. CFStringTokenizer is not an option) and that  
does not require using the lower level UCFindTextBreak.


Thanks,

david.
___

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]