Hi all,
<http://www.openoffice.org/issues/show_bug.cgi?id=76869> requests
functionality to work on an rtl::OUString as a sequence of Unicode
scalar values or code points, rather than a sequence of UTF-16 code units.
What I came up with is the minimalistic rtl_uString_iterateCodePoints in
rtl/ustring.h (see below) and an accompanying public rtl::OUString
member function
inline sal_uInt32 iterateCodePoints(
sal_Int32 * indexUtf16, sal_Int32 postIncrementCodePoints = 1);
that is an almost trivial wrapper around it.
Any comments? Especially, I am interested in the following two points:
1 Would there be legitimate use cases for rtl_uString_iterateCodePoints
to adjust an incoming index that points into the middle of a surrogate
pair, or would that only hide broken code?
2 With the current setup where moving past the beginning or end of the
string is undefined behavior, is there any use for
postIncrementCodePoints outside [-1 .. 1]? Or would there be legitimate
use cases for rtl_uString_iterateCodePoints to stop moving past the
beginning/end of the string when postIncrementCodePoints is too large?
-Stephan
/** Iterate through a string based on code points instead of UTF-16 code
units.
See Chapter 3 of The Unicode Standard 5.0 (Addison--Wesley, 2006)
for definitions of the various terms used in this description.
The given string is interpreted as a sequence of zero or more UTF-16
code units. For each index into this sequence (from zero to the
length of the sequence, inclusive), a code point represented
starting at the given index is computed as follows:
- If the index points to the end of the sequence, the computed code
point is the special marker SAL_MAX_UINT32.
- Otherwise, if the UTF-16 code unit addressed by the index
constitutes a well-formed UTF-16 code unit sequence, the computed
code point is the scalar value encoded by that UTF-16 code unit
sequence.
- Otherwise, if the index is at least two UTF-16 code units away
from the end of the sequence, and the sequence of two UTF-16 code
units addressed by the index constitutes a well-formed UTF-16 code
unit sequence, the computed code point is the scalar value encoded
by that UTF-16 code unit sequence.
- Otherwise, the computed code point is the UTF-16 code unit
addressed by the index. (This last case catches unmatched
surrogates as well as indices pointing into the middle of surrogate
pairs.)
@param string
pointer to a valid string; must not be null.
@param indexUtf16
pointer to a UTF-16 based index into the given string; must not be
null. On entry, the index must be in the range from zero to the
length of the string (in UTF-16 code units), inclusive. Upon
successful return, the index will be updated to address the UTF-16
code unit that is the given postIncrementCodePoints away from the
initial index.
@param postIncrementCodePoints
the number of code points to move the given indexUtf16; can be
negative. The value must be such that the resulting UTF-16 based
index is in the range from zero to the length of the string (in
UTF-16 code units), inclusive.
@return
the code point (an integer in the range from 0 to 0x10FFFF,
inclusive) or the special marker SAL_UINT_MAX that is represented at
the given indexUtf16 starting index within the given string.
@since UDK 3.2.7
*/
sal_uInt32 SAL_CALL rtl_uString_iterateCodePoints(
rtl_uString const * string, sal_Int32 * indexUtf16,
sal_Int32 postIncrementCodePoints);
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]