Hi all,

<http://www.openoffice.org/issues/show_bug.cgi?id=76869> requests functionality to work on an rtl::OUString as a sequence of Unicode scalar values or code points, rather than a sequence of UTF-16 code units.

What I came up with is the minimalistic rtl_uString_iterateCodePoints in rtl/ustring.h (see below) and an accompanying public rtl::OUString member function

  inline sal_uInt32 iterateCodePoints(
    sal_Int32 * indexUtf16, sal_Int32 postIncrementCodePoints = 1);

that is an almost trivial wrapper around it.

Any comments?  Especially, I am interested in the following two points:

1 Would there be legitimate use cases for rtl_uString_iterateCodePoints to adjust an incoming index that points into the middle of a surrogate pair, or would that only hide broken code?

2 With the current setup where moving past the beginning or end of the string is undefined behavior, is there any use for postIncrementCodePoints outside [-1 .. 1]? Or would there be legitimate use cases for rtl_uString_iterateCodePoints to stop moving past the beginning/end of the string when postIncrementCodePoints is too large?

-Stephan


/** Iterate through a string based on code points instead of UTF-16 code
    units.

    See Chapter 3 of The Unicode Standard 5.0 (Addison--Wesley, 2006)
    for definitions of the various terms used in this description.

    The given string is interpreted as a sequence of zero or more UTF-16
    code units.  For each index into this sequence (from zero to the
    length of the sequence, inclusive), a code point represented
    starting at the given index is computed as follows:

    - If the index points to the end of the sequence, the computed code
    point is the special marker SAL_MAX_UINT32.

    - Otherwise, if the UTF-16 code unit addressed by the index
    constitutes a well-formed UTF-16 code unit sequence, the computed
    code point is the scalar value encoded by that UTF-16 code unit
    sequence.

    - Otherwise, if the index is at least two UTF-16 code units away
    from the end of the sequence, and the sequence of two UTF-16 code
    units addressed by the index constitutes a well-formed UTF-16 code
    unit sequence, the computed code point is the scalar value encoded
    by that UTF-16 code unit sequence.

    - Otherwise, the computed code point is the UTF-16 code unit
    addressed by the index.  (This last case catches unmatched
    surrogates as well as indices pointing into the middle of surrogate
    pairs.)

    @param string
    pointer to a valid string; must not be null.

    @param indexUtf16
    pointer to a UTF-16 based index into the given string; must not be
    null.  On entry, the index must be in the range from zero to the
    length of the string (in UTF-16 code units), inclusive.  Upon
    successful return, the index will be updated to address the UTF-16
    code unit that is the given postIncrementCodePoints away from the
    initial index.

    @param postIncrementCodePoints
    the number of code points to move the given indexUtf16; can be
    negative.  The value must be such that the resulting UTF-16 based
    index is in the range from zero to the length of the string (in
    UTF-16 code units), inclusive.

    @return
    the code point (an integer in the range from 0 to 0x10FFFF,
    inclusive) or the special marker SAL_UINT_MAX that is represented at
    the given indexUtf16 starting index within the given string.

    @since UDK 3.2.7
*/
sal_uInt32 SAL_CALL rtl_uString_iterateCodePoints(
    rtl_uString const * string, sal_Int32 * indexUtf16,
    sal_Int32 postIncrementCodePoints);

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to