andrei Wed Sep 14 14:01:41 2005 EDT Modified files: /php-src README.UNICODE-UPGRADES Log: http://cvs.php.net/diff.php/php-src/README.UNICODE-UPGRADES?r1=1.3&r2=1.4&ty=u Index: php-src/README.UNICODE-UPGRADES diff -u php-src/README.UNICODE-UPGRADES:1.3 php-src/README.UNICODE-UPGRADES:1.4 --- php-src/README.UNICODE-UPGRADES:1.3 Tue Sep 13 17:07:46 2005 +++ php-src/README.UNICODE-UPGRADES Wed Sep 14 14:01:41 2005 @@ -20,14 +20,6 @@ IS_BINARY. The former one has its own storage in the value union part of zval (value.ustr) and the latter re-uses value.str. -IS_UNICODE strings are in the UTF-16 encoding where 1 Unicode character may -be represented by 1 or 2 UChar's. Each UChar is referred to as a "code -unit", and a full Unicode character as a "code point". So, number of code -units and number of code points for the same Unicode string may be -different. The value.ustr.len is actually the number of code units. To -obtain the number of code points, one can use u_counChar32() ICU API -function or Z_USTRCPLEN() macro. - Both types have new macros to set the zval value and to access it. Z_USTRVAL(), Z_USTRLEN() @@ -120,6 +112,60 @@ char *constant_name = colon + (UG(unicode)?UBYTES(2):2); +Code Points and Code Units +-------------------------- + +Unicode type strings are in the UTF-16 encoding where 1 Unicode character +may be represented by 1 or 2 UChar's. Each UChar is referred to as a "code +unit", and a full Unicode character as a "code point". Consequently, number +of code units and number of code points for the same Unicode string may be +different. This has many implications, the most important of which is that +you cannot simply index the UChar* string to get the desired codepoint. + +The zval's value.ustr.len contains actually the number of code units. To +obtain the number of code points, one can use u_counChar32() ICU API +function or Z_USTRCPLEN() macro. + +ICU provides a number of macros for working with UTF-16 strings on the +codepoint level [2]. They allow you to do things like obtain a codepoint at +random code unit offset, move forward and backward over the string, etc. +There are two versions of iterator macros, *_SAFE and *_UNSAFE. It is strong +recommended to use *_SAFE version, since they handle unpaired surrogates and +check for string boundaries. Here is an example of how to move through +UChar* string and work on codepoints. + + UChar *str = ...; + int32_t str_len = ...; + UChar32 codepoint; + int32_t offset = 0; + + while (offset < str_len) { + U16_NEXT(str, offset, str_len, codepoint); + /* now we have the Unicode character in codepoint */ + } + +There is not macro to get a codepoint at a certain code point offset, but +there is a Zend API function that does it. + + inline UChar32 zend_get_codepoint_at(UChar *str, int32_t length, int32_t n); + +To retrieve 3rd codepoint, you would call: + + zend_get_codepoint_at(str, str_len, 3); + +If you have a UChar32 codepoint and need to put it into a UChar* string, +there is another helper function, zend_codepoint_to_uchar(). It takes +a single UChar32 and converts it to a UChar sequence (1 or 2 UChar's). + + UChar buf[8]; + UChar32 codepoint = 0x101a2; + int8_t num_uchars; + num_uchars = zend_codepoint_to_uchar(codepoint, buf); + +The return value is the number of resulting UChar's or 0, which indicates +invalid codepoint. + + Memory Allocation ----------------- @@ -221,4 +267,6 @@ [1] http://icu.sourceforge.net/apiref/icu4c/ustring_8h.html#a1 +[2] http://icu.sourceforge.net/apiref/icu4c/utf16_8h.html + vim: set et ai tw=76 fo=tron21:
-- PHP CVS Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php