[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

Andrei Zmievski Wed, 10 Jan 2007 15:10:08 -0800

andrei          Wed Jan 10 23:09:29 2007 UTC

  Modified files:              
    /php-src    README.UNICODE-UPGRADES 
  Log:
  Update with info from README.UNICODE.
  
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE-UPGRADES?r1=1.14&r2=1.15&diff_format=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.14 
php-src/README.UNICODE-UPGRADES:1.15
--- php-src/README.UNICODE-UPGRADES:1.14        Wed Dec 20 20:17:45 2006
+++ php-src/README.UNICODE-UPGRADES     Wed Jan 10 23:09:28 2007
@@ -6,6 +6,151 @@
 functionality and concepts without going into technical implementation
 details.
 
+Internal Encoding
+=================
+
+UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumes
+two bytes for any Unicode character in the Basic Multilingual Plane, which
+is where most of the current world's languages are represented. While being
+less memory efficient for basic ASCII text it simplifies the processing and
+makes interfacing with ICU easier, since ICU uses UTF-16 for its internal
+processing as well.
+
+
+Zval Structure Changes
+======================
+
+For IS_UNICODE type, we add another structure to the union:
+
+    union {
+    ....
+        struct {
+            UChar *val;            /* Unicode string value */
+            int len;               /* number of UChar's */
+        } ustr;
+    ....
+    } value;
+
+This cleanly separates the two types of strings and helps preserve backwards
+compatibility.
+
+To optimize access to IS_STRING and IS_UNICODE storage at runtime, we need yet
+another structure:
+
+    union {
+    ....
+        struct {                    /* Universal string type */
+            zstr val;
+            int len;
+        } uni;
+    ....
+    } value;
+
+Where zstr ia union of char*, UChar*, and void*.
+
+
+Parameter Parsing API Modifications
+===================================
+
+There are now five new specifiers: 'u', 't', 'T', 'U', 'S', 'x' and a new '&'
+modifier.
+
+  't' specifier
+  -------------
+  This specifier indicates that the caller requires the incoming parameter to 
be
+  string data (IS_STRING, IS_UNICODE). The caller has to provide the storage 
for
+  string value, length, and type.
+
+    void *str;
+    int len;
+    zend_uchar type;
+
+    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "t", &str, &len, 
&type) == FAILURE) {
+        return;
+    }
+    if (type == IS_UNICODE) {
+       /* process Unicode string */
+    } else {
+       /* process binary string */
+    }
+
+  For IS_STRING type, the length represents the number of bytes, and for
+  IS_UNICODE the number of UChar's. When converting other types (numbers,
+  booleans, etc) to strings, the exact behavior depends on the Unicode 
semantics
+  switch: if on, they are converted to IS_UNICODE, otherwise to IS_STRING.
+
+
+  'u' specifier
+  -------------
+  This specifier indicates that the caller requires the incoming parameter
+  to be a Unicode encoded string. If a non-Unicode string is passed, the engine
+  creates a copy of the string and automatically convert it to Unicode type 
before
+  passing it to the internal function. No such conversion is necessary for 
Unicode
+  strings, obviously.
+
+    UChar *str;
+    int len;
+
+    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "u", &str, &len) == 
FAILURE) {
+        return;
+    }
+    /* process Unicode string */
+
+    
+  'T' specifier
+  -------------
+  This specifier is useful when the function takes two or more strings and
+  operates on them. Using 't' specifier for each one would be somewhat
+  problematic if the passed-in strings are of mixed types, and multiple
+  checks need to be performed in order to do anything. All parameters
+  marked by the 'T' specifier are promoted to the same type.
+  
+  If at least one of the 'T' parameters is of Unicode type, then the rest of
+  them are converted to IS_UNICODE. Otherwise all 'T' parameters are conveted 
to
+  IS_STRING type.
+
+
+    void *str1, *str2;
+    int len1, len2;
+    zend_uchar type1, type2;
+
+    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "TT", &str1, &len1,
+                             &type1, &str2, &len2, &type2) == FAILURE) {
+       return;
+    }
+    if (type1 == IS_UNICODE) {
+       /* process as Unicode, str2 is guaranteed to be Unicode as well */
+    } else {
+       /* process as binary string, str2 is guaranteed to be the same */
+    }
+
+
+   'x' specifier
+   -------------
+   This specifier acts as either 'u' or 's', depending on the value of the
+   unicode semantics switch. If UG(unicode) is on, it behaves as 'u', and as
+   's' otherwise.
+
+The existing 's' specifier has been modified as well. If a Unicode string is
+passed in, it automatically copies and converts the string to the runtime
+encoding, and issues a warning. If a binary type is passed-in, no conversion
+is necessary. The '&' modifier can be used after 's' specifier to force
+a different converter instead.
+
+    char *str;
+    int len;
+
+    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s&", &str, &len, 
UG(utf8_conv)) == FAILURE) {
+        return;
+    }
+    /* here str is in UTF-8, if a Unicode string was passed in */
+
+The 'U' and 'S' specifiers are similar to 'u' and 's' but they are more strict
+about the type of the passed-in parameter. If 'U' is specified and the binary
+string is passed in, the engine will issue a warning instead of doing automatic
+conversion. The converse applies to the 'S' specifier.
+
+
 Working in Unicode World
 ========================


-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

Reply via email to