[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2008-03-10 Thread Gwynne Raskind
gwynne  Mon Mar 10 14:27:07 2008 UTC

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  Fix small typo
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE-UPGRADES?r1=1.20r2=1.21diff_format=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.20 
php-src/README.UNICODE-UPGRADES:1.21
--- php-src/README.UNICODE-UPGRADES:1.20Fri Feb  8 09:28:15 2008
+++ php-src/README.UNICODE-UPGRADES Mon Mar 10 14:27:07 2008
@@ -46,7 +46,7 @@
 
 } value;
 
-Where zstr ia union of char*, UChar*, and void*.
+Where zstr is a union of char*, UChar*, and void*.
 
 
 Parameter Parsing API Modifications



-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2008-02-08 Thread Marcus Boerger
helly   Fri Feb  8 09:28:15 2008 UTC

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  - Type
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE-UPGRADES?r1=1.19r2=1.20diff_format=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.19 
php-src/README.UNICODE-UPGRADES:1.20
--- php-src/README.UNICODE-UPGRADES:1.19Thu Feb  7 18:40:28 2008
+++ php-src/README.UNICODE-UPGRADES Fri Feb  8 09:28:15 2008
@@ -549,7 +549,7 @@
 
 zend_spprintf(error, 0, class '%.*Z' not found, clen, callable);
 
-The function allows to output any kind of zaval values, as long as a
+The function allows to output any kind of zval values, as long as a
 string (or unicode) conversion is available. Note that printing non
 string zvals outside of request time is not possible.
 

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2008-02-07 Thread Marcus Boerger
helly   Thu Feb  7 18:33:21 2008 UTC

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  - WS
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE-UPGRADES?r1=1.17r2=1.18diff_format=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.17 
php-src/README.UNICODE-UPGRADES:1.18
--- php-src/README.UNICODE-UPGRADES:1.17Fri Jan 19 09:31:52 2007
+++ php-src/README.UNICODE-UPGRADES Thu Feb  7 18:33:20 2008
@@ -96,7 +96,7 @@
 }
 /* process Unicode string */
 
-
+
   'T' specifier
   -
   This specifier is useful when the function takes two or more strings and
@@ -104,7 +104,7 @@
   problematic if the passed-in strings are of mixed types, and multiple
   checks need to be performed in order to do anything. All parameters
   marked by the 'T' specifier are promoted to the same type.
-  
+
   If at least one of the 'T' parameters is of Unicode type, then the rest of
   them are converted to IS_UNICODE. Otherwise all 'T' parameters are conveted 
to
   IS_STRING type.
@@ -293,7 +293,7 @@
 zend_ascii_to_unicode() function can be used to convert an ASCII char*
 string to Unicode. This is useful especially for inline string literals, in
 which case you can simply use USTR_MAKE() macro, e.g.:
-   
+
UChar* ustr;
 
ustr = USTR_MAKE(main);
@@ -393,7 +393,7 @@
 
 UBYTES() macro can be used to obtain the number of bytes necessary to store
 the given number of UChar's. The typical usage is:
-  
+
 char *constant_name = colon + (UG(unicode)?UBYTES(2):2);
 
 
@@ -463,8 +463,8 @@
 eustrndup(s, length)
 eustrdup(s)
 
-peumalloc(size, persistent) 
-peurealloc(ptr, size, persistent) 
+peumalloc(size, persistent)
+peurealloc(ptr, size, persistent)
 
 The size parameter refers to the number of UChar's, not bytes.
 
@@ -542,7 +542,7 @@
 
 Since [v]spprintf() can only output native strings there are also the new
 functions [v]uspprintf() and [v]zspprintf() that create unicode strings and
-return the number of characters printed. That is they return the length rather 
+return the number of characters printed. That is they return the length rather
 than the byte size. The second pair of functions also takes an additional type
 parameter that allows to create a string of arbitrary type. The following
 example illustrates the use. Assume it fetches a unicode/native string into
@@ -556,9 +556,9 @@
 
if (path.v) {
sub_type = path_type;
-   sub_len = zspprintf(path_type, sub_name, 0, %R%c%s, 
-   path_type, path, 
-   DEFAULT_SLASH, 
+   sub_len = zspprintf(path_type, sub_name, 0, %R%c%s,
+   path_type, path,
+   DEFAULT_SLASH,
entry.d_name);
}
 

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2007-01-19 Thread Marcus Boerger
helly   Fri Jan 19 09:30:18 2007 UTC

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  - Update
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE-UPGRADES?r1=1.15r2=1.16diff_format=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.15 
php-src/README.UNICODE-UPGRADES:1.16
--- php-src/README.UNICODE-UPGRADES:1.15Wed Jan 10 23:09:28 2007
+++ php-src/README.UNICODE-UPGRADES Fri Jan 19 09:30:18 2007
@@ -540,6 +540,27 @@
 zend_error(E_WARNING, %v::__toString() did not return anything,
 Z_OBJCE_P(object)-name);
 
+Since [v]spprintf() can only output native strings there are also the new
+function [v]uspprintf() and [v]zspprintf() that create unicode strings and
+return the number of characters printed. That is they return the length rather 
+than the byte size. The second pair offunction also takes an additional type
+parameter that allows to create a string of arbitrary type. The following
+example illustrates the use. Assume it fetches a unicode/native string into
+path, path_len, path_type and then creates sub_name, sub_len and sub_type.
+
+   zstr path, sub_name;
+   int path_len, sub_len;
+   zend_uchar path_type, sub_type;
+
+   /* fetch */
+
+   if (path.v) {
+   sub_type = path_type;
+   sub_len = zspprintf(path_type, sub_name, 0, %R%c%s, 
+   path_type, path, 
+   DEFAULT_SLASH, 
+   entry.d_name);
+   }
 
 
 Upgrading Functions

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2007-01-19 Thread Marcus Boerger
helly   Fri Jan 19 09:31:52 2007 UTC

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  - Nicer version
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE-UPGRADES?r1=1.16r2=1.17diff_format=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.16 
php-src/README.UNICODE-UPGRADES:1.17
--- php-src/README.UNICODE-UPGRADES:1.16Fri Jan 19 09:30:18 2007
+++ php-src/README.UNICODE-UPGRADES Fri Jan 19 09:31:52 2007
@@ -541,12 +541,12 @@
 Z_OBJCE_P(object)-name);
 
 Since [v]spprintf() can only output native strings there are also the new
-function [v]uspprintf() and [v]zspprintf() that create unicode strings and
+functions [v]uspprintf() and [v]zspprintf() that create unicode strings and
 return the number of characters printed. That is they return the length rather 
-than the byte size. The second pair offunction also takes an additional type
+than the byte size. The second pair of functions also takes an additional type
 parameter that allows to create a string of arbitrary type. The following
 example illustrates the use. Assume it fetches a unicode/native string into
-path, path_len, path_type and then creates sub_name, sub_len and sub_type.
+path, path_len and path_type inorder to create sub_name, sub_len and sub_type.
 
zstr path, sub_name;
int path_len, sub_len;

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2007-01-10 Thread Andrei Zmievski
andrei  Wed Jan 10 23:09:29 2007 UTC

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  Update with info from README.UNICODE.
  
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE-UPGRADES?r1=1.14r2=1.15diff_format=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.14 
php-src/README.UNICODE-UPGRADES:1.15
--- php-src/README.UNICODE-UPGRADES:1.14Wed Dec 20 20:17:45 2006
+++ php-src/README.UNICODE-UPGRADES Wed Jan 10 23:09:28 2007
@@ -6,6 +6,151 @@
 functionality and concepts without going into technical implementation
 details.
 
+Internal Encoding
+=
+
+UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumes
+two bytes for any Unicode character in the Basic Multilingual Plane, which
+is where most of the current world's languages are represented. While being
+less memory efficient for basic ASCII text it simplifies the processing and
+makes interfacing with ICU easier, since ICU uses UTF-16 for its internal
+processing as well.
+
+
+Zval Structure Changes
+==
+
+For IS_UNICODE type, we add another structure to the union:
+
+union {
+
+struct {
+UChar *val;/* Unicode string value */
+int len;   /* number of UChar's */
+} ustr;
+
+} value;
+
+This cleanly separates the two types of strings and helps preserve backwards
+compatibility.
+
+To optimize access to IS_STRING and IS_UNICODE storage at runtime, we need yet
+another structure:
+
+union {
+
+struct {/* Universal string type */
+zstr val;
+int len;
+} uni;
+
+} value;
+
+Where zstr ia union of char*, UChar*, and void*.
+
+
+Parameter Parsing API Modifications
+===
+
+There are now five new specifiers: 'u', 't', 'T', 'U', 'S', 'x' and a new ''
+modifier.
+
+  't' specifier
+  -
+  This specifier indicates that the caller requires the incoming parameter to 
be
+  string data (IS_STRING, IS_UNICODE). The caller has to provide the storage 
for
+  string value, length, and type.
+
+void *str;
+int len;
+zend_uchar type;
+
+if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, t, str, len, 
type) == FAILURE) {
+return;
+}
+if (type == IS_UNICODE) {
+   /* process Unicode string */
+} else {
+   /* process binary string */
+}
+
+  For IS_STRING type, the length represents the number of bytes, and for
+  IS_UNICODE the number of UChar's. When converting other types (numbers,
+  booleans, etc) to strings, the exact behavior depends on the Unicode 
semantics
+  switch: if on, they are converted to IS_UNICODE, otherwise to IS_STRING.
+
+
+  'u' specifier
+  -
+  This specifier indicates that the caller requires the incoming parameter
+  to be a Unicode encoded string. If a non-Unicode string is passed, the engine
+  creates a copy of the string and automatically convert it to Unicode type 
before
+  passing it to the internal function. No such conversion is necessary for 
Unicode
+  strings, obviously.
+
+UChar *str;
+int len;
+
+if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, u, str, len) == 
FAILURE) {
+return;
+}
+/* process Unicode string */
+
+
+  'T' specifier
+  -
+  This specifier is useful when the function takes two or more strings and
+  operates on them. Using 't' specifier for each one would be somewhat
+  problematic if the passed-in strings are of mixed types, and multiple
+  checks need to be performed in order to do anything. All parameters
+  marked by the 'T' specifier are promoted to the same type.
+  
+  If at least one of the 'T' parameters is of Unicode type, then the rest of
+  them are converted to IS_UNICODE. Otherwise all 'T' parameters are conveted 
to
+  IS_STRING type.
+
+
+void *str1, *str2;
+int len1, len2;
+zend_uchar type1, type2;
+
+if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, TT, str1, len1,
+ type1, str2, len2, type2) == FAILURE) {
+   return;
+}
+if (type1 == IS_UNICODE) {
+   /* process as Unicode, str2 is guaranteed to be Unicode as well */
+} else {
+   /* process as binary string, str2 is guaranteed to be the same */
+}
+
+
+   'x' specifier
+   -
+   This specifier acts as either 'u' or 's', depending on the value of the
+   unicode semantics switch. If UG(unicode) is on, it behaves as 'u', and as
+   's' otherwise.
+
+The existing 's' specifier has been modified as well. If a Unicode string is
+passed in, it automatically copies and converts the string to the runtime
+encoding, and issues a warning. If a binary type is passed-in, no conversion
+is necessary. The '' modifier can be used after 's' specifier to force
+a different converter instead.
+
+char *str;
+int 

[PHP-CVS] cvs: php-src / README.UNICODE

2007-01-10 Thread Andrei Zmievski
andrei  Wed Jan 10 23:16:40 2007 UTC

  Modified files:  
/php-srcREADME.UNICODE 
  Log:
  Update with rewrites by me and Evan G.
  
  http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE?r1=1.7r2=1.8diff_format=u
Index: php-src/README.UNICODE
diff -u php-src/README.UNICODE:1.7 php-src/README.UNICODE:1.8
--- php-src/README.UNICODE:1.7  Fri Dec 15 23:33:48 2006
+++ php-src/README.UNICODE  Wed Jan 10 23:16:40 2007
@@ -1,133 +1,111 @@
+Audience
+
+
+This README describes how PHP 6 provides native support for the Unicode 
+Standard. Readers of this document should be proficient with PHP and have a
+basic understanding of Unicode concepts. For more technical details about
+PHP 6 design principles and for guidelines about writing Unicode-ready PHP 
+extensions, refer to README.UNICODE-UPGRADES.
+
 Introduction
 
 
-As successful as PHP has proven to be in the past several years, it is still
-the only remaining member of the P-trinity of scripting languages - Perl and
-Python being the other two - that remains blithely ignorant of the
-multilingual and multinational environment around it. The software
-development community has been moving towards Unicode Standard for some time
-now, and PHP can no longer afford to be outside of this movement. Surely,
-some steps have been taken recently to allow for easier processing of
-multibyte data with the mbstring extension, but it is not enabled in PHP by
-default and is not as intuitive or transparent as it could be.
-
-The basic goal of this document is to describe how PHP 6 will support the
-Unicode Standard natively. Since the full implementation of the Unicode
-Standard is very involved, the idea is to use the already existing,
-well-tested, full-featured, and freely available ICU (International
-Components for Unicode) library. This will allow us to concentrate on the
-details of PHP integration and speed up the implementation.
+As successful as PHP has proven to be over the years, its support for
+multilingual and multinational environments has languished. PHP can no
+longer afford to remain outside the overall movement towards the Unicode
+standard.  Although recent updates involving the mbstring extension have
+enabled easier multibyte data processing, this does not constitute native
+Unicode support.
+
+Since the full implementation of the Unicode Standard is very involved, our
+approach is to speed up implementation by using the well-tested,
+full-featured, and freely available ICU (International Components for
+Unicode) library.
+
 
 General Remarks
 ===
 
-Backwards Compatibility

-Throughout the design and implementation of Unicode support, backwards
-compatibility must be of paramount concern. PHP is used on an enormous number 
of
-sites and the upgrade to Unicode-enabled PHP has to be transparent. This means
-that the existing data types and functions must work as they have always
-done. However, the speed of certain operations may be affected, due to
-increased complexity of the code overall.
-
-Unicode Encoding
-
-The initial version will not support Byte Order Mark. Text processing will
-generally perform better if the characters are in Normalization Form C.
-
-
-Implementation Approach
-===
-
-The implementation is done in phases. This allows for more basic and
-low-level implementation issues to be ironed out and tested before
-proceeding to more advanced topics.
-
-Legend:
- - TODO
- + finished
- * in progress
-
-  Phase I
-  ---
-+ Basic Unicode string support, including instantiation, concatenation,
-  indexing
-
-+ Simple output of Unicode strings via 'print' and 'echo' statements
-  with appropriate output encoding conversion
-
-+ Conversion of Unicode strings to/from various encodings via encode() and
-  decode() functions
-
-+ Determining length of Unicode strings via strlen() function, some
-  simple string functions ported (substr).
-
+International Components for Unicode
+
 
-  Phase II
-  
-* HTTP input request decoding
+ICU (International Components for Unicode is a mature, widely used set of
+C/C++ and Java libraries for Unicode support, software internationalization
+and globalization. It provides:
+
+  - Encoding conversions
+  - Collations
+  - Unicode text processing
+  - and much more
+
+When building PHP 6, Unicode support is always enabled. The only
+configuration option during development should be the location of the ICU
+headers and libraries.
 
-+ Fixing remaining string-aware operators (assignment to [] etc)
-
-+ Support for Unicode and binary strings in PHP streams
-
-+ Support for Unicode identifiers
-
-+ Configurable handling of conversion failures
-
-+ \C{} escape sequence in strings
-
-
-  Phase III
-  -
-* Exposing ICU API
+  --with-icu-dir=dir
+  
+where dir specifies the location of 

[PHP-CVS] cvs: php-src / README.UNICODE

2006-12-15 Thread Andrei Zmievski
andrei  Fri Dec 15 23:33:48 2006 UTC

  Modified files:  
/php-srcREADME.UNICODE 
  Log:
  Update with INI file info.
  
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE?r1=1.6r2=1.7diff_format=u
Index: php-src/README.UNICODE
diff -u php-src/README.UNICODE:1.6 php-src/README.UNICODE:1.7
--- php-src/README.UNICODE:1.6  Thu Aug 24 21:56:57 2006
+++ php-src/README.UNICODE  Fri Dec 15 23:33:48 2006
@@ -211,6 +211,15 @@
unicode.script_encoding = utf-8
 
 
+INI Files
+=
+
+INI files will be presumed to contain UTF-8 encoded keys and values when the
+Unicode semantics mode is On. When the mode is off, the data is taken as-is,
+similar to PHP 5. No validation occurs during parsing. Instead invalid UTF-8
+sequences are caught during access by ini_*() functions.
+
+
 Conversion Semantics
 
 

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2006-10-17 Thread Sara Golemon
pollita Tue Oct 17 20:56:29 2006 UTC

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  Update the upgrading doc to the current wisdom.  Pass One.
  This pass simply retruthifies the information already present.
  The next pass will add additional information.
  
  http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE-UPGRADES?r1=1.7r2=1.8diff_format=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.7 php-src/README.UNICODE-UPGRADES:1.8
--- php-src/README.UNICODE-UPGRADES:1.7 Wed Jun 28 15:07:14 2006
+++ php-src/README.UNICODE-UPGRADES Tue Oct 17 20:56:28 2006
@@ -16,70 +16,131 @@
 switch. Its value is found in the Unicode globals variable, UG(unicode). It
 is either on or off for the entire request.
 
-The big thing is that there are two new string types: IS_UNICODE and
-IS_BINARY. The former one has its own storage in the value union part of
-zval (value.ustr) and the latter re-uses value.str.
+The big thing is that there is a new string types: IS_UNICODE.
+This has its own storage in the value union part of
+zval (value.ustr) while non-unicode (binary) strings reuse the
+IS_STRING type and the value.str element of the zval.
 
-Both types have new macros to set the zval value and to access it.
+New macros exist (parallel to Z_STRVAL/Z_STRLEN) for accessing unicode strings.
 
 Z_USTRVAL(), Z_USTRLEN()
- - accesses the value and length (in code units) of the Unicode type string
-
-Z_BINVAL(), Z_BINLEN()
- - accesses the value and length of the binary type string
+ - accesses the value (as a UChar*) and length (in code units) of the Unicode 
type string
+   value.ustr.val   value.ustr.len
 
 Z_UNIVAL(), Z_UNILEN()
- - accesses either Unicode or native string value, depending on the current
- setting of UG(unicode) switch. The Z_UNIVAL() type resolves to char*, so
- you may need to cast it appropriately.
+ - accesses the value (as a zstr) and length (in type-appropriate units)
+   value.uni.valvalue.uni.len
 
 Z_USTRCPLEN()
- - gives the number of codepoints in the Unicode type string
-
-ZVAL_BINARY(), ZVAL_BINARYL()
- - Sets zval to hold a binary string. Takes the same parameters as
-   Z_STRING(), Z_STRINGL().
+ - gives the number of codepoints (not units) in the Unicode type string
+   This macro examines the actual string taking into account Surrogate Pairs
+   and returns the number of UChar32(UTF32) codepoints which may be less than 
the
+   number of UChar(UTF16) codeunits found in the string buffer.
+   If this value will be used repeatedly, consider storing it in a local 
variable
+   to avoid having to reexamine the string every time.
+
+
+ZVAL_* macros
+-
+
+The 'dup' parameter to the ZVAL_STRING()/RETVAL_STRING()/RETURN_STRING() type
+macros has been extended slightly.  The following defines are now encouraged 
instead:
+
+#define ZSTR_DUPLICATE (10)
+#define ZSTR_AUTOFREE  (11)
+
+ZSTR_DUPLICATE (which has a resulting value of 1) serves the same purpose as a
+truth value in old-style 'dup' flags.  The value of 1 was specifically chosen
+to match the common practice of passing a 1 for this parameter.
+Warning: If you find extension code which uses a truth value other than one for
+the dup flag, its logic should be modified to explicitly pass ZSTR_DUPLICATE 
instead.
+
+ZSTR_AUTOFREE is used with macros such as ZVAL_RT_STRING which may populate 
Unicode
+zvals from non-unicode source strings.  When UG(unicode) is on, the source 
string
+will be implicitly copied (to make a UChar* version).  If the original string
+needed copying anyway this is fine.  However if the original string was 
emalloc()'d
+and would have ordinarily been given to the engine (i.e. 
RETURN_STRING(estrdup(foo), 0))
+then it will need to be freed in UG(unicode) mode to avoid leaking.
+The ZSTR_AUTOFREE flag ensures that the original string is freed in 
UG(unicode) mode.
 
-ZVAL_UNICODE, ZVAL_UNICODEL()
+ZVAL_UNICODE(pzv, str, dup), ZVAL_UNICODEL(pzv, str, str_len, dup)
  - Sets zval to hold a Unicode string. Takes the same parameters as
Z_STRING(), Z_STRINGL().
 
-ZVAL_ASCII_STRING(), ZVAL_ASCII_STRINGL()
- - When UG(unicode) is off, it's equivalent to Z_STRING(), ZSTRINGL(). When
-   UG(unicode) is on, it sets zval to hold a Unicode representation of the
-   passed-in ASCII string. It will always create a new string in
-   UG(unicode)=1 case, so the value of the duplicate flag is not taken into
-   account.
-
-ZVAL_RT_STRING()
- - When UG(unicode) is off, it's equivalent to Z_STRING(), Z_STRINGL(). WHen
-   UG(unicode) is on, it takes the input string, converts it to Unicode
-   using the runtime_encoding converter and sets zval to it. Since a new
-   string is always created in this case, the value of the duplicate flag
-   does not matter.
+ZVAL_U_STRING(conv, pzv, str, dup), ZVAL_U_STRINGL(conv, pzv, str, str_len, 
dup)
+ - When UG(unicode) is off, it's equivalent to Z_STRING(), ZSTRINGL()
+   and the conv parameter is ignored.

[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2006-10-17 Thread Sara Golemon
pollita Tue Oct 17 21:42:29 2006 UTC

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  More unicode upgrading notes
  
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE-UPGRADES?r1=1.8r2=1.9diff_format=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.8 php-src/README.UNICODE-UPGRADES:1.9
--- php-src/README.UNICODE-UPGRADES:1.8 Tue Oct 17 20:56:28 2006
+++ php-src/README.UNICODE-UPGRADES Tue Oct 17 21:42:28 2006
@@ -407,8 +407,8 @@
 This functions returns part of a string based on offset and length
 parameters.
 
-void *str;
-int32_t str_len, cp_len;
+zstr str;
+int str_len, cp_len;
 zend_uchar str_type;
 
 if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, tl|l, str, 
str_len, str_type, f, l) == FAILURE) {
@@ -417,11 +417,11 @@
 
 The first thing we notice is that the incoming string specifier is 't',
 which means that we can accept all 3 string types. The 'str' variable is
-declared as void*, because it can point to either UChar* or char*.
+declared as zstr, because it can point to either UChar* or char*.
 The actual type of the incoming string is stored in 'str_type' variable.
 
 if (str_type == IS_UNICODE) {
-cp_len = u_countChar32(str, str_len);
+cp_len = u_countChar32(str.u, str_len);
 } else {
 cp_len = str_len;
 }
@@ -435,10 +435,10 @@
 
 if (str_type == IS_UNICODE) {
 int32_t start = 0, end = 0;
-U16_FWD_N((UChar*)str, end, str_len, f);
+U16_FWD_N(str.u, end, str_len, f);
 start = end;
-U16_FWD_N((UChar*)str, end, str_len, l);
-RETURN_UNICODEL((UChar*)str + start, end-start, 1);
+U16_FWD_N(str.u, end, str_len, l);
+RETURN_UNICODEL(str.u + start, end-start, ZSTR_DUPLICATE);
 
 Since codepoint (character) #n is not necessarily at offset #n in Unicode
 strings, we start at the beginning and iterate forward until we have gone
@@ -448,10 +448,10 @@
 segment as a Unicode string.
 
 } else {
-RETURN_STRINGL((char*)str + f, l, 1);
+RETURN_STRINGL(str.s + f, l, ZSTR_DUPLICATE);
 }
 
-For native and binary types, we can return the segment directly.
+For native strings, we can return the segment directly.
 
 
 strrev()
@@ -486,9 +486,9 @@
 Unicode type, processes it exactly as before, simply swapping bytes around.
 For Unicode case, the magic is like this:
 
-   int32_t i, x1, x2;
-   UChar32 ch;
-   UChar *u_s, *u_n, *u_p;
+int32_t i, x1, x2;
+UChar32 ch;
+UChar *u_s, *u_n, *u_p;
 
 u_n = eumalloc(Z_USTRLEN_PP(str)+1);
 u_p = u_n;
@@ -525,6 +525,98 @@
 characters (UChar32 type) to 1 or 2 UTF-16 code units (UChar type).
 
 
+realpath()
+--
+
+Filenames use their own converter as it's not uncommon, for example,
+to need to access files on a filesystem with latin1 entries while outputting
+UTF8 runtime content.
+
+The most common approach to parsing filenames can be found in realpath():
+
+zval **ppfilename;
+char *filename;
+int filename_len;
+
+if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, Z, ppfilename) == 
FAILURE ||
+   php_stream_path_param_encode(ppfilename, filename, filename_len, 
REPORT_ERRORS, FG(default_context)) == FAILURE) {
+   return;
+}
+
+Here, the filename is taken first as a generic zval**, then converted 
(separating if necessary)
+and populated into local char* and int storage.  The filename will be 
converted according to
+unicode.filesystem_encoding unless the wrapper specified overrides this with 
its own conversion
+function (The http:// wrapper, for example, enforces utf8 conversion).
+
+
+rmdir()
+---
+
+If the function accepts a context parameter, then this context should be used 
in place of FG(default_context)
+
+zval **ppdir, *zcontext = NULL;
+char *dir;
+int dir_len;
+
+if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, Z|r, ppdir, zcontext) 
== FAILURE) {
+   return;
+}
+
+context = php_stream_context_from_zval(zcontext, 0);
+if (php_stream_path_param_encode(ppdir, dir, dir_len, REPORT_ERRORS, 
context) == FAILURE) {
+   return;
+}
+
+
+sqlite_query()
+--
+
+If the function's underlying library expects a particular encoding (i.e. 
UTF8), then the alternate form of
+the string parameter may be used with zend_parse_parameters().
+
+char *sql;
+int sql_len;
+
+if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, s, sql, sql_len, 
UG(utf8_conv)) == FAILURE) {
+return;
+}
+
+Converters
+==
+
+Standard Converters
+---
+
+The following converters (UConverter*) are initialized by Zend and are always 
available (regardless of UG(unicode) mode):
+  UG(utf8_conv)
+  UG(ascii_conv)
+  UG(fallback_encoding_conv) - UTF8 unless overridden by INI setting 
unicode.fallback_encoding
+
+Additional converters will be optionally initialized depending on INI settings:
+  UG(runtime_encoding_conv) - unicode.runtime_encoding
+   . Unicode output generated by 

[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2006-10-17 Thread Andrei Zmievski
andrei  Tue Oct 17 21:55:59 2006 UTC

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  Typo.
  
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE-UPGRADES?r1=1.9r2=1.10diff_format=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.9 php-src/README.UNICODE-UPGRADES:1.10
--- php-src/README.UNICODE-UPGRADES:1.9 Tue Oct 17 21:42:28 2006
+++ php-src/README.UNICODE-UPGRADES Tue Oct 17 21:55:59 2006
@@ -263,7 +263,7 @@
 you cannot simply index the UChar* string to  get the desired codepoint.
 
 The zval's value.ustr.len contains the number of code units (UChar -- UTF16).
-To obtain the number of code points, one can use u_counChar32() ICU API
+To obtain the number of code points, one can use u_countChar32() ICU API
 function or Z_USTRCPLEN() macro.
 
 ICU provides a number of macros for working with UTF-16 strings on the

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2006-10-17 Thread Andrei Zmievski
andrei  Tue Oct 17 21:57:22 2006 UTC

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  Don't mention http_input_encoding converter as it won't be used anymore
  soon.
  
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE-UPGRADES?r1=1.10r2=1.11diff_format=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.10 
php-src/README.UNICODE-UPGRADES:1.11
--- php-src/README.UNICODE-UPGRADES:1.10Tue Oct 17 21:55:59 2006
+++ php-src/README.UNICODE-UPGRADES Tue Oct 17 21:57:22 2006
@@ -599,9 +599,6 @@
   UG(script_encoding_conv) - unicode.script_encoding
. Scripts read from disk will be decoded using this converter
 
-  UG(http_input_encoding_conv) - unicode.http_input_encoding
-   . HTTP Request data ($_GET / $_POST) will be decoded using this converter
-
   UG(filesystem_encoding_conv) - unicode.filesystem_encoding
. Filenames and paths will be encoding using this converter
 

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE

2006-08-24 Thread Andrei Zmievski
andrei  Thu Aug 24 21:39:20 2006 UTC

  Modified files:  
/php-srcREADME.UNICODE 
  Log:
  Fix typo.
  
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE?r1=1.4r2=1.5diff_format=u
Index: php-src/README.UNICODE
diff -u php-src/README.UNICODE:1.4 php-src/README.UNICODE:1.5
--- php-src/README.UNICODE:1.4  Tue Jul 11 23:05:33 2006
+++ php-src/README.UNICODE  Thu Aug 24 21:39:20 2006
@@ -327,6 +327,7 @@
 struct {
 UChar *val;/* Unicode string value */
 int len;   /* number of UChar's */
+} ustr;
 
 } value;
 

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE

2006-08-24 Thread Andrei Zmievski
andrei  Thu Aug 24 21:56:57 2006 UTC

  Modified files:  
/php-srcREADME.UNICODE 
  Log:
  
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE?r1=1.5r2=1.6diff_format=u
Index: php-src/README.UNICODE
diff -u php-src/README.UNICODE:1.5 php-src/README.UNICODE:1.6
--- php-src/README.UNICODE:1.5  Thu Aug 24 21:39:20 2006
+++ php-src/README.UNICODE  Thu Aug 24 21:56:57 2006
@@ -32,9 +32,8 @@
 
 Unicode Encoding
 
-The initial version will not support Byte Order Mark. Characters are
-expected to be composed, Normalization Form C. Later versions will support
-BOM, and decomposed and other characters.
+The initial version will not support Byte Order Mark. Text processing will
+generally perform better if the characters are in Normalization Form C.
 
 
 Implementation Approach

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE

2006-07-12 Thread Andrei Zmievski
andrei  Tue Jul 11 22:59:19 2006 UTC

  Modified files:  
/php-srcREADME.UNICODE 
  Log:
  Update design doc.
  
  http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE?r1=1.2r2=1.3diff_format=u
Index: php-src/README.UNICODE
diff -u php-src/README.UNICODE:1.2 php-src/README.UNICODE:1.3
--- php-src/README.UNICODE:1.2  Wed Jun 28 15:07:14 2006
+++ php-src/README.UNICODE  Tue Jul 11 22:59:19 2006
@@ -68,15 +68,13 @@
   
 * HTTP input request decoding
 
-+ Fixing remaining string-aware operators (assignment to {}, etc)
++ Fixing remaining string-aware operators (assignment to [] etc)
 
-+ Comparison (collation) of Unicode strings with built-in operators
-
-* Support for Unicode and binary strings in PHP streams
++ Support for Unicode and binary strings in PHP streams
 
 + Support for Unicode identifiers
 
-* Configurable handling of conversion failures
++ Configurable handling of conversion failures
 
 + \C{} escape sequence in strings
 
@@ -85,7 +83,7 @@
   -
 * Exposing ICU API
 
-- Porting all remaining functions to support Unicode and/or binary
+* Porting all remaining functions to support Unicode and/or binary
   strings
 
 
@@ -96,6 +94,24 @@
 list of encodings.
 
 
+Unicode Semantics Switch
+
+
+Obviously, PHP cannot simply impose new Unicode support on everyone. There
+are many applications that do not care about Unicode and do not need it.
+Consequently, there is a switch that enables certain fundamental language
+changes related to Unicode. This switch is available only as a site-wide (per
+virtual server) INI setting.
+
+Note that having switch turned off does not imply that PHP is unaware of 
Unicode
+at all and that no Unicode strings can exist. It only affects certain aspects 
of
+the language, and Unicode strings can always be created programmatically. All
+the functions and operators will still support Unicode strings and work
+appropriately.
+
+unicode.semantics = On
+
+
 Internal Encoding
 =
 
@@ -115,7 +131,7 @@
 encoding. If the fallback_encoding is not specified either, it is set to
 UTF-8.
 
-  fallback_encoding = iso-8859-1
+  unicode.fallback_encoding = iso-8859-1
 
 
 Runtime Encoding
@@ -123,69 +139,77 @@
 
 Currently PHP neither specifies nor cares what the encoding of its strings
 is. However, the Unicode implementation needs to know what this encoding is
-for several reasons, including type coersion and encoding conversion for
-strings generated at runtime via function calls and casting. This setting
-specifies this runtime encoding.
+for several reasons, including explicit (casting) and implicit (concatenation,
+comparison, parameter passing) type coersions. This setting specifies the
+runtime encoding.
 
-  runtime_encoding = iso-8859-1
+  unicode.runtime_encoding = iso-8859-1
 
 
 Output Encoding
 ===
 
 Automatic output encoding conversion is supported on the standard output
-stream.  Therefore, command such as 'print' and 'echo' automatically convert
+stream.  Therefore, commands such as 'print' and 'echo' automatically convert
 their arguments to the specified encoding. No automatic output encoding is
 performed for anything else. Therefore, when writing to files or external
 resources, the developer has to manually encode the data using functions
-provided by the unicode extension or rely on stream encoding filters. The
-unicode extension provides necessary stream filters to make developers'
-lives easier.
+provided by the unicode extension or rely on stream encoding features
 
 The existing default_charset setting so far has been used only for
 specifying the charset portion of the Content-Type MIME header. For several
 reasons, this setting is deprecated. Now it is only used when the Unicode
 semantics switch is disabled and does not affect the actual transcoding of
 the output stream. The output encoding setting takes precedence in all other
-cases.
+cases. If the output encoding is set, PHP will automatically add 'charset'
+portion to the Conten-Type header.
 
-  output_encoding = utf-8
+  unicode.output_encoding = utf-8
 
 
 HTTP Input Encoding
 ===
 
-To make accessing HTTP input variables easier, PHP automatically decodes
-HTTP GET and POST requests based on the specified encoding. If the HTTP
-request contains the encoding specification in the headers, then it will be
-used instead of this setting. If the HTTP input encoding setting is not
-specified, PHP falls back onto the output encoding setting, because modern
-browsers are supposed to return the data in the same encoding as they
-received it in.
-
-If the actual encoding is passed in the request itself or is found
-elsewhere, then the application can ask PHP to re-decode the raw input
-explicitly.
-
-  http_input_encoding = utf-8
+There will be no explicit input encoding setting. Instead, PHP will rely on a
+couple of heuristics to 

[PHP-CVS] cvs: php-src / README.UNICODE

2006-07-11 Thread Andrei Zmievski
andrei  Tue Jul 11 23:05:34 2006 UTC

  Modified files:  
/php-srcREADME.UNICODE 
  Log:
  Typos
  
  
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE?r1=1.3r2=1.4diff_format=u
Index: php-src/README.UNICODE
diff -u php-src/README.UNICODE:1.3 php-src/README.UNICODE:1.4
--- php-src/README.UNICODE:1.3  Tue Jul 11 22:59:19 2006
+++ php-src/README.UNICODE  Tue Jul 11 23:05:33 2006
@@ -476,7 +476,7 @@
 This forces the input parameter to be a string, and its value and length are
 stored in the variables specified by the caller.
 
-There are now three new specifiers: 't', 'u', and 'T'.
+There are now five new specifiers: 'u', 't', 'T', 'U', and 'S'.
 
   't' specifier
   -
@@ -517,7 +517,7 @@
 if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, u, str, len) == 
FAILURE) {
 return;
 }
-/* process UTF-16 data */
+/* process Unicode string */
 
 
   'T' specifier
@@ -544,7 +544,7 @@
 if (type1 == IS_UNICODE) {
/* process as Unicode, str2 is guaranteed to be Unicode as well */
 } else {
-   /* process as native string, str2 is guaranteed to be the same */
+   /* process as binary string, str2 is guaranteed to be the same */
 }
 
 

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2005-09-27 Thread Andrei Zmievski
andrei  Tue Sep 27 15:56:39 2005 EDT

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  strrev() walkthrough
  
  
http://cvs.php.net/diff.php/php-src/README.UNICODE-UPGRADES?r1=1.5r2=1.6ty=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.5 php-src/README.UNICODE-UPGRADES:1.6
--- php-src/README.UNICODE-UPGRADES:1.5 Fri Sep 23 17:24:31 2005
+++ php-src/README.UNICODE-UPGRADES Tue Sep 27 15:56:39 2005
@@ -274,24 +274,24 @@
 This functions returns part of a string based on offset and length
 parameters.
 
-   void *str;
-   int32_t str_len, cp_len;
-   zend_uchar str_type;
-
-   if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, tl|l, str, 
str_len, str_type, f, l) == FAILURE) {
-   return;
-   }
+void *str;
+int32_t str_len, cp_len;
+zend_uchar str_type;
+
+if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, tl|l, str, 
str_len, str_type, f, l) == FAILURE) {
+return;
+}
 
 The first thing we notice is that the incoming string specifier is 't',
 which means that we can accept all 3 string types. The 'str' variable is
 declared as void*, because it can point to either UChar* or char*.
 The actual type of the incoming string is stored in 'str_type' variable.
 
-   if (str_type == IS_UNICODE) {
-   cp_len = u_countChar32(str, str_len);
-   } else {
-   cp_len = str_len;
-   }
+if (str_type == IS_UNICODE) {
+cp_len = u_countChar32(str, str_len);
+} else {
+cp_len = str_len;
+}
 
 If the string is a Unicode one, we cannot rely on the str_len value to tell
 us the number of characters in it. Instead, we call u_countChar32() to
@@ -300,12 +300,12 @@
 The next several lines normalize start and length parameters to fit within the
 string. Nothing new here. Then we locate the appropriate segment.
 
-   if (str_type == IS_UNICODE) {
-   int32_t start = 0, end = 0;
-   U16_FWD_N((UChar*)str, end, str_len, f);
-   start = end;
-   U16_FWD_N((UChar*)str, end, str_len, l);
-   RETURN_UNICODEL((UChar*)str + start, end-start, 1);
+if (str_type == IS_UNICODE) {
+int32_t start = 0, end = 0;
+U16_FWD_N((UChar*)str, end, str_len, f);
+start = end;
+U16_FWD_N((UChar*)str, end, str_len, l);
+RETURN_UNICODEL((UChar*)str + start, end-start, 1);
 
 Since codepoint (character) #n is not necessarily at offset #n in Unicode
 strings, we start at the beginning and iterate forward until we have gone
@@ -314,13 +314,84 @@
 of codepoints specified by the offset. Once that's done, we can return the
 segment as a Unicode string.
 
-   } else {
-   RETURN_STRINGL((char*)str + f, l, 1);
-   }
+} else {
+RETURN_STRINGL((char*)str + f, l, 1);
+}
 
 For native and binary types, we can return the segment directly.
 
 
+strrev()
+
+
+Let's look at strrev() which requires somewhat more complicated upgrade.
+While one of the guidelines for upgrades is that combining sequences are not
+really taken into account during processing -- substr() can break them up,
+for example -- in this case, we actually should be concerned, because
+reversing combining sequence may result in a completely different string. To
+illustrate:
+
+  a(U+0061 LATIN SMALL LETTER A)
+  o(U+006f LATIN SMALL LETTER O)
++ '(U+0301 COMBINING ACUTE ACCENT)
++ _(U+0320 COMBINING MINUS SIGN BELOW)
+  l(U+006C LATIN SMALL LETTER L)
+
+Reversing this would result in:
+
+  l(U+006C LATIN SMALL LETTER L)
++ _(U+0320 COMBINING MINUS SIGN BELOW)
++ '(U+0301 COMBINING ACUTE ACCENT)
+  o(U+006f LATIN SMALL LETTER O)
+  a(U+0061 LATIN SMALL LETTER A)
+
+All of a sudden the combining marks are being applied to 'l' instead of 'o'.
+To avoid this, we need to treat combininig sequences as a unit, by checking
+the combining character class of each character with u_getCombiningClass().
+
+strrev() obtains its single argument, a string, and unless the string is of
+Unicode type, processes it exactly as before, simply swapping bytes around.
+For Unicode case, the magic is like this:
+
+   int32_t i, x1, x2;
+   UChar32 ch;
+   UChar *u_s, *u_n, *u_p;
+
+u_n = eumalloc(Z_USTRLEN_PP(str)+1);
+u_p = u_n;
+u_s = Z_USTRVAL_PP(str);
+
+i = Z_USTRLEN_PP(str);
+while (i  0) {
+U16_PREV(u_s, 0, i, ch);
+if (u_getCombiningClass(ch) == 0) {
+u_p += zend_codepoint_to_uchar(ch, u_p);
+} else {
+x2 = i;
+do {
+U16_PREV(u_s, 0, i, ch);
+} while (u_getCombiningClass(ch) != 0);
+x1 = i;
+while (x1 = x2) {
+U16_NEXT(u_s, x1, Z_USTRLEN_PP(str), ch);
+u_p += zend_codepoint_to_uchar(ch, u_p);
+}
+}
+}

[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2005-09-23 Thread Andrei Zmievski
andrei  Fri Sep 23 17:24:31 2005 EDT

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  substr() sample case
  
  
http://cvs.php.net/diff.php/php-src/README.UNICODE-UPGRADES?r1=1.4r2=1.5ty=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.4 php-src/README.UNICODE-UPGRADES:1.5
--- php-src/README.UNICODE-UPGRADES:1.4 Wed Sep 14 14:01:41 2005
+++ php-src/README.UNICODE-UPGRADES Fri Sep 23 17:24:31 2005
@@ -262,6 +262,66 @@
 
 
 
+Upgrading Functions
+===
+
+Let's take a look at a couple of functions that have been upgraded to
+support new string types.
+
+substr()
+
+
+This functions returns part of a string based on offset and length
+parameters.
+
+   void *str;
+   int32_t str_len, cp_len;
+   zend_uchar str_type;
+
+   if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, tl|l, str, 
str_len, str_type, f, l) == FAILURE) {
+   return;
+   }
+
+The first thing we notice is that the incoming string specifier is 't',
+which means that we can accept all 3 string types. The 'str' variable is
+declared as void*, because it can point to either UChar* or char*.
+The actual type of the incoming string is stored in 'str_type' variable.
+
+   if (str_type == IS_UNICODE) {
+   cp_len = u_countChar32(str, str_len);
+   } else {
+   cp_len = str_len;
+   }
+
+If the string is a Unicode one, we cannot rely on the str_len value to tell
+us the number of characters in it. Instead, we call u_countChar32() to
+obtain it.
+
+The next several lines normalize start and length parameters to fit within the
+string. Nothing new here. Then we locate the appropriate segment.
+
+   if (str_type == IS_UNICODE) {
+   int32_t start = 0, end = 0;
+   U16_FWD_N((UChar*)str, end, str_len, f);
+   start = end;
+   U16_FWD_N((UChar*)str, end, str_len, l);
+   RETURN_UNICODEL((UChar*)str + start, end-start, 1);
+
+Since codepoint (character) #n is not necessarily at offset #n in Unicode
+strings, we start at the beginning and iterate forward until we have gone
+through the required number of codepoints to reach the start of the segment.
+Then we save the location in 'start' and continue iterating through the number
+of codepoints specified by the offset. Once that's done, we can return the
+segment as a Unicode string.
+
+   } else {
+   RETURN_STRINGL((char*)str + f, l, 1);
+   }
+
+For native and binary types, we can return the segment directly.
+
+
+
 References
 ==
 

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2005-09-14 Thread Andrei Zmievski
andrei  Wed Sep 14 14:01:41 2005 EDT

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  
  
http://cvs.php.net/diff.php/php-src/README.UNICODE-UPGRADES?r1=1.3r2=1.4ty=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.3 php-src/README.UNICODE-UPGRADES:1.4
--- php-src/README.UNICODE-UPGRADES:1.3 Tue Sep 13 17:07:46 2005
+++ php-src/README.UNICODE-UPGRADES Wed Sep 14 14:01:41 2005
@@ -20,14 +20,6 @@
 IS_BINARY. The former one has its own storage in the value union part of
 zval (value.ustr) and the latter re-uses value.str.
 
-IS_UNICODE strings are in the UTF-16 encoding where 1 Unicode character may
-be represented by 1 or 2 UChar's. Each UChar is referred to as a code
-unit, and a full Unicode character as a code point. So, number of code
-units and number of code points for the same Unicode string may be
-different. The value.ustr.len is actually the number of code units. To
-obtain the number of code points, one can use u_counChar32() ICU API
-function or Z_USTRCPLEN() macro.
-
 Both types have new macros to set the zval value and to access it.
 
 Z_USTRVAL(), Z_USTRLEN()
@@ -120,6 +112,60 @@
 char *constant_name = colon + (UG(unicode)?UBYTES(2):2);
 
 
+Code Points and Code Units
+--
+
+Unicode type strings are in the UTF-16 encoding where 1 Unicode character
+may be represented by 1 or 2 UChar's. Each UChar is referred to as a code
+unit, and a full Unicode character as a code point. Consequently, number
+of code units and number of code points for the same Unicode string may be
+different. This has many implications, the most important of which is that
+you cannot simply index the UChar* string to  get the desired codepoint.
+
+The zval's value.ustr.len contains  actually the number of code units. To
+obtain the number of code points, one can use u_counChar32() ICU API
+function or Z_USTRCPLEN() macro.
+
+ICU provides a number of macros for working with UTF-16 strings on the
+codepoint level [2]. They allow you to do things like obtain a codepoint at
+random code unit offset, move forward and backward over the string, etc.
+There are two versions of iterator macros, *_SAFE and *_UNSAFE. It is strong
+recommended to use *_SAFE version, since they handle unpaired surrogates and
+check for string boundaries. Here is an example of how to move through
+UChar* string and work on codepoints.
+
+UChar *str = ...;
+int32_t str_len = ...;
+UChar32 codepoint;
+int32_t offset = 0;
+
+while (offset  str_len) {
+U16_NEXT(str, offset, str_len, codepoint);
+/* now we have the Unicode character in codepoint */
+}
+
+There is not macro to get a codepoint at a certain code point offset, but
+there is a Zend API function that does it.
+
+inline UChar32 zend_get_codepoint_at(UChar *str, int32_t length, int32_t 
n);
+
+To retrieve 3rd codepoint, you would call:
+
+zend_get_codepoint_at(str, str_len, 3);
+
+If you have a UChar32 codepoint and need to put it into a UChar* string,
+there is another helper function, zend_codepoint_to_uchar(). It takes
+a single UChar32 and converts it to a UChar sequence (1 or 2 UChar's).
+
+UChar buf[8];
+UChar32 codepoint = 0x101a2;
+int8_t num_uchars;
+num_uchars = zend_codepoint_to_uchar(codepoint, buf);
+
+The return value is the number of resulting UChar's or 0, which indicates
+invalid codepoint.
+
+
 Memory Allocation
 -
 
@@ -221,4 +267,6 @@
 
 [1] http://icu.sourceforge.net/apiref/icu4c/ustring_8h.html#a1
 
+[2] http://icu.sourceforge.net/apiref/icu4c/utf16_8h.html
+
 vim: set et ai tw=76 fo=tron21:

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2005-09-13 Thread Andrei Zmievski
andrei  Tue Sep 13 12:21:49 2005 EDT

  Added files: 
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  Commit work in progress.
  
  

http://cvs.php.net/co.php/php-src/README.UNICODE-UPGRADES?r=1.1p=1
Index: php-src/README.UNICODE-UPGRADES
+++ php-src/README.UNICODE-UPGRADES
This document attempts to describe portions of the API related to the new
Unicode functionality and the best practices for upgrading existing
functions to support Unicode.

Your first stop should be README.UNICODE: it covers the general Unicode
functionality and concepts without going into technical implementation
details.

Working in Unicode World


Strings
---

A lot of internal functionality is controlled by the unicode_semantics
switch. Its value is found in the Unicode globals variable, UG(unicode). It
is either on or off for the entire request.

The big thing is that there are two new string types: IS_UNICODE and
IS_BINARY. The former one has its own storage in the value union part of
zval (value.ustr) and the latter re-uses value.str.

IS_UNICODE strings are in the UTF-16 encoding where 1 Unicode character may
be represented by 1 or 2 UChar's. Each UChar is referred to as a code
unit, and a full Unicode character as a code point. So, number of code
units and number of code points for the same Unicode string may be
different. The value.ustr.len is actually the number of code units. To
obtain the number of code points, one can use u_counChar32() ICU API
function or Z_USTRCPLEN() macro.

Both types have new macros to set the zval value and to access it.

Z_USTRVAL(), Z_USTRLEN()
 - accesses the value and length (in code units) of the Unicode type string

Z_BINVAL(), Z_BINLEN()
 - accesses the value and length of the binary type string

Z_UNIVAL(), Z_UNILEN()
 - accesses either Unicode or native string value, depending on the current
 setting of UG(unicode) switch. The Z_UNIVAL() type resolves to char*, so
 you may need to cast it appropriately.

Z_USTRCPLEN()
 - gives the number of codepoints in the Unicode type string

ZVAL_BINARY(), ZVAL_BINARYL()
 - Sets zval to hold a binary string. Takes the same parameters as
   Z_STRING(), Z_STRINGL().

ZVAL_UNICODE, ZVAL_UNICODEL()
 - Sets zval to hold a Unicode string. Takes the same parameters as
   Z_STRING(), Z_STRINGL().

ZVAL_ASCII_STRING(), ZVAL_ASCII_STRINGL()
 - When UG(unicode) is off, it's equivalent to Z_STRING(), ZSTRINGL(). When
   UG(unicode) is on, it sets zval to hold a Unicode representation of the
   passed-in ASCII string. It will always create a new string in
   UG(unicode)=1 case, so the value of the duplicate flag is not taken into
   account.

ZVAL_RT_STRING()
 - When UG(unicode) is off, it's equivalent to Z_STRING(), Z_STRINGL(). WHen
   UG(unicode) is on, it takes the input string, converts it to Unicode
   using the runtime_encoding converter and sets zval to it. Since a new
   string is always created in this case, the value of the duplicate flag
   does not matter.

ZVAL_TEXT()
 - This macro sets the zval to hold either a Unicode or a normal string,
   depending on the value of UG(unicode). No conversion happens, so the
   argument has to be cast to (char*) when using this macro. One example of
   its usage would be to initialize zval to hold the name of a user
   function.

There are, of course, related conversion macros.

convert_to_string_with_converter(zval *op, UConverter *conv)
 - converts a zval to native string using the specified converter, if necessary.

convert_to_binary()
 - converts a zval to binary string.

convert_to_unicode()
 - converts a zval to Unicode string.

convert_to_unicode_with_converter(zval *op, UConverter *conv)
 - converts a zval to Unicode string using the specified converter, if
   necessary.

convert_to_text(zval *op)
 - converts a zval to either Unicode or native string, depending on the
   value of UG(unicode) switch

zend_ascii_to_unicode() function can be used to convert an ASCII char*
string to Unicode. This is useful especially for inline string literals, in
which case you can simply use USTR_MAKE() macro, e.g.:
   
   UChar* ustr;

   ustr = USTR_MAKE(main);

If you need to initialize a few such variables, it may be more efficient to
use ICU macros, which avoid the conversion, depending on the platform. See
[1] for more information.

USTR_FREE() can be used to free a UChar* string safely, since it checks for
NULL argument. USTR_LEN() takes either a UChar* or a char* argument,
depending on the UG(unicode) value, and returns its length. Cast the
argument to char* before passing it.

The list of functions that add new array values and add object properties
has also been expanded to include the new types. Please see zend_API.h for
full listing (add_*_ascii_string_*, add_*_rt_string_*, add_*_unicode_*,
add_*_binary_*).


Hashes
--

Hashes API has been upgraded to work with Unicode and binary strings. All
hash functions that worked with string keys now have their 

[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2005-09-13 Thread Andrei Zmievski
andrei  Tue Sep 13 16:24:06 2005 EDT

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  
  
http://cvs.php.net/diff.php/php-src/README.UNICODE-UPGRADES?r1=1.1r2=1.2ty=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.1 php-src/README.UNICODE-UPGRADES:1.2
--- php-src/README.UNICODE-UPGRADES:1.1 Tue Sep 13 12:21:47 2005
+++ php-src/README.UNICODE-UPGRADES Tue Sep 13 16:24:02 2005
@@ -114,6 +114,29 @@
 full listing (add_*_ascii_string_*, add_*_rt_string_*, add_*_unicode_*,
 add_*_binary_*).
 
+UBYTES() macro can be used to obtain the number of bytes necessary to store
+the given number of UChar's. The typical usage is:
+  
+char *constant_name = colon + (UG(unicode)?UBYTES(2):2);
+
+
+Memory Allocation
+-
+
+For ease of use and to reduce possible bugs, there are memory allocation
+functions specific to Unicode strings. Please use them at all times when
+allocating UChar's.
+
+eumalloc(size)
+eurealloc(ptr, size)
+eustrndup(s, length)
+eustrdup(s)
+
+peumalloc(size, persistent) 
+peurealloc(ptr, size, persistent) 
+
+The size parameter refers to the number of UChar's, not bytes.
+
 
 Hashes
 --
@@ -135,6 +158,22 @@
 version. It returns the key as a char* pointer, you can can cast it
 appropriately based on the key type.
 
+Identifiers and Class Entries
+-
+
+In Unicode mode all the identifiers are Unicode strings. This means that
+while various structures such as zend_class_entry, zend_function, etc store
+the identifier name as a char* pointer, it will actually point to UChar*
+string. Be careful when accessing the names of classes, functions, and such
+-- always check UG(unicode) before using them.
+
+In addition, zend_class_entry has a u_twin field that points to its Unicode
+counterpart in UG(unicode) mode. Use U_CLASS_ENTRY() macro to access the
+correct class entry, e.g.:
+
+ce = U_CLASS_ENTRY(default_exception_ce);
+
+
 References
 ==
 

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-CVS] cvs: php-src / README.UNICODE-UPGRADES

2005-09-13 Thread Andrei Zmievski
andrei  Tue Sep 13 17:07:47 2005 EDT

  Modified files:  
/php-srcREADME.UNICODE-UPGRADES 
  Log:
  
  
http://cvs.php.net/diff.php/php-src/README.UNICODE-UPGRADES?r1=1.2r2=1.3ty=u
Index: php-src/README.UNICODE-UPGRADES
diff -u php-src/README.UNICODE-UPGRADES:1.2 php-src/README.UNICODE-UPGRADES:1.3
--- php-src/README.UNICODE-UPGRADES:1.2 Tue Sep 13 16:24:02 2005
+++ php-src/README.UNICODE-UPGRADES Tue Sep 13 17:07:46 2005
@@ -158,6 +158,7 @@
 version. It returns the key as a char* pointer, you can can cast it
 appropriately based on the key type.
 
+
 Identifiers and Class Entries
 -
 
@@ -174,6 +175,47 @@
 ce = U_CLASS_ENTRY(default_exception_ce);
 
 
+Formatted Output
+
+
+Since UTF-16 strings frequently contain NULL bytes, you cannot simpley use
+%s format to print them out. Towards that end, output functions such as
+php_printf(), spprintf(), etc now have three different formats for use with
+Unicode strings:
+
+  %r
+This format treats the corresponding argument as a Unicode string. The
+string is automatically converted to the output encoding. If you wish to
+apply a different converter to the string, use %*r and pass the
+converter before the string argument.
+
+UChar *class_name = USTR_NAME(ReflectionClass);
+zend_printf(%r, class_name);
+
+  %R
+This format requires at least two arguments: the first one specifies the
+type of the string to follow (IS_STRING or IS_UNICODE), and the second
+one - the string itself. If the string is of Unicode type, it is
+automatically converted to the output encoding. If you wish to apply
+a different converter to the string, use %*R and pass the converter
+before the string argument.
+
+zend_throw_exception_ex(U_CLASS_ENTRY(reflection_exception_ptr), 0 
TSRMLS_CC,
+Interface %R does not exist,
+Z_TYPE_P(class_name), Z_UNIVAL_P(class_name));
+
+  %v
+This format takes only one parameter, the string, but the expected
+string type depends on the UG(unicode) value. If the string is of
+Unicode type, it is automatically converted to the output encoding. If
+you wish to apply a different converter to the string, use %*R and pass
+the converter before the string argument.
+
+zend_error(E_WARNING, %v::__toString() did not return anything,
+Z_OBJCE_P(object)-name);
+
+
+
 References
 ==
 

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php