[PHP-CVS] cvs: php-src / README.UNICODE

Andrei Zmievski Wed, 10 Jan 2007 15:17:03 -0800

andrei          Wed Jan 10 23:16:40 2007 UTC

  Modified files:              
    /php-src    README.UNICODE 
  Log:
  Update with rewrites by me and Evan G.

http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE?r1=1.7&r2=1.8&diff_format=u
Index: php-src/README.UNICODE
diff -u php-src/README.UNICODE:1.7 php-src/README.UNICODE:1.8
--- php-src/README.UNICODE:1.7  Fri Dec 15 23:33:48 2006
+++ php-src/README.UNICODE      Wed Jan 10 23:16:40 2007
@@ -1,133 +1,111 @@
+Audience
+========
+
+This README describes how PHP 6 provides native support for the Unicode 
+Standard. Readers of this document should be proficient with PHP and have a
+basic understanding of Unicode concepts. For more technical details about
+PHP 6 design principles and for guidelines about writing Unicode-ready PHP 
+extensions, refer to README.UNICODE-UPGRADES.
+
 Introduction
 ============
 
-As successful as PHP has proven to be in the past several years, it is still
-the only remaining member of the P-trinity of scripting languages - Perl and
-Python being the other two - that remains blithely ignorant of the
-multilingual and multinational environment around it. The software
-development community has been moving towards Unicode Standard for some time
-now, and PHP can no longer afford to be outside of this movement. Surely,
-some steps have been taken recently to allow for easier processing of
-multibyte data with the mbstring extension, but it is not enabled in PHP by
-default and is not as intuitive or transparent as it could be.
-
-The basic goal of this document is to describe how PHP 6 will support the
-Unicode Standard natively. Since the full implementation of the Unicode
-Standard is very involved, the idea is to use the already existing,
-well-tested, full-featured, and freely available ICU (International
-Components for Unicode) library. This will allow us to concentrate on the
-details of PHP integration and speed up the implementation.
+As successful as PHP has proven to be over the years, its support for
+multilingual and multinational environments has languished. PHP can no
+longer afford to remain outside the overall movement towards the Unicode
+standard.  Although recent updates involving the mbstring extension have
+enabled easier multibyte data processing, this does not constitute native
+Unicode support.
+
+Since the full implementation of the Unicode Standard is very involved, our
+approach is to speed up implementation by using the well-tested,
+full-featured, and freely available ICU (International Components for
+Unicode) library.
+
 
 General Remarks
 ===============
 
-Backwards Compatibility
------------------------
-Throughout the design and implementation of Unicode support, backwards
-compatibility must be of paramount concern. PHP is used on an enormous number 
of
-sites and the upgrade to Unicode-enabled PHP has to be transparent. This means
-that the existing data types and functions must work as they have always
-done. However, the speed of certain operations may be affected, due to
-increased complexity of the code overall.
-
-Unicode Encoding
-----------------
-The initial version will not support Byte Order Mark. Text processing will
-generally perform better if the characters are in Normalization Form C.
-
-
-Implementation Approach
-=======================
-
-The implementation is done in phases. This allows for more basic and
-low-level implementation issues to be ironed out and tested before
-proceeding to more advanced topics.
-
-Legend:
- - TODO
- + finished
- * in progress
-
-  Phase I
-  -------
-    + Basic Unicode string support, including instantiation, concatenation,
-      indexing
-
-    + Simple output of Unicode strings via 'print' and 'echo' statements
-      with appropriate output encoding conversion
-
-    + Conversion of Unicode strings to/from various encodings via encode() and
-      decode() functions
-
-    + Determining length of Unicode strings via strlen() function, some
-      simple string functions ported (substr).
-
+International Components for Unicode
+------------------------------------
 
-  Phase II
-  --------
-    * HTTP input request decoding
+ICU (International Components for Unicode is a mature, widely used set of
+C/C++ and Java libraries for Unicode support, software internationalization
+and globalization. It provides:
+
+  - Encoding conversions
+  - Collations
+  - Unicode text processing
+  - and much more
+
+When building PHP 6, Unicode support is always enabled. The only
+configuration option during development should be the location of the ICU
+headers and libraries.
 
-    + Fixing remaining string-aware operators (assignment to [] etc)
-
-    + Support for Unicode and binary strings in PHP streams
-
-    + Support for Unicode identifiers
-
-    + Configurable handling of conversion failures
-
-    + \C{} escape sequence in strings
-
-
-  Phase III
-  ---------
-    * Exposing ICU API
+  --with-icu-dir=<dir>
+  
+where <dir> specifies the location of ICU header and library files. If you do
+not specify this option, PHP attempts to find ICU under /usr and /usr/local.
 
-    * Porting all remaining functions to support Unicode and/or binary
-      strings
+NOTE: ICU is not bundled with PHP 6 yet. To download the distribution, visit
+http://icu.sourceforge.net. PHP requires ICU version 3.4 or higher. 
 
+Backwards Compatibility
+-----------------------
+Our paramount concern for providing Unicode support is backwards compatibility.
+Because PHP is used on so many sites, existing data types and functions must
+work as they always have. However, although PHP's interfaces must remain
+backwards-compatible, the speed of certain operations might be affected due to
+internal implementation changes.
 
 Encoding Names
-==============
-All the encoding settings discussed in this document accept any valid
-encoding name supported by ICU. See ICU online documentation for the full
-list of encodings.
+--------------
+All the encoding settings discussed in this document can accept any valid
+encoding name supported by ICU. For a full list of encodings, refer to the ICU
+online documentation.
 
+NOTE: References to "Unicode" in this document generally mean the UTF-16
+character encoding, unless explicitly stated otherwise.
 
 Unicode Semantics Switch
 ========================
 
-Obviously, PHP cannot simply impose new Unicode support on everyone. There
-are many applications that do not care about Unicode and do not need it.
-Consequently, there is a switch that enables certain fundamental language
-changes related to Unicode. This switch is available only as a site-wide (per
-virtual server) INI setting.
-
-Note that having switch turned off does not imply that PHP is unaware of 
Unicode
-at all and that no Unicode strings can exist. It only affects certain aspects 
of
-the language, and Unicode strings can always be created programmatically. All
-the functions and operators will still support Unicode strings and work
-appropriately.
-
-    unicode.semantics = On
+Because many applications do not require Unicode, PHP 6 provides a server-wide
+INI setting to enable Unicode support:
 
+  unicode.semantics = On/Off
 
-Internal Encoding
-=================
-
-UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumes
-two bytes for any Unicode character in the Basic Multilingual Plane, which
-is where most of the current world's languages are represented. While being
-less memory efficient for basic ASCII text it simplifies the processing and
-makes interfacing with ICU easier, since ICU uses UTF-16 for its internal
-processing as well.
+This switch is off by default. If your applications do not require native
+Unicode support, you may leave this switch off, and continue to use Unicode
+strings only when you need to. 
+
+However, if your application is ready to fully support Unicode, you should 
+turn this switch on. This activates various Unicode support mechanisms, 
+including:
+
+  * All string literals become Unicode
+  * All variables received from HTTP requests become Unicode
+  * PHP identifiers may use Unicode characters
+
+More fundamentally, your PHP environment is now a Unicode environment.  Strings
+inside PHP are Unicode, and the system is responsible for converting 
non-Unicode
+strings on PHP's periphery (for example, in HTTP input and output, streams, and
+filesystem operations). With unicode.semantics on, you must specify binary
+strings explicitly. PHP makes no assumptions about the content of a binary
+string, so your application must handle all binary string appropriately.
+
+Conversely, if unicode.semantics is off, PHP behaves as it did in the past.
+String literals do not become Unicode, and files are binary strings for
+backwards compatibility. You can always create Unicode strings 
programmatically,
+and all functions and operators support Unicode strings transparently.
 
 
 Fallback Encoding
 =================
 
-This setting specifies the "fallback" encoding for all the other ones. So if
-a specific encoding setting is not set, PHP defaults it to the fallback
-encoding. If the fallback_encoding is not specified either, it is set to
+The fallback encoding provides a default value for all other unicode.*_encoding
+INI settings. If you do not set a particular unicode.*_encoding setting, PHP
+uses the fallback encoding. If you do not specify a fallback encoding, PHP uses
 UTF-8.
 
   unicode.fallback_encoding = "iso-8859-1"
@@ -136,114 +114,202 @@
 Runtime Encoding
 ================
 
-Currently PHP neither specifies nor cares what the encoding of its strings
-is. However, the Unicode implementation needs to know what this encoding is
-for several reasons, including explicit (casting) and implicit (concatenation,
-comparison, parameter passing) type coersions. This setting specifies the
-runtime encoding.
+The runtime encoding specifies the encoding PHP uses for converting binary 
+strings within the PHP engine itself. 
 
   unicode.runtime_encoding = "iso-8859-1"
 
+This setting has no effect on I/O-related operations such as writing to 
+standard out, reading from the filesystem, or decoding HTTP input variables.
+
+PHP enables you to explicitly convert strings using casting:
+
+  * (binary) -- casts to binary string type
+  * (unicode) -- casts to Unicode string type
+  * (string) -- casts to Unicode string type if unicode.semantics is on,
+    to binary otherwise
+
+For example, if unicode.runtime_encoding is iso-8859-1, and $uni is a unicode
+string, then
+
+  $str = (binary)$uni
+
+creates a binary string $str in the ISO-8859-1 encoding.
+
+Implicit conversions include concatenation, comparison, and parameter passing.
+For better precision, PHP attempts to convert strings to Unicode before
+performing these sorts of operations. For example, if we concatenate our binary
+string $str with a unicode literal, PHP converts $str to Unicode first, using
+the encoding specified by unicode.runtime_encoding.
 
 Output Encoding
 ===============
 
-Automatic output encoding conversion is supported on the standard output
-stream.  Therefore, commands such as 'print' and 'echo' automatically convert
-their arguments to the specified encoding. No automatic output encoding is
-performed for anything else. Therefore, when writing to files or external
-resources, the developer has to manually encode the data using functions
-provided by the unicode extension or rely on stream encoding features
-
-The existing default_charset setting so far has been used only for
-specifying the charset portion of the Content-Type MIME header. For several
-reasons, this setting is deprecated. Now it is only used when the Unicode
-semantics switch is disabled and does not affect the actual transcoding of
-the output stream. The output encoding setting takes precedence in all other
-cases. If the output encoding is set, PHP will automatically add 'charset'
-portion to the Conten-Type header.
+PHP automatically converts output for commands that write to the standard 
+output stream, such as 'print' and 'echo'.
 
   unicode.output_encoding = "utf-8"
 
+However, PHP does not convert binary strings. When writing to files or external
+resources, you must rely on stream encoding features or manually encode the 
data
+using functions provided by the unicode extension.
+
+The existing default_charset INI setting is DEPRECATED in favor of
+unicode.output_setting. Previously, default_charset only specified the charset
+portion of the Content-Type MIME header. Now default_charset only takes effect
+when unicode.semantics is off, and it does not affect the actual transcoding of
+the output stream. Setting unicode.output_encoding causes PHP to add the
+'charset' portion to the Content-Type header, overriding any value set for
+default_charset.
+
 
 HTTP Input Encoding
 ===================
 
-There will be no explicit input encoding setting. Instead, PHP will rely on a
-couple of heuristics to determine what encoding the incoming request might be
-in. Firstly, PHP will attempt to decode the input using the value of the
-unicode.output_encoding setting, because that is the most logical choice if we
-assume that the clients send the data back in the encoding that the page with
-the form was in. If that is unsuccessful, we could fallback on the "_charset_"
-form parameter, if present. This parameter is sent by IE (and possibly Firefox)
-along with the form data and indicates the encoding of the request. Note that
-this parameter will be present only if the form contains a hidden field named
-"_charset_".
-
-The variables that are decoded successfully will be put into the request arrays
-as Unicode strings, those that fail -- as binary strings. PHP will set a
-flag (probably in the $_SERVER array) indicating that there were problems 
during
-the conversion. The user will have access to the raw input in case of
-failure via the input filter extension and can to access the request parameters
-via input_get_arg() function. The input filter extension always looks in
-the raw input data and not in the request arrays, and input_get_arg() has a
-'charset' parameter that can be specified to tell PHP what charset the incoming
-data is in. This kills two birds with one stone: users have access to request
-arrays data on successful decoding as well as a standard and secure way to get
-at the data in case of failed decoding.
+The HTTP input encoding specifies the encoding of variables received via
+HTTP, such as the contents of the $_GET and _$POST arrays.
+
+This functionality is currently under development. For a discussion of the
+approach that the PHP 6 team is taking, refer to:
+
+http://marc.theaimsgroup.com/?t=116613047300005&r=1&w=2
+
+
+Filesystem Encoding
+===================
+
+The filesystem encoding specifies the encoding of file and directory names
+on the filesystem. 
+
+  unicode.filename_encoding = "utf-8"
+
+Filesystem-related functions such as opendir() perform this conversion when 
+accepting and returning file names. You should set the filename encoding to 
+the encoding used by your filesystem. 
 
 
 Script Encoding
 ===============
 
-PHP scripts may be written in any encoding supported by ICU. The encoding
-of the scripts can be specified site-wide via an INI directive, or with a
-'declare' pragma at the beginning of the script.  The reason for pragma is that
-an application written in Shift-JIS, for example, should be executable on a
-system where the INI directive cannot be changed by the application itself. The
-pragma setting is valid only for the script it occurs in, and does not 
propagate
-to the included files.
+You may write PHP scripts in any encoding supported by ICU. To specify the
+script encoding site-wide, use the INI setting:
 
-  pragma:
-   <?php declare(encoding = 'utf-8'); ?>
-
-  INI setting:
    unicode.script_encoding = utf-8
 
+If you cannot change the encoding system wide, you can use a pragma to 
+override the INI setting in a local script:
+
+   <?php declare(encoding = 'Shift-JIS'); ?>
+
+The pragma setting must be the first statement in the script. It only affects 
+the script in which it occurs, and does not propagate to any included files. 
+
 
 INI Files
 =========
 
-INI files will be presumed to contain UTF-8 encoded keys and values when the
-Unicode semantics mode is On. When the mode is off, the data is taken as-is,
+If unicode.semantics is on, INI files are presumed to contain UTF-8 encoded 
+keys and values. If unicode.semantics is off, the data is taken as-is,
 similar to PHP 5. No validation occurs during parsing. Instead invalid UTF-8
 sequences are caught during access by ini_*() functions.
 
 
-Conversion Semantics
-====================
+Stream I/O
+==========
+
+PHP has a streams-based I/O system for generalized filesystem access, 
+networking, data compression, and other operations. Since the data on the 
+other end of the stream can be in any encoding, you need to think about 
+data conversion. 
+
+Okay, this needs to be clarified. By "default", streams are actually
+opened in binary mode. You have to specify 't' flag or use FILE_TEXT in
+order to open it in text mode, where conversions apply. And for the text
+mode streams, the default stream encoding is UTF-8 indeed.
+
+By default, PHP opens streams in binary mode. To open a file in text mode,
+you must use the 't' flag (or the FILE_TEXT parameter -- see below). The 
+default encoding for streams in text mode is UTF-8. This means that if 
+'file.txt' is a UTF-8 text file, this code snippet:
+
+  $fp = fopen('file.txt', 'rt');
+  $str = fread($fp, 100)
+
+returns 100 Unicode characters, while: 
+
+  $fp = fopen('file.txt', 'wt');
+  $fwrite($fp, $uni)
+
+writes to a UTF-8 text file.
+
+If you mainly work with files in an encoding other than UTF-8, you can
+change the default context encoding setting:
+
+  stream_default_encoding('Shift-JIS');
+  $data = file_get_contents('file.txt', FILE_TEXT);
+  // work on $data
+  file_put_contents('file.txt', $data, FILE_TEXT);
+
+The file_get_contents() and file_put_contents() functions now accept an
+additional parameter, FILE_TEXT. If you provide FILE_TEXT for
+file_get_contents(), PHP returns a Unicode string. Without FILE_TEXT, PHP
+returns a binary string (which would be appropriate for true binary data, such
+as an image file). When writing a Unicode string with file_put_contents(), you
+must supply the FILE_TEXT parameter, or PHP generates a warning. 
+
+If you need to work with multiple encodings, you can create custom contexts
+using stream_context_create() and then pass in the custom context as an
+additional parameter. For example: 
+
+  $ctx = stream_context_create(NULL, array('encoding' => 'big5'));
+  $data = file_get_contents('file.txt', FILE_TEXT, $ctx);
+  // work on $data
+  file_put_contents('file.txt', $data, FILE_TEXT, $ctx);
 
-Not all characters can be converted between Unicode and legacy encodings.
-Normally, when downconverting from Unicode, the default behavior of ICU
-converters is to substitute the missing sequence with the appropriate
-substitution sequence for that codepage, such as 0x1A (Control-Z) in
-ISO-8859-1. When upconverting to Unicode, if an encoding has a character
-which cannot be converted into Unicode, that sequence is replaced by the
-Unicode substitution character (U+FFFD).
 
-The conversion error behavior can be customized:
+Conversion Semantics and Error Handling
+=======================================
+
+PHP can convert strings explicitly (casting) and implicitly (concatenation,
+comparison, and parameter passing. For example, when concatenating a Unicode
+string and a binary string, PHP converts the binary string to Unicode for 
better
+precision.
+
+However, not all characters can be converted between Unicode and legacy 
+encodings. The first possibility is that a string contains corrupt data or
+an illegal byte sequence. In this case, the converter simply stops with 
+a message that resembles:
+
+  Warning: Could not convert binary string to Unicode string
+  (converter UTF-8 failed on bytes (0xE9) at offset 2)
+
+Conversely, if a similar error occurs when attempting to convert Unicode to
+a legacy string, the converter generates a message that resembles:
+
+  Warning: Could not convert Unicode string to binary string  (converter 
ISO-8859-1 failed on character {U+DC00} at offset 2)
+
+To customize this behavior, refer to "Creating a Custom Error Handler" below.
+
+The second possibility is that a Unicode character simply cannot be represented
+in the legacy encoding. By default, when downconverting from Unicode, the
+converter substitutes any missing sequences with the appropriate substitution
+sequence for that codepage, such as 0x1A (Control-Z) in ISO-8859-1. When
+upconverting to Unicode, the converter replaces any byte sequence that has no
+Unicode equivalent with the Unicode substitution character (U+FFFD). 
+
+You can customize the conversion error behavior to:
 
   - stop the conversion and return an empty string
   - skip any invalid characters
   - substibute invalid characters with a custom substitution character
   - escape the invalid character in various formats
 
-The global conversion error settings can be controlled with these two 
functions:
+To control the global conversion error settings, use the functions:
 
   unicode_set_error_mode(int direction, int mode)
   unicode_set_subst_char(unicode char)
 
-Where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these
+where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these
 constants:
 
   U_CONV_ERROR_STOP
@@ -255,31 +321,102 @@
   U_CONV_ERROR_ESCAPE_XML_DEC
   U_CONV_ERROR_ESCAPE_XML_HEX
 
-Substitution character can be set only for FROM_UNICODE direction and has to
-exist in the target character set.
+As an example, with a runtime encoding of ISO-8859-1, the conversion:
+
+  $str = (binary)"< \u30AB >";
+
+results in:
+
+  MODE                    RESULT
+  --------------------------------------
+  stop                    ""
+  skip                    "<   >"
+  substitute              "< ? >"
+  escape (Unicode)        "< {U+30AB} >"
+  escape (ICU)            "< %U30AB >"
+  escape (Java)           "< \u30AB >"
+  escape (XML decimal)    "< &#12459; >"
+  escape (XML hex)        "< &#x30AB; >"
+
+With a runtime encoding of UTF-8, the conversion of the (illegal) sequence:
+
+  $str = (unicode)b"< \xe9\xfe >";
+
+results in:
+
+  MODE                    RESULT
+  --------------------------------------
+  stop                    ""
+  skip                    ""
+  substitute              ""
+  escape (Unicode)        "< %XE9%XFE >"
+  escape (ICU)            "< %XE9%XFE >"
+  escape (Java)           "< \xE9\xFE >"
+  escape (XML decimal)    "< &#233;&#254; >"
+  escape (XML hex)        "< &#xE9;&#xFE; >"
+
+The substitution character can be set only for FROM_UNICODE direction and has 
to
+exist in the target character set. The default substitution character is (?). 
+
+NOTE: Casting is just a shortcut for using unicode.runtime_encoding. To convert
+using an alternative encoding, use the unicode_encode() and unicode_decode()
+functions. For example,
+
+  $str = unicode_encode($uni, 'koi8-r', U_CONV_ERROR_SUBST);
+
+results in a binary KOI8-R encoded string. 
+
+Creating a Custom Error Handler
+-------------------------------
+If an error occurs during the conversion, PHP outputs a warning describing the
+problem. Instead of this default behavior, PHP can invoke a user-provided error
+handler, similar to how the current user-defined error handler works.  To set
+the custom conversion error handler, call:
+
+  mixed unicode_set_error_handler(callback error_handler)
+
+The function returns the previously defined custom error handler. If no error
+handler was defined, or if an error occurs when returning the handler, this 
+function returns NULL.
+
+When the custom handler is set, the standard error handler is bypassed. It is
+the responsibility of the custom handler to output or log any messages, raise
+exceptions, or die(), if necessary. However, if the custom error handler 
returns
+FALSE, the standard handler will be invoked afterwards.
+
+The user function specified as the error_handler must accept five parameters:
+
+  mixed error_handler($direction, $encoding, $char_or_byte, $offset, 
+  $message)
+
+where:
+
+  $direction    - the direction of conversion, FROM_UNICODE/TO_UNICODE
+
+  $encoding     - the name of the encoding to/from which the conversion
+                  was attempted
+
+  $char_or_byte - either Unicode character or byte sequence (depending
+                  on direction) which caused the error
+
+  $offset       - the offset of the failed character/byte sequence in
+                  the source string
+
+  $message      - the error message describing the problem
+
+NOTE: If the error mode set by unicode_set_error_mode() is substitute, 
+skip, or escape, the handler won't be called, since these are non-error
+causing operations. To always invoke your handler, set the error mode to
+U_CONV_ERROR_STOP.
 
 
 Unicode String Type
 ===================
 
-Unicode string type (IS_UNICODE) is supposed to contain text data encoded in
-UTF-16 format. It is the main string type in PHP when Unicode semantics
-switch is turned on. Unicode strings can exist when the switch is off, but
-they have to be produced programmatically, via calls to functions that
-return Unicode type.
-
-The operational unit when working with Unicode strings is a code point, not
-code unit or byte. One code point in UTF-16 may be comprised of 1 or 2 code
-units, each of which is a 16-bit word. Working on the code point level is
-necessary because doing otherwise would mean offloading the processing of
-surrogate pairs onto PHP users, and that is less than desirable.
-
-The repercussions are that one cannot expect code point N to be at offset N in
-the Unicode string. Instead, one has to iterate from the beginning from the
-string using U16_FWD() macro until the desired codepoint is reached. This will
-be transparent to the end user who will work only with "character" offsets.
-
-The codepoint access is one of the primary areas targeted for optimization.
+The Unicode string type (IS_UNICODE) is supposed to contain text data encoded 
in
+UTF-16. This is the main string type in PHP when Unicode semantics switch is
+turned on. Unicode strings can exist when the switch is off, but they have to 
be
+produced programmatically via calls to functions that return Unicode types.
 
 
 Binary String Type
@@ -294,108 +431,48 @@
 Printing binary data to the standard output passes it through as-is, 
independent
 of the output encoding.
 
-
-Zval Structure Changes
-======================
-
-PHP is a type-agnostic language. Its data values are encapsulated in a zval
-(Zend value) structure that can change as necessary to accomodate various 
types.
-
-struct _zval_struct {
-    /* Variable information */
-    union {
-        long lval;                  /* long value */
-        double dval;                /* double value */
-        struct {
-            char *val;
-            int len;
-        } str;                      /* string value */
-        HashTable *ht;              /* hash table value */
-        zend_object_value obj;      /* object value */
-    } value;
-    zend_uint refcount;
-    zend_uchar type;                /* active type */
-    zend_uchar is_ref;
-};
-
-The type field determines what is stored in the union, IS_STRING being the only
-data type pertinent to this discussion. In the current version, the strings
-are binary-safe, but, for all intents and purposes, are assumed to be
-comprised of 8-bit characters. It is possible to treat the string value as
-an opaque type containing arbitrary binary data, and in fact that is how
-mbstring extension uses it, in order to store multibyte strings.  However,
-many extensions and the Zend engine itself manipulate the string value
-directly without regard to its internals. Needless to say, this can lead to
-problems.
-
-For IS_UNICODE type, we need to add another structure to the union:
-
-    union {
-    ....
-        struct {
-            UChar *val;            /* Unicode string value */
-            int len;               /* number of UChar's */
-        } ustr;
-    ....
-    } value;
-
-This cleanly separates the two types of strings and helps preserve backwards
-compatibility.
-
-To optimize access to IS_STRING and IS_UNICODE storage at runtime, we need yet
-another structure:
-
-    union {
-    ....
-        struct {                    /* Universal string type */
-            zstr val;
-            int len;
-        } uni;
-    ....
-    } value;
-
-Where zstr ia union of char*, UChar*, and void*.
-
+For examples of specifying binary string literals, refer to the section 
+"Language Modfications".
 
 Language Modifications
 ======================
 
-If a Unicode switch is turned on, PHP string literals - single-quoted,
-double-quoted, and heredocs -  become Unicode strings (IS_UNICODE type).
-They support all the same escape sequences and variable interpolations as
-previously, with the addition of some new escape sequences.
+If a Unicode switch is turned on, PHP string literals -- single-quoted,
+double-quoted, and heredocs -- become Unicode strings (IS_UNICODE type).  
String
+literals support all the same escape sequences and variable interpolations as
+before, plus several new escape sequences.
 
-The contents of the strings are interpreted as follows:
+PHP interprets the contents of strings as follows:
 
   - all non-escaped characters are interpreted as a corresponding Unicode
-    codepoint based on the current script encoding, e.g. ASCII 'a' (0x51) =>
-    U+0061, Shift-JIS (0x92 0x69) => U+4E2D
+    codepoint based on the current script encoding, e.g. ASCII 'a' (0x61) =>
+    U+0061, Shift-JIS (0x92 0x86) => U+4E2D
  
   - existing PHP escape sequences are also interpreted as Unicode codepoints,
     including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020
 
-  - two new escape sequences, \uXXXX and \UXXXXXX are interpreted as a 4 or
+  - two new escape sequences, \uXXXX and \UXXXXXX, are interpreted as a 4 or
     6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 =>
-    U+10410
-
+    U+10410. (Having two sequences avoids the ambiguity of \u020608 --
+    is that supposed to be U+0206 followed by "08", or U+020608 ?)
+    
   - a new escape sequence allows specifying a character by its full
     Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20
 
-The single-quoted string is more restrictive than the other two types: so
-far the only escape sequence allowed inside of it was \', which specifies
-a literal single quote. However, single quoted strings now support the new
-Unicode character escape sequences as well.
+The single-quoted string is more restrictive than the other two types. So far
+the only escape sequence allowed inside of it was \', which specifies a literal
+single quote. However, single quoted strings now support the new Unicode
+character escape sequences as well.
 
 PHP allows variable interpolation inside the double-quoted and heredoc strings.
 However, the parser separates the string into literal and variable chunks 
during
-compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that the
-literal chunks can be handled in the normal way for as far as Unicode
-support is concerned.
-
-Since all string literals become Unicode by default, one loses the ability
-to specify byte-oriented or binary strings. In order to create binary string
-literals, a new syntax is necessary: prefixing a string literal with letter
-'b' creates a binary string.
+compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that PHP
+can handle literal chunks in the normal way as far as Unicode support is
+concerned.
+
+Since all string literals become Unicode by default, PHP 6 introduces new 
syntax
+for creating byte-oriented or binary strings. Prefixing a string literal with
+the letter 'b' creates a binary string:
 
     $var = b'abc\001';
     $var = b"abc\001";
@@ -403,235 +480,136 @@
       abc\001
     EOD;
 
-The binary string literals support the same escape sequences as the current
-PHP strings. If the Unicode switch is turned off, then the binary string
-literals generate normal string (IS_STRING) type internally, without any
-effect on the application.
-
-The string operators have been changed to accomodate the new IS_UNICODE and
-IS_BINARY types. In more detail:
-
-  - The concatenation (.) operator has been changed to automatically coerce
-    IS_STRING type to the more precise IS_UNICODE if its operands are of two
-    different string types.
-
-  - The concatenation assignment operator (.=) has been changed similarly.
-
-  - The string indexing operator [] has been changed to accomodate IS_UNICODE
-    type strings and extract the specified character. Note that the index
-    specifies a code point, not a byte, or a code unit, thus supporting
-    supplementary characters.
-
-  - Both Unicode and binary string types can be used as array keys. If the
-    Unicode switch is on, the binary keys are converted to Unicode.
+The content of a binary string is the literal byte sequence inside the
+delimiters, which depends on the script encoding (unicode.script_encoding).
+Binary string literals support the same escape sequences as PHP 5 strings. If
+the Unicode switch is turned off, then the binary string literals generate the
+normal string (IS_STRING) type internally without any effect on the 
application.
+
+The string operators now accomodate the new IS_UNICODE and IS_BINARY types:
+
+  - The concatenation operator (.) and concatenation assignment operator (.=)
+    automatically coerce the IS_STRING type to the more precise IS_UNICODE if
+    the operands are of different string types.
+
+  - The string indexing operator [] now accommodates IS_UNICODE type strings 
+    and extracts the specified character. To support supplementary characters,
+    the index specifies a code point, not a byte or a code unit.
 
   - Bitwise operators and increment/decrement operators do not work on
     Unicode strings. They do work on binary strings.
 
   - Two new casting operators are introduced, (unicode) and (binary). The
-    (string) operator will cast to Unicode type if the Unicode semantics 
switch is
+    (string) operator casts to Unicode type if the Unicode semantics switch is
     on, and to binary type otherwise.
 
-  - The comparison operators when applied to Unicode strings, perform
-    comparison in binary code point order. They also do appropriate coersion
-    if the strings are of differing types.
+  - The comparison operators compare Unicode strings in binary code point 
+    order. They also coerce strings to Unicode if the strings are of different 
+    types.
 
   - The arithmetic operators use the same semantics as today for converting
     strings to numbers. A Unicode string is considered numeric if it
-    represents a long or a double number in en_US_POSIX locale.
+    represents a long or a double number in the en_US_POSIX locale.
 
 
-Inline HTML
-===========
-Because inline HTML blocks are intermixed with PHP ones, they are also
-written in the script encoding. PHP transcodes the HTML blocks to the output
-encoding as needed, resulting in direct passthrough if the script encoding
-matches output encoding.
+Unicode Support in Existing Functions
+=====================================
+
+All functions in the PHP default distribution are undergoing analysis to 
+determine which functions need to be upgraded for native Unicode support. 
+You can track progress here:
+
+  http://www.php.net/~scoates/unicode/render_func_data.php
+
+Key extensions that are fully converted include:
+
+  * curl
+  * dom
+  * json
+  * mysql
+  * mysqli
+  * oci8
+  * pcre
+  * reflection
+  * simplexml
+  * soap
+  * sqlite
+  * xml
+  * xmlreader/xmlwriter
+  * xsl
+  * zlib
+
+NOTE: Unsafe functions might still work, since PHP performs Unicode conversions
+at runtime. However, unsafe functions might not work correctly with multibyte
+binary strings, or Unicode characters that are not representable in the
+specified unicode.runtime_encoding. 
 
 
 Identifiers
 ===========
-Considering that scripts may be written in various encodings, we do not
-restrict identifiers to be ASCII-only. PHP allows any valid identifier based
-on the Unicode Standard Annex #31. The identifiers are case folded when
-necessary (class and function names) and converted to normalization form
-NFKC, so that two identifiers written in two compatible ways refer to the
-same thing.
+
+Since scripts may be written in various encodings, we do not restrict 
+identifiers to be ASCII-only. PHP allows any valid identifier based
+on the Unicode Standard Annex #31. 
 
 
 Numbers
 =======
-Unlike identifiers, we restrict numbers to consist only of ASCII digits and
-do not interpret them as written in a specific locale. The numbers are
-expected to adhere to en_US_POSIX or C locale, i.e. having no thousands
-separator and fractional separator being (.) "full stop". Numeric strings
-are supposed to adhere to the same rules, i.e. "10,3" is not interpreted as
-a number even if the current locale's fractional separator is comma.
-
-
-Parameter Parsing API Modifications
-===================================
-
-Internal PHP functions largely uses zend_parse_parameters() API in order to
-obtain the parameters passed to them by the user. For example:
-
-    char *str;
-    int len;
-
-    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &str, &len) == 
FAILURE) {
-        return;
-    }
-
-This forces the input parameter to be a string, and its value and length are
-stored in the variables specified by the caller.
-
-There are now five new specifiers: 'u', 't', 'T', 'U', and 'S'.
-
-  't' specifier
-  -------------
-  This specifier indicates that the caller requires the incoming parameter to 
be
-  string data (IS_STRING, IS_UNICODE). The caller has to provide the storage 
for
-  string value, length, and type.
-
-    void *str;
-    int len;
-    zend_uchar type;
-
-    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "t", &str, &len, 
&type) == FAILURE) {
-        return;
-    }
-    if (type == IS_UNICODE) {
-       /* process Unicode string */
-    } else {
-       /* process binary string */
-    }
-
-  For IS_STRING type, the length represents the number of bytes, and for
-  IS_UNICODE the number of UChar's. When converting other types (numbers,
-  booleans, etc) to strings, the exact behavior depends on the Unicode 
semantics
-  switch: if on, they are converted to IS_UNICODE, otherwise to IS_STRING.
-
-
-  'u' specifier
-  -------------
-  This specifier indicates that the caller requires the incoming parameter
-  to be a Unicode encoded string. If a non-Unicode string is passed, the engine
-  creates a copy of the string and automatically convert it to Unicode type 
before
-  passing it to the internal function. No such conversion is necessary for 
Unicode
-  strings, obviously.
-
-    UChar *str;
-    int len;
-
-    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "u", &str, &len) == 
FAILURE) {
-        return;
-    }
-    /* process Unicode string */
 
-    
-  'T' specifier
-  -------------
-  This specifier is useful when the function takes two or more strings and
-  operates on them. Using 't' specifier for each one would be somewhat
-  problematic if the passed-in strings are of mixed types, and multiple
-  checks need to be performed in order to do anything. All parameters
-  marked by the 'T' specifier are promoted to the same type.
-  
-  If at least one of the 'T' parameters is of Unicode type, then the rest of
-  them are converted to IS_UNICODE. Otherwise all 'T' parameters are conveted 
to
-  IS_STRING type.
-
-
-    void *str1, *str2;
-    int len1, len2;
-    zend_uchar type1, type2;
-
-    if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "TT", &str1, &len1,
-                             &type1, &str2, &len2, &type2) == FAILURE) {
-       return;
-    }
-    if (type1 == IS_UNICODE) {
-       /* process as Unicode, str2 is guaranteed to be Unicode as well */
-    } else {
-       /* process as binary string, str2 is guaranteed to be the same */
-    }
-
-
-The existing 's' specifier has been modified as well. If a Unicode string is
-passed in, it automatically copies and converts the string to the runtime
-encoding, and issues a warning. If a binary type is passed-in, no conversion
-is necessary.
-
-The 'U' and 'S' specifiers are similar to 'u' and 's' but they are more strict
-about the type of the passed-in parameter. If 'U' is specified and the binary
-string is passed in, the engine will issue a warning instead of doing automatic
-conversion. The converse applies to the 'S' specifier.
-
-
-Upgrading Existing Functions
-============================
-
-Upgrading functions to work with new data types will be a deliberate and
-involved process, because one needs to consider not only the mechanisms for
-processing Unicode characters, for example, but also the semantics of
-the function.
-
-The main tenet of the upgrade process should be that when processing Unicode
-strings, the unit of operation is a code point, not a code unit or a byte.
-For example, strlen() returns the number of code points in the string.
-
-  strlen('abc') = 3
-  strlen('ab\U010000') = 3
-  strlen('ab\uD800\uDC00') = 3 /* not 4 */
-
-Function upgrade guidelines are available in a separate document.
-
-
-Document TODO
-==========================================
-- Streams support for Unicode - What stream filters will be provided?
-- User conversion error handler
-- INI files encoding - UTF-8? Do we support BOMs?
-- There are likely to be other issues which are missing from this document
+Unlike identifiers, numbers must consist only of ASCII digits,.and are
+restricted to the en_US_POSIX or C locale. In other words, numbers have no
+thousands separator, and the fractional separator is (.) "full stop".  Numeric
+strings adhere to the same rules, so "10,3" is not interpreted as a number even
+if the current locale's fractional separator is a comma.
+
+TextIterators
+=============
+
+Instead of using the offset operator [] to access characters in a linear
+fashion, use a TextIterator instead. TextIterator is very fast and enables you
+to iterate over code points, combining sequences, characters, words, lines, and
+sentences, both forward and backward. For example:
+
+  $text = "nai\u308ve";  
+  foreach (new TextIterator($text) as $u) {
+      var_inspect($u)
+  }
+
+lists six code points, including the umlaut (U+0308) as a separate code point.
+Instantiating the TextIterator to iterate over characters,
+
+  $text = "nai\u308ve";  
+  foreach (new TextIterator($text, TextIterator::CHARACTER) as $u) {
+      var_inspect($u)
+  }
 
+lists five characters, including an "i" with an umlaut as a single character.
 
-Build System
-============
+Locales
+=======
 
-Unicode support in PHP is always enabled. The only configuration option
-during development should be the location of the ICU headers and libraries.
+Unicode support in PHP relies exclusively on ICU locales, NOT the POSIX locales
+installed on the system. You may access the default ICU locale using:
 
-    --with-icu-dir=<dir>       <dir> parameter specifies the location of ICU
-                               header and library files.
+  locale_set_default()
+  locale_get_default()
 
-After the initial development we have to repackage ICU library for our needs
-and bundle it with PHP.
+ICU locale IDs have a somewhat different format from POSIX locale IDs. The ICU
+syntax is:
 
+  <language>[_<script>]_<country>[_<variant>][@<keywords>]
 
-Document History
-================
-  0.6: Remove notion of native encoding string, only 2 string types are used
-       now. Update conversion error behavior section and parameter parsing.
-       Bring the document up-to-date with reality in general.
-
-  0.5: Updated per latest discussions. Removed tentative language in several
-       places, since we have decided on everything described here already.
-       Clarified details according to Phase II progress.
- 
-  0.4: Updated to include all the latest discussions. Updated development
-       phases.
+For example, [EMAIL PROTECTED] is Serbian (Latin, Yugoslavia,
+Revised Orthography, Currency=US Dollar).
 
-  0.3: Updated to include all the latest discussions.
+Do not use the deprecated setlocale() function. This function interacts with 
the
+POSIX locale. If Unicode semantics are on, using setlocale() generates
+a deprecation warning.
 
-  0.2: Updated Phase I design proposal per discussion on [EMAIL PROTECTED]
-       Modified Internal Encoding section to contain only UTF-16 info..
-       Expanded Script Encoding section.
-       Added Binary Data Type section. 
-       Amended Language Modifications section to describe string literals
-       behavior.
-       Amended Build System section.
-
-  0.1: Phase I design proposal
+Document TODO
+==========================================
+- Final review.
+- Fix the HTTP Input Encoding section, that's obsolete now.
 
 
 References
@@ -665,5 +643,6 @@
 Authors
 =======
   Andrei Zmievski <[EMAIL PROTECTED]>
+  Evan Goer <[EMAIL PROTECTED]>
 
-vim: set et :
+vim: set et tw=80 :

-- 
PHP CVS Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

[PHP-CVS] cvs: php-src / README.UNICODE

Reply via email to