andrei Wed Jan 10 23:16:40 2007 UTC Modified files: /php-src README.UNICODE Log: Update with rewrites by me and Evan G.
http://cvs.php.net/viewvc.cgi/php-src/README.UNICODE?r1=1.7&r2=1.8&diff_format=u Index: php-src/README.UNICODE diff -u php-src/README.UNICODE:1.7 php-src/README.UNICODE:1.8 --- php-src/README.UNICODE:1.7 Fri Dec 15 23:33:48 2006 +++ php-src/README.UNICODE Wed Jan 10 23:16:40 2007 @@ -1,133 +1,111 @@ +Audience +======== + +This README describes how PHP 6 provides native support for the Unicode +Standard. Readers of this document should be proficient with PHP and have a +basic understanding of Unicode concepts. For more technical details about +PHP 6 design principles and for guidelines about writing Unicode-ready PHP +extensions, refer to README.UNICODE-UPGRADES. + Introduction ============ -As successful as PHP has proven to be in the past several years, it is still -the only remaining member of the P-trinity of scripting languages - Perl and -Python being the other two - that remains blithely ignorant of the -multilingual and multinational environment around it. The software -development community has been moving towards Unicode Standard for some time -now, and PHP can no longer afford to be outside of this movement. Surely, -some steps have been taken recently to allow for easier processing of -multibyte data with the mbstring extension, but it is not enabled in PHP by -default and is not as intuitive or transparent as it could be. - -The basic goal of this document is to describe how PHP 6 will support the -Unicode Standard natively. Since the full implementation of the Unicode -Standard is very involved, the idea is to use the already existing, -well-tested, full-featured, and freely available ICU (International -Components for Unicode) library. This will allow us to concentrate on the -details of PHP integration and speed up the implementation. +As successful as PHP has proven to be over the years, its support for +multilingual and multinational environments has languished. PHP can no +longer afford to remain outside the overall movement towards the Unicode +standard. Although recent updates involving the mbstring extension have +enabled easier multibyte data processing, this does not constitute native +Unicode support. + +Since the full implementation of the Unicode Standard is very involved, our +approach is to speed up implementation by using the well-tested, +full-featured, and freely available ICU (International Components for +Unicode) library. + General Remarks =============== -Backwards Compatibility ------------------------ -Throughout the design and implementation of Unicode support, backwards -compatibility must be of paramount concern. PHP is used on an enormous number of -sites and the upgrade to Unicode-enabled PHP has to be transparent. This means -that the existing data types and functions must work as they have always -done. However, the speed of certain operations may be affected, due to -increased complexity of the code overall. - -Unicode Encoding ----------------- -The initial version will not support Byte Order Mark. Text processing will -generally perform better if the characters are in Normalization Form C. - - -Implementation Approach -======================= - -The implementation is done in phases. This allows for more basic and -low-level implementation issues to be ironed out and tested before -proceeding to more advanced topics. - -Legend: - - TODO - + finished - * in progress - - Phase I - ------- - + Basic Unicode string support, including instantiation, concatenation, - indexing - - + Simple output of Unicode strings via 'print' and 'echo' statements - with appropriate output encoding conversion - - + Conversion of Unicode strings to/from various encodings via encode() and - decode() functions - - + Determining length of Unicode strings via strlen() function, some - simple string functions ported (substr). - +International Components for Unicode +------------------------------------ - Phase II - -------- - * HTTP input request decoding +ICU (International Components for Unicode is a mature, widely used set of +C/C++ and Java libraries for Unicode support, software internationalization +and globalization. It provides: + + - Encoding conversions + - Collations + - Unicode text processing + - and much more + +When building PHP 6, Unicode support is always enabled. The only +configuration option during development should be the location of the ICU +headers and libraries. - + Fixing remaining string-aware operators (assignment to [] etc) - - + Support for Unicode and binary strings in PHP streams - - + Support for Unicode identifiers - - + Configurable handling of conversion failures - - + \C{} escape sequence in strings - - - Phase III - --------- - * Exposing ICU API + --with-icu-dir=<dir> + +where <dir> specifies the location of ICU header and library files. If you do +not specify this option, PHP attempts to find ICU under /usr and /usr/local. - * Porting all remaining functions to support Unicode and/or binary - strings +NOTE: ICU is not bundled with PHP 6 yet. To download the distribution, visit +http://icu.sourceforge.net. PHP requires ICU version 3.4 or higher. +Backwards Compatibility +----------------------- +Our paramount concern for providing Unicode support is backwards compatibility. +Because PHP is used on so many sites, existing data types and functions must +work as they always have. However, although PHP's interfaces must remain +backwards-compatible, the speed of certain operations might be affected due to +internal implementation changes. Encoding Names -============== -All the encoding settings discussed in this document accept any valid -encoding name supported by ICU. See ICU online documentation for the full -list of encodings. +-------------- +All the encoding settings discussed in this document can accept any valid +encoding name supported by ICU. For a full list of encodings, refer to the ICU +online documentation. +NOTE: References to "Unicode" in this document generally mean the UTF-16 +character encoding, unless explicitly stated otherwise. Unicode Semantics Switch ======================== -Obviously, PHP cannot simply impose new Unicode support on everyone. There -are many applications that do not care about Unicode and do not need it. -Consequently, there is a switch that enables certain fundamental language -changes related to Unicode. This switch is available only as a site-wide (per -virtual server) INI setting. - -Note that having switch turned off does not imply that PHP is unaware of Unicode -at all and that no Unicode strings can exist. It only affects certain aspects of -the language, and Unicode strings can always be created programmatically. All -the functions and operators will still support Unicode strings and work -appropriately. - - unicode.semantics = On +Because many applications do not require Unicode, PHP 6 provides a server-wide +INI setting to enable Unicode support: + unicode.semantics = On/Off -Internal Encoding -================= - -UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumes -two bytes for any Unicode character in the Basic Multilingual Plane, which -is where most of the current world's languages are represented. While being -less memory efficient for basic ASCII text it simplifies the processing and -makes interfacing with ICU easier, since ICU uses UTF-16 for its internal -processing as well. +This switch is off by default. If your applications do not require native +Unicode support, you may leave this switch off, and continue to use Unicode +strings only when you need to. + +However, if your application is ready to fully support Unicode, you should +turn this switch on. This activates various Unicode support mechanisms, +including: + + * All string literals become Unicode + * All variables received from HTTP requests become Unicode + * PHP identifiers may use Unicode characters + +More fundamentally, your PHP environment is now a Unicode environment. Strings +inside PHP are Unicode, and the system is responsible for converting non-Unicode +strings on PHP's periphery (for example, in HTTP input and output, streams, and +filesystem operations). With unicode.semantics on, you must specify binary +strings explicitly. PHP makes no assumptions about the content of a binary +string, so your application must handle all binary string appropriately. + +Conversely, if unicode.semantics is off, PHP behaves as it did in the past. +String literals do not become Unicode, and files are binary strings for +backwards compatibility. You can always create Unicode strings programmatically, +and all functions and operators support Unicode strings transparently. Fallback Encoding ================= -This setting specifies the "fallback" encoding for all the other ones. So if -a specific encoding setting is not set, PHP defaults it to the fallback -encoding. If the fallback_encoding is not specified either, it is set to +The fallback encoding provides a default value for all other unicode.*_encoding +INI settings. If you do not set a particular unicode.*_encoding setting, PHP +uses the fallback encoding. If you do not specify a fallback encoding, PHP uses UTF-8. unicode.fallback_encoding = "iso-8859-1" @@ -136,114 +114,202 @@ Runtime Encoding ================ -Currently PHP neither specifies nor cares what the encoding of its strings -is. However, the Unicode implementation needs to know what this encoding is -for several reasons, including explicit (casting) and implicit (concatenation, -comparison, parameter passing) type coersions. This setting specifies the -runtime encoding. +The runtime encoding specifies the encoding PHP uses for converting binary +strings within the PHP engine itself. unicode.runtime_encoding = "iso-8859-1" +This setting has no effect on I/O-related operations such as writing to +standard out, reading from the filesystem, or decoding HTTP input variables. + +PHP enables you to explicitly convert strings using casting: + + * (binary) -- casts to binary string type + * (unicode) -- casts to Unicode string type + * (string) -- casts to Unicode string type if unicode.semantics is on, + to binary otherwise + +For example, if unicode.runtime_encoding is iso-8859-1, and $uni is a unicode +string, then + + $str = (binary)$uni + +creates a binary string $str in the ISO-8859-1 encoding. + +Implicit conversions include concatenation, comparison, and parameter passing. +For better precision, PHP attempts to convert strings to Unicode before +performing these sorts of operations. For example, if we concatenate our binary +string $str with a unicode literal, PHP converts $str to Unicode first, using +the encoding specified by unicode.runtime_encoding. Output Encoding =============== -Automatic output encoding conversion is supported on the standard output -stream. Therefore, commands such as 'print' and 'echo' automatically convert -their arguments to the specified encoding. No automatic output encoding is -performed for anything else. Therefore, when writing to files or external -resources, the developer has to manually encode the data using functions -provided by the unicode extension or rely on stream encoding features - -The existing default_charset setting so far has been used only for -specifying the charset portion of the Content-Type MIME header. For several -reasons, this setting is deprecated. Now it is only used when the Unicode -semantics switch is disabled and does not affect the actual transcoding of -the output stream. The output encoding setting takes precedence in all other -cases. If the output encoding is set, PHP will automatically add 'charset' -portion to the Conten-Type header. +PHP automatically converts output for commands that write to the standard +output stream, such as 'print' and 'echo'. unicode.output_encoding = "utf-8" +However, PHP does not convert binary strings. When writing to files or external +resources, you must rely on stream encoding features or manually encode the data +using functions provided by the unicode extension. + +The existing default_charset INI setting is DEPRECATED in favor of +unicode.output_setting. Previously, default_charset only specified the charset +portion of the Content-Type MIME header. Now default_charset only takes effect +when unicode.semantics is off, and it does not affect the actual transcoding of +the output stream. Setting unicode.output_encoding causes PHP to add the +'charset' portion to the Content-Type header, overriding any value set for +default_charset. + HTTP Input Encoding =================== -There will be no explicit input encoding setting. Instead, PHP will rely on a -couple of heuristics to determine what encoding the incoming request might be -in. Firstly, PHP will attempt to decode the input using the value of the -unicode.output_encoding setting, because that is the most logical choice if we -assume that the clients send the data back in the encoding that the page with -the form was in. If that is unsuccessful, we could fallback on the "_charset_" -form parameter, if present. This parameter is sent by IE (and possibly Firefox) -along with the form data and indicates the encoding of the request. Note that -this parameter will be present only if the form contains a hidden field named -"_charset_". - -The variables that are decoded successfully will be put into the request arrays -as Unicode strings, those that fail -- as binary strings. PHP will set a -flag (probably in the $_SERVER array) indicating that there were problems during -the conversion. The user will have access to the raw input in case of -failure via the input filter extension and can to access the request parameters -via input_get_arg() function. The input filter extension always looks in -the raw input data and not in the request arrays, and input_get_arg() has a -'charset' parameter that can be specified to tell PHP what charset the incoming -data is in. This kills two birds with one stone: users have access to request -arrays data on successful decoding as well as a standard and secure way to get -at the data in case of failed decoding. +The HTTP input encoding specifies the encoding of variables received via +HTTP, such as the contents of the $_GET and _$POST arrays. + +This functionality is currently under development. For a discussion of the +approach that the PHP 6 team is taking, refer to: + +http://marc.theaimsgroup.com/?t=116613047300005&r=1&w=2 + + +Filesystem Encoding +=================== + +The filesystem encoding specifies the encoding of file and directory names +on the filesystem. + + unicode.filename_encoding = "utf-8" + +Filesystem-related functions such as opendir() perform this conversion when +accepting and returning file names. You should set the filename encoding to +the encoding used by your filesystem. Script Encoding =============== -PHP scripts may be written in any encoding supported by ICU. The encoding -of the scripts can be specified site-wide via an INI directive, or with a -'declare' pragma at the beginning of the script. The reason for pragma is that -an application written in Shift-JIS, for example, should be executable on a -system where the INI directive cannot be changed by the application itself. The -pragma setting is valid only for the script it occurs in, and does not propagate -to the included files. +You may write PHP scripts in any encoding supported by ICU. To specify the +script encoding site-wide, use the INI setting: - pragma: - <?php declare(encoding = 'utf-8'); ?> - - INI setting: unicode.script_encoding = utf-8 +If you cannot change the encoding system wide, you can use a pragma to +override the INI setting in a local script: + + <?php declare(encoding = 'Shift-JIS'); ?> + +The pragma setting must be the first statement in the script. It only affects +the script in which it occurs, and does not propagate to any included files. + INI Files ========= -INI files will be presumed to contain UTF-8 encoded keys and values when the -Unicode semantics mode is On. When the mode is off, the data is taken as-is, +If unicode.semantics is on, INI files are presumed to contain UTF-8 encoded +keys and values. If unicode.semantics is off, the data is taken as-is, similar to PHP 5. No validation occurs during parsing. Instead invalid UTF-8 sequences are caught during access by ini_*() functions. -Conversion Semantics -==================== +Stream I/O +========== + +PHP has a streams-based I/O system for generalized filesystem access, +networking, data compression, and other operations. Since the data on the +other end of the stream can be in any encoding, you need to think about +data conversion. + +Okay, this needs to be clarified. By "default", streams are actually +opened in binary mode. You have to specify 't' flag or use FILE_TEXT in +order to open it in text mode, where conversions apply. And for the text +mode streams, the default stream encoding is UTF-8 indeed. + +By default, PHP opens streams in binary mode. To open a file in text mode, +you must use the 't' flag (or the FILE_TEXT parameter -- see below). The +default encoding for streams in text mode is UTF-8. This means that if +'file.txt' is a UTF-8 text file, this code snippet: + + $fp = fopen('file.txt', 'rt'); + $str = fread($fp, 100) + +returns 100 Unicode characters, while: + + $fp = fopen('file.txt', 'wt'); + $fwrite($fp, $uni) + +writes to a UTF-8 text file. + +If you mainly work with files in an encoding other than UTF-8, you can +change the default context encoding setting: + + stream_default_encoding('Shift-JIS'); + $data = file_get_contents('file.txt', FILE_TEXT); + // work on $data + file_put_contents('file.txt', $data, FILE_TEXT); + +The file_get_contents() and file_put_contents() functions now accept an +additional parameter, FILE_TEXT. If you provide FILE_TEXT for +file_get_contents(), PHP returns a Unicode string. Without FILE_TEXT, PHP +returns a binary string (which would be appropriate for true binary data, such +as an image file). When writing a Unicode string with file_put_contents(), you +must supply the FILE_TEXT parameter, or PHP generates a warning. + +If you need to work with multiple encodings, you can create custom contexts +using stream_context_create() and then pass in the custom context as an +additional parameter. For example: + + $ctx = stream_context_create(NULL, array('encoding' => 'big5')); + $data = file_get_contents('file.txt', FILE_TEXT, $ctx); + // work on $data + file_put_contents('file.txt', $data, FILE_TEXT, $ctx); -Not all characters can be converted between Unicode and legacy encodings. -Normally, when downconverting from Unicode, the default behavior of ICU -converters is to substitute the missing sequence with the appropriate -substitution sequence for that codepage, such as 0x1A (Control-Z) in -ISO-8859-1. When upconverting to Unicode, if an encoding has a character -which cannot be converted into Unicode, that sequence is replaced by the -Unicode substitution character (U+FFFD). -The conversion error behavior can be customized: +Conversion Semantics and Error Handling +======================================= + +PHP can convert strings explicitly (casting) and implicitly (concatenation, +comparison, and parameter passing. For example, when concatenating a Unicode +string and a binary string, PHP converts the binary string to Unicode for better +precision. + +However, not all characters can be converted between Unicode and legacy +encodings. The first possibility is that a string contains corrupt data or +an illegal byte sequence. In this case, the converter simply stops with +a message that resembles: + + Warning: Could not convert binary string to Unicode string + (converter UTF-8 failed on bytes (0xE9) at offset 2) + +Conversely, if a similar error occurs when attempting to convert Unicode to +a legacy string, the converter generates a message that resembles: + + Warning: Could not convert Unicode string to binary string (converter ISO-8859-1 failed on character {U+DC00} at offset 2) + +To customize this behavior, refer to "Creating a Custom Error Handler" below. + +The second possibility is that a Unicode character simply cannot be represented +in the legacy encoding. By default, when downconverting from Unicode, the +converter substitutes any missing sequences with the appropriate substitution +sequence for that codepage, such as 0x1A (Control-Z) in ISO-8859-1. When +upconverting to Unicode, the converter replaces any byte sequence that has no +Unicode equivalent with the Unicode substitution character (U+FFFD). + +You can customize the conversion error behavior to: - stop the conversion and return an empty string - skip any invalid characters - substibute invalid characters with a custom substitution character - escape the invalid character in various formats -The global conversion error settings can be controlled with these two functions: +To control the global conversion error settings, use the functions: unicode_set_error_mode(int direction, int mode) unicode_set_subst_char(unicode char) -Where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these +where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these constants: U_CONV_ERROR_STOP @@ -255,31 +321,102 @@ U_CONV_ERROR_ESCAPE_XML_DEC U_CONV_ERROR_ESCAPE_XML_HEX -Substitution character can be set only for FROM_UNICODE direction and has to -exist in the target character set. +As an example, with a runtime encoding of ISO-8859-1, the conversion: + + $str = (binary)"< \u30AB >"; + +results in: + + MODE RESULT + -------------------------------------- + stop "" + skip "< >" + substitute "< ? >" + escape (Unicode) "< {U+30AB} >" + escape (ICU) "< %U30AB >" + escape (Java) "< \u30AB >" + escape (XML decimal) "< カ >" + escape (XML hex) "< カ >" + +With a runtime encoding of UTF-8, the conversion of the (illegal) sequence: + + $str = (unicode)b"< \xe9\xfe >"; + +results in: + + MODE RESULT + -------------------------------------- + stop "" + skip "" + substitute "" + escape (Unicode) "< %XE9%XFE >" + escape (ICU) "< %XE9%XFE >" + escape (Java) "< \xE9\xFE >" + escape (XML decimal) "< éþ >" + escape (XML hex) "< éþ >" + +The substitution character can be set only for FROM_UNICODE direction and has to +exist in the target character set. The default substitution character is (?). + +NOTE: Casting is just a shortcut for using unicode.runtime_encoding. To convert +using an alternative encoding, use the unicode_encode() and unicode_decode() +functions. For example, + + $str = unicode_encode($uni, 'koi8-r', U_CONV_ERROR_SUBST); + +results in a binary KOI8-R encoded string. + +Creating a Custom Error Handler +------------------------------- +If an error occurs during the conversion, PHP outputs a warning describing the +problem. Instead of this default behavior, PHP can invoke a user-provided error +handler, similar to how the current user-defined error handler works. To set +the custom conversion error handler, call: + + mixed unicode_set_error_handler(callback error_handler) + +The function returns the previously defined custom error handler. If no error +handler was defined, or if an error occurs when returning the handler, this +function returns NULL. + +When the custom handler is set, the standard error handler is bypassed. It is +the responsibility of the custom handler to output or log any messages, raise +exceptions, or die(), if necessary. However, if the custom error handler returns +FALSE, the standard handler will be invoked afterwards. + +The user function specified as the error_handler must accept five parameters: + + mixed error_handler($direction, $encoding, $char_or_byte, $offset, + $message) + +where: + + $direction - the direction of conversion, FROM_UNICODE/TO_UNICODE + + $encoding - the name of the encoding to/from which the conversion + was attempted + + $char_or_byte - either Unicode character or byte sequence (depending + on direction) which caused the error + + $offset - the offset of the failed character/byte sequence in + the source string + + $message - the error message describing the problem + +NOTE: If the error mode set by unicode_set_error_mode() is substitute, +skip, or escape, the handler won't be called, since these are non-error +causing operations. To always invoke your handler, set the error mode to +U_CONV_ERROR_STOP. Unicode String Type =================== -Unicode string type (IS_UNICODE) is supposed to contain text data encoded in -UTF-16 format. It is the main string type in PHP when Unicode semantics -switch is turned on. Unicode strings can exist when the switch is off, but -they have to be produced programmatically, via calls to functions that -return Unicode type. - -The operational unit when working with Unicode strings is a code point, not -code unit or byte. One code point in UTF-16 may be comprised of 1 or 2 code -units, each of which is a 16-bit word. Working on the code point level is -necessary because doing otherwise would mean offloading the processing of -surrogate pairs onto PHP users, and that is less than desirable. - -The repercussions are that one cannot expect code point N to be at offset N in -the Unicode string. Instead, one has to iterate from the beginning from the -string using U16_FWD() macro until the desired codepoint is reached. This will -be transparent to the end user who will work only with "character" offsets. - -The codepoint access is one of the primary areas targeted for optimization. +The Unicode string type (IS_UNICODE) is supposed to contain text data encoded in +UTF-16. This is the main string type in PHP when Unicode semantics switch is +turned on. Unicode strings can exist when the switch is off, but they have to be +produced programmatically via calls to functions that return Unicode types. Binary String Type @@ -294,108 +431,48 @@ Printing binary data to the standard output passes it through as-is, independent of the output encoding. - -Zval Structure Changes -====================== - -PHP is a type-agnostic language. Its data values are encapsulated in a zval -(Zend value) structure that can change as necessary to accomodate various types. - -struct _zval_struct { - /* Variable information */ - union { - long lval; /* long value */ - double dval; /* double value */ - struct { - char *val; - int len; - } str; /* string value */ - HashTable *ht; /* hash table value */ - zend_object_value obj; /* object value */ - } value; - zend_uint refcount; - zend_uchar type; /* active type */ - zend_uchar is_ref; -}; - -The type field determines what is stored in the union, IS_STRING being the only -data type pertinent to this discussion. In the current version, the strings -are binary-safe, but, for all intents and purposes, are assumed to be -comprised of 8-bit characters. It is possible to treat the string value as -an opaque type containing arbitrary binary data, and in fact that is how -mbstring extension uses it, in order to store multibyte strings. However, -many extensions and the Zend engine itself manipulate the string value -directly without regard to its internals. Needless to say, this can lead to -problems. - -For IS_UNICODE type, we need to add another structure to the union: - - union { - .... - struct { - UChar *val; /* Unicode string value */ - int len; /* number of UChar's */ - } ustr; - .... - } value; - -This cleanly separates the two types of strings and helps preserve backwards -compatibility. - -To optimize access to IS_STRING and IS_UNICODE storage at runtime, we need yet -another structure: - - union { - .... - struct { /* Universal string type */ - zstr val; - int len; - } uni; - .... - } value; - -Where zstr ia union of char*, UChar*, and void*. - +For examples of specifying binary string literals, refer to the section +"Language Modfications". Language Modifications ====================== -If a Unicode switch is turned on, PHP string literals - single-quoted, -double-quoted, and heredocs - become Unicode strings (IS_UNICODE type). -They support all the same escape sequences and variable interpolations as -previously, with the addition of some new escape sequences. +If a Unicode switch is turned on, PHP string literals -- single-quoted, +double-quoted, and heredocs -- become Unicode strings (IS_UNICODE type). String +literals support all the same escape sequences and variable interpolations as +before, plus several new escape sequences. -The contents of the strings are interpreted as follows: +PHP interprets the contents of strings as follows: - all non-escaped characters are interpreted as a corresponding Unicode - codepoint based on the current script encoding, e.g. ASCII 'a' (0x51) => - U+0061, Shift-JIS (0x92 0x69) => U+4E2D + codepoint based on the current script encoding, e.g. ASCII 'a' (0x61) => + U+0061, Shift-JIS (0x92 0x86) => U+4E2D - existing PHP escape sequences are also interpreted as Unicode codepoints, including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020 - - two new escape sequences, \uXXXX and \UXXXXXX are interpreted as a 4 or + - two new escape sequences, \uXXXX and \UXXXXXX, are interpreted as a 4 or 6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 => - U+10410 - + U+10410. (Having two sequences avoids the ambiguity of \u020608 -- + is that supposed to be U+0206 followed by "08", or U+020608 ?) + - a new escape sequence allows specifying a character by its full Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20 -The single-quoted string is more restrictive than the other two types: so -far the only escape sequence allowed inside of it was \', which specifies -a literal single quote. However, single quoted strings now support the new -Unicode character escape sequences as well. +The single-quoted string is more restrictive than the other two types. So far +the only escape sequence allowed inside of it was \', which specifies a literal +single quote. However, single quoted strings now support the new Unicode +character escape sequences as well. PHP allows variable interpolation inside the double-quoted and heredoc strings. However, the parser separates the string into literal and variable chunks during -compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that the -literal chunks can be handled in the normal way for as far as Unicode -support is concerned. - -Since all string literals become Unicode by default, one loses the ability -to specify byte-oriented or binary strings. In order to create binary string -literals, a new syntax is necessary: prefixing a string literal with letter -'b' creates a binary string. +compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that PHP +can handle literal chunks in the normal way as far as Unicode support is +concerned. + +Since all string literals become Unicode by default, PHP 6 introduces new syntax +for creating byte-oriented or binary strings. Prefixing a string literal with +the letter 'b' creates a binary string: $var = b'abc\001'; $var = b"abc\001"; @@ -403,235 +480,136 @@ abc\001 EOD; -The binary string literals support the same escape sequences as the current -PHP strings. If the Unicode switch is turned off, then the binary string -literals generate normal string (IS_STRING) type internally, without any -effect on the application. - -The string operators have been changed to accomodate the new IS_UNICODE and -IS_BINARY types. In more detail: - - - The concatenation (.) operator has been changed to automatically coerce - IS_STRING type to the more precise IS_UNICODE if its operands are of two - different string types. - - - The concatenation assignment operator (.=) has been changed similarly. - - - The string indexing operator [] has been changed to accomodate IS_UNICODE - type strings and extract the specified character. Note that the index - specifies a code point, not a byte, or a code unit, thus supporting - supplementary characters. - - - Both Unicode and binary string types can be used as array keys. If the - Unicode switch is on, the binary keys are converted to Unicode. +The content of a binary string is the literal byte sequence inside the +delimiters, which depends on the script encoding (unicode.script_encoding). +Binary string literals support the same escape sequences as PHP 5 strings. If +the Unicode switch is turned off, then the binary string literals generate the +normal string (IS_STRING) type internally without any effect on the application. + +The string operators now accomodate the new IS_UNICODE and IS_BINARY types: + + - The concatenation operator (.) and concatenation assignment operator (.=) + automatically coerce the IS_STRING type to the more precise IS_UNICODE if + the operands are of different string types. + + - The string indexing operator [] now accommodates IS_UNICODE type strings + and extracts the specified character. To support supplementary characters, + the index specifies a code point, not a byte or a code unit. - Bitwise operators and increment/decrement operators do not work on Unicode strings. They do work on binary strings. - Two new casting operators are introduced, (unicode) and (binary). The - (string) operator will cast to Unicode type if the Unicode semantics switch is + (string) operator casts to Unicode type if the Unicode semantics switch is on, and to binary type otherwise. - - The comparison operators when applied to Unicode strings, perform - comparison in binary code point order. They also do appropriate coersion - if the strings are of differing types. + - The comparison operators compare Unicode strings in binary code point + order. They also coerce strings to Unicode if the strings are of different + types. - The arithmetic operators use the same semantics as today for converting strings to numbers. A Unicode string is considered numeric if it - represents a long or a double number in en_US_POSIX locale. + represents a long or a double number in the en_US_POSIX locale. -Inline HTML -=========== -Because inline HTML blocks are intermixed with PHP ones, they are also -written in the script encoding. PHP transcodes the HTML blocks to the output -encoding as needed, resulting in direct passthrough if the script encoding -matches output encoding. +Unicode Support in Existing Functions +===================================== + +All functions in the PHP default distribution are undergoing analysis to +determine which functions need to be upgraded for native Unicode support. +You can track progress here: + + http://www.php.net/~scoates/unicode/render_func_data.php + +Key extensions that are fully converted include: + + * curl + * dom + * json + * mysql + * mysqli + * oci8 + * pcre + * reflection + * simplexml + * soap + * sqlite + * xml + * xmlreader/xmlwriter + * xsl + * zlib + +NOTE: Unsafe functions might still work, since PHP performs Unicode conversions +at runtime. However, unsafe functions might not work correctly with multibyte +binary strings, or Unicode characters that are not representable in the +specified unicode.runtime_encoding. Identifiers =========== -Considering that scripts may be written in various encodings, we do not -restrict identifiers to be ASCII-only. PHP allows any valid identifier based -on the Unicode Standard Annex #31. The identifiers are case folded when -necessary (class and function names) and converted to normalization form -NFKC, so that two identifiers written in two compatible ways refer to the -same thing. + +Since scripts may be written in various encodings, we do not restrict +identifiers to be ASCII-only. PHP allows any valid identifier based +on the Unicode Standard Annex #31. Numbers ======= -Unlike identifiers, we restrict numbers to consist only of ASCII digits and -do not interpret them as written in a specific locale. The numbers are -expected to adhere to en_US_POSIX or C locale, i.e. having no thousands -separator and fractional separator being (.) "full stop". Numeric strings -are supposed to adhere to the same rules, i.e. "10,3" is not interpreted as -a number even if the current locale's fractional separator is comma. - - -Parameter Parsing API Modifications -=================================== - -Internal PHP functions largely uses zend_parse_parameters() API in order to -obtain the parameters passed to them by the user. For example: - - char *str; - int len; - - if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &str, &len) == FAILURE) { - return; - } - -This forces the input parameter to be a string, and its value and length are -stored in the variables specified by the caller. - -There are now five new specifiers: 'u', 't', 'T', 'U', and 'S'. - - 't' specifier - ------------- - This specifier indicates that the caller requires the incoming parameter to be - string data (IS_STRING, IS_UNICODE). The caller has to provide the storage for - string value, length, and type. - - void *str; - int len; - zend_uchar type; - - if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "t", &str, &len, &type) == FAILURE) { - return; - } - if (type == IS_UNICODE) { - /* process Unicode string */ - } else { - /* process binary string */ - } - - For IS_STRING type, the length represents the number of bytes, and for - IS_UNICODE the number of UChar's. When converting other types (numbers, - booleans, etc) to strings, the exact behavior depends on the Unicode semantics - switch: if on, they are converted to IS_UNICODE, otherwise to IS_STRING. - - - 'u' specifier - ------------- - This specifier indicates that the caller requires the incoming parameter - to be a Unicode encoded string. If a non-Unicode string is passed, the engine - creates a copy of the string and automatically convert it to Unicode type before - passing it to the internal function. No such conversion is necessary for Unicode - strings, obviously. - - UChar *str; - int len; - - if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "u", &str, &len) == FAILURE) { - return; - } - /* process Unicode string */ - - 'T' specifier - ------------- - This specifier is useful when the function takes two or more strings and - operates on them. Using 't' specifier for each one would be somewhat - problematic if the passed-in strings are of mixed types, and multiple - checks need to be performed in order to do anything. All parameters - marked by the 'T' specifier are promoted to the same type. - - If at least one of the 'T' parameters is of Unicode type, then the rest of - them are converted to IS_UNICODE. Otherwise all 'T' parameters are conveted to - IS_STRING type. - - - void *str1, *str2; - int len1, len2; - zend_uchar type1, type2; - - if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "TT", &str1, &len1, - &type1, &str2, &len2, &type2) == FAILURE) { - return; - } - if (type1 == IS_UNICODE) { - /* process as Unicode, str2 is guaranteed to be Unicode as well */ - } else { - /* process as binary string, str2 is guaranteed to be the same */ - } - - -The existing 's' specifier has been modified as well. If a Unicode string is -passed in, it automatically copies and converts the string to the runtime -encoding, and issues a warning. If a binary type is passed-in, no conversion -is necessary. - -The 'U' and 'S' specifiers are similar to 'u' and 's' but they are more strict -about the type of the passed-in parameter. If 'U' is specified and the binary -string is passed in, the engine will issue a warning instead of doing automatic -conversion. The converse applies to the 'S' specifier. - - -Upgrading Existing Functions -============================ - -Upgrading functions to work with new data types will be a deliberate and -involved process, because one needs to consider not only the mechanisms for -processing Unicode characters, for example, but also the semantics of -the function. - -The main tenet of the upgrade process should be that when processing Unicode -strings, the unit of operation is a code point, not a code unit or a byte. -For example, strlen() returns the number of code points in the string. - - strlen('abc') = 3 - strlen('ab\U010000') = 3 - strlen('ab\uD800\uDC00') = 3 /* not 4 */ - -Function upgrade guidelines are available in a separate document. - - -Document TODO -========================================== -- Streams support for Unicode - What stream filters will be provided? -- User conversion error handler -- INI files encoding - UTF-8? Do we support BOMs? -- There are likely to be other issues which are missing from this document +Unlike identifiers, numbers must consist only of ASCII digits,.and are +restricted to the en_US_POSIX or C locale. In other words, numbers have no +thousands separator, and the fractional separator is (.) "full stop". Numeric +strings adhere to the same rules, so "10,3" is not interpreted as a number even +if the current locale's fractional separator is a comma. + +TextIterators +============= + +Instead of using the offset operator [] to access characters in a linear +fashion, use a TextIterator instead. TextIterator is very fast and enables you +to iterate over code points, combining sequences, characters, words, lines, and +sentences, both forward and backward. For example: + + $text = "nai\u308ve"; + foreach (new TextIterator($text) as $u) { + var_inspect($u) + } + +lists six code points, including the umlaut (U+0308) as a separate code point. +Instantiating the TextIterator to iterate over characters, + + $text = "nai\u308ve"; + foreach (new TextIterator($text, TextIterator::CHARACTER) as $u) { + var_inspect($u) + } +lists five characters, including an "i" with an umlaut as a single character. -Build System -============ +Locales +======= -Unicode support in PHP is always enabled. The only configuration option -during development should be the location of the ICU headers and libraries. +Unicode support in PHP relies exclusively on ICU locales, NOT the POSIX locales +installed on the system. You may access the default ICU locale using: - --with-icu-dir=<dir> <dir> parameter specifies the location of ICU - header and library files. + locale_set_default() + locale_get_default() -After the initial development we have to repackage ICU library for our needs -and bundle it with PHP. +ICU locale IDs have a somewhat different format from POSIX locale IDs. The ICU +syntax is: + <language>[_<script>]_<country>[_<variant>][@<keywords>] -Document History -================ - 0.6: Remove notion of native encoding string, only 2 string types are used - now. Update conversion error behavior section and parameter parsing. - Bring the document up-to-date with reality in general. - - 0.5: Updated per latest discussions. Removed tentative language in several - places, since we have decided on everything described here already. - Clarified details according to Phase II progress. - - 0.4: Updated to include all the latest discussions. Updated development - phases. +For example, [EMAIL PROTECTED] is Serbian (Latin, Yugoslavia, +Revised Orthography, Currency=US Dollar). - 0.3: Updated to include all the latest discussions. +Do not use the deprecated setlocale() function. This function interacts with the +POSIX locale. If Unicode semantics are on, using setlocale() generates +a deprecation warning. - 0.2: Updated Phase I design proposal per discussion on [EMAIL PROTECTED] - Modified Internal Encoding section to contain only UTF-16 info.. - Expanded Script Encoding section. - Added Binary Data Type section. - Amended Language Modifications section to describe string literals - behavior. - Amended Build System section. - - 0.1: Phase I design proposal +Document TODO +========================================== +- Final review. +- Fix the HTTP Input Encoding section, that's obsolete now. References @@ -665,5 +643,6 @@ Authors ======= Andrei Zmievski <[EMAIL PROTECTED]> + Evan Goer <[EMAIL PROTECTED]> -vim: set et : +vim: set et tw=80 :
-- PHP CVS Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php