vrana Fri Jun 17 07:40:21 2005 EDT
Modified files: /phpdoc/en/reference/pcre pattern.modifiers.xml pattern.syntax.xml Log: PCRE 5.0
http://cvs.php.net/diff.php/phpdoc/en/reference/pcre/pattern.modifiers.xml?r1=1.7&r2=1.8&ty=u Index: phpdoc/en/reference/pcre/pattern.modifiers.xml diff -u phpdoc/en/reference/pcre/pattern.modifiers.xml:1.7 phpdoc/en/reference/pcre/pattern.modifiers.xml:1.8 --- phpdoc/en/reference/pcre/pattern.modifiers.xml:1.7 Wed Sep 15 03:22:26 2004 +++ phpdoc/en/reference/pcre/pattern.modifiers.xml Fri Jun 17 07:40:20 2005 @@ -1,5 +1,5 @@ <?xml version="1.0" encoding="iso-8859-1"?> -<!-- $Revision: 1.7 $ --> +<!-- $Revision: 1.8 $ --> <!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 --> <refentry id="reference.pcre.pattern.modifiers"> <refnamediv> @@ -12,6 +12,7 @@ <para> The current possible PCRE modifiers are listed below. The names in parentheses refer to internal PCRE names for these modifiers. + Spaces and newlines are ignored in modifiers, other characters cause error. </para> <para> <blockquote> @@ -179,6 +180,7 @@ is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. + UTF-8 validity of the pattern is checked since PHP 4.3.5. </simpara> </listitem> </varlistentry> http://cvs.php.net/diff.php/phpdoc/en/reference/pcre/pattern.syntax.xml?r1=1.8&r2=1.9&ty=u Index: phpdoc/en/reference/pcre/pattern.syntax.xml diff -u phpdoc/en/reference/pcre/pattern.syntax.xml:1.8 phpdoc/en/reference/pcre/pattern.syntax.xml:1.9 --- phpdoc/en/reference/pcre/pattern.syntax.xml:1.8 Mon Jun 13 12:30:42 2005 +++ phpdoc/en/reference/pcre/pattern.syntax.xml Fri Jun 17 07:40:21 2005 @@ -1,5 +1,5 @@ <?xml version="1.0" encoding="iso-8859-1"?> -<!-- $Revision: 1.8 $ --> +<!-- $Revision: 1.9 $ --> <!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 --> <refentry id="reference.pcre.pattern.syntax"> <refnamediv> @@ -274,7 +274,7 @@ </refsect2> <refsect2 id="regexp.reference.backslash"> - <title>backslash</title> + <title>Backslash</title> <para> The backslash character has several uses. Firstly, if it is followed by a non-alphanumeric character, it takes away any @@ -358,6 +358,12 @@ <para> After "<literal>\x</literal>", up to two hexadecimal digits are read (letters can be in upper or lower case). + In <emphasis>UTF-8 mode</emphasis>, "<literal>\x{...}</literal>" is + allowed, where the contents of the braces is a string of hexadecimal + digits. It is interpreted as a UTF-8 character whose code number is the + given hexadecimal number. The original hexadecimal escape sequence, + <literal>\xhh</literal>, matches a two-byte UTF-8 character if the value + is greater than 127. </para> <para> After "<literal>\0</literal>" up to two further octal digits are read. @@ -545,7 +551,11 @@ </varlistentry> <varlistentry> <term><emphasis>\z</emphasis></term> - <listitem><simpara>end of subject(independent of multiline mode)</simpara></listitem> + <listitem><simpara>end of subject (independent of multiline mode)</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\G</emphasis></term> + <listitem><simpara>first matching position in subject</simpara></listitem> </varlistentry> </variablelist> </para> @@ -575,6 +585,14 @@ newline that is the last character of the string as well as at the end of the string, whereas <literal>\z</literal> matches only at the end. </para> + <para> + The <literal>\G</literal> assertion is true only when the current + matching position is at the start point of the match, as specified by + the <parameter>offset</parameter> argument of + <function>preg_match</function>. It differs from <literal>\A</literal> + when the value of <parameter>offset</parameter> is non-zero. + It is available since PHP 4.3.3. + </para> <para> <literal>\Q</literal> and <literal>\E</literal> can be used to ignore @@ -586,6 +604,116 @@ </refsect2> + <refsect2 id="regexp.reference.unicode"> + <title>Unicode character properties</title> + <para> + Since PHP 4.4.0 and 5.1.0, three + additional escape sequences to match generic character types are available + when <emphasis>UTF-8 mode</emphasis> is selected. They are: + </para> + <variablelist> + <varlistentry> + <term><emphasis>\p{xx}</emphasis></term> + <listitem><simpara>a character with the xx property</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\P{xx}</emphasis></term> + <listitem><simpara>a character without the xx property</simpara></listitem> + </varlistentry> + <varlistentry> + <term><emphasis>\X</emphasis></term> + <listitem><simpara>an extended Unicode sequence</simpara></listitem> + </varlistentry> + </variablelist> + <para> + The property names represented by <literal>xx</literal> above are limited to the Unicode + general category properties. Each character has exactly one such + property, specified by a two-letter abbreviation. For compatibility with + Perl, negation can be specified by including a circumflex between the + opening brace and the property name. For example, <literal>\p{^Lu}</literal> is the same + as <literal>\P{Lu}</literal>. + </para> + <para> + If only one letter is specified with <literal>\p</literal> or <literal>\P</literal>, it includes all the + properties that start with that letter. In this case, in the absence of + negation, the curly brackets in the escape sequence are optional; these + two examples have the same effect: + </para> + <literallayout> + \p{L} + \pL + </literallayout> + <table> + <title>Supported property codes</title> + <tgroup cols="2"> + <tbody> + <row><entry><literal>C</literal></entry><entry>Other</entry></row> + <row><entry><literal>Cc</literal></entry><entry>Control</entry></row> + <row><entry><literal>Cf</literal></entry><entry>Format</entry></row> + <row><entry><literal>Cn</literal></entry><entry>Unassigned</entry></row> + <row><entry><literal>Co</literal></entry><entry>Private use</entry></row> + <row rowsep="1"><entry><literal>Cs</literal></entry><entry>Surrogate</entry></row> + <row><entry><literal>L</literal></entry><entry>Letter</entry></row> + <row><entry><literal>Ll</literal></entry><entry>Lower case letter</entry></row> + <row><entry><literal>Lm</literal></entry><entry>Modifier letter</entry></row> + <row><entry><literal>Lo</literal></entry><entry>Other letter</entry></row> + <row><entry><literal>Lt</literal></entry><entry>Title case letter</entry></row> + <row rowsep="1"><entry><literal>Lu</literal></entry><entry>Upper case letter</entry></row> + <row><entry><literal>M</literal></entry><entry>Mark</entry></row> + <row><entry><literal>Mc</literal></entry><entry>Spacing mark</entry></row> + <row><entry><literal>Me</literal></entry><entry>Enclosing mark</entry></row> + <row rowsep="1"><entry><literal>Mn</literal></entry><entry>Non-spacing mark</entry></row> + <row><entry><literal>N</literal></entry><entry>Number</entry></row> + <row><entry><literal>Nd</literal></entry><entry>Decimal number</entry></row> + <row><entry><literal>Nl</literal></entry><entry>Letter number</entry></row> + <row rowsep="1"><entry><literal>No</literal></entry><entry>Other number</entry></row> + <row><entry><literal>P</literal></entry><entry>Punctuation</entry></row> + <row><entry><literal>Pc</literal></entry><entry>Connector punctuation</entry></row> + <row><entry><literal>Pd</literal></entry><entry>Dash punctuation</entry></row> + <row><entry><literal>Pe</literal></entry><entry>Close punctuation</entry></row> + <row><entry><literal>Pf</literal></entry><entry>Final punctuation</entry></row> + <row><entry><literal>Pi</literal></entry><entry>Initial punctuation</entry></row> + <row><entry><literal>Po</literal></entry><entry>Other punctuation</entry></row> + <row rowsep="1"><entry><literal>Ps</literal></entry><entry>Open punctuation</entry></row> + <row><entry><literal>S</literal></entry><entry>Symbol</entry></row> + <row><entry><literal>Sc</literal></entry><entry>Currency symbol</entry></row> + <row><entry><literal>Sk</literal></entry><entry>Modifier symbol</entry></row> + <row><entry><literal>Sm</literal></entry><entry>Mathematical symbol</entry></row> + <row rowsep="1"><entry><literal>So</literal></entry><entry>Other symbol</entry></row> + <row><entry><literal>Z</literal></entry><entry>Separator</entry></row> + <row><entry><literal>Zl</literal></entry><entry>Line separator</entry></row> + <row><entry><literal>Zp</literal></entry><entry>Paragraph separator</entry></row> + <row><entry><literal>Zs</literal></entry><entry>Space separator</entry></row> + </tbody> + </tgroup> + </table> + <para> + Extended properties such as "Greek" or "InMusicalSymbols" are not + supported by PCRE. + </para> + <para> + Specifying caseless matching does not affect these escape sequences. + For example, <literal>\p{Lu}</literal> always matches only upper case letters. + </para> + <para> + The <literal>\X</literal> escape matches any number of Unicode characters that form an + extended Unicode sequence. <literal>\X</literal> is equivalent to + <literal>(?>\PM\pM*)</literal>. + </para> + <para> + That is, it matches a character without the "mark" property, followed + by zero or more characters with the "mark" property, and treats the + sequence as an atomic group (see below). Characters with the "mark" + property are typically accents that affect the preceding character. + </para> + <para> + Matching characters by Unicode property is not fast, because PCRE has + to search a structure that contains data for over fifteen thousand + characters. That is why the traditional escape sequences such as <literal>\d</literal> and + <literal>\w</literal> do not use Unicode properties in PCRE. + </para> + </refsect2> + <refsect2 id="regexp.reference.circudollar"> <title>Circumflex and dollar</title> <para> @@ -646,7 +774,7 @@ </refsect2> <refsect2 id="regexp.reference.dot"> - <title>FULL STOP</title> + <title>Full stop</title> <para> Outside a character class, a dot in the pattern matches any one character in the subject, including a non-printing @@ -658,6 +786,11 @@ both involve newline characters. Dot has no special meaning in a character class. </para> + <para> + <emphasis>\C</emphasis> can be used to match single byte. It makes sense + in <emphasis>UTF-8 mode</emphasis> where full stop matches the whole + character which can consist of multiple bytes. + </para> </refsect2> <refsect2 id="regexp.reference.squarebrackets"> @@ -862,7 +995,7 @@ </refsect2> <refsect2 id="regexp.reference.subpatterns"> - <title>subpatterns</title> + <title>Subpatterns</title> <para> Subpatterns are delimited by parentheses (round brackets), which can be nested. Marking part of a pattern as a subpattern @@ -1119,7 +1252,7 @@ </refsect2> <refsect2 id="regexp.reference.back-references"> - <title>BACK REFERENCES</title> + <title>Back references</title> <para> Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back @@ -1479,7 +1612,12 @@ in parentheses. </para> <para> - If the condition is not a sequence of digits, it must be an + If the condition is the string <literal>(R)</literal>, it is satisfied if + a recursive call to the pattern or subpattern has been made. At "top + level", the condition is false. + </para> + <para> + If the condition is not a sequence of digits or (R), it must be an assertion. This may be a positive or negative lookahead or lookbehind assertion. Consider this pattern, again containing non-significant white space, and with the two alternatives on @@ -1585,6 +1723,19 @@ for recursive subpatterns too. It is also possible to use named subpatterns: <literal>(?P>foo)</literal>. </para> + <para> + If the syntax for a recursive subpattern reference (either by number or + by name) is used outside the parentheses to which it refers, it operates + like a subroutine in a programming language. An earlier example + pointed out that the pattern + <literal>(sens|respons)e and \1ibility</literal> + matches "sense and sensibility" and "response and responsibility", but + not "sense and responsibility". If instead the pattern + <literal>(sens|respons)e and (?1)ibility</literal> + is used, it does match "sense and responsibility" as well as the other + two strings. Such references must, however, follow the subpattern to + which they refer. + </para> </refsect2>