[perl.git] branch blead, updated. v5.25.2-42-gfea12a3

Karl Williamson Sun, 26 Jun 2016 09:19:16 -0700

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/fea12a3ecdb8d0cbe872c0aea1e2828112bcba12?hp=793b60b28c235d675aa6db799d335a684f9ee704>


- Log -----------------------------------------------------------------
commit fea12a3ecdb8d0cbe872c0aea1e2828112bcba12
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Jun 25 22:37:38 2016 -0600

    Update perlunicode
    
    This fixes a couple of nits, but mostly it updates the text to
    correspond with changes in Unicode UTS#18, concerning regular
    expressions, and Perl compatibility with what it says.
    
    Note that though this Unicode document's text is written as if it were
    imposing requirements, it is not technically a part of the Unicode
    standard, so its "requirements" are merely suggestions or guidelines.
    
    It turns out that several of the "requirements" that Perl didn't meet
    have been retracted by Unicode (as effectively unimplementable), so the
    Perl Unicode support is actually better than it appeared, and in fact,
    is almost complete at the first 2 (of 3) levels of support discussed in
    UTS#18.

M       pod/perlunicode.pod

commit d2b457d752c03448e8006f00a3761b5f542000d6
Author: Karl Williamson <k...@cpan.org>
Date:   Sat Jun 25 22:37:21 2016 -0600

    perlunicode: Fix mistatement
    
    v5.24 reinstated the ability to compile any earlier version of the
    Unicode standard into Perl, but this pod did not get updated.

M       pod/perlunicode.pod
-----------------------------------------------------------------------

Summary of changes:
 pod/perlunicode.pod | 174 ++++++++++++++++++++++++++++------------------------
 1 file changed, 95 insertions(+), 79 deletions(-)

diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 775a430..e3eebdb 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -75,7 +75,7 @@ utf8>> is needed.>  (See L<utf8>).
 
 =item C<BOM>-marked scripts and L<UTF-16|/Unicode Encodings> scripts 
autodetected
 
-However, if a Perl script begins with the Unicode C<BOM> (UTF-16LE,
+If a Perl script begins with the Unicode C<BOM> (UTF-16LE,
 UTF16-BE, or UTF-8), or if the script looks like non-C<BOM>-marked
 UTF-16 of either endianness, Perl will correctly read in the script as
 the appropriate Unicode encoding.  (C<BOM>-less UTF-8 cannot be
@@ -1069,38 +1069,40 @@ See L<Encode>.
 =head2 Unicode Regular Expression Support Level
 
 The following list of Unicode supported features for regular expressions 
describes
-all features currently directly supported by core Perl.  The references to 
"Level N"
-and the section numbers refer to the Unicode Technical Standard #18,
-"Unicode Regular Expressions", version 13, from August 2008.
-
-=over 4
-
-=item *
-
-Level 1 - Basic Unicode Support
-
- RL1.1   Hex Notation                     - done          [1]
- RL1.2   Properties                       - done          [2][3]
- RL1.2a  Compatibility Properties         - done          [4]
- RL1.3   Subtraction and Intersection     - experimental  [5]
- RL1.4   Simple Word Boundaries           - done          [6]
- RL1.5   Simple Loose Matches             - done          [7]
- RL1.6   Line Boundaries                  - MISSING       [8][9]
- RL1.7   Supplementary Code Points        - done          [10]
+all features currently directly supported by core Perl.  The references
+to "Level I<N>" and the section numbers refer to
+L<UTS#18 "Unicode Regular Expressions"|http://www.unicode.org/reports/tr18>,
+version 13, November 2013.
+
+=head3 Level 1 - Basic Unicode Support
+
+ RL1.1   Hex Notation                     - Done          [1]
+ RL1.2   Properties                       - Done          [2]
+ RL1.2a  Compatibility Properties         - Done          [3]
+ RL1.3   Subtraction and Intersection     - Experimental  [4]
+ RL1.4   Simple Word Boundaries           - Done          [5]
+ RL1.5   Simple Loose Matches             - Done          [6]
+ RL1.6   Line Boundaries                  - Partial       [7]
+ RL1.7   Supplementary Code Points        - Done          [8]
 
 =over 4
 
 =item [1] C<\N{U+...}> and C<\x{...}>
 
-=item [2] C<\p{...}> C<\P{...}>
+=item [2]
+C<\p{...}> C<\P{...}>.  This requirement is for a minimal list of
+properties.  Perl supports these and all other Unicode character
+properties, as R2.7 asks (see L</"Unicode Character Properties"> above).
 
-=item [3] supports not only minimal list, but all Unicode character
-properties (see Unicode Character Properties above)
+=item [3]
+Perl has C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]>
+C<[:^I<prop>:]>, plus all the properties specified by
+L<http://www.unicode.org/reports/tr18/#Compatibility_Properties>.  These
+are described above in L</Other Properties>
 
-=item [4] C<\d> C<\D> C<\s> C<\S> C<\w> C<\W> C<\X> C<[:I<prop>:]>
-C<[:^I<prop>:]>
+=item [4]
 
-=item [5] The experimental feature starting in v5.18 C<"(?[...])"> accomplishes
+The experimental feature C<"(?[...])"> starting in v5.18 accomplishes
 this.
 
 See L<perlre/(?[ ])>.  If you don't want to use an experimental
@@ -1109,7 +1111,6 @@ feature, you can use one of the following:
 =over 4
 
 =item *
-
 Regular expression lookahead
 
 You can mimic class subtraction using lookahead.
@@ -1143,9 +1144,12 @@ C<"+"> for union, C<"-"> for removal (set-difference), 
C<"&"> for intersection
 
 =back
 
-=item [6] C<\b> C<\B>
+=item [5]
+C<\b> C<\B> meet most, but not all, the details of this requirement, but
+C<\b{wb}> and C<\B{wb}> do, as well as the stricter R2.3.
+
+=item [6]
 
-=item [7]
 Note that Perl does Full case-folding in matching, not Simple:
 
 For example C<U+1F88> is equivalent to C<U+1F00 U+03B9>, instead of just
@@ -1154,9 +1158,18 @@ letters with certain modifiers: the Full case-folding 
decomposes the
 letter, while the Simple case-folding would map it to a single
 character.
 
-=item [8]
-Perl treats C<\n> as the start- and end-line delimiter.  Unicode
-specifies more characters that should be so-interpreted.
+=item [7]
+
+The reason this is considered to be only partially implemented is that
+Perl has L<C<qrE<sol>\b{lb}E<sol>>|perlrebackslash/\b{lb}> and
+C<L<Unicode::LineBreak>> that are conformant with
+L<UAX#14 "Unicode Line Breaking 
Algorithm"|http://www.unicode.org/reports/tr14>.
+The regular expression construct provides default behavior, while the
+heavier-weight module provides customizable line breaking.
+
+But Perl treats C<\n> as the start- and end-line
+delimiter, whereas Unicode specifies more characters that should be
+so-interpreted.
 
 These are:
 
@@ -1176,63 +1189,66 @@ Also, lines should not be split within C<CRLF> (i.e. 
there is no
 empty line between C<\r> and C<\n>).  For C<CRLF>, try the C<:crlf>
 layer (see L<PerlIO>).
 
-=item [9] But C<qr/\b{lb}/> and C<L<Unicode::LineBreak>> are available.
-
-L<C<qrE<sol>\b{lb}E<sol>>|perlrebackslash/\b{lb}> supplies default line
-breaking conformant with
-L<UAX#14 "Unicode Line Breaking 
Algorithm"|http://www.unicode.org/reports/tr14>.
-
-And, the module C<L<Unicode::LineBreak>> also conformant with UAX#14,
-provides customizable line breaking.
-
-=item [10]
+=item [8]
 UTF-8/UTF-EBDDIC used in Perl allows not only C<U+10000> to
 C<U+10FFFF> but also beyond C<U+10FFFF>
 
 =back
 
-=item *
+=head3 Level 2 - Extended Unicode Support
 
-Level 2 - Extended Unicode Support
+ RL2.1   Canonical Equivalents           - Retracted     [9]
+                                           by Unicode
+ RL2.2   Extended Grapheme Clusters      - Partial       [10]
+ RL2.3   Default Word Boundaries         - Done          [11]
+ RL2.4   Default Case Conversion         - Done
+ RL2.5   Name Properties                 - Done
+ RL2.6   Wildcard Properties             - Missing
+ RL2.7   Full Properties                 - Done
 
- RL2.1   Canonical Equivalents           - MISSING       [10][11]
- RL2.2   Default Grapheme Clusters       - MISSING       [12]
- RL2.3   Default Word Boundaries         - DONE          [14]
- RL2.4   Default Loose Matches           - MISSING       [15]
- RL2.5   Name Properties                 - DONE
- RL2.6   Wildcard Properties             - MISSING
+=over 4
 
- [10] see UAX#15 "Unicode Normalization Forms"
- [11] have Unicode::Normalize but not integrated to regexes
- [12] have \X and \b{gcb} but we don't have a "Grapheme Cluster
-      Mode"
- [14] see UAX#29, Word Boundaries
- [15] This is covered in Chapter 3.13 (in Unicode 6.0)
+=item [9]
+Unicode has rewritten this portion of UTS#18 to say that getting
+canonical equivalence (see UAX#15
+L<"Unicode Normalization Forms"|http://www.unicode.org/reports/tr15>)
+is basically to be done at the programmer level.  Use NFD to write
+both your regular expressions and text to match them against (you
+can use L<Unicode::Normalize>).
 
-=item *
+=item [10]
+Perl has C<\X> and C<\b{gcb}> but we don't have a "Grapheme Cluster Mode".
+
+=item [11] see
+L<UAX#29 "Unicode Text Segmentation"|http://www.unicode.org/reports/tr29>,
+
+=back
+
+=head3 Level 3 - Tailored Support
+
+ RL3.1   Tailored Punctuation            - Missing
+ RL3.2   Tailored Grapheme Clusters      - Missing       [12]
+ RL3.3   Tailored Word Boundaries        - Missing
+ RL3.4   Tailored Loose Matches          - Retracted by Unicode
+ RL3.5   Tailored Ranges                 - Retracted by Unicode
+ RL3.6   Context Matching                - Missing       [13]
+ RL3.7   Incremental Matches             - Missing
+ RL3.8   Unicode Set Sharing             - Unicode is proposing
+                                           to retract this
+ RL3.9   Possible Match Sets             - Missing
+ RL3.10  Folded Matching                 - Retracted by Unicode
+ RL3.11  Submatchers                     - Missing
+
+=over 4
+
+=item [12]
+Perl has L<Unicode::Collate>, but it isn't integrated with regular
+expressions.  See
+L<UTS#10 "Unicode Collation Algorithms"|http://www.unicode.org/reports/tr10>.
 
-Level 3 - Tailored Support
-
- RL3.1   Tailored Punctuation            - MISSING
- RL3.2   Tailored Grapheme Clusters      - MISSING       [17][18]
- RL3.3   Tailored Word Boundaries        - MISSING
- RL3.4   Tailored Loose Matches          - MISSING
- RL3.5   Tailored Ranges                 - MISSING
- RL3.6   Context Matching                - MISSING       [19]
- RL3.7   Incremental Matches             - MISSING
-      ( RL3.8   Unicode Set Sharing )
- RL3.9   Possible Match Sets             - MISSING
- RL3.10  Folded Matching                 - MISSING       [20]
- RL3.11  Submatchers                     - MISSING
-
- [17] see UAX#10 "Unicode Collation Algorithms"
- [18] have Unicode::Collate but not integrated to regexes
- [19] have (?<=x) and (?=x), but lookaheads or lookbehinds
-      should see outside of the target substring
- [20] need insensitive matching for linguistic features other
-      than case; for example, hiragana to katakana, wide and
-      narrow, simplified Han to traditional Han (see UTR#30
-      "Character Foldings")
+=item [13]
+Perl has C<(?<=x)> and C<(?=x)>, but lookaheads or lookbehinds should
+see outside of the target substring
 
 =back
 
@@ -1827,7 +1843,7 @@ the XS level, and L<perlapi/Unicode Support> for the API 
details.
 Perl by default comes with the latest supported Unicode version built-in, but
 the goal is to allow you to change to use any earlier one.  In Perls
 v5.20 and v5.22, however, the earliest usable version is Unicode 5.1.
-Perl v5.18 is able to handle all earlier versions.
+Perl v5.18  and v5.24 are able to handle all earlier versions.
 
 Download the files in the desired version of Unicode from the Unicode web
 site L<http://www.unicode.org>).  These should replace the existing files in

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.25.2-42-gfea12a3

Reply via email to