[perl.git] branch blead updated. v5.29.8-108-g912b808cb4

Karl Williamson Thu, 14 Mar 2019 11:18:29 -0700

In perl.git, the branch blead has been updated

<https://perl5.git.perl.org/perl.git/commitdiff/912b808cb4fcd596e07f77898c626f5567fbe994?hp=bfa9f5ee70ce509f0e66dcff9e9fda131ea8a133>


- Log -----------------------------------------------------------------
commit 912b808cb4fcd596e07f77898c626f5567fbe994
Author: Karl Williamson <[email protected]>
Date:   Thu Mar 14 11:50:10 2019 -0600

    regnodes.h, perldebguts: Shorten some descriptions

commit f4e61fc03836484ea88518e8bf04cc1b32a6a1a0
Author: Karl Williamson <[email protected]>
Date:   Thu Mar 14 11:48:11 2019 -0600

    Any Common digit set can match in any script
    
    This fixes a design flaw in script runs that in 5.30 effectively
    prevented digits from the Common script except the ASCII [0-9] from
    being in any meaningful script run.

-----------------------------------------------------------------------

Summary of changes:
 pod/perldebguts.pod | 37 +++++++++++++++++--------------------
 pod/perldelta.pod   | 19 +++++++++++++++++++
 pod/perlre.pod      | 19 ++++++++-----------
 regcomp.sym         | 20 ++++++++++----------
 regexec.c           | 39 ++++++++++++---------------------------
 regnodes.h          | 20 ++++++++++----------
 t/re/script_run.t   | 19 +++++++++++++++++--
 7 files changed, 93 insertions(+), 80 deletions(-)

diff --git a/pod/perldebguts.pod b/pod/perldebguts.pod
index 2aa906e903..ff2eaed89b 100644
--- a/pod/perldebguts.pod
+++ b/pod/perldebguts.pod
@@ -587,7 +587,7 @@ will be lost.
  BOUNDL           no         Like BOUND/BOUNDU, but \w and \W are
                              defined by current locale
  BOUNDU           no         Match "" at any boundary of a given type
-                             using Unicode rules
+                             using /u rules.
  BOUNDA           no         Match "" at any boundary between \w\W or
                              \W\w, where \w is [_a-zA-Z0-9]
  NBOUND           no         Like NBOUNDA for non-utf8, otherwise match
@@ -595,7 +595,7 @@ will be lost.
  NBOUNDL          no         Like NBOUND/NBOUNDU, but \w and \W are
                              defined by current locale
  NBOUNDU          no         Match "" at any non-boundary of a given
-                             type using using Unicode rules
+                             type using using /u rules.
  NBOUNDA          no         Match "" betweeen any \w\w or \W\W, where
                              \w is [_a-zA-Z0-9]
 
@@ -720,28 +720,25 @@ will be lost.
  SRCLOSE          none       Close preceding SROPEN
 
  REF              num 1      Match some already matched string
- REFF             num 1      Match already matched string, folded using
-                             native charset rules for non-utf8
- REFFL            num 1      Match already matched string, folded in
-                             loc.
- REFFU            num 1      Match already matched string, folded using
-                             unicode rules for non-utf8
- REFFA            num 1      Match already matched string, folded using
-                             unicode rules for non-utf8, no mixing
-                             ASCII, non-ASCII
+ REFF             num 1      Match already matched string, using /di
+                             rules.
+ REFFL            num 1      Match already matched string, using /li
+                             rules.
+ REFFU            num 1      Match already matched string, usng /ui.
+ REFFA            num 1      Match already matched string, using /aai
+                             rules.
 
  # Named references.  Code in regcomp.c assumes that these all are after
  # the numbered references
  NREF             no-sv 1    Match some already matched string
- NREFF            no-sv 1    Match already matched string, folded using
-                             native charset rules for non-utf8
- NREFFL           no-sv 1    Match already matched string, folded in
-                             loc.
- NREFFU           num 1      Match already matched string, folded using
-                             unicode rules for non-utf8
- NREFFA           num 1      Match already matched string, folded using
-                             unicode rules for non-utf8, no mixing
-                             ASCII, non-ASCII
+ NREFF            no-sv 1    Match already matched string, using /di
+                             rules.
+ NREFFL           no-sv 1    Match already matched string, using /li
+                             rules.
+ NREFFU           num 1      Match already matched string, using /ui
+                             rules.
+ NREFFA           num 1      Match already matched string, using /aai
+                             rules.
 
  # Support for long RE
  LONGJMP          off 1 1    Jump far away.
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index 06ae872679..68f4ba9fac 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -91,6 +91,20 @@ It croaks if it would otherwise return a UTF-8 string that 
contains
 malformed UTF-8.  This protects agains potential security threats.  This
 is considered a bug fix as well ([perl #131642]).
 
+=head2 Any set of digits in the Common script are legal in a script run
+of another script
+
+There are several sets of digits in the Common script.  C<[0-9]> is the
+most familiar.  But there are also C<[\x{FF10}-\x{FF19}]> (FULLWIDTH
+DIGIT ZERO - FULLWIDTH DIGIT NINE), and several sets for use in
+mathematical notation, such as the MATHEMATICAL DOUBLE-STRUCK DIGITs.
+Any of these sets should be able to appear in script runs of, say,
+Greek.  But the design of 5.30 overlooked all but the ASCII digits
+C<[0-9]>, so the design was flawed.  This has been fixed, so is both a
+bug fix and an incompatibility. [perl #133547]
+
+All digits in a run still have to come from the same set of ten digits.
+
 =head1 Deprecations
 
 XXX Any deprecated features, syntax, modules etc. should be listed here.
@@ -430,6 +444,11 @@ C<pack()> no longer can return malformed UTF-8.  It croaks 
if it would
 otherwise return a UTF-8 string that contains malformed UTF-8.  This
 protects agains potential security threats.  [perl #131642]
 
+=item *
+
+See L</Any set of digits in the Common script are legal in a script run
+of another script>.
+
 =back
 
 =head1 Known Problems
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 209cac7f8d..4898f94d9f 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -2550,15 +2550,12 @@ Katakana and Hiragana are commonly mixed together in 
practice, along
 with some Chinese characters, and hence are treated as being in a single
 script run by Perl.
 
-The rules used for matching decimal digits are somewhat different.  Many
+The rules used for matching decimal digits are slightly stricter.  Many
 scripts have their own sets of digits equivalent to the Western C<0>
 through C<9> ones.  A few, such as Arabic, have more than one set.  For
 a string to be considered a script run, all digits in it must come from
-the same set, as determined by the first digit encountered. The ASCII
-C<[0-9]> are accepted as being in any script, even those that have their
-own set.  This is because these are often used in commerce even in such
-scripts.  But any mixing of the ASCII and other digits will cause the
-sequence to not be a script run, failing the match.  As an example,
+the same set of ten, as determined by the first digit encountered.
+As an example,
 
  qr/(*script_run: \d+ \b )/x
 
@@ -2579,11 +2576,11 @@ accent of some type.  These are considered to be in the 
script of the
 master character, and so never cause a script run to not match.
 
 The other one is "Common".  This consists of mostly punctuation, emoji,
-and characters used in mathematics and music, and the ASCII digits C<0>
-through C<9>.  These characters can appear intermixed in text in many of
-the world's scripts.  These also don't cause a script run to not match,
-except any ASCII digits encountered have to obey the decimal digit rules
-described above.
+and characters used in mathematics and music, the ASCII digits C<0>
+through C<9>, and full-width forms of these digits.  These characters
+can appear intermixed in text in many of the world's scripts.  These
+also don't cause a script run to not match.  But like other scripts, all
+digits in a run must come from the same set of 10.
 
 This construct is non-capturing.  You can add parentheses to I<pattern>
 to capture, if desired.  You will have to do this if you plan to use
diff --git a/regcomp.sym b/regcomp.sym
index 09a21e9cc0..4b9a42c338 100644
--- a/regcomp.sym
+++ b/regcomp.sym
@@ -47,12 +47,12 @@ GPOS        GPOS,       no        ; Matches where last m//g 
left off.
 # BOUND, POSIX and their complements are affected, as well as EXACTF.
 BOUND       BOUND,      no        ; Like BOUNDA for non-utf8, otherwise match 
"" between any Unicode \w\W or \W\w
 BOUNDL      BOUND,      no        ; Like BOUND/BOUNDU, but \w and \W are 
defined by current locale
-BOUNDU      BOUND,      no        ; Match "" at any boundary of a given type 
using Unicode rules
+BOUNDU      BOUND,      no        ; Match "" at any boundary of a given type 
using /u rules.
 BOUNDA      BOUND,      no        ; Match "" at any boundary between \w\W or 
\W\w, where \w is [_a-zA-Z0-9]
 # All NBOUND nodes are required by code in regexec.c to be greater than all 
BOUND ones
 NBOUND      NBOUND,     no        ; Like NBOUNDA for non-utf8, otherwise match 
"" between any Unicode \w\w or \W\W
 NBOUNDL     NBOUND,     no        ; Like NBOUND/NBOUNDU, but \w and \W are 
defined by current locale
-NBOUNDU     NBOUND,     no        ; Match "" at any non-boundary of a given 
type using using Unicode rules
+NBOUNDU     NBOUND,     no        ; Match "" at any non-boundary of a given 
type using using /u rules.
 NBOUNDA     NBOUND,     no        ; Match "" betweeen any \w\w or \W\W, where 
\w is [_a-zA-Z0-9]
 
 #* [Special] alternatives:
@@ -156,21 +156,21 @@ SROPEN      SROPEN,     none      ; Same as OPEN, but for 
script run
 SRCLOSE     SRCLOSE,    none      ; Close preceding SROPEN
 
 REF         REF,        num 1 V   ; Match some already matched string
-REFF        REF,        num 1 V   ; Match already matched string, folded using 
native charset rules for non-utf8
-REFFL       REF,        num 1 V   ; Match already matched string, folded in 
loc.
+REFF        REF,        num 1 V   ; Match already matched string, using /di 
rules.
+REFFL       REF,        num 1 V   ; Match already matched string, using /li 
rules.
 # N?REFF[AU] could have been implemented using the FLAGS field of the
 # regnode, but by having a separate node type, we can use the existing switch
 # statement to avoid some tests
-REFFU       REF,        num 1 V   ; Match already matched string, folded using 
unicode rules for non-utf8
-REFFA       REF,        num 1 V   ; Match already matched string, folded using 
unicode rules for non-utf8, no mixing ASCII, non-ASCII
+REFFU       REF,        num 1 V   ; Match already matched string, usng /ui.
+REFFA       REF,        num 1 V   ; Match already matched string, using /aai 
rules.
 
 #*Named references.  Code in regcomp.c assumes that these all are after
 #*the numbered references
 NREF        REF,        no-sv 1 V ; Match some already matched string
-NREFF       REF,        no-sv 1 V ; Match already matched string, folded using 
native charset rules for non-utf8
-NREFFL      REF,        no-sv 1 V ; Match already matched string, folded in 
loc.
-NREFFU      REF,        num   1 V ; Match already matched string, folded using 
unicode rules for non-utf8
-NREFFA      REF,        num   1 V ; Match already matched string, folded using 
unicode rules for non-utf8, no mixing ASCII, non-ASCII
+NREFF       REF,        no-sv 1 V ; Match already matched string, using /di 
rules.
+NREFFL      REF,        no-sv 1 V ; Match already matched string, using /li 
rules.
+NREFFU      REF,        num   1 V ; Match already matched string, using /ui 
rules.
+NREFFA      REF,        num   1 V ; Match already matched string, using /aai 
rules.
 
 #*Support for long RE
 LONGJMP     LONGJMP,    off 1 . 1 ; Jump far away.
diff --git a/regexec.c b/regexec.c
index 64a65462b5..dff221a99c 100644
--- a/regexec.c
+++ b/regexec.c
@@ -10252,11 +10252,13 @@ Additionally all decimal digits must come from the 
same consecutive sequence of
 
 For example, if all the characters in the sequence are Greek, or Common, or
 Inherited, this function will return TRUE, provided any decimal digits in it
-are the ASCII digits "0".."9".  For scripts (unlike Greek) that have their own
-digits defined this will accept either digits from that set or from 0..9, but
-not a combination of the two.  Some scripts, such as Arabic, have more than one
-set of digits.  All digits must come from the same set for this function to
-return TRUE.
+are from the same block of digits in Common.  (These are the ASCII digits
+"0".."9" and additionally a block for full width forms of these, and several
+others used in mathematical notation.)   For scripts (unlike Greek) that have
+their own digits defined this will accept either digits from that set or from
+one of the Common digit sets, but not a combination of the two.  Some scripts,
+such as Arabic, have more than one set of digits.  All digits must come from
+the same set for this function to return TRUE.
 
 C<*ret_script>, if C<ret_script> is not NULL, will on return of TRUE
 contain the script found, using the C<SCX_enum> typedef.  Its value will be
@@ -10359,10 +10361,9 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, 
const bool utf8_target)
         UV cp;
 
         /* The code allows all scripts to use the ASCII digits.  This is
-         * because they are used in commerce even in scripts that have their
-         * own set.  Hence any ASCII ones found are ok, unless and until a
-         * digit from another set has already been encountered.  (The other
-         * digit ranges in Common are not similarly blessed) */
+         * because they are in the Common script.  Hence any ASCII ones found
+         * are ok, unless and until a digit from another set has already been
+         * encountered.  digit ranges in Common are not similarly blessed) */
         if (UNLIKELY(isDIGIT(*s))) {
             if (UNLIKELY(script_of_run == SCX_Unknown)) {
                 retval = FALSE;
@@ -10456,19 +10457,11 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * 
send, const bool utf8_target)
         /* If the run so far is Common, and the new character isn't, change the
          * run's script to that of this character */
         if (script_of_run == SCX_Common && script_of_char != SCX_Common) {
-
-            /* But Common contains several sets of digits.  Only the '0' set
-             * can be part of another script. */
-            if (zero_of_run && zero_of_run != '0') {
-                retval = FALSE;
-                break;
-            }
-
             script_of_run = script_of_char;
         }
 
-        /* Now we can see if the script of the character is the same as that of
-         * the run */
+        /* Now we can see if the script of the new character is the same as
+         * that of the run */
         if (LIKELY(script_of_char == script_of_run)) {
             /* By far the most common case */
             goto scripts_match;
@@ -10668,14 +10661,6 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, 
const bool utf8_target)
                 break;
             }
         }
-        else if (script_of_char == SCX_Common && script_of_run != SCX_Common) {
-
-            /* Here, the script run isn't Common, but the current digit is in
-             * Common, and isn't '0'-'9' (those were handled earlier).   Only
-             * '0'-'9' are acceptable in non-Common scripts. */
-            retval = FALSE;
-            break;
-        }
         else {  /* Otherwise we now have a zero for this run */
             zero_of_run = zero_of_char;
         }
diff --git a/regnodes.h b/regnodes.h
index 412a630561..3b53c1715f 100644
--- a/regnodes.h
+++ b/regnodes.h
@@ -21,11 +21,11 @@
 #define        GPOS                    7       /* 0x07 Matches where last m//g 
left off. */
 #define        BOUND                   8       /* 0x08 Like BOUNDA for 
non-utf8, otherwise match "" between any Unicode \w\W or \W\w */
 #define        BOUNDL                  9       /* 0x09 Like BOUND/BOUNDU, but 
\w and \W are defined by current locale */
-#define        BOUNDU                  10      /* 0x0a Match "" at any 
boundary of a given type using Unicode rules */
+#define        BOUNDU                  10      /* 0x0a Match "" at any 
boundary of a given type using /u rules. */
 #define        BOUNDA                  11      /* 0x0b Match "" at any 
boundary between \w\W or \W\w, where \w is [_a-zA-Z0-9] */
 #define        NBOUND                  12      /* 0x0c Like NBOUNDA for 
non-utf8, otherwise match "" between any Unicode \w\w or \W\W */
 #define        NBOUNDL                 13      /* 0x0d Like NBOUND/NBOUNDU, 
but \w and \W are defined by current locale */
-#define        NBOUNDU                 14      /* 0x0e Match "" at any 
non-boundary of a given type using using Unicode rules */
+#define        NBOUNDU                 14      /* 0x0e Match "" at any 
non-boundary of a given type using using /u rules. */
 #define        NBOUNDA                 15      /* 0x0f Match "" betweeen any 
\w\w or \W\W, where \w is [_a-zA-Z0-9] */
 #define        REG_ANY                 16      /* 0x10 Match any one character 
(except newline). */
 #define        SANY                    17      /* 0x11 Match any one 
character. */
@@ -72,15 +72,15 @@
 #define        SROPEN                  58      /* 0x3a Same as OPEN, but for 
script run */
 #define        SRCLOSE                 59      /* 0x3b Close preceding SROPEN 
*/
 #define        REF                     60      /* 0x3c Match some already 
matched string */
-#define        REFF                    61      /* 0x3d Match already matched 
string, folded using native charset rules for non-utf8 */
-#define        REFFL                   62      /* 0x3e Match already matched 
string, folded in loc. */
-#define        REFFU                   63      /* 0x3f Match already matched 
string, folded using unicode rules for non-utf8 */
-#define        REFFA                   64      /* 0x40 Match already matched 
string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII */
+#define        REFF                    61      /* 0x3d Match already matched 
string, using /di rules. */
+#define        REFFL                   62      /* 0x3e Match already matched 
string, using /li rules. */
+#define        REFFU                   63      /* 0x3f Match already matched 
string, usng /ui. */
+#define        REFFA                   64      /* 0x40 Match already matched 
string, using /aai rules. */
 #define        NREF                    65      /* 0x41 Match some already 
matched string */
-#define        NREFF                   66      /* 0x42 Match already matched 
string, folded using native charset rules for non-utf8 */
-#define        NREFFL                  67      /* 0x43 Match already matched 
string, folded in loc. */
-#define        NREFFU                  68      /* 0x44 Match already matched 
string, folded using unicode rules for non-utf8 */
-#define        NREFFA                  69      /* 0x45 Match already matched 
string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII */
+#define        NREFF                   66      /* 0x42 Match already matched 
string, using /di rules. */
+#define        NREFFL                  67      /* 0x43 Match already matched 
string, using /li rules. */
+#define        NREFFU                  68      /* 0x44 Match already matched 
string, using /ui rules. */
+#define        NREFFA                  69      /* 0x45 Match already matched 
string, using /aai rules. */
 #define        LONGJMP                 70      /* 0x46 Jump far away. */
 #define        BRANCHJ                 71      /* 0x47 BRANCH with long 
offset. */
 #define        IFMATCH                 72      /* 0x48 Succeeds if the 
following matches; non-zero flags "f" means lookbehind assertion starting "f" 
characters before current */
diff --git a/t/re/script_run.t b/t/re/script_run.t
index 035a9104aa..19d4e10e53 100644
--- a/t/re/script_run.t
+++ b/t/re/script_run.t
@@ -51,8 +51,8 @@ foreach my $type ('script_run', 'sr', 'atomic_script_run', 
'asr') {
     unlike("\N{HEBREW LETTER ALEF}\N{HEBREW LETTER TAV}\N{MODIFIER LETTER 
SMALL Y}", $script_run, "Hebrew then Latin isn't a script run");
     like("9876543210\N{DESERET SMALL LETTER WU}", $script_run, "0-9 are the 
digits for Deseret");
     like("\N{DESERET SMALL LETTER WU}9876543210", $script_run, "Also when they 
aren't in the initial position");
-    unlike("\N{DESERET SMALL LETTER WU}\N{FULLWIDTH DIGIT FIVE}", $script_run, 
"Fullwidth digits aren't the digits for Deseret");
-    unlike("\N{FULLWIDTH DIGIT SIX}\N{DESERET SMALL LETTER LONG I}", 
$script_run, "... likewise if the digits come first");
+    like("\N{DESERET SMALL LETTER WU}\N{FULLWIDTH DIGIT FIVE}", $script_run, 
"Fullwidth digits may be digits for Deseret");
+    like("\N{FULLWIDTH DIGIT SIX}\N{DESERET SMALL LETTER LONG I}", 
$script_run, "... likewise if the digits come first");
 
     like("1234567890\N{ARABIC LETTER ALEF}", $script_run, "[0-9] work for 
Arabic");
     unlike("1234567890\N{ARABIC LETTER ALEF}\N{ARABIC-INDIC DIGIT 
FOUR}\N{ARABIC-INDIC DIGIT FIVE}", $script_run, "... but not in combination 
with real ARABIC digits");
@@ -104,4 +104,19 @@ foreach my $type ('script_run', 'sr', 'atomic_script_run', 
'asr') {
     like("\x{3041}12\x{3041}", qr/^(*sr:.{4})/,
          "Script without own zero works with ASCII digits");
 
+    like("A\x{ff10}\x{ff19}B", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Latin"); # perl #133547
+    like("A\x{ff10}BC", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Latin"); # perl #133547
+    like("A\x{1d7ce}\x{1d7cf}B", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Latin"); # perl #133547
+    like("A\x{1d7ce}BC", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Latin"); # perl #133547
+    like("\x{1d7ce}\x{1d7cf}AB", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Latin"); # perl #133547
+    like("α\x{1d7ce}βγ", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Greek"); # perl #133547
+    like("\x{1d7ce}αβγ", qr/^(*sr:.{4})/,
+         "Non-ASCII Common digits work with Greek"); # perl #133547
+
 done_testing();

-- 
Perl5 Master Repository

[perl.git] branch blead updated. v5.29.8-108-g912b808cb4

Reply via email to