In perl.git, the branch blead has been updated <https://perl5.git.perl.org/perl.git/commitdiff/912b808cb4fcd596e07f77898c626f5567fbe994?hp=bfa9f5ee70ce509f0e66dcff9e9fda131ea8a133>
- Log ----------------------------------------------------------------- commit 912b808cb4fcd596e07f77898c626f5567fbe994 Author: Karl Williamson <[email protected]> Date: Thu Mar 14 11:50:10 2019 -0600 regnodes.h, perldebguts: Shorten some descriptions commit f4e61fc03836484ea88518e8bf04cc1b32a6a1a0 Author: Karl Williamson <[email protected]> Date: Thu Mar 14 11:48:11 2019 -0600 Any Common digit set can match in any script This fixes a design flaw in script runs that in 5.30 effectively prevented digits from the Common script except the ASCII [0-9] from being in any meaningful script run. ----------------------------------------------------------------------- Summary of changes: pod/perldebguts.pod | 37 +++++++++++++++++-------------------- pod/perldelta.pod | 19 +++++++++++++++++++ pod/perlre.pod | 19 ++++++++----------- regcomp.sym | 20 ++++++++++---------- regexec.c | 39 ++++++++++++--------------------------- regnodes.h | 20 ++++++++++---------- t/re/script_run.t | 19 +++++++++++++++++-- 7 files changed, 93 insertions(+), 80 deletions(-) diff --git a/pod/perldebguts.pod b/pod/perldebguts.pod index 2aa906e903..ff2eaed89b 100644 --- a/pod/perldebguts.pod +++ b/pod/perldebguts.pod @@ -587,7 +587,7 @@ will be lost. BOUNDL no Like BOUND/BOUNDU, but \w and \W are defined by current locale BOUNDU no Match "" at any boundary of a given type - using Unicode rules + using /u rules. BOUNDA no Match "" at any boundary between \w\W or \W\w, where \w is [_a-zA-Z0-9] NBOUND no Like NBOUNDA for non-utf8, otherwise match @@ -595,7 +595,7 @@ will be lost. NBOUNDL no Like NBOUND/NBOUNDU, but \w and \W are defined by current locale NBOUNDU no Match "" at any non-boundary of a given - type using using Unicode rules + type using using /u rules. NBOUNDA no Match "" betweeen any \w\w or \W\W, where \w is [_a-zA-Z0-9] @@ -720,28 +720,25 @@ will be lost. SRCLOSE none Close preceding SROPEN REF num 1 Match some already matched string - REFF num 1 Match already matched string, folded using - native charset rules for non-utf8 - REFFL num 1 Match already matched string, folded in - loc. - REFFU num 1 Match already matched string, folded using - unicode rules for non-utf8 - REFFA num 1 Match already matched string, folded using - unicode rules for non-utf8, no mixing - ASCII, non-ASCII + REFF num 1 Match already matched string, using /di + rules. + REFFL num 1 Match already matched string, using /li + rules. + REFFU num 1 Match already matched string, usng /ui. + REFFA num 1 Match already matched string, using /aai + rules. # Named references. Code in regcomp.c assumes that these all are after # the numbered references NREF no-sv 1 Match some already matched string - NREFF no-sv 1 Match already matched string, folded using - native charset rules for non-utf8 - NREFFL no-sv 1 Match already matched string, folded in - loc. - NREFFU num 1 Match already matched string, folded using - unicode rules for non-utf8 - NREFFA num 1 Match already matched string, folded using - unicode rules for non-utf8, no mixing - ASCII, non-ASCII + NREFF no-sv 1 Match already matched string, using /di + rules. + NREFFL no-sv 1 Match already matched string, using /li + rules. + NREFFU num 1 Match already matched string, using /ui + rules. + NREFFA num 1 Match already matched string, using /aai + rules. # Support for long RE LONGJMP off 1 1 Jump far away. diff --git a/pod/perldelta.pod b/pod/perldelta.pod index 06ae872679..68f4ba9fac 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -91,6 +91,20 @@ It croaks if it would otherwise return a UTF-8 string that contains malformed UTF-8. This protects agains potential security threats. This is considered a bug fix as well ([perl #131642]). +=head2 Any set of digits in the Common script are legal in a script run +of another script + +There are several sets of digits in the Common script. C<[0-9]> is the +most familiar. But there are also C<[\x{FF10}-\x{FF19}]> (FULLWIDTH +DIGIT ZERO - FULLWIDTH DIGIT NINE), and several sets for use in +mathematical notation, such as the MATHEMATICAL DOUBLE-STRUCK DIGITs. +Any of these sets should be able to appear in script runs of, say, +Greek. But the design of 5.30 overlooked all but the ASCII digits +C<[0-9]>, so the design was flawed. This has been fixed, so is both a +bug fix and an incompatibility. [perl #133547] + +All digits in a run still have to come from the same set of ten digits. + =head1 Deprecations XXX Any deprecated features, syntax, modules etc. should be listed here. @@ -430,6 +444,11 @@ C<pack()> no longer can return malformed UTF-8. It croaks if it would otherwise return a UTF-8 string that contains malformed UTF-8. This protects agains potential security threats. [perl #131642] +=item * + +See L</Any set of digits in the Common script are legal in a script run +of another script>. + =back =head1 Known Problems diff --git a/pod/perlre.pod b/pod/perlre.pod index 209cac7f8d..4898f94d9f 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -2550,15 +2550,12 @@ Katakana and Hiragana are commonly mixed together in practice, along with some Chinese characters, and hence are treated as being in a single script run by Perl. -The rules used for matching decimal digits are somewhat different. Many +The rules used for matching decimal digits are slightly stricter. Many scripts have their own sets of digits equivalent to the Western C<0> through C<9> ones. A few, such as Arabic, have more than one set. For a string to be considered a script run, all digits in it must come from -the same set, as determined by the first digit encountered. The ASCII -C<[0-9]> are accepted as being in any script, even those that have their -own set. This is because these are often used in commerce even in such -scripts. But any mixing of the ASCII and other digits will cause the -sequence to not be a script run, failing the match. As an example, +the same set of ten, as determined by the first digit encountered. +As an example, qr/(*script_run: \d+ \b )/x @@ -2579,11 +2576,11 @@ accent of some type. These are considered to be in the script of the master character, and so never cause a script run to not match. The other one is "Common". This consists of mostly punctuation, emoji, -and characters used in mathematics and music, and the ASCII digits C<0> -through C<9>. These characters can appear intermixed in text in many of -the world's scripts. These also don't cause a script run to not match, -except any ASCII digits encountered have to obey the decimal digit rules -described above. +and characters used in mathematics and music, the ASCII digits C<0> +through C<9>, and full-width forms of these digits. These characters +can appear intermixed in text in many of the world's scripts. These +also don't cause a script run to not match. But like other scripts, all +digits in a run must come from the same set of 10. This construct is non-capturing. You can add parentheses to I<pattern> to capture, if desired. You will have to do this if you plan to use diff --git a/regcomp.sym b/regcomp.sym index 09a21e9cc0..4b9a42c338 100644 --- a/regcomp.sym +++ b/regcomp.sym @@ -47,12 +47,12 @@ GPOS GPOS, no ; Matches where last m//g left off. # BOUND, POSIX and their complements are affected, as well as EXACTF. BOUND BOUND, no ; Like BOUNDA for non-utf8, otherwise match "" between any Unicode \w\W or \W\w BOUNDL BOUND, no ; Like BOUND/BOUNDU, but \w and \W are defined by current locale -BOUNDU BOUND, no ; Match "" at any boundary of a given type using Unicode rules +BOUNDU BOUND, no ; Match "" at any boundary of a given type using /u rules. BOUNDA BOUND, no ; Match "" at any boundary between \w\W or \W\w, where \w is [_a-zA-Z0-9] # All NBOUND nodes are required by code in regexec.c to be greater than all BOUND ones NBOUND NBOUND, no ; Like NBOUNDA for non-utf8, otherwise match "" between any Unicode \w\w or \W\W NBOUNDL NBOUND, no ; Like NBOUND/NBOUNDU, but \w and \W are defined by current locale -NBOUNDU NBOUND, no ; Match "" at any non-boundary of a given type using using Unicode rules +NBOUNDU NBOUND, no ; Match "" at any non-boundary of a given type using using /u rules. NBOUNDA NBOUND, no ; Match "" betweeen any \w\w or \W\W, where \w is [_a-zA-Z0-9] #* [Special] alternatives: @@ -156,21 +156,21 @@ SROPEN SROPEN, none ; Same as OPEN, but for script run SRCLOSE SRCLOSE, none ; Close preceding SROPEN REF REF, num 1 V ; Match some already matched string -REFF REF, num 1 V ; Match already matched string, folded using native charset rules for non-utf8 -REFFL REF, num 1 V ; Match already matched string, folded in loc. +REFF REF, num 1 V ; Match already matched string, using /di rules. +REFFL REF, num 1 V ; Match already matched string, using /li rules. # N?REFF[AU] could have been implemented using the FLAGS field of the # regnode, but by having a separate node type, we can use the existing switch # statement to avoid some tests -REFFU REF, num 1 V ; Match already matched string, folded using unicode rules for non-utf8 -REFFA REF, num 1 V ; Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII +REFFU REF, num 1 V ; Match already matched string, usng /ui. +REFFA REF, num 1 V ; Match already matched string, using /aai rules. #*Named references. Code in regcomp.c assumes that these all are after #*the numbered references NREF REF, no-sv 1 V ; Match some already matched string -NREFF REF, no-sv 1 V ; Match already matched string, folded using native charset rules for non-utf8 -NREFFL REF, no-sv 1 V ; Match already matched string, folded in loc. -NREFFU REF, num 1 V ; Match already matched string, folded using unicode rules for non-utf8 -NREFFA REF, num 1 V ; Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII +NREFF REF, no-sv 1 V ; Match already matched string, using /di rules. +NREFFL REF, no-sv 1 V ; Match already matched string, using /li rules. +NREFFU REF, num 1 V ; Match already matched string, using /ui rules. +NREFFA REF, num 1 V ; Match already matched string, using /aai rules. #*Support for long RE LONGJMP LONGJMP, off 1 . 1 ; Jump far away. diff --git a/regexec.c b/regexec.c index 64a65462b5..dff221a99c 100644 --- a/regexec.c +++ b/regexec.c @@ -10252,11 +10252,13 @@ Additionally all decimal digits must come from the same consecutive sequence of For example, if all the characters in the sequence are Greek, or Common, or Inherited, this function will return TRUE, provided any decimal digits in it -are the ASCII digits "0".."9". For scripts (unlike Greek) that have their own -digits defined this will accept either digits from that set or from 0..9, but -not a combination of the two. Some scripts, such as Arabic, have more than one -set of digits. All digits must come from the same set for this function to -return TRUE. +are from the same block of digits in Common. (These are the ASCII digits +"0".."9" and additionally a block for full width forms of these, and several +others used in mathematical notation.) For scripts (unlike Greek) that have +their own digits defined this will accept either digits from that set or from +one of the Common digit sets, but not a combination of the two. Some scripts, +such as Arabic, have more than one set of digits. All digits must come from +the same set for this function to return TRUE. C<*ret_script>, if C<ret_script> is not NULL, will on return of TRUE contain the script found, using the C<SCX_enum> typedef. Its value will be @@ -10359,10 +10361,9 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, const bool utf8_target) UV cp; /* The code allows all scripts to use the ASCII digits. This is - * because they are used in commerce even in scripts that have their - * own set. Hence any ASCII ones found are ok, unless and until a - * digit from another set has already been encountered. (The other - * digit ranges in Common are not similarly blessed) */ + * because they are in the Common script. Hence any ASCII ones found + * are ok, unless and until a digit from another set has already been + * encountered. digit ranges in Common are not similarly blessed) */ if (UNLIKELY(isDIGIT(*s))) { if (UNLIKELY(script_of_run == SCX_Unknown)) { retval = FALSE; @@ -10456,19 +10457,11 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, const bool utf8_target) /* If the run so far is Common, and the new character isn't, change the * run's script to that of this character */ if (script_of_run == SCX_Common && script_of_char != SCX_Common) { - - /* But Common contains several sets of digits. Only the '0' set - * can be part of another script. */ - if (zero_of_run && zero_of_run != '0') { - retval = FALSE; - break; - } - script_of_run = script_of_char; } - /* Now we can see if the script of the character is the same as that of - * the run */ + /* Now we can see if the script of the new character is the same as + * that of the run */ if (LIKELY(script_of_char == script_of_run)) { /* By far the most common case */ goto scripts_match; @@ -10668,14 +10661,6 @@ Perl_isSCRIPT_RUN(pTHX_ const U8 * s, const U8 * send, const bool utf8_target) break; } } - else if (script_of_char == SCX_Common && script_of_run != SCX_Common) { - - /* Here, the script run isn't Common, but the current digit is in - * Common, and isn't '0'-'9' (those were handled earlier). Only - * '0'-'9' are acceptable in non-Common scripts. */ - retval = FALSE; - break; - } else { /* Otherwise we now have a zero for this run */ zero_of_run = zero_of_char; } diff --git a/regnodes.h b/regnodes.h index 412a630561..3b53c1715f 100644 --- a/regnodes.h +++ b/regnodes.h @@ -21,11 +21,11 @@ #define GPOS 7 /* 0x07 Matches where last m//g left off. */ #define BOUND 8 /* 0x08 Like BOUNDA for non-utf8, otherwise match "" between any Unicode \w\W or \W\w */ #define BOUNDL 9 /* 0x09 Like BOUND/BOUNDU, but \w and \W are defined by current locale */ -#define BOUNDU 10 /* 0x0a Match "" at any boundary of a given type using Unicode rules */ +#define BOUNDU 10 /* 0x0a Match "" at any boundary of a given type using /u rules. */ #define BOUNDA 11 /* 0x0b Match "" at any boundary between \w\W or \W\w, where \w is [_a-zA-Z0-9] */ #define NBOUND 12 /* 0x0c Like NBOUNDA for non-utf8, otherwise match "" between any Unicode \w\w or \W\W */ #define NBOUNDL 13 /* 0x0d Like NBOUND/NBOUNDU, but \w and \W are defined by current locale */ -#define NBOUNDU 14 /* 0x0e Match "" at any non-boundary of a given type using using Unicode rules */ +#define NBOUNDU 14 /* 0x0e Match "" at any non-boundary of a given type using using /u rules. */ #define NBOUNDA 15 /* 0x0f Match "" betweeen any \w\w or \W\W, where \w is [_a-zA-Z0-9] */ #define REG_ANY 16 /* 0x10 Match any one character (except newline). */ #define SANY 17 /* 0x11 Match any one character. */ @@ -72,15 +72,15 @@ #define SROPEN 58 /* 0x3a Same as OPEN, but for script run */ #define SRCLOSE 59 /* 0x3b Close preceding SROPEN */ #define REF 60 /* 0x3c Match some already matched string */ -#define REFF 61 /* 0x3d Match already matched string, folded using native charset rules for non-utf8 */ -#define REFFL 62 /* 0x3e Match already matched string, folded in loc. */ -#define REFFU 63 /* 0x3f Match already matched string, folded using unicode rules for non-utf8 */ -#define REFFA 64 /* 0x40 Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII */ +#define REFF 61 /* 0x3d Match already matched string, using /di rules. */ +#define REFFL 62 /* 0x3e Match already matched string, using /li rules. */ +#define REFFU 63 /* 0x3f Match already matched string, usng /ui. */ +#define REFFA 64 /* 0x40 Match already matched string, using /aai rules. */ #define NREF 65 /* 0x41 Match some already matched string */ -#define NREFF 66 /* 0x42 Match already matched string, folded using native charset rules for non-utf8 */ -#define NREFFL 67 /* 0x43 Match already matched string, folded in loc. */ -#define NREFFU 68 /* 0x44 Match already matched string, folded using unicode rules for non-utf8 */ -#define NREFFA 69 /* 0x45 Match already matched string, folded using unicode rules for non-utf8, no mixing ASCII, non-ASCII */ +#define NREFF 66 /* 0x42 Match already matched string, using /di rules. */ +#define NREFFL 67 /* 0x43 Match already matched string, using /li rules. */ +#define NREFFU 68 /* 0x44 Match already matched string, using /ui rules. */ +#define NREFFA 69 /* 0x45 Match already matched string, using /aai rules. */ #define LONGJMP 70 /* 0x46 Jump far away. */ #define BRANCHJ 71 /* 0x47 BRANCH with long offset. */ #define IFMATCH 72 /* 0x48 Succeeds if the following matches; non-zero flags "f" means lookbehind assertion starting "f" characters before current */ diff --git a/t/re/script_run.t b/t/re/script_run.t index 035a9104aa..19d4e10e53 100644 --- a/t/re/script_run.t +++ b/t/re/script_run.t @@ -51,8 +51,8 @@ foreach my $type ('script_run', 'sr', 'atomic_script_run', 'asr') { unlike("\N{HEBREW LETTER ALEF}\N{HEBREW LETTER TAV}\N{MODIFIER LETTER SMALL Y}", $script_run, "Hebrew then Latin isn't a script run"); like("9876543210\N{DESERET SMALL LETTER WU}", $script_run, "0-9 are the digits for Deseret"); like("\N{DESERET SMALL LETTER WU}9876543210", $script_run, "Also when they aren't in the initial position"); - unlike("\N{DESERET SMALL LETTER WU}\N{FULLWIDTH DIGIT FIVE}", $script_run, "Fullwidth digits aren't the digits for Deseret"); - unlike("\N{FULLWIDTH DIGIT SIX}\N{DESERET SMALL LETTER LONG I}", $script_run, "... likewise if the digits come first"); + like("\N{DESERET SMALL LETTER WU}\N{FULLWIDTH DIGIT FIVE}", $script_run, "Fullwidth digits may be digits for Deseret"); + like("\N{FULLWIDTH DIGIT SIX}\N{DESERET SMALL LETTER LONG I}", $script_run, "... likewise if the digits come first"); like("1234567890\N{ARABIC LETTER ALEF}", $script_run, "[0-9] work for Arabic"); unlike("1234567890\N{ARABIC LETTER ALEF}\N{ARABIC-INDIC DIGIT FOUR}\N{ARABIC-INDIC DIGIT FIVE}", $script_run, "... but not in combination with real ARABIC digits"); @@ -104,4 +104,19 @@ foreach my $type ('script_run', 'sr', 'atomic_script_run', 'asr') { like("\x{3041}12\x{3041}", qr/^(*sr:.{4})/, "Script without own zero works with ASCII digits"); + like("A\x{ff10}\x{ff19}B", qr/^(*sr:.{4})/, + "Non-ASCII Common digits work with Latin"); # perl #133547 + like("A\x{ff10}BC", qr/^(*sr:.{4})/, + "Non-ASCII Common digits work with Latin"); # perl #133547 + like("A\x{1d7ce}\x{1d7cf}B", qr/^(*sr:.{4})/, + "Non-ASCII Common digits work with Latin"); # perl #133547 + like("A\x{1d7ce}BC", qr/^(*sr:.{4})/, + "Non-ASCII Common digits work with Latin"); # perl #133547 + like("\x{1d7ce}\x{1d7cf}AB", qr/^(*sr:.{4})/, + "Non-ASCII Common digits work with Latin"); # perl #133547 + like("α\x{1d7ce}βγ", qr/^(*sr:.{4})/, + "Non-ASCII Common digits work with Greek"); # perl #133547 + like("\x{1d7ce}αβγ", qr/^(*sr:.{4})/, + "Non-ASCII Common digits work with Greek"); # perl #133547 + done_testing(); -- Perl5 Master Repository
