[perl.git] branch blead, updated. v5.25.5-24-g98fce2a

Karl Williamson Sun, 25 Sep 2016 21:25:54 -0700

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/98fce2a4417fa36585bb48f6ae845bee93cac0fa?hp=3a57bd4d5e78d639b78eed9fcc27028720f8d326>


- Log -----------------------------------------------------------------
commit 98fce2a4417fa36585bb48f6ae845bee93cac0fa
Author: Karl Williamson <[email protected]>
Date:   Wed Sep 21 16:15:08 2016 -0600

    Centralize definitions of MIN, MAX
    
    Instead of having each file have them, keep them in handy.h, but only
    for core compilations.

M       handy.h
M       regcomp.c
M       utf8.c

commit 8bc127bf58304a1e46a3e33d30b0b8b6f21abb07
Author: Karl Williamson <[email protected]>
Date:   Sun Sep 25 22:04:08 2016 -0600

    Add is_utf8_fixed_width_buf_flags() and use it
    
    This encodes a simple pattern that may not be immediately obvious to
    someone needing it.  If you have a fixed-size buffer that is full of
    purportedly UTF-8 bytes, is it valid or not?  It's easy to do, as shown
    in this commit.  The file test operators -T and -B can be simpified by
    using this function.

M       embed.fnc
M       embed.h
M       ext/XS-APItest/APItest.xs
M       ext/XS-APItest/t/utf8.t
M       inline.h
M       pp_sys.c
M       proto.h

commit 9f2abfdef8903cce0a7b12ce12788ce7e9f72ed1
Author: Karl Williamson <[email protected]>
Date:   Mon Sep 19 09:59:32 2016 -0600

    Add API Unicode handling functions
    
    These functions are all extensions of the is_utf8_string_foo()
    functions, that restrict the UTF-8 recognized as valid in various ways.
    There are named ones for the two definitions that Unicode makes, and
    foo_flags ones for more custom restrictions.
    
    The named ones are implemented as tries, while the flags ones provide
    complete generality

M       embed.fnc
M       embed.h
M       ext/XS-APItest/APItest.pm
M       ext/XS-APItest/APItest.xs
M       ext/XS-APItest/t/utf8.t
M       inline.h
M       proto.h
M       utf8.h

commit 152c1f4b3a3b82886ecaa218d01d1a5a20f80f17
Author: Karl Williamson <[email protected]>
Date:   Sun Sep 25 10:14:50 2016 -0600

    APItest/t/utf8.t: Rename variable
    
    The new name is clearer, which will matter more in the next commit

M       ext/XS-APItest/t/utf8.t

commit 5f8a3d1d179cec9e4e9086e24c4f682844b93438
Author: Karl Williamson <[email protected]>
Date:   Tue Sep 20 10:12:45 2016 -0600

    XS-APItest/t/utf8.t: Add some tests
    
    These will help in testing the string functions coming in the next
    commit.  These add problematic code points to the first testing loop.
    As a result some of the tests in the final loop may be redundant, but
    since this .t is quick to run, I chose not to investigate and remove any
    such.

M       ext/XS-APItest/APItest.pm
M       ext/XS-APItest/t/utf8.t

commit 3964c812010ebc56145ccf7cde87b5eb97d0daf0
Author: Karl Williamson <[email protected]>
Date:   Wed Sep 14 19:57:46 2016 -0600

    Move #define to different header
    
    Instead of having a comment in one header pointing to the #define in the
    other, remove the indirection and just have the #define itself where it
    is needed.

M       inline.h
M       utf8.h

commit 2717076ad3197147ee82d8e263fa3cf7fc9ca19c
Author: Karl Williamson <[email protected]>
Date:   Mon Sep 19 09:52:57 2016 -0600

    perlapi: Clarifications, nits in Unicode support docs
    
    This also does a white space change to inline.h

M       inline.h
M       utf8.h

commit f21517291ac6c737159b2b06bd18b58a063ddb6b
Author: Karl Williamson <[email protected]>
Date:   Thu Sep 15 09:06:39 2016 -0600

    perlapi: Minor clarifications to sv_utf8_decode

M       sv.c
-----------------------------------------------------------------------

Summary of changes:
 embed.fnc                 |  35 +++-
 embed.h                   |   7 +
 ext/XS-APItest/APItest.pm |   2 +-
 ext/XS-APItest/APItest.xs | 179 +++++++++++++++-
 ext/XS-APItest/t/utf8.t   | 383 +++++++++++++++++++++++++++++++--
 handy.h                   |   9 +
 inline.h                  | 523 ++++++++++++++++++++++++++++++++++++++++++++--
 pp_sys.c                  |   8 +-
 proto.h                   |  32 +++
 regcomp.c                 |   8 -
 sv.c                      |   4 +-
 utf8.c                    |   3 -
 utf8.h                    |  50 +++--
 13 files changed, 1174 insertions(+), 69 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index 2954eda..168fe68 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -742,8 +742,39 @@ AmnpdRP    |bool   |is_invariant_string|NN const U8* const 
s|const STRLEN len
 AnpdD  |STRLEN |is_utf8_char   |NN const U8 *s
 Abmnpd |STRLEN |is_utf8_char_buf|NN const U8 *buf|NN const U8 *buf_end
 AnipdP |bool   |is_utf8_string |NN const U8 *s|const STRLEN len
-Anpdmb |bool   |is_utf8_string_loc|NN const U8 *s|const STRLEN len|NN const U8 
**ep
-Anipd  |bool   |is_utf8_string_loclen|NN const U8 *s|const STRLEN len|NULLOK 
const U8 **ep|NULLOK STRLEN *el
+AnidP  |bool   |is_utf8_string_flags                                       \
+               |NN const U8 *s|const STRLEN len|const U32 flags
+AnidP  |bool   |is_strict_utf8_string|NN const U8 *s|const STRLEN len
+AnidP  |bool   |is_c9strict_utf8_string|NN const U8 *s|const STRLEN len
+Anpdmb |bool   |is_utf8_string_loc                                         \
+               |NN const U8 *s|const STRLEN len|NN const U8 **ep
+Andm   |bool   |is_utf8_string_loc_flags                                   \
+               |NN const U8 *s|const STRLEN len|NN const U8 **ep           \
+               |const U32 flags
+Andm   |bool   |is_strict_utf8_string_loc                                  \
+               |NN const U8 *s|const STRLEN len|NN const U8 **ep
+Andm   |bool   |is_c9strict_utf8_string_loc                                \
+               |NN const U8 *s|const STRLEN len|NN const U8 **ep
+Anipd  |bool   |is_utf8_string_loclen                                      \
+               |NN const U8 *s|const STRLEN len|NULLOK const U8 **ep       \
+               |NULLOK STRLEN *el
+Anid   |bool   |is_utf8_string_loclen_flags                                \
+               |NN const U8 *s|const STRLEN len|NULLOK const U8 **ep       \
+               |NULLOK STRLEN *el|const U32 flags
+Anid   |bool   |is_strict_utf8_string_loclen                               \
+               |NN const U8 *s|const STRLEN len|NULLOK const U8 **ep       \
+               |NULLOK STRLEN *el
+Anid   |bool   |is_c9strict_utf8_string_loclen                             \
+               |NN const U8 *s|const STRLEN len|NULLOK const U8 **ep       \
+               |NULLOK STRLEN *el
+Amnd   |bool   |is_utf8_fixed_width_buf_flags                              \
+               |NN const U8 * const s|const STRLEN len|const U32 flags
+Amnd   |bool   |is_utf8_fixed_width_buf_loc_flags                          \
+               |NN const U8 * const s|const STRLEN len                     \
+               |NULLOK const U8 **ep|const U32 flags
+Anid   |bool   |is_utf8_fixed_width_buf_loclen_flags                       \
+               |NN const U8 * const s|const STRLEN len                     \
+               |NULLOK const U8 **ep|NULLOK STRLEN *el|const U32 flags
 AmndP  |bool   |is_utf8_valid_partial_char                                 \
                |NN const U8 * const s|NN const U8 * const e
 AnidP  |bool   |is_utf8_valid_partial_char_flags                           \
diff --git a/embed.h b/embed.h
index 50a19a4..31d0548 100644
--- a/embed.h
+++ b/embed.h
@@ -242,7 +242,11 @@
 #define intro_my()             Perl_intro_my(aTHX)
 #define isALNUM_lazy(a)                Perl_isALNUM_lazy(aTHX_ a)
 #define isIDFIRST_lazy(a)      Perl_isIDFIRST_lazy(aTHX_ a)
+#define is_c9strict_utf8_string        S_is_c9strict_utf8_string
+#define is_c9strict_utf8_string_loclen S_is_c9strict_utf8_string_loclen
 #define is_lvalue_sub()                Perl_is_lvalue_sub(aTHX)
+#define is_strict_utf8_string  S_is_strict_utf8_string
+#define is_strict_utf8_string_loclen   S_is_strict_utf8_string_loclen
 #define is_uni_alnum(a)                Perl_is_uni_alnum(aTHX_ a)
 #define is_uni_alnum_lc(a)     Perl_is_uni_alnum_lc(aTHX_ a)
 #define is_uni_alnumc(a)       Perl_is_uni_alnumc(aTHX_ a)
@@ -281,6 +285,7 @@
 #define is_utf8_char           Perl_is_utf8_char
 #define is_utf8_cntrl(a)       Perl_is_utf8_cntrl(aTHX_ a)
 #define is_utf8_digit(a)       Perl_is_utf8_digit(aTHX_ a)
+#define is_utf8_fixed_width_buf_loclen_flags   
S_is_utf8_fixed_width_buf_loclen_flags
 #define is_utf8_graph(a)       Perl_is_utf8_graph(aTHX_ a)
 #define is_utf8_idcont(a)      Perl_is_utf8_idcont(aTHX_ a)
 #define is_utf8_idfirst(a)     Perl_is_utf8_idfirst(aTHX_ a)
@@ -294,7 +299,9 @@
 #define is_utf8_punct(a)       Perl_is_utf8_punct(aTHX_ a)
 #define is_utf8_space(a)       Perl_is_utf8_space(aTHX_ a)
 #define is_utf8_string         Perl_is_utf8_string
+#define is_utf8_string_flags   S_is_utf8_string_flags
 #define is_utf8_string_loclen  Perl_is_utf8_string_loclen
+#define is_utf8_string_loclen_flags    S_is_utf8_string_loclen_flags
 #define is_utf8_upper(a)       Perl_is_utf8_upper(aTHX_ a)
 #define is_utf8_valid_partial_char_flags       
S_is_utf8_valid_partial_char_flags
 #define is_utf8_xdigit(a)      Perl_is_utf8_xdigit(aTHX_ a)
diff --git a/ext/XS-APItest/APItest.pm b/ext/XS-APItest/APItest.pm
index d35018f..64a25f1 100644
--- a/ext/XS-APItest/APItest.pm
+++ b/ext/XS-APItest/APItest.pm
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use Carp;
 
-our $VERSION = '0.84';
+our $VERSION = '0.86';
 
 require XSLoader;
 
diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs
index 954bb60..ce94968 100644
--- a/ext/XS-APItest/APItest.xs
+++ b/ext/XS-APItest/APItest.xs
@@ -5351,12 +5351,187 @@ test_isC9_STRICT_UTF8_CHAR(char *s, STRLEN len)
 IV
 test_is_utf8_valid_partial_char_flags(char *s, STRLEN len, U32 flags)
     CODE:
-        /* RETVAL should be bool, but making it IV allows us to test it
-         * returning 0 or 1 */
+        /* RETVAL should be bool (here and in tests below), but making it IV
+         * allows us to test it returning 0 or 1 */
         RETVAL = is_utf8_valid_partial_char_flags((U8 *) s, (U8 *) s + len, 
flags);
     OUTPUT:
         RETVAL
 
+IV
+test_is_utf8_string(char *s, STRLEN len)
+    CODE:
+        RETVAL = is_utf8_string((U8 *) s, len);
+    OUTPUT:
+        RETVAL
+
+AV *
+test_is_utf8_string_loc(char *s, STRLEN len)
+    PREINIT:
+        AV *av;
+        const U8 * ep;
+    CODE:
+        av = newAV();
+        av_push(av, newSViv(is_utf8_string_loc((U8 *) s, len, &ep)));
+        av_push(av, newSViv(ep - (U8 *) s));
+        RETVAL = av;
+    OUTPUT:
+        RETVAL
+
+AV *
+test_is_utf8_string_loclen(char *s, STRLEN len)
+    PREINIT:
+        AV *av;
+        STRLEN ret_len;
+        const U8 * ep;
+    CODE:
+        av = newAV();
+        av_push(av, newSViv(is_utf8_string_loclen((U8 *) s, len, &ep, 
&ret_len)));
+        av_push(av, newSViv(ep - (U8 *) s));
+        av_push(av, newSVuv(ret_len));
+        RETVAL = av;
+    OUTPUT:
+        RETVAL
+
+IV
+test_is_utf8_string_flags(char *s, STRLEN len, U32 flags)
+    CODE:
+        RETVAL = is_utf8_string_flags((U8 *) s, len, flags);
+    OUTPUT:
+        RETVAL
+
+AV *
+test_is_utf8_string_loc_flags(char *s, STRLEN len, U32 flags)
+    PREINIT:
+        AV *av;
+        const U8 * ep;
+    CODE:
+        av = newAV();
+        av_push(av, newSViv(is_utf8_string_loc_flags((U8 *) s, len, &ep, 
flags)));
+        av_push(av, newSViv(ep - (U8 *) s));
+        RETVAL = av;
+    OUTPUT:
+        RETVAL
+
+AV *
+test_is_utf8_string_loclen_flags(char *s, STRLEN len, U32 flags)
+    PREINIT:
+        AV *av;
+        STRLEN ret_len;
+        const U8 * ep;
+    CODE:
+        av = newAV();
+        av_push(av, newSViv(is_utf8_string_loclen_flags((U8 *) s, len, &ep, 
&ret_len, flags)));
+        av_push(av, newSViv(ep - (U8 *) s));
+        av_push(av, newSVuv(ret_len));
+        RETVAL = av;
+    OUTPUT:
+        RETVAL
+
+IV
+test_is_strict_utf8_string(char *s, STRLEN len)
+    CODE:
+        RETVAL = is_strict_utf8_string((U8 *) s, len);
+    OUTPUT:
+        RETVAL
+
+AV *
+test_is_strict_utf8_string_loc(char *s, STRLEN len)
+    PREINIT:
+        AV *av;
+        const U8 * ep;
+    CODE:
+        av = newAV();
+        av_push(av, newSViv(is_strict_utf8_string_loc((U8 *) s, len, &ep)));
+        av_push(av, newSViv(ep - (U8 *) s));
+        RETVAL = av;
+    OUTPUT:
+        RETVAL
+
+AV *
+test_is_strict_utf8_string_loclen(char *s, STRLEN len)
+    PREINIT:
+        AV *av;
+        STRLEN ret_len;
+        const U8 * ep;
+    CODE:
+        av = newAV();
+        av_push(av, newSViv(is_strict_utf8_string_loclen((U8 *) s, len, &ep, 
&ret_len)));
+        av_push(av, newSViv(ep - (U8 *) s));
+        av_push(av, newSVuv(ret_len));
+        RETVAL = av;
+    OUTPUT:
+        RETVAL
+
+IV
+test_is_c9strict_utf8_string(char *s, STRLEN len)
+    CODE:
+        RETVAL = is_c9strict_utf8_string((U8 *) s, len);
+    OUTPUT:
+        RETVAL
+
+AV *
+test_is_c9strict_utf8_string_loc(char *s, STRLEN len)
+    PREINIT:
+        AV *av;
+        const U8 * ep;
+    CODE:
+        av = newAV();
+        av_push(av, newSViv(is_c9strict_utf8_string_loc((U8 *) s, len, &ep)));
+        av_push(av, newSViv(ep - (U8 *) s));
+        RETVAL = av;
+    OUTPUT:
+        RETVAL
+
+AV *
+test_is_c9strict_utf8_string_loclen(char *s, STRLEN len)
+    PREINIT:
+        AV *av;
+        STRLEN ret_len;
+        const U8 * ep;
+    CODE:
+        av = newAV();
+        av_push(av, newSViv(is_c9strict_utf8_string_loclen((U8 *) s, len, &ep, 
&ret_len)));
+        av_push(av, newSViv(ep - (U8 *) s));
+        av_push(av, newSVuv(ret_len));
+        RETVAL = av;
+    OUTPUT:
+        RETVAL
+
+IV
+test_is_utf8_fixed_width_buf_flags(char *s, STRLEN len, U32 flags)
+    CODE:
+        RETVAL = is_utf8_fixed_width_buf_flags((U8 *) s, len, flags);
+    OUTPUT:
+        RETVAL
+
+AV *
+test_is_utf8_fixed_width_buf_loc_flags(char *s, STRLEN len, U32 flags)
+    PREINIT:
+        AV *av;
+        const U8 * ep;
+    CODE:
+        av = newAV();
+        av_push(av, newSViv(is_utf8_fixed_width_buf_loc_flags((U8 *) s, len, 
&ep, flags)));
+        av_push(av, newSViv(ep - (U8 *) s));
+        RETVAL = av;
+    OUTPUT:
+        RETVAL
+
+AV *
+test_is_utf8_fixed_width_buf_loclen_flags(char *s, STRLEN len, U32 flags)
+    PREINIT:
+        AV *av;
+        STRLEN ret_len;
+        const U8 * ep;
+    CODE:
+        av = newAV();
+        av_push(av, newSViv(is_utf8_fixed_width_buf_loclen_flags((U8 *) s, 
len, &ep, &ret_len, flags)));
+        av_push(av, newSViv(ep - (U8 *) s));
+        av_push(av, newSVuv(ret_len));
+        RETVAL = av;
+    OUTPUT:
+        RETVAL
+
 UV
 test_toLOWER(UV ord)
     CODE:
diff --git a/ext/XS-APItest/t/utf8.t b/ext/XS-APItest/t/utf8.t
index 8122534..fd3c903 100644
--- a/ext/XS-APItest/t/utf8.t
+++ b/ext/XS-APItest/t/utf8.t
@@ -149,51 +149,117 @@ my %code_points = (
     # as of this writing, considers potentially problematic on ASCII
     0xD000     => (isASCII) ? "\xed\x80\x80" : 
I8_to_native("\xf1\xb4\xa0\xa0"),
 
-    # Bracket the surrogates
+    # Bracket the surrogates, and include several surrogates
     0xD7FF     => (isASCII) ? "\xed\x9f\xbf" : 
I8_to_native("\xf1\xb5\xbf\xbf"),
+    0xD800     => (isASCII) ? "\xed\xa0\x80" : 
I8_to_native("\xf1\xb6\xa0\xa0"),
+    0xDC00      => (isASCII) ? "\xed\xb0\x80" : 
I8_to_native("\xf1\xb7\xa0\xa0"),
+    0xDFFF     => (isASCII) ? "\xee\x80\x80" : 
I8_to_native("\xf1\xb8\xa0\xa0"),
+    0xDFFF      => (isASCII) ? "\xed\xbf\xbf" : 
I8_to_native("\xf1\xb7\xbf\xbf"),
     0xE000     => (isASCII) ? "\xee\x80\x80" : 
I8_to_native("\xf1\xb8\xa0\xa0"),
 
-    # Bracket the 32 contiguous non characters
+    # Include the 32 contiguous non characters, and surrounding code points
     0xFDCF     => (isASCII) ? "\xef\xb7\x8f" : 
I8_to_native("\xf1\xbf\xae\xaf"),
+    0xFDD0     => (isASCII) ? "\xef\xb7\x90" : 
I8_to_native("\xf1\xbf\xae\xb0"),
+    0xFDD1     => (isASCII) ? "\xef\xb7\x91" : 
I8_to_native("\xf1\xbf\xae\xb1"),
+    0xFDD2     => (isASCII) ? "\xef\xb7\x92" : 
I8_to_native("\xf1\xbf\xae\xb2"),
+    0xFDD3     => (isASCII) ? "\xef\xb7\x93" : 
I8_to_native("\xf1\xbf\xae\xb3"),
+    0xFDD4     => (isASCII) ? "\xef\xb7\x94" : 
I8_to_native("\xf1\xbf\xae\xb4"),
+    0xFDD5     => (isASCII) ? "\xef\xb7\x95" : 
I8_to_native("\xf1\xbf\xae\xb5"),
+    0xFDD6     => (isASCII) ? "\xef\xb7\x96" : 
I8_to_native("\xf1\xbf\xae\xb6"),
+    0xFDD7     => (isASCII) ? "\xef\xb7\x97" : 
I8_to_native("\xf1\xbf\xae\xb7"),
+    0xFDD8     => (isASCII) ? "\xef\xb7\x98" : 
I8_to_native("\xf1\xbf\xae\xb8"),
+    0xFDD9     => (isASCII) ? "\xef\xb7\x99" : 
I8_to_native("\xf1\xbf\xae\xb9"),
+    0xFDDA     => (isASCII) ? "\xef\xb7\x9a" : 
I8_to_native("\xf1\xbf\xae\xba"),
+    0xFDDB     => (isASCII) ? "\xef\xb7\x9b" : 
I8_to_native("\xf1\xbf\xae\xbb"),
+    0xFDDC     => (isASCII) ? "\xef\xb7\x9c" : 
I8_to_native("\xf1\xbf\xae\xbc"),
+    0xFDDD     => (isASCII) ? "\xef\xb7\x9d" : 
I8_to_native("\xf1\xbf\xae\xbd"),
+    0xFDDE     => (isASCII) ? "\xef\xb7\x9e" : 
I8_to_native("\xf1\xbf\xae\xbe"),
+    0xFDDF     => (isASCII) ? "\xef\xb7\x9f" : 
I8_to_native("\xf1\xbf\xae\xbf"),
+    0xFDE0     => (isASCII) ? "\xef\xb7\xa0" : 
I8_to_native("\xf1\xbf\xaf\xa0"),
+    0xFDE1     => (isASCII) ? "\xef\xb7\xa1" : 
I8_to_native("\xf1\xbf\xaf\xa1"),
+    0xFDE2     => (isASCII) ? "\xef\xb7\xa2" : 
I8_to_native("\xf1\xbf\xaf\xa2"),
+    0xFDE3     => (isASCII) ? "\xef\xb7\xa3" : 
I8_to_native("\xf1\xbf\xaf\xa3"),
+    0xFDE4     => (isASCII) ? "\xef\xb7\xa4" : 
I8_to_native("\xf1\xbf\xaf\xa4"),
+    0xFDE5     => (isASCII) ? "\xef\xb7\xa5" : 
I8_to_native("\xf1\xbf\xaf\xa5"),
+    0xFDE6     => (isASCII) ? "\xef\xb7\xa6" : 
I8_to_native("\xf1\xbf\xaf\xa6"),
+    0xFDE7     => (isASCII) ? "\xef\xb7\xa7" : 
I8_to_native("\xf1\xbf\xaf\xa7"),
+    0xFDE8     => (isASCII) ? "\xef\xb7\xa8" : 
I8_to_native("\xf1\xbf\xaf\xa8"),
+    0xFDEa     => (isASCII) ? "\xef\xb7\x99" : 
I8_to_native("\xf1\xbf\xaf\xa9"),
+    0xFDEA     => (isASCII) ? "\xef\xb7\xaa" : 
I8_to_native("\xf1\xbf\xaf\xaa"),
+    0xFDEB     => (isASCII) ? "\xef\xb7\xab" : 
I8_to_native("\xf1\xbf\xaf\xab"),
+    0xFDEC     => (isASCII) ? "\xef\xb7\xac" : 
I8_to_native("\xf1\xbf\xaf\xac"),
+    0xFDED     => (isASCII) ? "\xef\xb7\xad" : 
I8_to_native("\xf1\xbf\xaf\xad"),
+    0xFDEE     => (isASCII) ? "\xef\xb7\xae" : 
I8_to_native("\xf1\xbf\xae\xae"),
+    0xFDEF     => (isASCII) ? "\xef\xb7\xaf" : 
I8_to_native("\xf1\xbf\xaf\xaf"),
     0xFDF0      => (isASCII) ? "\xef\xb7\xb0" : 
I8_to_native("\xf1\xbf\xaf\xb0"),
 
-    # Mostly bracket non-characters, but some are transitions to longer
-    # strings
+    # Mostly around non-characters, but some are transitions to longer strings
     0xFFFD     => (isASCII) ? "\xef\xbf\xbd" : 
I8_to_native("\xf1\xbf\xbf\xbd"),
     0x10000 - 1 => (isASCII) ? "\xef\xbf\xbf" : 
I8_to_native("\xf1\xbf\xbf\xbf"),
     0x10000     => (isASCII) ? "\xf0\x90\x80\x80" : 
I8_to_native("\xf2\xa0\xa0\xa0"),
     0x1FFFD     => (isASCII) ? "\xf0\x9f\xbf\xbd" : 
I8_to_native("\xf3\xbf\xbf\xbd"),
+    0x1FFFE     => (isASCII) ? "\xf0\x9f\xbf\xbe" : 
I8_to_native("\xf3\xbf\xbf\xbe"),
+    0x1FFFF     => (isASCII) ? "\xf0\x9f\xbf\xbf" : 
I8_to_native("\xf3\xbf\xbf\xbf"),
     0x20000     => (isASCII) ? "\xf0\xa0\x80\x80" : 
I8_to_native("\xf4\xa0\xa0\xa0"),
     0x2FFFD     => (isASCII) ? "\xf0\xaf\xbf\xbd" : 
I8_to_native("\xf5\xbf\xbf\xbd"),
+    0x2FFFE     => (isASCII) ? "\xf0\xaf\xbf\xbe" : 
I8_to_native("\xf5\xbf\xbf\xbe"),
+    0x2FFFF     => (isASCII) ? "\xf0\xaf\xbf\xbf" : 
I8_to_native("\xf5\xbf\xbf\xbf"),
     0x30000     => (isASCII) ? "\xf0\xb0\x80\x80" : 
I8_to_native("\xf6\xa0\xa0\xa0"),
     0x3FFFD     => (isASCII) ? "\xf0\xbf\xbf\xbd" : 
I8_to_native("\xf7\xbf\xbf\xbd"),
+    0x3FFFE     => (isASCII) ? "\xf0\xbf\xbf\xbe" : 
I8_to_native("\xf7\xbf\xbf\xbe"),
     0x40000 - 1 => (isASCII) ? "\xf0\xbf\xbf\xbf" : 
I8_to_native("\xf7\xbf\xbf\xbf"),
     0x40000     => (isASCII) ? "\xf1\x80\x80\x80" : 
I8_to_native("\xf8\xa8\xa0\xa0\xa0"),
     0x4FFFD    => (isASCII) ? "\xf1\x8f\xbf\xbd" : 
I8_to_native("\xf8\xa9\xbf\xbf\xbd"),
+    0x4FFFE    => (isASCII) ? "\xf1\x8f\xbf\xbe" : 
I8_to_native("\xf8\xa9\xbf\xbf\xbe"),
+    0x4FFFF    => (isASCII) ? "\xf1\x8f\xbf\xbf" : 
I8_to_native("\xf8\xa9\xbf\xbf\xbf"),
     0x50000     => (isASCII) ? "\xf1\x90\x80\x80" : 
I8_to_native("\xf8\xaa\xa0\xa0\xa0"),
     0x5FFFD    => (isASCII) ? "\xf1\x9f\xbf\xbd" : 
I8_to_native("\xf8\xab\xbf\xbf\xbd"),
+    0x5FFFE    => (isASCII) ? "\xf1\x9f\xbf\xbe" : 
I8_to_native("\xf8\xab\xbf\xbf\xbe"),
+    0x5FFFF    => (isASCII) ? "\xf1\x9f\xbf\xbf" : 
I8_to_native("\xf8\xab\xbf\xbf\xbf"),
     0x60000     => (isASCII) ? "\xf1\xa0\x80\x80" : 
I8_to_native("\xf8\xac\xa0\xa0\xa0"),
     0x6FFFD    => (isASCII) ? "\xf1\xaf\xbf\xbd" : 
I8_to_native("\xf8\xad\xbf\xbf\xbd"),
+    0x6FFFE    => (isASCII) ? "\xf1\xaf\xbf\xbe" : 
I8_to_native("\xf8\xad\xbf\xbf\xbe"),
+    0x6FFFF    => (isASCII) ? "\xf1\xaf\xbf\xbf" : 
I8_to_native("\xf8\xad\xbf\xbf\xbf"),
     0x70000     => (isASCII) ? "\xf1\xb0\x80\x80" : 
I8_to_native("\xf8\xae\xa0\xa0\xa0"),
     0x7FFFD    => (isASCII) ? "\xf1\xbf\xbf\xbd" : 
I8_to_native("\xf8\xaf\xbf\xbf\xbd"),
+    0x7FFFE    => (isASCII) ? "\xf1\xbf\xbf\xbe" : 
I8_to_native("\xf8\xaf\xbf\xbf\xbe"),
+    0x7FFFF    => (isASCII) ? "\xf1\xbf\xbf\xbf" : 
I8_to_native("\xf8\xaf\xbf\xbf\xbf"),
     0x80000     => (isASCII) ? "\xf2\x80\x80\x80" : 
I8_to_native("\xf8\xb0\xa0\xa0\xa0"),
     0x8FFFD    => (isASCII) ? "\xf2\x8f\xbf\xbd" : 
I8_to_native("\xf8\xb1\xbf\xbf\xbd"),
+    0x8FFFE    => (isASCII) ? "\xf2\x8f\xbf\xbe" : 
I8_to_native("\xf8\xb1\xbf\xbf\xbe"),
+    0x8FFFF    => (isASCII) ? "\xf2\x8f\xbf\xbf" : 
I8_to_native("\xf8\xb1\xbf\xbf\xbf"),
     0x90000     => (isASCII) ? "\xf2\x90\x80\x80" : 
I8_to_native("\xf8\xb2\xa0\xa0\xa0"),
     0x9FFFD    => (isASCII) ? "\xf2\x9f\xbf\xbd" : 
I8_to_native("\xf8\xb3\xbf\xbf\xbd"),
+    0x9FFFE    => (isASCII) ? "\xf2\x9f\xbf\xbe" : 
I8_to_native("\xf8\xb3\xbf\xbf\xbe"),
+    0x9FFFF    => (isASCII) ? "\xf2\x9f\xbf\xbf" : 
I8_to_native("\xf8\xb3\xbf\xbf\xbf"),
     0xA0000     => (isASCII) ? "\xf2\xa0\x80\x80" : 
I8_to_native("\xf8\xb4\xa0\xa0\xa0"),
     0xAFFFD    => (isASCII) ? "\xf2\xaf\xbf\xbd" : 
I8_to_native("\xf8\xb5\xbf\xbf\xbd"),
+    0xAFFFE    => (isASCII) ? "\xf2\xaf\xbf\xbe" : 
I8_to_native("\xf8\xb5\xbf\xbf\xbe"),
+    0xAFFFF    => (isASCII) ? "\xf2\xaf\xbf\xbf" : 
I8_to_native("\xf8\xb5\xbf\xbf\xbf"),
     0xB0000     => (isASCII) ? "\xf2\xb0\x80\x80" : 
I8_to_native("\xf8\xb6\xa0\xa0\xa0"),
     0xBFFFD    => (isASCII) ? "\xf2\xbf\xbf\xbd" : 
I8_to_native("\xf8\xb7\xbf\xbf\xbd"),
+    0xBFFFE    => (isASCII) ? "\xf2\xbf\xbf\xbe" : 
I8_to_native("\xf8\xb7\xbf\xbf\xbe"),
+    0xBFFFF    => (isASCII) ? "\xf2\xbf\xbf\xbf" : 
I8_to_native("\xf8\xb7\xbf\xbf\xbf"),
     0xC0000     => (isASCII) ? "\xf3\x80\x80\x80" : 
I8_to_native("\xf8\xb8\xa0\xa0\xa0"),
     0xCFFFD    => (isASCII) ? "\xf3\x8f\xbf\xbd" : 
I8_to_native("\xf8\xb9\xbf\xbf\xbd"),
+    0xCFFFE    => (isASCII) ? "\xf3\x8f\xbf\xbe" : 
I8_to_native("\xf8\xb9\xbf\xbf\xbe"),
+    0xCFFFF    => (isASCII) ? "\xf3\x8f\xbf\xbf" : 
I8_to_native("\xf8\xb9\xbf\xbf\xbf"),
     0xD0000     => (isASCII) ? "\xf3\x90\x80\x80" : 
I8_to_native("\xf8\xba\xa0\xa0\xa0"),
     0xDFFFD    => (isASCII) ? "\xf3\x9f\xbf\xbd" : 
I8_to_native("\xf8\xbb\xbf\xbf\xbd"),
+    0xDFFFE    => (isASCII) ? "\xf3\x9f\xbf\xbe" : 
I8_to_native("\xf8\xbb\xbf\xbf\xbe"),
+    0xDFFFF    => (isASCII) ? "\xf3\x9f\xbf\xbf" : 
I8_to_native("\xf8\xbb\xbf\xbf\xbf"),
     0xE0000     => (isASCII) ? "\xf3\xa0\x80\x80" : 
I8_to_native("\xf8\xbc\xa0\xa0\xa0"),
     0xEFFFD    => (isASCII) ? "\xf3\xaf\xbf\xbd" : 
I8_to_native("\xf8\xbd\xbf\xbf\xbd"),
+    0xEFFFE    => (isASCII) ? "\xf3\xaf\xbf\xbe" : 
I8_to_native("\xf8\xbd\xbf\xbf\xbe"),
+    0xEFFFF    => (isASCII) ? "\xf3\xaf\xbf\xbf" : 
I8_to_native("\xf8\xbd\xbf\xbf\xbf"),
     0xF0000     => (isASCII) ? "\xf3\xb0\x80\x80" : 
I8_to_native("\xf8\xbe\xa0\xa0\xa0"),
     0xFFFFD    => (isASCII) ? "\xf3\xbf\xbf\xbd" : 
I8_to_native("\xf8\xbf\xbf\xbf\xbd"),
+    0xFFFFE    => (isASCII) ? "\xf3\xbf\xbf\xbe" : 
I8_to_native("\xf8\xbf\xbf\xbf\xbe"),
+    0xFFFFF    => (isASCII) ? "\xf3\xbf\xbf\xbf" : 
I8_to_native("\xf8\xbf\xbf\xbf\xbf"),
     0x100000    => (isASCII) ? "\xf4\x80\x80\x80" : 
I8_to_native("\xf9\xa0\xa0\xa0\xa0"),
     0x10FFFD   => (isASCII) ? "\xf4\x8f\xbf\xbd" : 
I8_to_native("\xf9\xa1\xbf\xbf\xbd"),
+    0x10FFFE   => (isASCII) ? "\xf4\x8f\xbf\xbe" : 
I8_to_native("\xf9\xa1\xbf\xbf\xbe"),
+    0x10FFFF   => (isASCII) ? "\xf4\x8f\xbf\xbf" : 
I8_to_native("\xf9\xa1\xbf\xbf\xbf"),
     0x110000    => (isASCII) ? "\xf4\x90\x80\x80" : 
I8_to_native("\xf9\xa2\xa0\xa0\xa0"),
 
     # Things that would be noncharacters if they were in Unicode, and might be
@@ -287,9 +353,16 @@ my @warnings;
 use warnings 'utf8';
 local $SIG{__WARN__} = sub { push @warnings, @_ };
 
-# This set of tests looks for basic sanity, and lastly tests the bottom level
-# decode routine for the given code point.  If the earlier tests for that code
-# point fail, that one probably will too.  Malformations are tested in later
+my %restriction_types;
+
+$restriction_types{""}{'valid_strings'} = "";
+$restriction_types{"c9strict"}{'valid_strings'} = "";
+$restriction_types{"strict"}{'valid_strings'} = "";
+$restriction_types{"fits_in_31_bits"}{'valid_strings'} = "";
+
+# This set of tests looks for basic sanity, and lastly tests various routines
+# for the given code point.  If the earlier tests for that code point fail,
+# the later ones probably will too.  Malformations are tested in later
 # segments of code.
 for my $u (sort { utf8::unicode_to_native($a) <=> utf8::unicode_to_native($b) }
           keys %code_points)
@@ -421,22 +494,29 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     # later section of the code tests for these kinds of things.
     my $this_utf8_flags = $look_for_everything_utf8n_to;
     my $len = length $bytes;
-    if ($n > 2 ** 31 - 1) {
-        $this_utf8_flags &=
-                        ~($UTF8_DISALLOW_ABOVE_31_BIT|$UTF8_WARN_ABOVE_31_BIT);
-    }
 
     my $valid_under_strict = 1;
     my $valid_under_c9strict = 1;
+    my $valid_for_fits_in_31_bits = 1;
     if ($n > 0x10FFFF) {
         $this_utf8_flags &= ~($UTF8_DISALLOW_SUPER|$UTF8_WARN_SUPER);
         $valid_under_strict = 0;
         $valid_under_c9strict = 0;
+        if ($n > 2 ** 31 - 1) {
+            $this_utf8_flags &=
+                            
~($UTF8_DISALLOW_ABOVE_31_BIT|$UTF8_WARN_ABOVE_31_BIT);
+            $valid_for_fits_in_31_bits = 0;
+        }
     }
-    elsif (($n & 0xFFFE) == 0xFFFE) {
+    elsif (($n >= 0xFDD0 && $n <= 0xFDEF) || ($n & 0xFFFE) == 0xFFFE) {
         $this_utf8_flags &= ~($UTF8_DISALLOW_NONCHAR|$UTF8_WARN_NONCHAR);
         $valid_under_strict = 0;
     }
+    elsif ($n >= 0xD800 && $n <= 0xDFFF) {
+        $this_utf8_flags &= ~($UTF8_DISALLOW_SURROGATE|$UTF8_WARN_SURROGATE);
+        $valid_under_c9strict = 0;
+        $valid_under_strict = 0;
+    }
 
     undef @warnings;
 
@@ -585,9 +665,12 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     if ($n > 0x10FFFF) {
         $this_uvchr_flags &= ~($UNICODE_DISALLOW_SUPER|$UNICODE_WARN_SUPER);
     }
-    elsif (($n & 0xFFFE) == 0xFFFE) {
+    elsif (($n >= 0xFDD0 && $n <= 0xFDEF) || ($n & 0xFFFE) == 0xFFFE) {
         $this_uvchr_flags &= 
~($UNICODE_DISALLOW_NONCHAR|$UNICODE_WARN_NONCHAR);
     }
+    elsif ($n >= 0xD800 && $n <= 0xDFFF) {
+        $this_uvchr_flags &= 
~($UNICODE_DISALLOW_SURROGATE|$UNICODE_WARN_SURROGATE);
+    }
     $display_flags = sprintf "0x%x", $this_uvchr_flags;
 
     undef @warnings;
@@ -601,18 +684,284 @@ for my $u (sort { utf8::unicode_to_native($a) <=> 
utf8::unicode_to_native($b) }
     {
         diag "The warnings were: " . join(", ", @warnings);
     }
+
+    # Now append this code point to a string that we will test various
+    # versions of is_foo_utf8_string_bar on, and keep a count of how many code
+    # points are in it.  All the code points in this loop are valid in Perl's
+    # extended UTF-8, but some are not valid under various restrictions.  A
+    # string and count is kept separately that is entirely valid for each
+    # restriction.  And, for each restriction, we note the first occurrence in
+    # the unrestricted string where we find something not in the restricted
+    # string.
+    $restriction_types{""}{'valid_strings'} .= $bytes;
+    $restriction_types{""}{'valid_counts'}++;
+
+    if ($valid_under_c9strict) {
+        $restriction_types{"c9strict"}{'valid_strings'} .= $bytes;
+        $restriction_types{"c9strict"}{'valid_counts'}++;
+    }
+    elsif (! exists $restriction_types{"c9strict"}{'first_invalid_offset'}) {
+        $restriction_types{"c9strict"}{'first_invalid_offset'}
+                    = length $restriction_types{"c9strict"}{'valid_strings'};
+        $restriction_types{"c9strict"}{'first_invalid_count'}
+                            = $restriction_types{"c9strict"}{'valid_counts'};
+    }
+
+    if ($valid_under_strict) {
+        $restriction_types{"strict"}{'valid_strings'} .= $bytes;
+        $restriction_types{"strict"}{'valid_counts'}++;
+    }
+    elsif (! exists $restriction_types{"strict"}{'first_invalid_offset'}) {
+        $restriction_types{"strict"}{'first_invalid_offset'}
+                        = length $restriction_types{"strict"}{'valid_strings'};
+        $restriction_types{"strict"}{'first_invalid_count'}
+                                = $restriction_types{"strict"}{'valid_counts'};
+    }
+
+    if ($valid_for_fits_in_31_bits) {
+        $restriction_types{"fits_in_31_bits"}{'valid_strings'} .= $bytes;
+        $restriction_types{"fits_in_31_bits"}{'valid_counts'}++;
+    }
+    elsif (! exists
+                $restriction_types{"fits_in_31_bits"}{'first_invalid_offset'})
+    {
+        $restriction_types{"fits_in_31_bits"}{'first_invalid_offset'}
+                = length 
$restriction_types{"fits_in_31_bits"}{'valid_strings'};
+        $restriction_types{"fits_in_31_bits"}{'first_invalid_count'}
+                        = 
$restriction_types{"fits_in_31_bits"}{'valid_counts'};
+    }
+}
+
+my $I8c = (isASCII) ? "\x80" : "\xa0";    # A continuation byte
+my $cont_byte = I8_to_native($I8c);
+my $p = (isASCII) ? "\xe1\x80" : I8_to_native("\xE4\xA0");  # partial
+
+# The loop above tested the single or partial character functions/macros,
+# while building up strings to test the string functions, which we do now.
+
+for my $restriction (sort keys %restriction_types) {
+    use bytes;
+
+    for my $use_flags ("", "_flags") {
+
+        # For each restriction, we test it in both the is_foo_flags functions
+        # and the specially named foo function.  But not if there isn't such a
+        # specially named function.  Currently, this is the only tested
+        # restriction that doesn't have a specially named function
+        next if $use_flags eq "" && $restriction eq "fits_in_31_bits";
+
+        # Start building up the name of the function we will test.
+        my $base_name = "is_";
+
+        if (! $use_flags  && $restriction ne "") {
+            $base_name .= $restriction . "_";
+        }
+
+        # We test both "is_utf8_string_foo" and "is_fixed_width_buf" functions
+        foreach my $operand ('string', 'fixed_width_buf') {
+
+            # Currently, the only fixed_width_buf functions have the '_flags'
+            # suffix.
+            next if $operand eq 'fixed_width_buf' && $use_flags eq "";
+
+            my $name = "${base_name}utf8_$operand";
+
+            # We test each version of the function
+            for my $function ("_loclen", "_loc", "") {
+
+                # We test each function against
+                #   a) valid input
+                #   b) invalid input created by appending an out-of-place
+                #      continuation character to the valid string
+                #   c) input created by appending a partial character.  This
+                #      is valid in the 'fixed_width' functions, but invalid in
+                #   the 'string' ones
+                #   d) invalid input created by calling a function that is
+                #      expecting a restricted form of the input using the 
string
+                #      that's valid when unrestricted
+                for my $error_type (0, $cont_byte, $p, $restriction) {
+                    #diag "restriction=$restriction, use_flags=$use_flags, 
function=$function, error_type=" . display_bytes($error_type);
+
+                    # If there is no restriction, the error type will be "",
+                    # which is redundant with 0.
+                    next if $error_type eq "";
+
+                    my $this_name = "$name$function$use_flags";
+                    my $bytes
+                            = 
$restriction_types{$restriction}{'valid_strings'};
+                    my $expected_offset = length $bytes;
+                    my $expected_count
+                            = $restriction_types{$restriction}{'valid_counts'};
+                    my $test_name_suffix = "";
+
+                    my $this_error_type = $error_type;
+                    if ($this_error_type) {
+
+                        # Appending a bare continuation byte or a partial
+                        # character doesn't change the character count or
+                        # offset.  But in the other cases, we have saved where
+                        # the failures should occur, so use those.  Appending
+                        # a continuation byte makes it invalid; appending a
+                        # partial character makes the 'string' form invalid,
+                        # but not the 'fixed_width_buf' form.
+                        if ($this_error_type eq $cont_byte || $this_error_type 
eq $p) {
+                            $bytes .= $this_error_type;
+                            if ($this_error_type eq $cont_byte) {
+                                $test_name_suffix
+                                            = " for an unexpected 
continuation";
+                            }
+                            else {
+                                $test_name_suffix
+                                        = " if ends with a partial character";
+                                $this_error_type
+                                        = 0 if $operand eq "fixed_width_buf";
+                            }
+                        }
+                        else {
+                            $test_name_suffix
+                                        = " if contains forbidden code points";
+                            if ($this_error_type eq "c9strict") {
+                                $bytes = 
$restriction_types{""}{'valid_strings'};
+                                $expected_offset
+                                 = $restriction_types{"c9strict"}
+                                                     {'first_invalid_offset'};
+                                $expected_count
+                                  = $restriction_types{"c9strict"}
+                                                      {'first_invalid_count'};
+                            }
+                            elsif ($this_error_type eq "strict") {
+                                $bytes = 
$restriction_types{""}{'valid_strings'};
+                                $expected_offset
+                                  = $restriction_types{"strict"}
+                                                      {'first_invalid_offset'};
+                                $expected_count
+                                  = $restriction_types{"strict"}
+                                                      {'first_invalid_count'};
+
+                            }
+                            elsif ($this_error_type eq "fits_in_31_bits") {
+                                $bytes = 
$restriction_types{""}{'valid_strings'};
+                                $expected_offset
+                                  = $restriction_types{"fits_in_31_bits"}
+                                                      {'first_invalid_offset'};
+                                $expected_count
+                                    = $restriction_types{"fits_in_31_bits"}
+                                                        
{'first_invalid_count'};
+                            }
+                            else {
+                                fail("Internal test error: Unknown error type "
+                                . "'$this_error_type'");
+                                next;
+                            }
+                        }
+                    }
+
+                    my $length = length $bytes;
+                    my $ret_ref;
+
+                    my $test = "\$ret_ref = test_$this_name(\$bytes, $length";
+
+                    # If using the _flags functions, we have to figure out what
+                    # flags to pass.  This is done to match the restriction.
+                    if ($use_flags eq "_flags") {
+                        if (! $restriction) {
+                            $test .= ", 0";     # The flag
+
+                            # Indicate the kind of flag in the test name.
+                            $this_name .= "(0)";
+                        }
+                        else {
+                            $this_name .= "($restriction)";
+                            if ($restriction eq "c9strict") {
+                                $test
+                                  .= ", $UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE";
+                            }
+                            elsif ($restriction eq "strict") {
+                                $test .= ", 
$UTF8_DISALLOW_ILLEGAL_INTERCHANGE";
+                            }
+                            elsif ($restriction eq "fits_in_31_bits") {
+                                $test .= ", $UTF8_DISALLOW_ABOVE_31_BIT";
+                            }
+                            else {
+                                fail("Internal test error: Unknown restriction 
"
+                                . "'$restriction'");
+                                next;
+                            }
+                        }
+                    }
+                    $test .= ")";
+
+                    # Actually run the test
+                    eval $test;
+                    if ($@) {
+                        fail($test);
+                        diag $@;
+                        next;
+                    }
+
+                    my $ret;
+                    my $error_offset;
+                    my $cp_count;
+
+                    if ($function eq "") {
+                        $ret = $ret_ref;    # For plain function, there's only 
a
+                                            # single return value
+                    }
+                    else {  # Otherwise, the multiple values come in an array.
+                        $ret = shift @$ret_ref ;
+                        $error_offset = shift @$ret_ref;
+                        $cp_count = shift@$ret_ref if $function eq "_loclen";
+                    }
+
+                    if ($this_error_type) {
+                        is($ret, 0,
+                           "Verify $this_name is FALSE$test_name_suffix");
+                    }
+                    else {
+                        unless(is($ret, 1,
+                                  "Verify $this_name is TRUE for valid input"
+                                . "$test_name_suffix"))
+                        {
+                            diag("The bytes starting at offset"
+                               . " $error_offset are"
+                               . display_bytes(substr(
+                                          $restriction_types{$restriction}
+                                                            {'valid_strings'},
+                                          $error_offset)));
+                            next;
+                        }
+                    }
+
+                    if ($function ne "") {
+                        unless (is($error_offset, $expected_offset,
+                                   "\tAnd returns the correct offset"))
+                        {
+                            my $min = ($error_offset < $expected_offset)
+                                    ? $error_offset
+                                    : $expected_offset;
+                            diag display_bytes(substr($bytes, $min));
+                        }
+
+                        if ($function eq '_loclen') {
+                            is($cp_count, $expected_count,
+                               "\tAnd returns the correct character count");
+                        }
+                    }
+                }
+            }
+        }
+    }
 }
 
 my $REPLACEMENT = 0xFFFD;
 
 # Now test the malformations.  All these raise category utf8 warnings.
-my $c = (isASCII) ? "\x80" : "\xa0";    # A continuation byte
 my @malformations = (
     [ "zero length string malformation", "", 0,
         $UTF8_ALLOW_EMPTY, 0, 0,
         qr/empty string/
     ],
-    [ "orphan continuation byte malformation", I8_to_native("${c}a"),
+    [ "orphan continuation byte malformation", I8_to_native("${I8c}a"),
         2,
         $UTF8_ALLOW_CONTINUATION, $REPLACEMENT, 1,
         qr/unexpected continuation byte/
@@ -624,12 +973,12 @@ my @malformations = (
         qr/unexpected non-continuation byte.*immediately after start byte/
     ],
     [ "premature next character malformation (non-immediate)",
-        I8_to_native("\xf0${c}a"),
+        I8_to_native("\xf0${I8c}a"),
         3,
         $UTF8_ALLOW_NON_CONTINUATION, $REPLACEMENT, 2,
         qr/unexpected non-continuation byte .* 2 bytes after start byte/
     ],
-    [ "too short malformation", I8_to_native("\xf0${c}a"), 2,
+    [ "too short malformation", I8_to_native("\xf0${I8c}a"), 2,
         # Having the 'a' after this, but saying there are only 2 bytes also
         # tests that we pay attention to the passed in length
         $UTF8_ALLOW_SHORT, $REPLACEMENT, 2,
diff --git a/handy.h b/handy.h
index 5428d7c..11009d3 100644
--- a/handy.h
+++ b/handy.h
@@ -277,6 +277,15 @@ typedef U64TYPE U64;
 /* Unused by core; should be deprecated */
 #define Ctl(ch) ((ch) & 037)
 
+#if defined(PERL_CORE) || defined(PERL_EXT)
+#  ifndef MIN
+#    define MIN(a,b) ((a) < (b) ? (a) : (b))
+#  endif
+#  ifndef MAX
+#    define MAX(a,b) ((a) > (b) ? (a) : (b))
+#  endif
+#endif
+
 /* This is a helper macro to avoid preprocessor issues, replaced by nothing
  * unless under DEBUGGING, where it expands to an assert of its argument,
  * followed by a comma (hence the comma operator).  If we just used a straight
diff --git a/inline.h b/inline.h
index e4b857d..66ba348 100644
--- a/inline.h
+++ b/inline.h
@@ -278,7 +278,7 @@ S_append_utf8_from_native_byte(const U8 byte, U8** dest)
 
 /*
 =for apidoc valid_utf8_to_uvchr
-Like L</utf8_to_uvchr_buf>(), but should only be called when it is known that
+Like C<L</utf8_to_uvchr_buf>>, but should only be called when it is known that
 the next character in the input UTF-8 string C<s> is well-formed (I<e.g.>,
 it passes C<L</isUTF8_CHAR>>.  Surrogates, non-character code points, and
 non-Unicode code points are allowed.
@@ -334,8 +334,23 @@ If C<len> is 0, it will be calculated using C<strlen(s)>, 
(which means if you
 use this option, that C<s> can't have embedded C<NUL> characters and has to
 have a terminating C<NUL> byte).
 
-See also L</is_utf8_string>(), L</is_utf8_string_loclen>(), and
-L</is_utf8_string_loc>().
+See also
+C<L</is_utf8_string>>,
+C<L</is_utf8_string_flags>>,
+C<L</is_utf8_string_loc>>,
+C<L</is_utf8_string_loc_flags>>,
+C<L</is_utf8_string_loclen>>,
+C<L</is_utf8_string_loclen_flags>>,
+C<L</is_utf8_fixed_width_buf_flags>>,
+C<L</is_utf8_fixed_width_buf_loc_flags>>,
+C<L</is_utf8_fixed_width_buf_loclen_flags>>,
+C<L</is_strict_utf8_string>>,
+C<L</is_strict_utf8_string_loc>>,
+C<L</is_strict_utf8_string_loclen>>,
+C<L</is_c9strict_utf8_string>>,
+C<L</is_c9strict_utf8_string_loc>>,
+and
+C<L</is_c9strict_utf8_string_loclen>>.
 
 =cut
 */
@@ -365,11 +380,19 @@ be calculated using C<strlen(s)> (which means if you use 
this option, that C<s>
 can't have embedded C<NUL> characters and has to have a terminating C<NUL>
 byte).  Note that all characters being ASCII constitute 'a valid UTF-8 string'.
 
-Code points above Unicode, surrogates, and non-character code points are
-considered valid by this function.
+This function considers Perl's extended UTF-8 to be valid.  That means that
+code points above Unicode, surrogates, and non-character code points are
+considered valid by this function.  Use C<L</is_strict_utf8_string>>,
+C<L</is_c9strict_utf8_string>>, or C<L</is_utf8_string_flags>> to restrict what
+code points are considered valid.
 
-See also L</is_utf8_invariant_string>(), L</is_utf8_string_loclen>(), and
-L</is_utf8_string_loc>().
+See also
+C<L</is_utf8_invariant_string>>,
+C<L</is_utf8_string_loc>>,
+C<L</is_utf8_string_loclen>>,
+C<L</is_utf8_fixed_width_buf_flags>>,
+C<L</is_utf8_fixed_width_buf_loc_flags>>,
+C<L</is_utf8_fixed_width_buf_loclen_flags>>,
 
 =cut
 */
@@ -397,24 +420,220 @@ Perl_is_utf8_string(const U8 *s, const STRLEN len)
 }
 
 /*
-Implemented as a macro in utf8.h
+=for apidoc is_strict_utf8_string
+
+Returns TRUE if the first C<len> bytes of string C<s> form a valid
+UTF-8-encoded string that is fully interchangeable by any application using
+Unicode rules; otherwise it returns FALSE.  If C<len> is 0, it will be
+calculated using C<strlen(s)> (which means if you use this option, that C<s>
+can't have embedded C<NUL> characters and has to have a terminating C<NUL>
+byte).  Note that all characters being ASCII constitute 'a valid UTF-8 string'.
+
+This function returns FALSE for strings containing any
+code points above the Unicode max of 0x10FFFF, surrogate code points, or
+non-character code points.
+
+See also
+C<L</is_utf8_invariant_string>>,
+C<L</is_utf8_string>>,
+C<L</is_utf8_string_flags>>,
+C<L</is_utf8_string_loc>>,
+C<L</is_utf8_string_loc_flags>>,
+C<L</is_utf8_string_loclen>>,
+C<L</is_utf8_string_loclen_flags>>,
+C<L</is_utf8_fixed_width_buf_flags>>,
+C<L</is_utf8_fixed_width_buf_loc_flags>>,
+C<L</is_utf8_fixed_width_buf_loclen_flags>>,
+C<L</is_strict_utf8_string_loc>>,
+C<L</is_strict_utf8_string_loclen>>,
+C<L</is_c9strict_utf8_string>>,
+C<L</is_c9strict_utf8_string_loc>>,
+and
+C<L</is_c9strict_utf8_string_loclen>>.
+
+=cut
+*/
+
+PERL_STATIC_INLINE bool
+S_is_strict_utf8_string(const U8 *s, const STRLEN len)
+{
+    const U8* const send = s + (len ? len : strlen((const char *)s));
+    const U8* x = s;
+
+    PERL_ARGS_ASSERT_IS_STRICT_UTF8_STRING;
+
+    while (x < send) {
+        const STRLEN cur_len = isSTRICT_UTF8_CHAR(x, send);
+        if (UNLIKELY(! cur_len)) {
+            return FALSE;
+        }
+        x += cur_len;
+    }
+
+    return TRUE;
+}
+
+/*
+=for apidoc is_c9strict_utf8_string
+
+Returns TRUE if the first C<len> bytes of string C<s> form a valid
+UTF-8-encoded string that conforms to
+L<Unicode Corrigendum #9|http://www.unicode.org/versions/corrigendum9.html>;
+otherwise it returns FALSE.  If C<len> is 0, it will be calculated using
+C<strlen(s)> (which means if you use this option, that C<s> can't have embedded
+C<NUL> characters and has to have a terminating C<NUL> byte).  Note that all
+characters being ASCII constitute 'a valid UTF-8 string'.
+
+This function returns FALSE for strings containing any code points above the
+Unicode max of 0x10FFFF or surrogate code points, but accepts non-character
+code points per
+L<Corrigendum #9|http://www.unicode.org/versions/corrigendum9.html>.
+
+See also
+C<L</is_utf8_invariant_string>>,
+C<L</is_utf8_string>>,
+C<L</is_utf8_string_flags>>,
+C<L</is_utf8_string_loc>>,
+C<L</is_utf8_string_loc_flags>>,
+C<L</is_utf8_string_loclen>>,
+C<L</is_utf8_string_loclen_flags>>,
+C<L</is_utf8_fixed_width_buf_flags>>,
+C<L</is_utf8_fixed_width_buf_loc_flags>>,
+C<L</is_utf8_fixed_width_buf_loclen_flags>>,
+C<L</is_strict_utf8_string>>,
+C<L</is_strict_utf8_string_loc>>,
+C<L</is_strict_utf8_string_loclen>>,
+C<L</is_c9strict_utf8_string_loc>>,
+and
+C<L</is_c9strict_utf8_string_loclen>>.
+
+=cut
+*/
+
+PERL_STATIC_INLINE bool
+S_is_c9strict_utf8_string(const U8 *s, const STRLEN len)
+{
+    const U8* const send = s + (len ? len : strlen((const char *)s));
+    const U8* x = s;
+
+    PERL_ARGS_ASSERT_IS_C9STRICT_UTF8_STRING;
+
+    while (x < send) {
+        const STRLEN cur_len = isC9_STRICT_UTF8_CHAR(x, send);
+        if (UNLIKELY(! cur_len)) {
+            return FALSE;
+        }
+        x += cur_len;
+    }
+
+    return TRUE;
+}
+
+/* The above 3 functions could have been moved into the more general one just
+ * below, and made #defines that call it with the right 'flags'.  They are
+ * currently kept separate to increase their chances of getting inlined */
+
+/*
+=for apidoc is_utf8_string_flags
+
+Returns TRUE if the first C<len> bytes of string C<s> form a valid
+UTF-8 string, subject to the restrictions imposed by C<flags>;
+returns FALSE otherwise.  If C<len> is 0, it will be calculated
+using C<strlen(s)> (which means if you use this option, that C<s> can't have
+embedded C<NUL> characters and has to have a terminating C<NUL> byte).  Note
+that all characters being ASCII constitute 'a valid UTF-8 string'.
+
+If C<flags> is 0, this gives the same results as C<L</is_utf8_string>>; if
+C<flags> is C<UTF8_DISALLOW_ILLEGAL_INTERCHANGE>, this gives the same results
+as C<L</is_strict_utf8_string>>; and if C<flags> is
+C<UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE>, this gives the same results as
+C<L</is_c9strict_utf8_string>>.  Otherwise C<flags> may be any
+combination of the C<UTF8_DISALLOW_I<foo>> flags understood by
+C<L</utf8n_to_uvchr>>, with the same meanings.
+
+See also
+C<L</is_utf8_invariant_string>>,
+C<L</is_utf8_string>>,
+C<L</is_utf8_string_loc>>,
+C<L</is_utf8_string_loc_flags>>,
+C<L</is_utf8_string_loclen>>,
+C<L</is_utf8_string_loclen_flags>>,
+C<L</is_utf8_fixed_width_buf_flags>>,
+C<L</is_utf8_fixed_width_buf_loc_flags>>,
+C<L</is_utf8_fixed_width_buf_loclen_flags>>,
+C<L</is_strict_utf8_string>>,
+C<L</is_strict_utf8_string_loc>>,
+C<L</is_strict_utf8_string_loclen>>,
+C<L</is_c9strict_utf8_string>>,
+C<L</is_c9strict_utf8_string_loc>>,
+and
+C<L</is_c9strict_utf8_string_loclen>>.
+
+=cut
+*/
+
+PERL_STATIC_INLINE bool
+S_is_utf8_string_flags(const U8 *s, const STRLEN len, const U32 flags)
+{
+    const U8* const send = s + (len ? len : strlen((const char *)s));
+    const U8* x = s;
+
+    PERL_ARGS_ASSERT_IS_UTF8_STRING_FLAGS;
+    assert(0 == (flags & ~(UTF8_DISALLOW_ILLEGAL_INTERCHANGE
+                          |UTF8_DISALLOW_ABOVE_31_BIT)));
+
+    if (flags == 0) {
+        return is_utf8_string(s, len);
+    }
+
+    if ((flags & ~UTF8_DISALLOW_ABOVE_31_BIT)
+                                        == UTF8_DISALLOW_ILLEGAL_INTERCHANGE)
+    {
+        return is_strict_utf8_string(s, len);
+    }
+
+    if ((flags & ~UTF8_DISALLOW_ABOVE_31_BIT)
+                                       == UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE)
+    {
+        return is_c9strict_utf8_string(s, len);
+    }
+
+    while (x < send) {
+        STRLEN cur_len = isUTF8_CHAR_flags(x, send, flags);
+        if (UNLIKELY(! cur_len)) {
+            return FALSE;
+        }
+        x += cur_len;
+    }
+
+    return TRUE;
+}
+
+/*
 
 =for apidoc is_utf8_string_loc
 
-Like L</is_utf8_string> but stores the location of the failure (in the
+Like C<L</is_utf8_string>> but stores the location of the failure (in the
 case of "utf8ness failure") or the location C<s>+C<len> (in the case of
 "utf8ness success") in the C<ep> pointer.
 
-See also L</is_utf8_string_loclen>() and L</is_utf8_string>().
+See also C<L</is_utf8_string_loclen>>.
+
+=cut
+*/
+
+#define is_utf8_string_loc(s, len, ep)  is_utf8_string_loclen(s, len, ep, 0)
+
+/*
 
 =for apidoc is_utf8_string_loclen
 
-Like L</is_utf8_string>() but stores the location of the failure (in the
+Like C<L</is_utf8_string>> but stores the location of the failure (in the
 case of "utf8ness failure") or the location C<s>+C<len> (in the case of
-"utf8ness success") in the C<ep>, and the number of UTF-8
+"utf8ness success") in the C<ep> pointer, and the number of UTF-8
 encoded characters in the C<el> pointer.
 
-See also L</is_utf8_string_loc>() and L</is_utf8_string>().
+See also C<L</is_utf8_string_loc>>.
 
 =cut
 */
@@ -448,6 +667,203 @@ Perl_is_utf8_string_loclen(const U8 *s, const STRLEN len, 
const U8 **ep, STRLEN
 }
 
 /*
+
+=for apidoc is_strict_utf8_string_loc
+
+Like C<L</is_strict_utf8_string>> but stores the location of the failure (in 
the
+case of "utf8ness failure") or the location C<s>+C<len> (in the case of
+"utf8ness success") in the C<ep> pointer.
+
+See also C<L</is_strict_utf8_string_loclen>>.
+
+=cut
+*/
+
+#define is_strict_utf8_string_loc(s, len, ep)                               \
+                                is_strict_utf8_string_loclen(s, len, ep, 0)
+
+/*
+
+=for apidoc is_strict_utf8_string_loclen
+
+Like C<L</is_strict_utf8_string>> but stores the location of the failure (in 
the
+case of "utf8ness failure") or the location C<s>+C<len> (in the case of
+"utf8ness success") in the C<ep> pointer, and the number of UTF-8
+encoded characters in the C<el> pointer.
+
+See also C<L</is_strict_utf8_string_loc>>.
+
+=cut
+*/
+
+PERL_STATIC_INLINE bool
+S_is_strict_utf8_string_loclen(const U8 *s, const STRLEN len, const U8 **ep, 
STRLEN *el)
+{
+    const U8* const send = s + (len ? len : strlen((const char *)s));
+    const U8* x = s;
+    STRLEN outlen = 0;
+
+    PERL_ARGS_ASSERT_IS_STRICT_UTF8_STRING_LOCLEN;
+
+    while (x < send) {
+        const STRLEN cur_len = isSTRICT_UTF8_CHAR(x, send);
+        if (UNLIKELY(! cur_len)) {
+            break;
+        }
+        x += cur_len;
+        outlen++;
+    }
+
+    if (el)
+        *el = outlen;
+
+    if (ep) {
+        *ep = x;
+    }
+
+    return (x == send);
+}
+
+/*
+
+=for apidoc is_c9strict_utf8_string_loc
+
+Like C<L</is_c9strict_utf8_string>> but stores the location of the failure (in
+the case of "utf8ness failure") or the location C<s>+C<len> (in the case of
+"utf8ness success") in the C<ep> pointer.
+
+See also C<L</is_c9strict_utf8_string_loclen>>.
+
+=cut
+*/
+
+#define is_c9strict_utf8_string_loc(s, len, ep)                                
    \
+                            is_c9strict_utf8_string_loclen(s, len, ep, 0)
+
+/*
+
+=for apidoc is_c9strict_utf8_string_loclen
+
+Like C<L</is_c9strict_utf8_string>> but stores the location of the failure (in
+the case of "utf8ness failure") or the location C<s>+C<len> (in the case of
+"utf8ness success") in the C<ep> pointer, and the number of UTF-8 encoded
+characters in the C<el> pointer.
+
+See also C<L</is_c9strict_utf8_string_loc>>.
+
+=cut
+*/
+
+PERL_STATIC_INLINE bool
+S_is_c9strict_utf8_string_loclen(const U8 *s, const STRLEN len, const U8 **ep, 
STRLEN *el)
+{
+    const U8* const send = s + (len ? len : strlen((const char *)s));
+    const U8* x = s;
+    STRLEN outlen = 0;
+
+    PERL_ARGS_ASSERT_IS_C9STRICT_UTF8_STRING_LOCLEN;
+
+    while (x < send) {
+        const STRLEN cur_len = isC9_STRICT_UTF8_CHAR(x, send);
+        if (UNLIKELY(! cur_len)) {
+            break;
+        }
+        x += cur_len;
+        outlen++;
+    }
+
+    if (el)
+        *el = outlen;
+
+    if (ep) {
+        *ep = x;
+    }
+
+    return (x == send);
+}
+
+/*
+
+=for apidoc is_utf8_string_loc_flags
+
+Like C<L</is_utf8_string_flags>> but stores the location of the failure (in the
+case of "utf8ness failure") or the location C<s>+C<len> (in the case of
+"utf8ness success") in the C<ep> pointer.
+
+See also C<L</is_utf8_string_loclen_flags>>.
+
+=cut
+*/
+
+#define is_utf8_string_loc_flags(s, len, ep, flags)                         \
+                        is_utf8_string_loclen_flags(s, len, ep, 0, flags)
+
+
+/* The above 3 actual functions could have been moved into the more general one
+ * just below, and made #defines that call it with the right 'flags'.  They are
+ * currently kept separate to increase their chances of getting inlined */
+
+/*
+
+=for apidoc is_utf8_string_loclen_flags
+
+Like C<L</is_utf8_string_flags>> but stores the location of the failure (in the
+case of "utf8ness failure") or the location C<s>+C<len> (in the case of
+"utf8ness success") in the C<ep> pointer, and the number of UTF-8
+encoded characters in the C<el> pointer.
+
+See also C<L</is_utf8_string_loc_flags>>.
+
+=cut
+*/
+
+PERL_STATIC_INLINE bool
+S_is_utf8_string_loclen_flags(const U8 *s, const STRLEN len, const U8 **ep, 
STRLEN *el, const U32 flags)
+{
+    const U8* const send = s + (len ? len : strlen((const char *)s));
+    const U8* x = s;
+    STRLEN outlen = 0;
+
+    PERL_ARGS_ASSERT_IS_UTF8_STRING_LOCLEN_FLAGS;
+    assert(0 == (flags & ~(UTF8_DISALLOW_ILLEGAL_INTERCHANGE
+                          |UTF8_DISALLOW_ABOVE_31_BIT)));
+
+    if (flags == 0) {
+        return is_utf8_string_loclen(s, len, ep, el);
+    }
+
+    if ((flags & ~UTF8_DISALLOW_ABOVE_31_BIT)
+                                        == UTF8_DISALLOW_ILLEGAL_INTERCHANGE)
+    {
+        return is_strict_utf8_string_loclen(s, len, ep, el);
+    }
+
+    if ((flags & ~UTF8_DISALLOW_ABOVE_31_BIT)
+                                    == UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE)
+    {
+        return is_c9strict_utf8_string_loclen(s, len, ep, el);
+    }
+
+    while (x < send) {
+        const STRLEN cur_len = isUTF8_CHAR_flags(x, send, flags);
+        if (UNLIKELY(! cur_len)) {
+            break;
+        }
+        x += cur_len;
+        outlen++;
+    }
+
+    if (el)
+        *el = outlen;
+
+    if (ep) {
+        *ep = x;
+    }
+
+    return (x == send);
+}
+
+/*
 =for apidoc utf8_distance
 
 Returns the number of UTF-8 characters between the UTF-8 pointers C<a>
@@ -528,7 +944,8 @@ failure can be signalled without having to wait for the 
next read.
 
 =cut
 */
-#define is_utf8_valid_partial_char(s, e) is_utf8_valid_partial_char_flags(s, 
e, 0)
+#define is_utf8_valid_partial_char(s, e)                                    \
+                                is_utf8_valid_partial_char_flags(s, e, 0)
 
 /*
 
@@ -544,8 +961,8 @@ C<L</is_utf8_valid_partial_char>>.  Otherwise C<flags> can 
be any combination
 of the C<UTF8_DISALLOW_I<foo>> flags accepted by C<L</utf8n_to_uvchr>>.  If
 there is any sequence of bytes that can complete the input partial character in
 such a way that a non-prohibited character is formed, the function returns
-TRUE; otherwise FALSE.  Non characters cannot be determined based on partial
-character input.  But many  of the other possible excluded types can be
+TRUE; otherwise FALSE.  Non character code points cannot be determined based on
+partial character input.  But many  of the other possible excluded types can be
 determined from just the first one or two bytes.
 
 =cut
@@ -566,6 +983,80 @@ S_is_utf8_valid_partial_char_flags(const U8 * const s, 
const U8 * const e, const
     return cBOOL(_is_utf8_char_helper(s, e, flags));
 }
 
+/*
+
+=for apidoc is_utf8_fixed_width_buf_flags
+
+Returns TRUE if the fixed-width buffer starting at C<s> with length C<len>
+is entirely valid UTF-8, subject to the restrictions given by C<flags>;
+otherwise it returns FALSE.
+
+If C<flags> is 0, any well-formed UTF-8, as extended by Perl, is accepted
+without restriction.  If the final few bytes of the buffer do not form a
+complete code point, this will return TRUE anyway, provided that
+C<L</is_utf8_valid_partial_char_flags>> returns TRUE for them.
+
+If C<flags> in non-zero, it can be any combination of the
+C<UTF8_DISALLOW_I<foo>> flags accepted by C<L</utf8n_to_uvchr>>, and with the
+same meanings.
+
+This function differs from C<L</is_utf8_string_flags>> only in that the latter
+returns FALSE if the final few bytes of the string don't form a complete code
+point.
+
+=cut
+ */
+#define is_utf8_fixed_width_buf_flags(s, len, flags)                        \
+                is_utf8_fixed_width_buf_loclen_flags(s, len, 0, 0, flags)
+
+/*
+
+=for apidoc is_utf8_fixed_width_buf_loc_flags
+
+Like C<L</is_utf8_fixed_width_buf_flags>> but stores the location of the
+failure in the C<ep> pointer.  If the function returns TRUE, C<*ep> will point
+to the beginning of any partial character at the end of the buffer; if there is
+no partial character C<*ep> will contain C<s>+C<len>.
+
+See also C<L</is_utf8_fixed_width_buf_loclen_flags>>.
+
+=cut
+*/
+
+#define is_utf8_fixed_width_buf_loc_flags(s, len, loc, flags)               \
+                is_utf8_fixed_width_buf_loclen_flags(s, len, loc, 0, flags)
+
+/*
+
+=for apidoc is_utf8_fixed_width_buf_loclen_flags
+
+Like C<L</is_utf8_fixed_width_buf_loc_flags>> but stores the number of
+complete, valid characters found in the C<el> pointer.
+
+=cut
+*/
+
+PERL_STATIC_INLINE bool
+S_is_utf8_fixed_width_buf_loclen_flags(const U8 * const s,
+                                       const STRLEN len,
+                                       const U8 **ep,
+                                       STRLEN *el,
+                                       const U32 flags)
+{
+    const U8 * maybe_partial;
+
+    PERL_ARGS_ASSERT_IS_UTF8_FIXED_WIDTH_BUF_LOCLEN_FLAGS;
+
+    if (! ep) {
+        ep  = &maybe_partial;
+    }
+
+    /* If it's entirely valid, return that; otherwise see if the only error is
+     * that the final few bytes are for a partial character */
+    return    is_utf8_string_loclen_flags(s, len, ep, el, flags)
+           || is_utf8_valid_partial_char_flags(*ep, s + len, flags);
+}
+
 /* ------------------------------- perl.h ----------------------------- */
 
 /*
diff --git a/pp_sys.c b/pp_sys.c
index a198d4e..3c8e985 100644
--- a/pp_sys.c
+++ b/pp_sys.c
@@ -3556,14 +3556,10 @@ PP(pp_fttext)
 
     assert(len);
     if (! is_utf8_invariant_string((U8 *) s, len)) {
-        const U8 *ep;
 
         /* Here contains a variant under UTF-8 .  See if the entire string is
-         * UTF-8.  But the buffer may end in a partial character, so if it
-         * failed, see if the failure was due just to that */
-        if (   is_utf8_string_loc((U8 *) s, len, &ep)
-            || is_utf8_valid_partial_char(ep, (U8 *) s + len))
-        {
+         * UTF-8. */
+        if (is_utf8_fixed_width_buf_flags((U8 *) s, len, 0)) {
             if (PL_op->op_type == OP_FTTEXT) {
                 FT_RETURNYES;
             }
diff --git a/proto.h b/proto.h
index 7c2a821..b30a593 100644
--- a/proto.h
+++ b/proto.h
@@ -1328,6 +1328,15 @@ PERL_CALLCONV bool       Perl_isIDFIRST_lazy(pTHX_ const 
char* p)
                        __attribute__warn_unused_result__
                        __attribute__pure__; */
 
+PERL_STATIC_INLINE bool        S_is_c9strict_utf8_string(const U8 *s, const 
STRLEN len)
+                       __attribute__pure__;
+#define PERL_ARGS_ASSERT_IS_C9STRICT_UTF8_STRING       \
+       assert(s)
+
+/* PERL_CALLCONV bool  is_c9strict_utf8_string_loc(const U8 *s, const STRLEN 
len, const U8 **ep); */
+PERL_STATIC_INLINE bool        S_is_c9strict_utf8_string_loclen(const U8 *s, 
const STRLEN len, const U8 **ep, STRLEN *el);
+#define PERL_ARGS_ASSERT_IS_C9STRICT_UTF8_STRING_LOCLEN        \
+       assert(s)
 /* PERL_CALLCONV bool  Perl_is_invariant_string(const U8* const s, const 
STRLEN len)
                        __attribute__warn_unused_result__
                        __attribute__pure__; */
@@ -1335,6 +1344,15 @@ PERL_CALLCONV bool       Perl_isIDFIRST_lazy(pTHX_ const 
char* p)
 PERL_CALLCONV I32      Perl_is_lvalue_sub(pTHX)
                        __attribute__warn_unused_result__;
 
+PERL_STATIC_INLINE bool        S_is_strict_utf8_string(const U8 *s, const 
STRLEN len)
+                       __attribute__pure__;
+#define PERL_ARGS_ASSERT_IS_STRICT_UTF8_STRING \
+       assert(s)
+
+/* PERL_CALLCONV bool  is_strict_utf8_string_loc(const U8 *s, const STRLEN 
len, const U8 **ep); */
+PERL_STATIC_INLINE bool        S_is_strict_utf8_string_loclen(const U8 *s, 
const STRLEN len, const U8 **ep, STRLEN *el);
+#define PERL_ARGS_ASSERT_IS_STRICT_UTF8_STRING_LOCLEN  \
+       assert(s)
 PERL_CALLCONV bool     Perl_is_uni_alnum(pTHX_ UV c)
                        __attribute__deprecated__
                        __attribute__warn_unused_result__
@@ -1537,6 +1555,11 @@ PERL_CALLCONV bool       Perl_is_utf8_digit(pTHX_ const 
U8 *p)
 #define PERL_ARGS_ASSERT_IS_UTF8_DIGIT \
        assert(p)
 
+/* PERL_CALLCONV bool  is_utf8_fixed_width_buf_flags(const U8 * const s, const 
STRLEN len, const U32 flags); */
+/* PERL_CALLCONV bool  is_utf8_fixed_width_buf_loc_flags(const U8 * const s, 
const STRLEN len, const U8 **ep, const U32 flags); */
+PERL_STATIC_INLINE bool        S_is_utf8_fixed_width_buf_loclen_flags(const U8 
* const s, const STRLEN len, const U8 **ep, STRLEN *el, const U32 flags);
+#define PERL_ARGS_ASSERT_IS_UTF8_FIXED_WIDTH_BUF_LOCLEN_FLAGS  \
+       assert(s)
 PERL_CALLCONV bool     Perl_is_utf8_graph(pTHX_ const U8 *p)
                        __attribute__deprecated__
                        __attribute__warn_unused_result__;
@@ -1614,14 +1637,23 @@ PERL_STATIC_INLINE bool Perl_is_utf8_string(const U8 
*s, const STRLEN len)
 #define PERL_ARGS_ASSERT_IS_UTF8_STRING        \
        assert(s)
 
+PERL_STATIC_INLINE bool        S_is_utf8_string_flags(const U8 *s, const 
STRLEN len, const U32 flags)
+                       __attribute__pure__;
+#define PERL_ARGS_ASSERT_IS_UTF8_STRING_FLAGS  \
+       assert(s)
+
 #ifndef NO_MATHOMS
 PERL_CALLCONV bool     Perl_is_utf8_string_loc(const U8 *s, const STRLEN len, 
const U8 **ep);
 #define PERL_ARGS_ASSERT_IS_UTF8_STRING_LOC    \
        assert(s); assert(ep)
 #endif
+/* PERL_CALLCONV bool  is_utf8_string_loc_flags(const U8 *s, const STRLEN len, 
const U8 **ep, const U32 flags); */
 PERL_STATIC_INLINE bool        Perl_is_utf8_string_loclen(const U8 *s, const 
STRLEN len, const U8 **ep, STRLEN *el);
 #define PERL_ARGS_ASSERT_IS_UTF8_STRING_LOCLEN \
        assert(s)
+PERL_STATIC_INLINE bool        S_is_utf8_string_loclen_flags(const U8 *s, 
const STRLEN len, const U8 **ep, STRLEN *el, const U32 flags);
+#define PERL_ARGS_ASSERT_IS_UTF8_STRING_LOCLEN_FLAGS   \
+       assert(s)
 PERL_CALLCONV bool     Perl_is_utf8_upper(pTHX_ const U8 *p)
                        __attribute__deprecated__
                        __attribute__warn_unused_result__;
diff --git a/regcomp.c b/regcomp.c
index b00b385..634a320 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -101,14 +101,6 @@ EXTERN_C const struct regexp_engine my_reg_engine;
 #define        STATIC  static
 #endif
 
-#ifndef MIN
-#define MIN(a,b) ((a) < (b) ? (a) : (b))
-#endif
-
-#ifndef MAX
-#define MAX(a,b) ((a) > (b) ? (a) : (b))
-#endif
-
 /* this is a chain of data about sub patterns we are processing that
    need to be handled separately/specially in study_chunk. Its so
    we can simulate recursion without losing state.  */
diff --git a/sv.c b/sv.c
index 3cf52d9..850c727 100644
--- a/sv.c
+++ b/sv.c
@@ -3749,11 +3749,11 @@ Perl_sv_utf8_encode(pTHX_ SV *const sv)
 /*
 =for apidoc sv_utf8_decode
 
-If the PV of the SV is an octet sequence in UTF-8
+If the PV of the SV is an octet sequence in Perl's extended UTF-8
 and contains a multiple-byte character, the C<SvUTF8> flag is turned on
 so that it looks like a character.  If the PV contains only single-byte
 characters, the C<SvUTF8> flag stays off.
-Scans PV for validity and returns false if the PV is invalid UTF-8.
+Scans PV for validity and returns FALSE if the PV is invalid UTF-8.
 
 =cut
 */
diff --git a/utf8.c b/utf8.c
index 5fdaf52..7f8df9d 100644
--- a/utf8.c
+++ b/utf8.c
@@ -381,9 +381,6 @@ S_is_utf8_cp_above_31_bits(const U8 * const s, const U8 * 
const e)
      */
 
 #ifdef EBCDIC
-#  ifndef MIN
-#    define MIN(a,b) ((a) < (b) ? (a) : (b))
-#  endif
 
         /* [0] is start byte    [1] [2] [3] [4] [5] [6] [7] */
     const U8 * const prefix = "\x41\x41\x41\x41\x41\x41\x42";
diff --git a/utf8.h b/utf8.h
index 392a86a..77eb63d 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,9 +79,6 @@ the string is invariant.
 #define to_utf8_upper(a,b,c) _to_utf8_upper_flags(a,b,c,0)
 #define to_utf8_title(a,b,c) _to_utf8_title_flags(a,b,c,0)
 
-/* Source backward compatibility. */
-#define is_utf8_string_loc(s, len, ep) is_utf8_string_loclen(s, len, ep, 0)
-
 #define foldEQ_utf8(s1, pe1, l1, u1, s2, pe2, l2, u2) \
                    foldEQ_utf8_flags(s1, pe1, l1, u1, s2, pe2, l2, u2, 0)
 #define FOLDEQ_UTF8_NOMIX_ASCII   (1 << 0)
@@ -964,14 +961,22 @@ Evaluates to non-zero if the first few bytes of the 
string starting at C<s> and
 looking no further than S<C<e - 1>> are well-formed UTF-8, as extended by Perl,
 that represents some code point; otherwise it evaluates to 0.  If non-zero, the
 value gives how many bytes starting at C<s> comprise the code point's
-representation.
+representation.  Any bytes remaining before C<e>, but beyond the ones needed to
+form the first code point in C<s>, are not examined.
 
 The code point can be any that will fit in a UV on this machine, using Perl's
 extension to official UTF-8 to represent those higher than the Unicode maximum
 of 0x10FFFF.  That means that this macro is used to efficiently decide if the
-next few bytes in C<s> is legal UTF-8 for a single character.  Use
-L</is_utf8_string>(), L</is_utf8_string_loclen>(), and
-L</is_utf8_string_loc>() to check entire strings.
+next few bytes in C<s> is legal UTF-8 for a single character.
+
+Use C<L</isSTRICT_UTF8_CHAR>> to restrict the acceptable code points to those
+defined by Unicode to be fully interchangeable across applications;
+C<L</isC9_STRICT_UTF8_CHAR>> to use the L<Unicode Corrigendum
+#9|http://www.unicode.org/versions/corrigendum9.html> definition of allowable
+code points; and C<L</isUTF8_CHAR_flags>> for a more customized definition.
+
+Use C<L</is_utf8_string>>, C<L</is_utf8_string_loc>>, and
+C<L</is_utf8_string_loclen>> to check entire strings.
 
 Note that it is deprecated to use code points higher than what will fit in an
 IV.  This macro does not raise any warnings for such code points, treating them
@@ -1004,15 +1009,24 @@ Evaluates to non-zero if the first few bytes of the 
string starting at C<s> and
 looking no further than S<C<e - 1>> are well-formed UTF-8 that represents some
 Unicode code point completely acceptable for open interchange between all
 applications; otherwise it evaluates to 0.  If non-zero, the value gives how
-many bytes starting at C<s> comprise the code point's representation.
+many bytes starting at C<s> comprise the code point's representation.  Any
+bytes remaining before C<e>, but beyond the ones needed to form the first code
+point in C<s>, are not examined.
 
 The largest acceptable code point is the Unicode maximum 0x10FFFF, and must not
 be a surrogate nor a non-character code point.  Thus this excludes any code
 point from Perl's extended UTF-8.
 
 This is used to efficiently decide if the next few bytes in C<s> is
-legal Unicode-acceptable UTF-8 for a single character.  Use
-C<L</isC9_STRICT_UTF8_CHAR>> to also accept non-character code points.
+legal Unicode-acceptable UTF-8 for a single character.
+
+Use C<L</isC9_STRICT_UTF8_CHAR>> to use the L<Unicode Corrigendum
+#9|http://www.unicode.org/versions/corrigendum9.html> definition of allowable
+code points; C<L</isUTF8_CHAR>> to check for Perl's extended UTF-8;
+and C<L</isUTF8_CHAR_flags>> for a more customized definition.
+
+Use C<L</is_strict_utf8_string>>, C<L</is_strict_utf8_string_loc>>, and
+C<L</is_strict_utf8_string_loclen>> to check entire strings.
 
 =cut
 */
@@ -1034,7 +1048,8 @@ Evaluates to non-zero if the first few bytes of the 
string starting at C<s> and
 looking no further than S<C<e - 1>> are well-formed UTF-8 that represents some
 Unicode non-surrogate code point; otherwise it evaluates to 0.  If non-zero,
 the value gives how many bytes starting at C<s> comprise the code point's
-representation.
+representation.  Any bytes remaining before C<e>, but beyond the ones needed to
+form the first code point in C<s>, are not examined.
 
 The largest acceptable code point is the Unicode maximum 0x10FFFF.  This
 differs from C<L</isSTRICT_UTF8_CHAR>> only in that it accepts non-character
@@ -1044,6 +1059,12 @@ which said that non-character code points are merely 
discouraged rather than
 completely forbidden in open interchange.  See
 L<perlunicode/Noncharacter code points>.
 
+Use C<L</isUTF8_CHAR>> to check for Perl's extended UTF-8; and
+C<L</isUTF8_CHAR_flags>> for a more customized definition.
+
+Use C<L</is_c9strict_utf8_string>>, C<L</is_c9strict_utf8_string_loc>>, and
+C<L</is_c9strict_utf8_string_loclen>> to check entire strings.
+
 =cut
 */
 
@@ -1064,7 +1085,9 @@ Evaluates to non-zero if the first few bytes of the 
string starting at C<s> and
 looking no further than S<C<e - 1>> are well-formed UTF-8, as extended by Perl,
 that represents some code point, subject to the restrictions given by C<flags>;
 otherwise it evaluates to 0.  If non-zero, the value gives how many bytes
-starting at C<s> comprise the code point's representation.
+starting at C<s> comprise the code point's representation.  Any bytes remaining
+before C<e>, but beyond the ones needed to form the first code point in C<s>,
+are not examined.
 
 If C<flags> is 0, this gives the same results as C<L</isUTF8_CHAR>>;
 if C<flags> is C<UTF8_DISALLOW_ILLEGAL_INTERCHANGE>, this gives the same 
results
@@ -1078,6 +1101,9 @@ The three alternative macros are for the most commonly 
needed validations; they
 are likely to run somewhat faster than this more general one, as they can be
 inlined into your code.
 
+Use L</is_utf8_string_flags>, L</is_utf8_string_loc_flags>, and
+L</is_utf8_string_loclen_flags> to check entire strings.
+
 =cut
 */
 

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.25.5-24-g98fce2a

Reply via email to