[perl.git] branch blead, updated. v5.25.8-54-g99a765e9e3

Karl Williamson Fri, 23 Dec 2016 15:49:59 -0800

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/99a765e9e37afa8c2519ed155d6ce30fe0b6994c?hp=6cdc5cd8f36f88172b0fcefdcadec75f5b6600b2>


- Log -----------------------------------------------------------------
commit 99a765e9e37afa8c2519ed155d6ce30fe0b6994c
Author: Karl Williamson <[email protected]>
Date:   Sun Dec 11 20:53:54 2016 -0700

    utf8.c: Add flag to indicate unsure as to end of string to print
    
    When decoding a UTF-8 encoded string, we may have guessed as to how long
    it is.  This adds a flag so that the base level decode routine knows
    that it is a guess, and it minimizes what gets printed, rather than the
    normal full information, so as to minimize reading past the end of the
    string

M       utf8.c
M       utf8.h

commit 34aeb2e92066dd41c16797e63eb0496735b5dfe4
Author: Karl Williamson <[email protected]>
Date:   Thu Dec 15 19:51:26 2016 -0700

    Deprecate isFOO_utf8() macros
    
    These macros are being replaced by a safe version; they now generate a
    deprecation message at each call site upon the first use there in each
    program run.

M       embed.fnc
M       embed.h
M       embedvar.h
M       ext/XS-APItest/t/handy.t
M       handy.h
M       intrpvar.h
M       pod/perldelta.pod
M       proto.h
M       sv.c
M       utf8.c
M       utf8.h

commit 24e16d7b405f10168aae144d4a2c37d9c6443b9e
Author: Karl Williamson <[email protected]>
Date:   Sun Dec 11 20:35:09 2016 -0700

    regexec.c: Make isFOO_lc() non-static
    
    This is in preparation for it to be called from outside this file.

M       embed.fnc
M       embed.h
M       proto.h
M       regexec.c

commit ddb659335ba5267366f1c691fb334983fd1b2023
Author: Karl Williamson <[email protected]>
Date:   Thu Dec 8 22:01:58 2016 -0700

    utf8.c: White space, comments only
    
    This indents code because a new block was formed around it.  It also
    does a few other white-space changes to fit in 79 columns, and removes
    an unbalanced '{' in a comment so editors that find matching pairs
    aren't fooled, and adds text to another comment

M       utf8.c

commit d60baaa7781e81851a5ac29fea2abebde6730478
Author: Karl Williamson <[email protected]>
Date:   Sat Dec 10 18:01:39 2016 -0700

    Allow allowing UTF-8 overflow malformation
    
    perl has never allowed the UTF-8 overflow malformation, for some reason.
    But as long as overflows are turned into the REPLACEMENT CHARACTER,
    there is no real reason not to.  And making it allowable allows code
    that wants to carry on in the face of malformed input to do so, without
    risk of contaminating things, as the REPLACEMENT is the Unicode
    prescribed way of handling malformations.

M       ext/XS-APItest/t/utf8.t
M       pod/perldelta.pod
M       utf8.c
M       utf8.h

commit 9495395586e6a655057cb766ed00213037dd06c0
Author: Karl Williamson <[email protected]>
Date:   Sat Dec 10 15:26:24 2016 -0700

    Return REPLACEMENT for UTF-8 overlong malformation
    
    When perl decodes UTF-8 into a code point, it must decide what to do if
    the input is malformed in some way.  When the flags passed to the decode
    function indicate that a given malformation type is not acceptable, the
    function returns 0 to indicate failure; on success it returns the decoded
    code point (unfortunately that may require disambiguation if the
    input is validly a NUL).  As perl evolved, what happened when various
    allowed malformations were encountered got stricter and stricter.  This
    is the final malformation that was not turned into a REPLACEMENT
    CHARACTER when the malformation was allowed, and this commit changes to
    return that.  Unlike most other malformations, the code point value of
    an overlong is well-defined, and that is why it hadn't been changed
    here-to-fore.  But it is safer to use the Unicode prescribed behavior on
    all malformations, which is to replace them with the REPLACEMENT
    CHARACTER.  Just in case there is code that requires the old behavior,
    it is retained, but you have to search the source for the undocumented
    flag that enables it.

M       ext/XS-APItest/t/utf8.t
M       pod/perldelta.pod
M       utf8.c
M       utf8.h

commit 5a48568dae7e81342fc2f8d0845423834f5c818f
Author: Karl Williamson <[email protected]>
Date:   Wed Dec 14 11:38:42 2016 -0700

    Return REPLACEMENT for UTF-8 empty malformation
    
    The previous commit no longer allows this so-called malformation under
    DEBUGGING builds, except if code explicitly changes to request it (or
    already explicitly does, but there are no instances of this in CPAN).
    
    If it is explicitly allowed, prior to this commit it returned NUL.  If
    it wasn't allowed, it returned 0.  Most code won't treat these as
    different.  When returning NUL, it basically is making nothing into
    something, which might be exploitable some way by an attacker.  The
    Unicode accepted way of dealing with malformations is to replace them
    with the REPLACEMENT CHARACTER, and so this commit changes things to
    conform to this.

M       ext/XS-APItest/t/utf8.t
M       pod/perldelta.pod
M       utf8.c

commit d1f8d421df731c77beff3db92d27dc6ec28589f2
Author: Karl Williamson <[email protected]>
Date:   Mon Dec 19 13:25:06 2016 -0700

    utf8.c: Forbid zero-length malformation under DEBUGGING

M       ext/XS-APItest/t/utf8.t
M       pod/perldelta.pod
M       utf8.c

commit 2d532c27c843a85ae0a9743642866ef4b70d1323
Author: Karl Williamson <[email protected]>
Date:   Sat Dec 10 12:51:59 2016 -0700

    utf8.h: Don't allow zero length malformation unless requested
    
    The bottom level Perl routine that decodes UTF-8 into a code point has
    long accepted inputs where the length is specified to be 0, returning a
    NUL.  It considers this a malformation, which is accepted in some
    scenarios, but not others.  In consultation with Tony Cook, we decided
    this really isn't a malformation, but is a bug in the calling program.
    Rather than call the decode routine when it has nothing to decode, it
    should just not call it.
    
    This commit removes the acceptance of a zero length string from any of
    the canned flag combinations passed to the decode function.  One can
    convert to specify this flag explicitly, if necessary.  However the next
    commit will cause this to fail under DEBUGGING builds, as a step towards
    removing the capability altogether.

M       utf8.h

commit f180b2926a9378db829862d88921feefe2460d35
Author: Karl Williamson <[email protected]>
Date:   Sat Dec 10 12:27:19 2016 -0700

    utf8.h: Renumber flag bits
    
    This creates a gap that will be filled by future commits

M       ext/XS-APItest/t/utf8.t
M       utf8.h

commit c496ba27a3e2f56b1cc186e5c5254d4004f89ffc
Author: Karl Williamson <[email protected]>
Date:   Mon Dec 12 19:42:23 2016 -0700

    toke.c: Replace infinite loop reading input by bounded
    
    It's safer to have an upper limit on how far you look in your input.

M       toke.c

commit 23359a664a973661774a4730b281fbd03cbd01b1
Author: Karl Williamson <[email protected]>
Date:   Mon Dec 12 19:36:36 2016 -0700

    toke.c: Use fewer branches
    
    This code is true for all ASCII space characters except \n.  Rather
    than enumerating them with a branch each, use a single lookup, and then
    exclude \n

M       toke.c

commit e20fd8f46e1fcbfb9b705bb1723360e062a298f0
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 6 10:15:07 2016 -0700

    toke.c: Use macro instead of repeating code
    
    toke.c has a macro that does this task.  Use it.

M       toke.c

commit caae07006a52dd4f7719940be63600cf2d8a0510
Author: Karl Williamson <[email protected]>
Date:   Mon Dec 5 21:50:08 2016 -0700

    toke.c: White-space only

M       toke.c

commit fac0f7a38edc4e50a7250b738699165079b852d8
Author: Karl Williamson <[email protected]>
Date:   Tue Dec 13 18:34:12 2016 -0700

    toke.c: Convert to use isFOO_utf8_safe() macros

M       toke.c

commit 7a2070659f99247def6a6df08dea5709c01b7877
Author: Karl Williamson <[email protected]>
Date:   Wed Nov 30 09:53:17 2016 -0700

    Convert core (except toke.c) to use isFOO_utf8_safe()
    
    The previous commit added this feature; now this commit uses it in core.
    toke.c is deferred to the next commit to aid in possible future
    bisecting, because some of the changes there seem somewhat more likely
    to expose bugs.

M       gv.c
M       op.c
M       pp.c
M       pp_pack.c
M       regcomp.c
M       regexec.c

commit da8c1a98236a9f56df850c47705cb3046d6636aa
Author: Karl Williamson <[email protected]>
Date:   Thu Dec 15 16:30:27 2016 -0700

    Add isFOO_utf8_safe() macros
    
    The original API does not check that we aren't reading beyond the end of
    a buffer, apparently assuming that we could keep malformed UTF-8 out by
    use of gatekeepers, but that is currently impossible.  This commit adds
    "safe" macros for determining if a UTF-8 sequence represents
    an alphabetic, a digit, etc.  Each new macro has an extra parameter
    pointing to the end of the sequence, so that looking beyond the input
    string can be avoided.
    
    The macros aren't currently completely safe, as they don't test that
    there is at least a single valid byte in the input, except by an
    assertion in DEBUGGING builds.  This is because typically they are
    called in code that makes that assumption, and frequently tests the
    current byte for one thing or another.

M       embed.fnc
M       embed.h
M       ext/XS-APItest/APItest.xs
M       ext/XS-APItest/t/handy.t
M       handy.h
M       pod/perldelta.pod
M       proto.h
M       utf8.c
M       utf8.h

commit 9dfb44ee59033dc1f1f858d46a05a3f3c8ce85d9
Author: Karl Williamson <[email protected]>
Date:   Sat Dec 3 12:14:33 2016 -0700

    toke.c: Avoid a conversion to/from UTF-8
    
    If the source file is encoded as UTF-8, we don't have to find its code
    point equivalent when parsing--we can just copy it unchanged.  This
    wasn't done before because of the fear the input would be malformed, and
    finding the code point had the side effect of checking for
    well-formedness.  The previous commit added wellformedness checking,
    so doing it again here would be redundant.

M       toke.c
-----------------------------------------------------------------------

Summary of changes:
 embed.fnc                 |  34 +++-
 embed.h                   |  13 +-
 embedvar.h                |   1 +
 ext/XS-APItest/APItest.xs | 448 ++++++++++++++++++++++++++++++++++++++--------
 ext/XS-APItest/t/handy.t  |  92 +++++++++-
 ext/XS-APItest/t/utf8.t   |  67 ++++---
 gv.c                      |  21 ++-
 handy.h                   | 434 ++++++++++++++++++++++++++++++--------------
 intrpvar.h                |   1 +
 op.c                      |  11 +-
 pod/perldelta.pod         |  37 ++++
 pp.c                      |   8 +-
 pp_pack.c                 |  11 +-
 proto.h                   |  37 ++--
 regcomp.c                 |   7 +-
 regexec.c                 |  71 +++++---
 sv.c                      |   1 +
 toke.c                    | 230 +++++++++++++++---------
 utf8.c                    | 315 ++++++++++++++++++++++++++------
 utf8.h                    |  80 ++++++---
 20 files changed, 1445 insertions(+), 474 deletions(-)

diff --git a/embed.fnc b/embed.fnc
index 4743524f17..561ad9f564 100644
--- a/embed.fnc
+++ b/embed.fnc
@@ -812,7 +812,13 @@ AmndP      |bool   |is_utf8_valid_partial_char             
                    \
 AnidR  |bool   |is_utf8_valid_partial_char_flags                           \
                |NN const U8 * const s|NN const U8 * const e|const U32 flags
 AMpR   |bool   |_is_uni_FOO|const U8 classnum|const UV c
-AMpR   |bool   |_is_utf8_FOO|const U8 classnum|NN const U8 *p
+AMpR   |bool   |_is_utf8_FOO|U8 classnum|NN const U8 * const p             \
+               |NN const char * const name                                 \
+               |NN const char * const alternative                          \
+               |const bool use_utf8|const bool use_locale                  \
+               |NN const char * const file|const unsigned line
+AMpR   |bool   |_is_utf8_FOO_with_len|const U8 classnum|NN const U8 *p     \
+               |NN const U8 * const e
 ADMpR  |bool   |is_utf8_alnum  |NN const U8 *p
 ADMpR  |bool   |is_utf8_alnumc |NN const U8 *p
 ADMpR  |bool   |is_utf8_idfirst|NN const U8 *p
@@ -821,8 +827,10 @@ AMpR       |bool   |_is_utf8_idcont|NN const U8 *p
 AMpR   |bool   |_is_utf8_idstart|NN const U8 *p
 AMpR   |bool   |_is_utf8_xidcont|NN const U8 *p
 AMpR   |bool   |_is_utf8_xidstart|NN const U8 *p
-AMpR   |bool   |_is_utf8_perl_idcont|NN const U8 *p
-AMpR   |bool   |_is_utf8_perl_idstart|NN const U8 *p
+AMpR   |bool   |_is_utf8_perl_idcont_with_len|NN const U8 *p               \
+               |NN const U8 * const e
+AMpR   |bool   |_is_utf8_perl_idstart_with_len|NN const U8 *p              \
+               |NN const U8 * const e
 ADMpR  |bool   |is_utf8_idcont |NN const U8 *p
 ADMpR  |bool   |is_utf8_xidcont        |NN const U8 *p
 ADMpR  |bool   |is_utf8_alpha  |NN const U8 *p
@@ -1715,6 +1723,12 @@ sMR      |char * |unexpected_non_continuation_text       
                \
                |const STRLEN non_cont_byte_pos                         \
                |const STRLEN expect_len
 sM     |char * |_byte_dump_string|NN const U8 * s|const STRLEN len
+s      |void   |warn_on_first_deprecated_use                               \
+                               |NN const char * const name                 \
+                               |NN const char * const alternative          \
+                               |const bool use_locale                      \
+                               |NN const char * const file                 \
+                               |const unsigned line
 s      |UV     |_to_utf8_case  |const UV uv1                                   
\
                                |NN const U8 *p                                 
\
                                |NN U8* ustrp                                   
\
@@ -2437,8 +2451,10 @@ Es       |U8     |regtail_study  |NN RExC_state_t 
*pRExC_state \
 #  endif
 #endif
 
+#if defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C)
+EXRpM  |bool   |isFOO_lc       |const U8 classnum|const U8 character
+#endif
 #if defined(PERL_IN_REGEXEC_C)
-ERs    |bool   |isFOO_lc       |const U8 classnum|const U8 character
 ERs    |bool   |isFOO_utf8_lc  |const U8 classnum|NN const U8* character
 ERs    |SSize_t|regmatch       |NN regmatch_info *reginfo|NN char *startpos|NN 
regnode *prog
 WERs   |I32    |regrepeat      |NN regexp *prog|NN char **startposp \
@@ -2726,7 +2742,15 @@ sRM      |UV     |check_locale_boundary_crossing         
                    \
                |const UV result                                            \
                |NN U8* const ustrp                                         \
                |NN STRLEN *lenp
-iR     |bool   |is_utf8_common |NN const U8 *const p|NN SV **swash|NN const 
char * const swashname|NULLOK SV* const invlist
+iR     |bool   |is_utf8_common |NN const U8 *const p                       \
+                               |NN SV **swash                              \
+                               |NN const char * const swashname            \
+                               |NULLOK SV* const invlist
+iR     |bool   |is_utf8_common_with_len|NN const U8 *const p               \
+                                          |NN const U8 *const e            \
+                                   |NN SV **swash                          \
+                                   |NN const char * const swashname        \
+                                   |NULLOK SV* const invlist
 sR     |SV*    |swatch_get     |NN SV* swash|UV start|UV span
 sRM    |U8*    |swash_scan_list_line|NN U8* l|NN U8* const lend|NN UV* min \
                |NN UV* max|NN UV* val|const bool wants_value               \
diff --git a/embed.h b/embed.h
index 66fe0ccfc9..4687806c08 100644
--- a/embed.h
+++ b/embed.h
@@ -32,12 +32,13 @@
 #define _is_uni_FOO(a,b)       Perl__is_uni_FOO(aTHX_ a,b)
 #define _is_uni_perl_idcont(a) Perl__is_uni_perl_idcont(aTHX_ a)
 #define _is_uni_perl_idstart(a)        Perl__is_uni_perl_idstart(aTHX_ a)
-#define _is_utf8_FOO(a,b)      Perl__is_utf8_FOO(aTHX_ a,b)
+#define _is_utf8_FOO(a,b,c,d,e,f,g,h)  Perl__is_utf8_FOO(aTHX_ a,b,c,d,e,f,g,h)
+#define _is_utf8_FOO_with_len(a,b,c)   Perl__is_utf8_FOO_with_len(aTHX_ a,b,c)
 #define _is_utf8_idcont(a)     Perl__is_utf8_idcont(aTHX_ a)
 #define _is_utf8_idstart(a)    Perl__is_utf8_idstart(aTHX_ a)
 #define _is_utf8_mark(a)       Perl__is_utf8_mark(aTHX_ a)
-#define _is_utf8_perl_idcont(a)        Perl__is_utf8_perl_idcont(aTHX_ a)
-#define _is_utf8_perl_idstart(a)       Perl__is_utf8_perl_idstart(aTHX_ a)
+#define _is_utf8_perl_idcont_with_len(a,b)     
Perl__is_utf8_perl_idcont_with_len(aTHX_ a,b)
+#define _is_utf8_perl_idstart_with_len(a,b)    
Perl__is_utf8_perl_idstart_with_len(aTHX_ a,b)
 #define _is_utf8_xidcont(a)    Perl__is_utf8_xidcont(aTHX_ a)
 #define _is_utf8_xidstart(a)   Perl__is_utf8_xidstart(aTHX_ a)
 #define _to_uni_fold_flags(a,b,c,d)    Perl__to_uni_fold_flags(aTHX_ a,b,c,d)
@@ -1130,7 +1131,6 @@
 #define backup_one_SB(a,b,c)   S_backup_one_SB(aTHX_ a,b,c)
 #define backup_one_WB(a,b,c,d) S_backup_one_WB(aTHX_ a,b,c,d)
 #define find_byclass(a,b,c,d,e)        S_find_byclass(aTHX_ a,b,c,d,e)
-#define isFOO_lc(a,b)          S_isFOO_lc(aTHX_ a,b)
 #define isFOO_utf8_lc(a,b)     S_isFOO_utf8_lc(aTHX_ a,b)
 #define isGCB(a,b,c,d,e)       S_isGCB(aTHX_ a,b,c,d,e)
 #define isLB(a,b,c,d,e,f)      S_isLB(aTHX_ a,b,c,d,e,f)
@@ -1150,6 +1150,9 @@
 #define to_byte_substr(a)      S_to_byte_substr(aTHX_ a)
 #define to_utf8_substr(a)      S_to_utf8_substr(aTHX_ a)
 #  endif
+#  if defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C)
+#define isFOO_lc(a,b)          Perl_isFOO_lc(aTHX_ a,b)
+#  endif
 #  if defined(PERL_IN_UTF8_C) || defined(PERL_IN_REGCOMP_C) || 
defined(PERL_IN_REGEXEC_C)
 #define _to_fold_latin1(a,b,c,d)       Perl__to_fold_latin1(aTHX_ a,b,c,d)
 #  endif
@@ -1835,12 +1838,14 @@
 #define does_utf8_overflow     S_does_utf8_overflow
 #define isFF_OVERLONG          S_isFF_OVERLONG
 #define is_utf8_common(a,b,c,d)        S_is_utf8_common(aTHX_ a,b,c,d)
+#define is_utf8_common_with_len(a,b,c,d,e)     S_is_utf8_common_with_len(aTHX_ 
a,b,c,d,e)
 #define is_utf8_cp_above_31_bits       S_is_utf8_cp_above_31_bits
 #define is_utf8_overlong_given_start_byte_ok   
S_is_utf8_overlong_given_start_byte_ok
 #define swash_scan_list_line(a,b,c,d,e,f,g)    S_swash_scan_list_line(aTHX_ 
a,b,c,d,e,f,g)
 #define swatch_get(a,b,c)      S_swatch_get(aTHX_ a,b,c)
 #define to_lower_latin1                S_to_lower_latin1
 #define unexpected_non_continuation_text(a,b,c,d)      
S_unexpected_non_continuation_text(aTHX_ a,b,c,d)
+#define warn_on_first_deprecated_use(a,b,c,d,e)        
S_warn_on_first_deprecated_use(aTHX_ a,b,c,d,e)
 #  endif
 #  if defined(PERL_IN_UTF8_C) || defined(PERL_IN_PP_C)
 #define _to_upper_title_latin1(a,b,c,d)        
Perl__to_upper_title_latin1(aTHX_ a,b,c,d)
diff --git a/embedvar.h b/embedvar.h
index c413932967..f1fa5ba790 100644
--- a/embedvar.h
+++ b/embedvar.h
@@ -279,6 +279,7 @@
 #define PL_scopestack_max      (vTHX->Iscopestack_max)
 #define PL_scopestack_name     (vTHX->Iscopestack_name)
 #define PL_secondgv            (vTHX->Isecondgv)
+#define PL_seen_deprecated_macro       (vTHX->Iseen_deprecated_macro)
 #define PL_sharehook           (vTHX->Isharehook)
 #define PL_sig_pending         (vTHX->Isig_pending)
 #define PL_sighandlerp         (vTHX->Isighandlerp)
diff --git a/ext/XS-APItest/APItest.xs b/ext/XS-APItest/APItest.xs
index 8b4e638484..e9d28c8d49 100644
--- a/ext/XS-APItest/APItest.xs
+++ b/ext/XS-APItest/APItest.xs
@@ -4414,16 +4414,36 @@ test_isBLANK_LC(UV ord)
         RETVAL
 
 bool
-test_isBLANK_utf8(unsigned char * p)
+test_isBLANK_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isBLANK_utf8(p);
+
+        /* In this function and those that follow, the boolean 'type'
+         * indicates if to pass a malformed UTF-8 string to the tested macro
+         * (malformed by making it too short) */
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isBLANK_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isBLANK_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isBLANK_LC_utf8(unsigned char * p)
+test_isBLANK_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isBLANK_LC_utf8(p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isBLANK_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isBLANK_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -4442,9 +4462,17 @@ test_isVERTWS_uvchr(UV ord)
         RETVAL
 
 bool
-test_isVERTWS_utf8(unsigned char * p)
+test_isVERTWS_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isVERTWS_utf8(p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isVERTWS_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isVERTWS_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -4498,16 +4526,32 @@ test_isUPPER_LC(UV ord)
         RETVAL
 
 bool
-test_isUPPER_utf8(unsigned char * p)
+test_isUPPER_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isUPPER_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isUPPER_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isUPPER_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isUPPER_LC_utf8(unsigned char * p)
+test_isUPPER_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isUPPER_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isUPPER_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isUPPER_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -4561,16 +4605,32 @@ test_isLOWER_LC(UV ord)
         RETVAL
 
 bool
-test_isLOWER_utf8(unsigned char * p)
+test_isLOWER_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isLOWER_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isLOWER_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isLOWER_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isLOWER_LC_utf8(unsigned char * p)
+test_isLOWER_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isLOWER_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isLOWER_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isLOWER_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -4624,16 +4684,32 @@ test_isALPHA_LC(UV ord)
         RETVAL
 
 bool
-test_isALPHA_utf8(unsigned char * p)
+test_isALPHA_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isALPHA_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isALPHA_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isALPHA_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isALPHA_LC_utf8(unsigned char * p)
+test_isALPHA_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isALPHA_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isALPHA_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isALPHA_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -4687,16 +4763,32 @@ test_isWORDCHAR_LC(UV ord)
         RETVAL
 
 bool
-test_isWORDCHAR_utf8(unsigned char * p)
+test_isWORDCHAR_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isWORDCHAR_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isWORDCHAR_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isWORDCHAR_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isWORDCHAR_LC_utf8(unsigned char * p)
+test_isWORDCHAR_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isWORDCHAR_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isWORDCHAR_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isWORDCHAR_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -4750,16 +4842,32 @@ test_isALPHANUMERIC_LC(UV ord)
         RETVAL
 
 bool
-test_isALPHANUMERIC_utf8(unsigned char * p)
+test_isALPHANUMERIC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isALPHANUMERIC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isALPHANUMERIC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isALPHANUMERIC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isALPHANUMERIC_LC_utf8(unsigned char * p)
+test_isALPHANUMERIC_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isALPHANUMERIC_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isALPHANUMERIC_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isALPHANUMERIC_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -4792,16 +4900,32 @@ test_isALNUM_LC(UV ord)
         RETVAL
 
 bool
-test_isALNUM_utf8(unsigned char * p)
+test_isALNUM_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isALNUM_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isWORDCHAR_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isWORDCHAR_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isALNUM_LC_utf8(unsigned char * p)
+test_isALNUM_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isALNUM_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isWORDCHAR_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isWORDCHAR_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -4827,16 +4951,32 @@ test_isDIGIT_LC_uvchr(UV ord)
         RETVAL
 
 bool
-test_isDIGIT_utf8(unsigned char * p)
+test_isDIGIT_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isDIGIT_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isDIGIT_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isDIGIT_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isDIGIT_LC_utf8(unsigned char * p)
+test_isDIGIT_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isDIGIT_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isDIGIT_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isDIGIT_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -4939,16 +5079,32 @@ test_isIDFIRST_LC(UV ord)
         RETVAL
 
 bool
-test_isIDFIRST_utf8(unsigned char * p)
+test_isIDFIRST_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isIDFIRST_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isIDFIRST_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isIDFIRST_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isIDFIRST_LC_utf8(unsigned char * p)
+test_isIDFIRST_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isIDFIRST_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isIDFIRST_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isIDFIRST_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -5002,16 +5158,32 @@ test_isIDCONT_LC(UV ord)
         RETVAL
 
 bool
-test_isIDCONT_utf8(unsigned char * p)
+test_isIDCONT_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isIDCONT_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isIDCONT_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isIDCONT_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isIDCONT_LC_utf8(unsigned char * p)
+test_isIDCONT_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isIDCONT_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isIDCONT_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isIDCONT_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -5065,16 +5237,32 @@ test_isSPACE_LC(UV ord)
         RETVAL
 
 bool
-test_isSPACE_utf8(unsigned char * p)
+test_isSPACE_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isSPACE_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isSPACE_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isSPACE_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isSPACE_LC_utf8(unsigned char * p)
+test_isSPACE_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isSPACE_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isSPACE_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isSPACE_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -5128,16 +5316,32 @@ test_isASCII_LC(UV ord)
         RETVAL
 
 bool
-test_isASCII_utf8(unsigned char * p)
+test_isASCII_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isASCII_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isASCII_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isASCII_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isASCII_LC_utf8(unsigned char * p)
+test_isASCII_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isASCII_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isASCII_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isASCII_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -5191,16 +5395,32 @@ test_isCNTRL_LC(UV ord)
         RETVAL
 
 bool
-test_isCNTRL_utf8(unsigned char * p)
+test_isCNTRL_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isCNTRL_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isCNTRL_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isCNTRL_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isCNTRL_LC_utf8(unsigned char * p)
+test_isCNTRL_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isCNTRL_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isCNTRL_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isCNTRL_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -5254,16 +5474,32 @@ test_isPRINT_LC(UV ord)
         RETVAL
 
 bool
-test_isPRINT_utf8(unsigned char * p)
+test_isPRINT_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isPRINT_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isPRINT_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isPRINT_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isPRINT_LC_utf8(unsigned char * p)
+test_isPRINT_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isPRINT_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isPRINT_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isPRINT_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -5317,16 +5553,32 @@ test_isGRAPH_LC(UV ord)
         RETVAL
 
 bool
-test_isGRAPH_utf8(unsigned char * p)
+test_isGRAPH_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isGRAPH_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isGRAPH_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isGRAPH_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isGRAPH_LC_utf8(unsigned char * p)
+test_isGRAPH_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isGRAPH_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isGRAPH_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isGRAPH_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -5380,16 +5632,32 @@ test_isPUNCT_LC(UV ord)
         RETVAL
 
 bool
-test_isPUNCT_utf8(unsigned char * p)
+test_isPUNCT_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isPUNCT_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isPUNCT_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isPUNCT_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isPUNCT_LC_utf8(unsigned char * p)
+test_isPUNCT_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isPUNCT_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isPUNCT_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isPUNCT_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -5443,16 +5711,32 @@ test_isXDIGIT_LC(UV ord)
         RETVAL
 
 bool
-test_isXDIGIT_utf8(unsigned char * p)
+test_isXDIGIT_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isXDIGIT_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isXDIGIT_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isXDIGIT_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isXDIGIT_LC_utf8(unsigned char * p)
+test_isXDIGIT_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isXDIGIT_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isXDIGIT_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isXDIGIT_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
@@ -5506,16 +5790,32 @@ test_isPSXSPC_LC(UV ord)
         RETVAL
 
 bool
-test_isPSXSPC_utf8(unsigned char * p)
+test_isPSXSPC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isPSXSPC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isPSXSPC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isPSXSPC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
 bool
-test_isPSXSPC_LC_utf8(unsigned char * p)
+test_isPSXSPC_LC_utf8(unsigned char * p, int type)
+    PREINIT:
+       const unsigned char * e;
     CODE:
-        RETVAL = isPSXSPC_LC_utf8( p);
+        if (type >= 0) {
+            e = p + UTF8SKIP(p) - type;
+            RETVAL = isPSXSPC_LC_utf8_safe(p, e);
+        }
+        else {
+            RETVAL = isPSXSPC_LC_utf8(p);
+        }
     OUTPUT:
         RETVAL
 
diff --git a/ext/XS-APItest/t/handy.t b/ext/XS-APItest/t/handy.t
index b08e8146d3..81e4c7c75b 100644
--- a/ext/XS-APItest/t/handy.t
+++ b/ext/XS-APItest/t/handy.t
@@ -104,6 +104,31 @@ sub get_display_locale_or_skip($$) {
     return (" ($locale)", 1);
 }
 
+sub try_malforming($$$)
+{
+    # Determines if the tests for malformed UTF-8 should be done.  When done,
+    # the .xs code creates malformations by pretending the length is shorter
+    # than it actually is.  Some things can't be malformed, and sometimes this
+    # test knows that the current code doesn't look for a malformation under
+    # various circumstances.
+
+    my ($i, $function, $using_locale) = @_;
+
+    # Single bytes can't be malformed
+    return 0 if $i < ((ord "A" == 65) ? 128 : 160);
+
+    # ASCII doesn't need to ever look beyond the first byte.
+    return 0 if $function eq "ASCII";
+
+    # No controls above 255, so the code doesn't look at those
+    return 0 if $i > 255 && $function eq "CNTRL";
+
+    # No non-ASCII digits below 256, except if using locales.
+    return 0 if $i < 256 && ! $using_locale && $function =~ /X?DIGIT/;
+
+    return 1;
+}
+
 my %properties = (
                    # name => Lookup-property name
                    alnum => 'Word',
@@ -128,9 +153,15 @@ my %properties = (
                    xdigit => 'XDigit',
                 );
 
+my %seen;
 my @warnings;
 local $SIG{__WARN__} = sub { push @warnings, @_ };
 
+my %utf8_param_code = (
+                        "_safe"                 =>  0,
+                        "_safe, malformed"      =>  1,
+                        "deprecated unsafe"     => -1,
+                      );
 
 foreach my $name (sort keys %properties, 'octal') {
     my @invlist;
@@ -282,13 +313,66 @@ foreach my $name (sort keys %properties, 'octal') {
                         $truth = $matches;
                     }
 
-                        my $display_call = "is${function}$suffix("
-                                         . " $display_name )$display_locale";
-                        $ret = truth eval "test_is${function}$suffix('$char')";
-                        if (is ($@, "", "$display_call didn't give error")) {
+                    foreach my $utf8_param("_safe",
+                                           "_safe, malformed",
+                                           "deprecated unsafe"
+                                          )
+                    {
+                        my $utf8_param_code = $utf8_param_code{$utf8_param};
+                        my $expect_error = $utf8_param_code > 0;
+                        next if      $expect_error
+                                && ! try_malforming($i, $function,
+                                                    $suffix =~ /LC/);
+
+                        my $display_call = "is${function}$suffix( 
$display_name"
+                                         . ", $utf8_param )$display_locale";
+                        $ret = truth eval "test_is${function}$suffix('$char',"
+                                        . " $utf8_param_code)";
+                        if ($expect_error) {
+                            isnt ($@, "",
+                                    "expected and got error in $display_call");
+                            like($@, qr/Malformed UTF-8 character/,
+                                "${tab}And got expected message");
+                            if (is (@warnings, 1,
+                                           "${tab}Got a single warning 
besides"))
+                            {
+                                like($warnings[0],
+                                     qr/Malformed UTF-8 character.*short/,
+                                     "${tab}Got expected warning");
+                            }
+                            else {
+                                diag("@warnings");
+                            }
+                            undef @warnings;
+                        }
+                        elsif (is ($@, "", "$display_call didn't give error")) 
{
                             is ($ret, $truth,
                                 "${tab}And correctly returned $truth");
+                            if ($utf8_param_code < 0) {
+                                my $warnings_ok;
+                                my $unique_function = "is" . $function . 
$suffix;
+                                if (! $seen{$unique_function}++) {
+                                    $warnings_ok = is(@warnings, 1,
+                                        "${tab}This is first call to"
+                                      . " $unique_function; Got a single"
+                                      . " warning");
+                                    if ($warnings_ok) {
+                                        $warnings_ok = like($warnings[0],
+                qr/starting in Perl .* will require an additional parameter/,
+                                            "${tab}The warning was the 
expected"
+                                          . " deprecation one");
+                                    }
+                                }
+                                else {
+                                    $warnings_ok = is(@warnings, 0,
+                                        "${tab}This subsequent call to"
+                                      . " $unique_function did not warn");
+                                }
+                                $warnings_ok or diag("@warnings");
+                                undef @warnings;
+                            }
                         }
+                    }
                 }
             }
         }
diff --git a/ext/XS-APItest/t/utf8.t b/ext/XS-APItest/t/utf8.t
index 05693c05a4..c7f2c1d65f 100644
--- a/ext/XS-APItest/t/utf8.t
+++ b/ext/XS-APItest/t/utf8.t
@@ -98,21 +98,23 @@ my $UTF8_GOT_NON_CONTINUATION   = 
$UTF8_ALLOW_NON_CONTINUATION;
 my $UTF8_ALLOW_SHORT            = 0x0008;
 my $UTF8_GOT_SHORT              = $UTF8_ALLOW_SHORT;
 my $UTF8_ALLOW_LONG             = 0x0010;
+my $UTF8_ALLOW_LONG_AND_ITS_VALUE = $UTF8_ALLOW_LONG|0x0020;
 my $UTF8_GOT_LONG               = $UTF8_ALLOW_LONG;
-my $UTF8_GOT_OVERFLOW           = 0x0020;
-my $UTF8_DISALLOW_SURROGATE     = 0x0040;
+my $UTF8_ALLOW_OVERFLOW         = 0x0080;
+my $UTF8_GOT_OVERFLOW           = $UTF8_ALLOW_OVERFLOW;
+my $UTF8_DISALLOW_SURROGATE     = 0x0100;
 my $UTF8_GOT_SURROGATE          = $UTF8_DISALLOW_SURROGATE;
-my $UTF8_WARN_SURROGATE         = 0x0080;
-my $UTF8_DISALLOW_NONCHAR       = 0x0100;
+my $UTF8_WARN_SURROGATE         = 0x0200;
+my $UTF8_DISALLOW_NONCHAR       = 0x0400;
 my $UTF8_GOT_NONCHAR            = $UTF8_DISALLOW_NONCHAR;
-my $UTF8_WARN_NONCHAR           = 0x0200;
-my $UTF8_DISALLOW_SUPER         = 0x0400;
+my $UTF8_WARN_NONCHAR           = 0x0800;
+my $UTF8_DISALLOW_SUPER         = 0x1000;
 my $UTF8_GOT_SUPER              = $UTF8_DISALLOW_SUPER;
-my $UTF8_WARN_SUPER             = 0x0800;
-my $UTF8_DISALLOW_ABOVE_31_BIT  = 0x1000;
+my $UTF8_WARN_SUPER             = 0x2000;
+my $UTF8_DISALLOW_ABOVE_31_BIT  = 0x4000;
 my $UTF8_GOT_ABOVE_31_BIT       = $UTF8_DISALLOW_ABOVE_31_BIT;
-my $UTF8_WARN_ABOVE_31_BIT      = 0x2000;
-my $UTF8_CHECK_ONLY             = 0x4000;
+my $UTF8_WARN_ABOVE_31_BIT      = 0x8000;
+my $UTF8_CHECK_ONLY             = 0x10000;
 my $UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE
                              = $UTF8_DISALLOW_SUPER|$UTF8_DISALLOW_SURROGATE;
 my $UTF8_DISALLOW_ILLEGAL_INTERCHANGE
@@ -1199,10 +1201,12 @@ my $REPLACEMENT = 0xFFFD;
 my @malformations = (
     # ($testname, $bytes, $length, $allow_flags, $expected_error_flags,
     #  $allowed_uv, $expected_len, $needed_to_discern_len, $message )
-    [ "zero length string malformation", "", 0,
-        $UTF8_ALLOW_EMPTY, $UTF8_GOT_EMPTY, 0, 0, 0,
-        qr/empty string/
-    ],
+
+# Now considered a program bug, and asserted against
+    #[ "zero length string malformation", "", 0,
+    #    $UTF8_ALLOW_EMPTY, $UTF8_GOT_EMPTY, $REPLACEMENT, 0, 0,
+    #    qr/empty string/
+    #],
     [ "orphan continuation byte malformation", I8_to_native("${I8c}a"), 2,
         $UTF8_ALLOW_CONTINUATION, $UTF8_GOT_CONTINUATION, $REPLACEMENT,
         1, 1,
@@ -1344,8 +1348,7 @@ if (isASCII && ! $is64bit) {    # 32-bit ASCII platform
         [ "overflow malformation",
             "\xfe\x84\x80\x80\x80\x80\x80",  # Represents 2**32
             7,
-            0,  # There is no way to allow this malformation
-            $UTF8_GOT_OVERFLOW,
+            $UTF8_ALLOW_OVERFLOW, $UTF8_GOT_OVERFLOW,
             $REPLACEMENT,
             7, 2,
             qr/overflows/
@@ -1353,8 +1356,7 @@ if (isASCII && ! $is64bit) {    # 32-bit ASCII platform
         [ "overflow malformation",
             "\xff\x80\x80\x80\x80\x80\x81\x80\x80\x80\x80\x80\x80",
             $max_bytes,
-            0,  # There is no way to allow this malformation
-            $UTF8_GOT_OVERFLOW,
+            $UTF8_ALLOW_OVERFLOW, $UTF8_GOT_OVERFLOW,
             $REPLACEMENT,
             $max_bytes, 1,
             qr/overflows/
@@ -1396,8 +1398,7 @@ else { # 64-bit ASCII, or EBCDIC of any size.
             I8_to_native(
                     
"\xff\xa0\xa0\xa0\xa0\xa0\xa0\xa4\xa0\xa0\xa0\xa0\xa0\xa0"),
             $max_bytes,
-            0,  # There is no way to allow this malformation
-            $UTF8_GOT_OVERFLOW,
+            $UTF8_ALLOW_OVERFLOW, $UTF8_GOT_OVERFLOW,
             $REPLACEMENT,
             $max_bytes, 8,
             qr/overflows/
@@ -1411,8 +1412,7 @@ else { # 64-bit ASCII, or EBCDIC of any size.
                 : I8_to_native(
                     
"\xff\xb0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0"),
                 $max_bytes,
-                0,  # There is no way to allow this malformation
-                $UTF8_GOT_OVERFLOW,
+                $UTF8_ALLOW_OVERFLOW, $UTF8_GOT_OVERFLOW,
                 $REPLACEMENT,
                 $max_bytes, (isASCII) ? 3 : 2,
                 qr/overflows/
@@ -1420,6 +1420,29 @@ else { # 64-bit ASCII, or EBCDIC of any size.
     }
 }
 
+# For each overlong malformation in the list, we modify it, so that there are
+# two tests.  The first one returns the replacement character given the input
+# flags, and the second test adds a flag that causes the actual code point the
+# malformation represents to be returned.
+my @added_overlongs;
+foreach my $test (@malformations) {
+    my ($testname, $bytes, $length, $allow_flags, $expected_error_flags,
+        $allowed_uv, $expected_len, $needed_to_discern_len, $message ) = 
@$test;
+    next unless $testname =~ /overlong/;
+
+    $test->[0] .= "; use REPLACEMENT CHAR";
+    $test->[5] = $REPLACEMENT;
+
+    push @added_overlongs,
+        [ $testname . "; use actual value",
+          $bytes, $length,
+          $allow_flags | $UTF8_ALLOW_LONG_AND_ITS_VALUE,
+          $expected_error_flags, $allowed_uv, $expected_len,
+          $needed_to_discern_len, $message
+        ];
+}
+push @malformations, @added_overlongs;
+
 foreach my $test (@malformations) {
     my ($testname, $bytes, $length, $allow_flags, $expected_error_flags,
         $allowed_uv, $expected_len, $needed_to_discern_len, $message ) = 
@$test;
diff --git a/gv.c b/gv.c
index 775951b75a..2570cf0657 100644
--- a/gv.c
+++ b/gv.c
@@ -1591,7 +1591,10 @@ S_parse_gv_stash_name(pTHX_ HV **stash, GV **gv, const 
char **name,
 
     PERL_ARGS_ASSERT_PARSE_GV_STASH_NAME;
     
-    if (full_len > 2 && **name == '*' && isIDFIRST_lazy_if(*name + 1, 
is_utf8)) {
+    if (   full_len > 2
+        && **name == '*'
+        && isIDFIRST_lazy_if_safe(*name + 1, name_end, is_utf8))
+    {
         /* accidental stringify on a GV? */
         (*name)++;
     }
@@ -1676,7 +1679,7 @@ S_gv_is_in_main(pTHX_ const char *name, STRLEN len, const 
U32 is_utf8)
     PERL_ARGS_ASSERT_GV_IS_IN_MAIN;
     
     /* If it's an alphanumeric variable */
-    if ( len && isIDFIRST_lazy_if(name, is_utf8) ) {
+    if ( len && isIDFIRST_lazy_if_safe(name, name + len, is_utf8) ) {
         /* Some "normal" variables are always in main::,
          * like INC or STDOUT.
          */
@@ -2400,8 +2403,12 @@ Perl_gv_fetchpvn_flags(pTHX_ const char *nambeg, STRLEN 
full_len, I32 flags,
                 UTF8fARG(is_utf8, name_end-nambeg, nambeg));
     gv_init_pvn(gv, stash, name, len, (add & GV_ADDMULTI)|is_utf8);
 
-    if ( isIDFIRST_lazy_if(name, is_utf8) && !ckWARN(WARN_ONCE) )
+    if (   full_len != 0
+        && isIDFIRST_lazy_if_safe(name, name + full_len, is_utf8)
+        && !ckWARN(WARN_ONCE) )
+    {
         GvMULTI_on(gv) ;
+    }
 
     /* set up magic where warranted */
     if ( gv_magicalize(gv, stash, name, len, sv_type) ) {
@@ -2492,8 +2499,12 @@ Perl_gv_check(pTHX_ HV *stash)
                 )
                     gv_check(hv);              /* nested package */
            }
-            else if ( *HeKEY(entry) != '_'
-                        && isIDFIRST_lazy_if(HeKEY(entry), HeUTF8(entry)) ) {
+            else if (   HeKLEN(entry) != 0
+                     && *HeKEY(entry) != '_'
+                     && isIDFIRST_lazy_if_safe(HeKEY(entry),
+                                               HeKEY(entry) + HeKLEN(entry),
+                                               HeUTF8(entry)) )
+            {
                 const char *file;
                gv = MUTABLE_GV(HeVAL(entry));
                if (SvTYPE(gv) != SVt_PVGV || GvMULTI(gv))
diff --git a/handy.h b/handy.h
index 848050f333..98ae51dd7b 100644
--- a/handy.h
+++ b/handy.h
@@ -565,10 +565,31 @@ to determine if it is in the character class.  For 
example,
 C<isWORDCHAR_uvchr(0x100)> returns TRUE, since 0x100 is LATIN CAPITAL LETTER A
 WITH MACRON in Unicode, and is a word character.
 
-Variant C<isFOO_utf8> is like C<isFOO_uvchr>, but the input is a pointer to a
-(known to be well-formed) UTF-8 encoded string (C<U8*> or C<char*>, and
-possibly containing embedded C<NUL> characters).  The classification of just
-the first (possibly multi-byte) character in the string is tested.
+Variant C<isFOO_utf8_safe> is like C<isFOO_uvchr>, but is used for UTF-8
+encoded strings.  Each call classifies one character, even if the string
+contains many.  This variant takes two parameters.  The first, C<p>, is a
+pointer to the first byte of the character to be classified.  (Recall that it
+may take more than one byte to represent a character in UTF-8 strings.)  The
+second parameter, C<e>, points to anywhere in the string beyond the first
+character, up to one byte past the end of the entire string.  The suffix
+C<_safe> in the function's name indicates that it will not attempt to read
+beyond S<C<e - 1>>, provided that the constraint S<C<s E<lt> e>> is true (this
+is asserted for in C<-DDEBUGGING> builds).  If the UTF-8 for the input
+character is malformed in some way, the program may croak, or the function may
+return FALSE, at the discretion of the implementation, and subject to change in
+future releases.
+
+Variant C<isFOO_utf8> is like C<isFOO_utf8_safe>, but takes just a single
+parameter, C<p>, which has the same meaning as the corresponding parameter does
+in C<isFOO_utf8_safe>.  The function therefore can't check if it is reading
+beyond the end of the string.  Starting in Perl v5.30, it will take a second
+parameter, becoming a synonym for C<isFOO_utf8_safe>.  At that time every
+program that uses it will have to be changed to successfully compile.  In the
+meantime, the first runtime call to C<isFOO_utf8> from each call point in the
+program will raise a deprecation warning, enabled by default.  You can convert
+your program now to use C<isFOO_utf8_safe>, and avoid the warnings, and get an
+extra measure of protection, or you can wait until v5.30, when you'll be forced
+to add the C<e> parameter.
 
 Variant C<isFOO_LC> is like the C<isFOO_A> and C<isFOO_L1> variants, but the
 result is based on the current locale, which is what C<LC> in the name stands
@@ -584,18 +605,39 @@ Variant C<isFOO_LC_uvchr> is like C<isFOO_LC>, but is 
defined on any UV.  It
 returns the same as C<isFOO_LC> for input code points less than 256, and
 returns the hard-coded, not-affected-by-locale, Unicode results for larger 
ones.
 
-Variant C<isFOO_LC_utf8> is like C<isFOO_LC_uvchr>, but the input is a pointer
-to a (known to be well-formed) UTF-8 encoded string (C<U8*> or C<char*>, and
-possibly containing embedded C<NUL> characters).  The classification of just
-the first (possibly multi-byte) character in the string is tested.
+Variant C<isFOO_LC_utf8_safe> is like C<isFOO_LC_uvchr>, but is used for UTF-8
+encoded strings.  Each call classifies one character, even if the string
+contains many.  This variant takes two parameters.  The first, C<p>, is a
+pointer to the first byte of the character to be classified.  (Recall that it
+may take more than one byte to represent a character in UTF-8 strings.) The
+second parameter, C<e>, points to anywhere in the string beyond the first
+character, up to one byte past the end of the entire string.  The suffix
+C<_safe> in the function's name indicates that it will not attempt to read
+beyond S<C<e - 1>>, provided that the constraint S<C<s E<lt> e>> is true (this
+is asserted for in C<-DDEBUGGING> builds).  If the UTF-8 for the input
+character is malformed in some way, the program may croak, or the function may
+return FALSE, at the discretion of the implementation, and subject to change in
+future releases.
+
+Variant C<isFOO_LC_utf8> is like C<isFOO_LC_utf8_safe>, but takes just a single
+parameter, C<p>, which has the same meaning as the corresponding parameter does
+in C<isFOO_LC_utf8_safe>.  The function therefore can't check if it is reading
+beyond the end of the string.  Starting in Perl v5.30, it will take a second
+parameter, becoming a synonym for C<isFOO_LC_utf8_safe>.  At that time every
+program that uses it will have to be changed to successfully compile.  In the
+meantime, the first runtime call to C<isFOO_LC_utf8> from each call point in
+the program will raise a deprecation warning, enabled by default.  You can
+convert your program now to use C<isFOO_LC_utf8_safe>, and avoid the warnings,
+and get an extra measure of protection, or you can wait until v5.30, when
+you'll be forced to add the C<e> parameter.
 
 =for apidoc Am|bool|isALPHA|char ch
 Returns a boolean indicating whether the specified character is an
 alphabetic character, analogous to C<m/[[:alpha:]]/>.
 See the L<top of this section|/Character classification> for an explanation of
 variants
-C<isALPHA_A>, C<isALPHA_L1>, C<isALPHA_uvchr>, C<isALPHA_utf8>, C<isALPHA_LC>,
-C<isALPHA_LC_uvchr>, and C<isALPHA_LC_utf8>.
+C<isALPHA_A>, C<isALPHA_L1>, C<isALPHA_uvchr>, C<isALPHA_utf8_safe>,
+C<isALPHA_LC>, C<isALPHA_LC_uvchr>, and C<isALPHA_LC_utf8_safe>.
 
 =for apidoc Am|bool|isALPHANUMERIC|char ch
 Returns a boolean indicating whether the specified character is a either an
@@ -603,8 +645,8 @@ alphabetic character or decimal digit, analogous to 
C<m/[[:alnum:]]/>.
 See the L<top of this section|/Character classification> for an explanation of
 variants
 C<isALPHANUMERIC_A>, C<isALPHANUMERIC_L1>, C<isALPHANUMERIC_uvchr>,
-C<isALPHANUMERIC_utf8>, C<isALPHANUMERIC_LC>, C<isALPHANUMERIC_LC_uvchr>, and
-C<isALPHANUMERIC_LC_utf8>.
+C<isALPHANUMERIC_utf8_safe>, C<isALPHANUMERIC_LC>, C<isALPHANUMERIC_LC_uvchr>,
+and C<isALPHANUMERIC_LC_utf8_safe>.
 
 =for apidoc Am|bool|isASCII|char ch
 Returns a boolean indicating whether the specified character is one of the 128
@@ -614,36 +656,36 @@ character corresponds to an ASCII character.  Variants 
C<isASCII_A()> and
 C<isASCII_L1()> are identical to C<isASCII()>.
 See the L<top of this section|/Character classification> for an explanation of
 variants
-C<isASCII_uvchr>, C<isASCII_utf8>, C<isASCII_LC>, C<isASCII_LC_uvchr>, and
-C<isASCII_LC_utf8>.  Note, however, that some platforms do not have the C
+C<isASCII_uvchr>, C<isASCII_utf8_safe>, C<isASCII_LC>, C<isASCII_LC_uvchr>, and
+C<isASCII_LC_utf8_safe>.  Note, however, that some platforms do not have the C
 library routine C<isascii()>.  In these cases, the variants whose names contain
 C<LC> are the same as the corresponding ones without.
 
 Also note, that because all ASCII characters are UTF-8 invariant (meaning they
 have the exact same representation (always a single byte) whether encoded in
 UTF-8 or not), C<isASCII> will give the correct results when called with any
-byte in any string encoded or not in UTF-8.  And similarly C<isASCII_utf8> will
-work properly on any string encoded or not in UTF-8.
+byte in any string encoded or not in UTF-8.  And similarly C<isASCII_utf8_safe>
+will work properly on any string encoded or not in UTF-8.
 
 =for apidoc Am|bool|isBLANK|char ch
 Returns a boolean indicating whether the specified character is a
 character considered to be a blank, analogous to C<m/[[:blank:]]/>.
 See the L<top of this section|/Character classification> for an explanation of
 variants
-C<isBLANK_A>, C<isBLANK_L1>, C<isBLANK_uvchr>, C<isBLANK_utf8>, C<isBLANK_LC>,
-C<isBLANK_LC_uvchr>, and C<isBLANK_LC_utf8>.  Note, however, that some
-platforms do not have the C library routine C<isblank()>.  In these cases, the
-variants whose names contain C<LC> are the same as the corresponding ones
-without.
+C<isBLANK_A>, C<isBLANK_L1>, C<isBLANK_uvchr>, C<isBLANK_utf8_safe>,
+C<isBLANK_LC>, C<isBLANK_LC_uvchr>, and C<isBLANK_LC_utf8_safe>.  Note,
+however, that some platforms do not have the C library routine
+C<isblank()>.  In these cases, the variants whose names contain C<LC> are
+the same as the corresponding ones without.
 
 =for apidoc Am|bool|isCNTRL|char ch
 Returns a boolean indicating whether the specified character is a
 control character, analogous to C<m/[[:cntrl:]]/>.
 See the L<top of this section|/Character classification> for an explanation of
 variants
-C<isCNTRL_A>, C<isCNTRL_L1>, C<isCNTRL_uvchr>, C<isCNTRL_utf8>, C<isCNTRL_LC>,
-C<isCNTRL_LC_uvchr>, and C<isCNTRL_LC_utf8>
-On EBCDIC platforms, you almost always want to use the C<isCNTRL_L1> variant.
+C<isCNTRL_A>, C<isCNTRL_L1>, C<isCNTRL_uvchr>, C<isCNTRL_utf8_safe>,
+C<isCNTRL_LC>, C<isCNTRL_LC_uvchr>, and C<isCNTRL_LC_utf8_safe> On EBCDIC
+platforms, you almost always want to use the C<isCNTRL_L1> variant.
 
 =for apidoc Am|bool|isDIGIT|char ch
 Returns a boolean indicating whether the specified character is a
@@ -651,24 +693,23 @@ digit, analogous to C<m/[[:digit:]]/>.
 Variants C<isDIGIT_A> and C<isDIGIT_L1> are identical to C<isDIGIT>.
 See the L<top of this section|/Character classification> for an explanation of
 variants
-C<isDIGIT_uvchr>, C<isDIGIT_utf8>, C<isDIGIT_LC>, C<isDIGIT_LC_uvchr>, and
-C<isDIGIT_LC_utf8>.
+C<isDIGIT_uvchr>, C<isDIGIT_utf8_safe>, C<isDIGIT_LC>, C<isDIGIT_LC_uvchr>, and
+C<isDIGIT_LC_utf8_safe>.
 
 =for apidoc Am|bool|isGRAPH|char ch
 Returns a boolean indicating whether the specified character is a
 graphic character, analogous to C<m/[[:graph:]]/>.
 See the L<top of this section|/Character classification> for an explanation of
-variants
-C<isGRAPH_A>, C<isGRAPH_L1>, C<isGRAPH_uvchr>, C<isGRAPH_utf8>, C<isGRAPH_LC>,
-C<isGRAPH_LC_uvchr>, and C<isGRAPH_LC_utf8>.
+variants C<isGRAPH_A>, C<isGRAPH_L1>, C<isGRAPH_uvchr>, C<isGRAPH_utf8_safe>,
+C<isGRAPH_LC>, C<isGRAPH_LC_uvchr>, and C<isGRAPH_LC_utf8_safe>.
 
 =for apidoc Am|bool|isLOWER|char ch
 Returns a boolean indicating whether the specified character is a
 lowercase character, analogous to C<m/[[:lower:]]/>.
 See the L<top of this section|/Character classification> for an explanation of
 variants
-C<isLOWER_A>, C<isLOWER_L1>, C<isLOWER_uvchr>, C<isLOWER_utf8>, C<isLOWER_LC>,
-C<isLOWER_LC_uvchr>, and C<isLOWER_LC_utf8>.
+C<isLOWER_A>, C<isLOWER_L1>, C<isLOWER_uvchr>, C<isLOWER_utf8_safe>,
+C<isLOWER_LC>, C<isLOWER_LC_uvchr>, and C<isLOWER_LC_utf8_safe>.
 
 =for apidoc Am|bool|isOCTAL|char ch
 Returns a boolean indicating whether the specified character is an
@@ -683,9 +724,8 @@ Note that the definition of what is punctuation isn't as
 straightforward as one might desire.  See L<perlrecharclass/POSIX Character
 Classes> for details.
 See the L<top of this section|/Character classification> for an explanation of
-variants
-C<isPUNCT_A>, C<isPUNCT_L1>, C<isPUNCT_uvchr>, C<isPUNCT_utf8>, C<isPUNCT_LC>,
-C<isPUNCT_LC_uvchr>, and C<isPUNCT_LC_utf8>.
+variants C<isPUNCT_A>, C<isPUNCT_L1>, C<isPUNCT_uvchr>, C<isPUNCT_utf8_safe>,
+C<isPUNCT_LC>, C<isPUNCT_LC_uvchr>, and C<isPUNCT_LC_utf8_safe>.
 
 =for apidoc Am|bool|isSPACE|char ch
 Returns a boolean indicating whether the specified character is a
@@ -698,8 +738,8 @@ in the non-locale variants, was that C<isSPACE()> did not 
match a vertical tab.
 (See L</isPSXSPC> for a macro that matches a vertical tab in all releases.)
 See the L<top of this section|/Character classification> for an explanation of
 variants
-C<isSPACE_A>, C<isSPACE_L1>, C<isSPACE_uvchr>, C<isSPACE_utf8>, C<isSPACE_LC>,
-C<isSPACE_LC_uvchr>, and C<isSPACE_LC_utf8>.
+C<isSPACE_A>, C<isSPACE_L1>, C<isSPACE_uvchr>, C<isSPACE_utf8_safe>,
+C<isSPACE_LC>, C<isSPACE_LC_uvchr>, and C<isSPACE_LC_utf8_safe>.
 
 =for apidoc Am|bool|isPSXSPC|char ch
 (short for Posix Space)
@@ -712,24 +752,23 @@ C<isSPACE()> forms don't match a Vertical Tab, and the 
C<isPSXSPC()> forms do.
 Otherwise they are identical.  Thus this macro is analogous to what
 C<m/[[:space:]]/> matches in a regular expression.
 See the L<top of this section|/Character classification> for an explanation of
-variants C<isPSXSPC_A>, C<isPSXSPC_L1>, C<isPSXSPC_uvchr>, C<isPSXSPC_utf8>,
-C<isPSXSPC_LC>, C<isPSXSPC_LC_uvchr>, and C<isPSXSPC_LC_utf8>.
+variants C<isPSXSPC_A>, C<isPSXSPC_L1>, C<isPSXSPC_uvchr>, 
C<isPSXSPC_utf8_safe>,
+C<isPSXSPC_LC>, C<isPSXSPC_LC_uvchr>, and C<isPSXSPC_LC_utf8_safe>.
 
 =for apidoc Am|bool|isUPPER|char ch
 Returns a boolean indicating whether the specified character is an
 uppercase character, analogous to C<m/[[:upper:]]/>.
 See the L<top of this section|/Character classification> for an explanation of
-variants
-C<isUPPER_A>, C<isUPPER_L1>, C<isUPPER_uvchr>, C<isUPPER_utf8>, C<isUPPER_LC>,
-C<isUPPER_LC_uvchr>, and C<isUPPER_LC_utf8>.
+variants C<isUPPER_A>, C<isUPPER_L1>, C<isUPPER_uvchr>, C<isUPPER_utf8_safe>,
+C<isUPPER_LC>, C<isUPPER_LC_uvchr>, and C<isUPPER_LC_utf8_safe>.
 
 =for apidoc Am|bool|isPRINT|char ch
 Returns a boolean indicating whether the specified character is a
 printable character, analogous to C<m/[[:print:]]/>.
 See the L<top of this section|/Character classification> for an explanation of
 variants
-C<isPRINT_A>, C<isPRINT_L1>, C<isPRINT_uvchr>, C<isPRINT_utf8>, C<isPRINT_LC>,
-C<isPRINT_LC_uvchr>, and C<isPRINT_LC_utf8>.
+C<isPRINT_A>, C<isPRINT_L1>, C<isPRINT_uvchr>, C<isPRINT_utf8_safe>,
+C<isPRINT_LC>, C<isPRINT_LC_uvchr>, and C<isPRINT_LC_utf8_safe>.
 
 =for apidoc Am|bool|isWORDCHAR|char ch
 Returns a boolean indicating whether the specified character is a character
@@ -741,10 +780,10 @@ C<isALNUM()> is a synonym provided for backward 
compatibility, even though a
 word character includes more than the standard C language meaning of
 alphanumeric.
 See the L<top of this section|/Character classification> for an explanation of
-variants
-C<isWORDCHAR_A>, C<isWORDCHAR_L1>, C<isWORDCHAR_uvchr>, and C<isWORDCHAR_utf8>.
-C<isWORDCHAR_LC>, C<isWORDCHAR_LC_uvchr>, and C<isWORDCHAR_LC_utf8> are also as
-described there, but additionally include the platform's native underscore.
+variants C<isWORDCHAR_A>, C<isWORDCHAR_L1>, C<isWORDCHAR_uvchr>, and
+C<isWORDCHAR_utf8_safe>.  C<isWORDCHAR_LC>, C<isWORDCHAR_LC_uvchr>, and
+C<isWORDCHAR_LC_utf8_safe> are also as described there, but additionally
+include the platform's native underscore.
 
 =for apidoc Am|bool|isXDIGIT|char ch
 Returns a boolean indicating whether the specified character is a hexadecimal
@@ -752,8 +791,8 @@ digit.  In the ASCII range these are C<[0-9A-Fa-f]>.  
Variants C<isXDIGIT_A()>
 and C<isXDIGIT_L1()> are identical to C<isXDIGIT()>.
 See the L<top of this section|/Character classification> for an explanation of
 variants
-C<isXDIGIT_uvchr>, C<isXDIGIT_utf8>, C<isXDIGIT_LC>, C<isXDIGIT_LC_uvchr>, and
-C<isXDIGIT_LC_utf8>.
+C<isXDIGIT_uvchr>, C<isXDIGIT_utf8_safe>, C<isXDIGIT_LC>, C<isXDIGIT_LC_uvchr>,
+and C<isXDIGIT_LC_utf8_safe>.
 
 =for apidoc Am|bool|isIDFIRST|char ch
 Returns a boolean indicating whether the specified character can be the first
@@ -762,8 +801,8 @@ the official Unicode property C<XID_Start>.  The difference 
is that this
 returns true only if the input character also matches L</isWORDCHAR>.
 See the L<top of this section|/Character classification> for an explanation of
 variants
-C<isIDFIRST_A>, C<isIDFIRST_L1>, C<isIDFIRST_uvchr>, C<isIDFIRST_utf8>,
-C<isIDFIRST_LC>, C<isIDFIRST_LC_uvchr>, and C<isIDFIRST_LC_utf8>.
+C<isIDFIRST_A>, C<isIDFIRST_L1>, C<isIDFIRST_uvchr>, C<isIDFIRST_utf8_safe>,
+C<isIDFIRST_LC>, C<isIDFIRST_LC_uvchr>, and C<isIDFIRST_LC_utf8_safe>.
 
 =for apidoc Am|bool|isIDCONT|char ch
 Returns a boolean indicating whether the specified character can be the
@@ -773,8 +812,8 @@ difference is that this returns true only if the input 
character also matches
 L</isWORDCHAR>.  See the L<top of this section|/Character classification> for
 an
 explanation of variants C<isIDCONT_A>, C<isIDCONT_L1>, C<isIDCONT_uvchr>,
-C<isIDCONT_utf8>, C<isIDCONT_LC>, C<isIDCONT_LC_uvchr>, and
-C<isIDCONT_LC_utf8>.
+C<isIDCONT_utf8_safe>, C<isIDCONT_LC>, C<isIDCONT_LC_uvchr>, and
+C<isIDCONT_LC_utf8_safe>.
 
 =head1 Miscellaneous Functions
 
@@ -1018,6 +1057,9 @@ patched there.  The file as of this writing is 
cpan/Devel-PPPort/parts/inc/misc
  * above ASCII in the latter case) */
 
 #  define _CC_SPACE             10      /* \s, [:space:] */
+#  define _CC_PSXSPC            _CC_SPACE   /* XXX Temporary, can be removed
+                                               when the deprecated isFOO_utf8()
+                                               functions are removed */
 #  define _CC_BLANK             11      /* [:blank:] */
 #  define _CC_XDIGIT            12      /* [:xdigit:] */
 #  define _CC_CNTRL             13      /* [:cntrl:] */
@@ -1037,6 +1079,9 @@ patched there.  The file as of this writing is 
cpan/Devel-PPPort/parts/inc/misc
 #  define _CC_IS_IN_SOME_FOLD          22
 #  define _CC_MNEMONIC_CNTRL           23
 
+#  define _CC_IDCONT 24 /* XXX Temporary, can be removed when the deprecated
+                           isFOO_utf8() functions are removed */
+
 /* This next group is only used on EBCDIC platforms, so theoretically could be
  * shared with something entirely different that's only on ASCII platforms */
 #  define _CC_UTF8_START_BYTE_IS_FOR_AT_LEAST_SURROGATE 28
@@ -1676,33 +1721,75 @@ END_EXTERN_C
  * 'utf8' parameter.  This relies on the fact that ASCII characters have the
  * same representation whether utf8 or not.  Note that it assumes that the utf8
  * has been validated, and ignores 'use bytes' */
-#define _generic_utf8(classnum, p, utf8) (UTF8_IS_INVARIANT(*(p))              
\
-                                         ? _generic_isCC(*(p), classnum)       
\
-                                         : (UTF8_IS_DOWNGRADEABLE_START(*(p))) 
\
-                                           ? _generic_isCC(                    
\
-                                                EIGHT_BIT_UTF8_TO_NATIVE(*(p), 
\
-                                                                   *((p)+1 )), 
\
-                                                classnum)                      
\
-                                           : utf8)
+#define _base_generic_utf8(enum_name, name, p, use_locale )                 \
+    _is_utf8_FOO(CAT2(_CC_, enum_name),                                     \
+                 (const U8 *) p,                                            \
+                 "is" STRINGIFY(name) "_utf8",                              \
+                 "is" STRINGIFY(name) "_utf8_safe",                         \
+                 1, use_locale, __FILE__,__LINE__)
+
+#define _generic_utf8(name, p) _base_generic_utf8(name, name, p, 0)
+
+/* The "_safe" macros make sure that we don't attempt to read beyond 'e', but
+ * they don't otherwise go out of their way to look for malformed UTF-8.  If
+ * they can return accurate results without knowing if the input is otherwise
+ * malformed, they do so.  For example isASCII is accurate in spite of any
+ * non-length malformations because it looks only at a single byte. Likewise
+ * isDIGIT looks just at the first byte for code points 0-255, as all UTF-8
+ * variant ones return FALSE.  But, if the input has to be well-formed in order
+ * for the results to be accurate, the macros will test and if malformed will
+ * call a routine to die
+ *
+ * Except for toke.c, the macros do assume that e > p, asserting that on
+ * DEBUGGING builds.  Much code that calls these depends on this being true,
+ * for other reasons.  toke.c is treated specially as using the regular
+ * assertion breaks it in many ways.  All strings that these operate on there
+ * are supposed to have an extra NUL character at the end,  so that *e = \0. A
+ * bunch of code in toke.c assumes that this is true, so the assertion allows
+ * for that */
+#ifdef PERL_IN_TOKE_C
+#  define _utf8_safe_assert(p,e) ((e) > (p) || ((e) == (p) && *(p) == '\0'))
+#else
+#  define _utf8_safe_assert(p,e) ((e) > (p))
+#endif
+
+#define _generic_utf8_safe(classnum, p, e, above_latin1)                    \
+         (__ASSERT_(_utf8_safe_assert(p, e))                                \
+         (UTF8_IS_INVARIANT(*(p)))                                          \
+          ? _generic_isCC(*(p), classnum)                                   \
+          : (UTF8_IS_DOWNGRADEABLE_START(*(p))                              \
+             ? ((LIKELY((e) - (p) > 1 && UTF8_IS_CONTINUATION(*((p)+1))))   \
+                ? _generic_isCC(EIGHT_BIT_UTF8_TO_NATIVE(*(p), *((p)+1 )),  \
+                                classnum)                                   \
+                : (_force_out_malformed_utf8_message(                       \
+                                        (U8 *) (p), (U8 *) (e), 0, 1), 0))  \
+             : above_latin1))
 /* Like the above, but calls 'above_latin1(p)' to get the utf8 value.
  * 'above_latin1' can be a macro */
-#define _generic_func_utf8(classnum, above_latin1, p)  \
-                                    _generic_utf8(classnum, p, above_latin1(p))
+#define _generic_func_utf8_safe(classnum, above_latin1, p, e)               \
+                    _generic_utf8_safe(classnum, p, e, above_latin1(p, e))
+#define _generic_non_swash_utf8_safe(classnum, above_latin1, p, e)          \
+          _generic_utf8_safe(classnum, p, e,                                \
+                             (UNLIKELY((e) - (p) < UTF8SKIP(p))             \
+                              ? (_force_out_malformed_utf8_message(         \
+                                      (U8 *) (p), (U8 *) (e), 0, 1), 0)     \
+                              : above_latin1(p)))
 /* Like the above, but passes classnum to _isFOO_utf8(), instead of having an
  * 'above_latin1' parameter */
-#define _generic_swash_utf8(classnum, p)  \
-                      _generic_utf8(classnum, p, _is_utf8_FOO(classnum, p))
+#define _generic_swash_utf8_safe(classnum, p, e)                            \
+_generic_utf8_safe(classnum, p, e, _is_utf8_FOO_with_len(classnum, p, e))
 
 /* Like the above, but should be used only when it is known that there are no
  * characters in the upper-Latin1 range (128-255 on ASCII platforms) which the
  * class is TRUE for.  Hence it can skip the tests for this range.
  * 'above_latin1' should include its arguments */
-#define _generic_utf8_no_upper_latin1(classnum, p, above_latin1)               
\
-                                         (UTF8_IS_INVARIANT(*(p))              
\
-                                         ? _generic_isCC(*(p), classnum)       
\
-                                         : (UTF8_IS_ABOVE_LATIN1(*(p)))        
\
-                                           ? above_latin1                      
\
-                                           : 0)
+#define _generic_utf8_safe_no_upper_latin1(classnum, p, e, above_latin1)    \
+         (__ASSERT_(_utf8_safe_assert(p, e))                                \
+         (UTF8_IS_INVARIANT(*(p)))                                          \
+          ? _generic_isCC(*(p), classnum)                                   \
+          : (UTF8_IS_DOWNGRADEABLE_START(*(p)))                             \
+             ? 0 /* Note that doesn't check validity for latin1 */          \
+             : above_latin1)
 
 /* NOTE that some of these macros have very similar ones in regcharclass.h.
  * For example, there is (at the time of this writing) an 'is_SPACE_utf8()'
@@ -1712,26 +1799,50 @@ END_EXTERN_C
  * points; the regcharclass.h ones are implemented as a series of
  * "if-else-if-else ..." */
 
-#define isALPHA_utf8(p)        _generic_swash_utf8(_CC_ALPHA, p)
-#define isALPHANUMERIC_utf8(p) _generic_swash_utf8(_CC_ALPHANUMERIC, p)
-#define isASCII_utf8(p)        isASCII(*p) /* Because ASCII is invariant under
-                                               utf8, the non-utf8 macro works
-                                             */
-#define isBLANK_utf8(p)        _generic_func_utf8(_CC_BLANK, is_HORIZWS_high, 
p)
+#define isALPHA_utf8(p)         _generic_utf8(ALPHA, p)
+#define isALPHANUMERIC_utf8(p)  _generic_utf8(ALPHANUMERIC, p)
+#define isASCII_utf8(p)         _generic_utf8(ASCII, p)
+#define isBLANK_utf8(p)         _generic_utf8(BLANK, p)
+#define isCNTRL_utf8(p)         _generic_utf8(CNTRL, p)
+#define isDIGIT_utf8(p)         _generic_utf8(DIGIT, p)
+#define isGRAPH_utf8(p)         _generic_utf8(GRAPH, p)
+#define isIDCONT_utf8(p)        _generic_utf8(IDCONT, p)
+#define isIDFIRST_utf8(p)       _generic_utf8(IDFIRST, p)
+#define isLOWER_utf8(p)         _generic_utf8(LOWER, p)
+#define isPRINT_utf8(p)         _generic_utf8(PRINT, p)
+#define isPSXSPC_utf8(p)        _generic_utf8(PSXSPC, p)
+#define isPUNCT_utf8(p)         _generic_utf8(PUNCT, p)
+#define isSPACE_utf8(p)         _generic_utf8(SPACE, p)
+#define isUPPER_utf8(p)         _generic_utf8(UPPER, p)
+#define isVERTWS_utf8(p)        _generic_utf8(VERTSPACE, p)
+#define isWORDCHAR_utf8(p)      _generic_utf8(WORDCHAR, p)
+#define isXDIGIT_utf8(p)        _generic_utf8(XDIGIT, p)
+
+#define isALPHA_utf8_safe(p, e)  _generic_swash_utf8_safe(_CC_ALPHA, p, e)
+#define isALPHANUMERIC_utf8_safe(p, e)                                      \
+                        _generic_swash_utf8_safe(_CC_ALPHANUMERIC, p, e)
+#define isASCII_utf8_safe(p, e)                                             \
+    /* Because ASCII is invariant under utf8, the non-utf8 macro            \
+    * works */                                                              \
+    (__ASSERT_(_utf8_safe_assert(p, e)) isASCII(*(p)))
+#define isBLANK_utf8_safe(p, e)                                             \
+        _generic_non_swash_utf8_safe(_CC_BLANK, is_HORIZWS_high, p, e)
 
 #ifdef EBCDIC
     /* Because all controls are UTF-8 invariants in EBCDIC, we can use this
      * more efficient macro instead of the more general one */
-#   define isCNTRL_utf8(p)      isCNTRL_L1(*(p))
+#   define isCNTRL_utf8_safe(p, e)                                          \
+                    (__ASSERT_(_utf8_safe_assert(p, e)) isCNTRL_L1(*(p))
 #else
-#   define isCNTRL_utf8(p)      _generic_utf8(_CC_CNTRL, p, 0)
+#   define isCNTRL_utf8_safe(p, e)  _generic_utf8_safe(_CC_CNTRL, p, e, 0)
 #endif
 
-#define isDIGIT_utf8(p)         _generic_utf8_no_upper_latin1(_CC_DIGIT, p,   \
-                                                  _is_utf8_FOO(_CC_DIGIT, p))
-#define isGRAPH_utf8(p)         _generic_swash_utf8(_CC_GRAPH, p)
-#define isIDCONT_utf8(p)        _generic_func_utf8(_CC_WORDCHAR,              \
-                                                  _is_utf8_perl_idcont, p)
+#define isDIGIT_utf8_safe(p, e)                                             \
+            _generic_utf8_safe_no_upper_latin1(_CC_DIGIT, p, e,             \
+                                    _is_utf8_FOO_with_len(_CC_DIGIT, p, e))
+#define isGRAPH_utf8_safe(p, e)    _generic_swash_utf8_safe(_CC_GRAPH, p, e)
+#define isIDCONT_utf8_safe(p, e)   _generic_func_utf8_safe(_CC_WORDCHAR,    \
+                                     _is_utf8_perl_idcont_with_len, p, e)
 
 /* To prevent S_scan_word in toke.c from hanging, we have to make sure that
  * IDFIRST is an alnum.  See
@@ -1739,19 +1850,27 @@ END_EXTERN_C
  * ever wanted to know about.  (In the ASCII range, there isn't a difference.)
  * This used to be not the XID version, but we decided to go with the more
  * modern Unicode definition */
-#define isIDFIRST_utf8(p)   _generic_func_utf8(_CC_IDFIRST,                  \
-                                                _is_utf8_perl_idstart, p)
-
-#define isLOWER_utf8(p)     _generic_swash_utf8(_CC_LOWER, p)
-#define isPRINT_utf8(p)     _generic_swash_utf8(_CC_PRINT, p)
-#define isPSXSPC_utf8(p)    isSPACE_utf8(p)
-#define isPUNCT_utf8(p)     _generic_swash_utf8(_CC_PUNCT, p)
-#define isSPACE_utf8(p)     _generic_func_utf8(_CC_SPACE, is_XPERLSPACE_high, 
p)
-#define isUPPER_utf8(p)     _generic_swash_utf8(_CC_UPPER, p)
-#define isVERTWS_utf8(p)    _generic_func_utf8(_CC_VERTSPACE, is_VERTWS_high, 
p)
-#define isWORDCHAR_utf8(p)  _generic_swash_utf8(_CC_WORDCHAR, p)
-#define isXDIGIT_utf8(p)    _generic_utf8_no_upper_latin1(_CC_XDIGIT, p,     \
-                                                          is_XDIGIT_high(p))
+#define isIDFIRST_utf8_safe(p, e)                                           \
+    _generic_func_utf8_safe(_CC_IDFIRST,                                    \
+                    _is_utf8_perl_idstart_with_len, (U8 *) (p), (U8 *) (e))
+
+#define isLOWER_utf8_safe(p, e)     _generic_swash_utf8_safe(_CC_LOWER, p, e)
+#define isPRINT_utf8_safe(p, e)     _generic_swash_utf8_safe(_CC_PRINT, p, e)
+#define isPSXSPC_utf8_safe(p, e)     isSPACE_utf8_safe(p, e)
+#define isPUNCT_utf8_safe(p, e)     _generic_swash_utf8_safe(_CC_PUNCT, p, e)
+#define isSPACE_utf8_safe(p, e)                                             \
+    _generic_non_swash_utf8_safe(_CC_SPACE, is_XPERLSPACE_high, p, e)
+#define isUPPER_utf8_safe(p, e)  _generic_swash_utf8_safe(_CC_UPPER, p, e)
+#define isVERTWS_utf8_safe(p, e)                                            \
+        _generic_non_swash_utf8_safe(_CC_VERTSPACE, is_VERTWS_high, p, e)
+#define isWORDCHAR_utf8_safe(p, e)                                          \
+                             _generic_swash_utf8_safe(_CC_WORDCHAR, p, e)
+#define isXDIGIT_utf8_safe(p, e)                                            \
+                   _generic_utf8_safe_no_upper_latin1(_CC_XDIGIT, p, e,     \
+                             (UNLIKELY((e) - (p) < UTF8SKIP(p))             \
+                              ? (_force_out_malformed_utf8_message(         \
+                                      (U8 *) (p), (U8 *) (e), 0, 1), 0)     \
+                              : is_XDIGIT_high(p)))
 
 #define toFOLD_utf8(p,s,l)     to_utf8_fold(p,s,l)
 #define toLOWER_utf8(p,s,l)    to_utf8_lower(p,s,l)
@@ -1762,42 +1881,91 @@ END_EXTERN_C
  * isALPHA_LC_utf8.  These are like _generic_utf8, but if the first code point
  * in 'p' is within the 0-255 range, it uses locale rules from the passed-in
  * 'macro' parameter */
-#define _generic_LC_utf8(macro, p, utf8)                                    \
-                         (UTF8_IS_INVARIANT(*(p))                           \
-                         ? macro(*(p))                                      \
-                         : (UTF8_IS_DOWNGRADEABLE_START(*(p)))              \
-                           ? macro(EIGHT_BIT_UTF8_TO_NATIVE(*(p), *((p)+1)))\
-                           : utf8)
-
-#define _generic_LC_swash_utf8(macro, classnum, p)                         \
-                    _generic_LC_utf8(macro, p, _is_utf8_FOO(classnum, p))
-#define _generic_LC_func_utf8(macro, above_latin1, p)                         \
-                              _generic_LC_utf8(macro, p, above_latin1(p))
-
-#define isALPHANUMERIC_LC_utf8(p) _generic_LC_swash_utf8(isALPHANUMERIC_LC,   \
-                                                      _CC_ALPHANUMERIC, p)
-#define isALPHA_LC_utf8(p)    _generic_LC_swash_utf8(isALPHA_LC, _CC_ALPHA, p)
-#define isASCII_LC_utf8(p)     isASCII_LC(*p)
-#define isBLANK_LC_utf8(p)    _generic_LC_func_utf8(isBLANK_LC,               \
-                                                         is_HORIZWS_high, p)
-#define isCNTRL_LC_utf8(p)    _generic_LC_utf8(isCNTRL_LC, p, 0)
-#define isDIGIT_LC_utf8(p)    _generic_LC_swash_utf8(isDIGIT_LC, _CC_DIGIT, p)
-#define isGRAPH_LC_utf8(p)    _generic_LC_swash_utf8(isGRAPH_LC, _CC_GRAPH, p)
-#define isIDCONT_LC_utf8(p)   _generic_LC_func_utf8(isIDCONT_LC,              \
-                                                    _is_utf8_perl_idcont, p)
-#define isIDFIRST_LC_utf8(p)  _generic_LC_func_utf8(isIDFIRST_LC,             \
-                                                    _is_utf8_perl_idstart, p)
-#define isLOWER_LC_utf8(p)    _generic_LC_swash_utf8(isLOWER_LC, _CC_LOWER, p)
-#define isPRINT_LC_utf8(p)    _generic_LC_swash_utf8(isPRINT_LC, _CC_PRINT, p)
-#define isPSXSPC_LC_utf8(p)    isSPACE_LC_utf8(p)
-#define isPUNCT_LC_utf8(p)    _generic_LC_swash_utf8(isPUNCT_LC, _CC_PUNCT, p)
-#define isSPACE_LC_utf8(p)    _generic_LC_func_utf8(isSPACE_LC,               \
-                                                        is_XPERLSPACE_high, p)
-#define isUPPER_LC_utf8(p)    _generic_LC_swash_utf8(isUPPER_LC, _CC_UPPER, p)
-#define isWORDCHAR_LC_utf8(p) _generic_LC_swash_utf8(isWORDCHAR_LC,           \
-                                                            _CC_WORDCHAR, p)
-#define isXDIGIT_LC_utf8(p)   _generic_LC_func_utf8(isXDIGIT_LC,              \
-                                                            is_XDIGIT_high, p)
+#define _generic_LC_utf8(name, p) _base_generic_utf8(name, name, p, 1)
+
+#define isALPHA_LC_utf8(p)         _generic_LC_utf8(ALPHA, p)
+#define isALPHANUMERIC_LC_utf8(p)  _generic_LC_utf8(ALPHANUMERIC, p)
+#define isASCII_LC_utf8(p)         _generic_LC_utf8(ASCII, p)
+#define isBLANK_LC_utf8(p)         _generic_LC_utf8(BLANK, p)
+#define isCNTRL_LC_utf8(p)         _generic_LC_utf8(CNTRL, p)
+#define isDIGIT_LC_utf8(p)         _generic_LC_utf8(DIGIT, p)
+#define isGRAPH_LC_utf8(p)         _generic_LC_utf8(GRAPH, p)
+#define isIDCONT_LC_utf8(p)        _generic_LC_utf8(IDCONT, p)
+#define isIDFIRST_LC_utf8(p)       _generic_LC_utf8(IDFIRST, p)
+#define isLOWER_LC_utf8(p)         _generic_LC_utf8(LOWER, p)
+#define isPRINT_LC_utf8(p)         _generic_LC_utf8(PRINT, p)
+#define isPSXSPC_LC_utf8(p)        _generic_LC_utf8(PSXSPC, p)
+#define isPUNCT_LC_utf8(p)         _generic_LC_utf8(PUNCT, p)
+#define isSPACE_LC_utf8(p)         _generic_LC_utf8(SPACE, p)
+#define isUPPER_LC_utf8(p)         _generic_LC_utf8(UPPER, p)
+#define isWORDCHAR_LC_utf8(p)      _generic_LC_utf8(WORDCHAR, p)
+#define isXDIGIT_LC_utf8(p)        _generic_LC_utf8(XDIGIT, p)
+
+/* For internal core Perl use only: the base macros for defining macros like
+ * isALPHA_LC_utf8_safe.  These are like _generic_utf8, but if the first code
+ * point in 'p' is within the 0-255 range, it uses locale rules from the
+ * passed-in 'macro' parameter */
+#define _generic_LC_utf8_safe(macro, p, e, above_latin1)                    \
+         (__ASSERT_(_utf8_safe_assert(p, e))                                \
+         (UTF8_IS_INVARIANT(*(p)))                                          \
+          ? macro(*(p))                                                     \
+          : (UTF8_IS_DOWNGRADEABLE_START(*(p))                              \
+             ? ((LIKELY((e) - (p) > 1 && UTF8_IS_CONTINUATION(*((p)+1))))   \
+                ? macro(EIGHT_BIT_UTF8_TO_NATIVE(*(p), *((p)+1)))           \
+                : (_force_out_malformed_utf8_message(                       \
+                                        (U8 *) (p), (U8 *) (e), 0, 1), 0))  \
+              : above_latin1))
+
+#define _generic_LC_swash_utf8_safe(macro, classnum, p, e)                  \
+            _generic_LC_utf8_safe(macro, p, e,                              \
+                               _is_utf8_FOO_with_len(classnum, p, e))
+
+#define _generic_LC_func_utf8_safe(macro, above_latin1, p, e)               \
+            _generic_LC_utf8_safe(macro, p, e, above_latin1(p, e))
+
+#define _generic_LC_non_swash_utf8_safe(classnum, above_latin1, p, e)       \
+          _generic_LC_utf8_safe(classnum, p, e,                             \
+                             (UNLIKELY((e) - (p) < UTF8SKIP(p))             \
+                              ? (_force_out_malformed_utf8_message(         \
+                                      (U8 *) (p), (U8 *) (e), 0, 1), 0)     \
+                              : above_latin1(p)))
+
+#define isALPHANUMERIC_LC_utf8_safe(p, e)                                   \
+            _generic_LC_swash_utf8_safe(isALPHANUMERIC_LC,                  \
+                                        _CC_ALPHANUMERIC, p, e)
+#define isALPHA_LC_utf8_safe(p, e)                                          \
+            _generic_LC_swash_utf8_safe(isALPHA_LC, _CC_ALPHA, p, e)
+#define isASCII_LC_utf8_safe(p, e)                                          \
+                    (__ASSERT_(_utf8_safe_assert(p, e)) isASCII_LC(*(p)))
+#define isBLANK_LC_utf8_safe(p, e)                                          \
+        _generic_LC_non_swash_utf8_safe(isBLANK_LC, is_HORIZWS_high, p, e)
+#define isCNTRL_LC_utf8_safe(p, e)                                          \
+            _generic_LC_utf8_safe(isCNTRL_LC, p, e, 0)
+#define isDIGIT_LC_utf8_safe(p, e)                                          \
+            _generic_LC_swash_utf8_safe(isDIGIT_LC, _CC_DIGIT, p, e)
+#define isGRAPH_LC_utf8_safe(p, e)                                          \
+            _generic_LC_swash_utf8_safe(isGRAPH_LC, _CC_GRAPH, p, e)
+#define isIDCONT_LC_utf8_safe(p, e)                                         \
+            _generic_LC_func_utf8_safe(isIDCONT_LC,                         \
+                                _is_utf8_perl_idcont_with_len, p, e)
+#define isIDFIRST_LC_utf8_safe(p, e)                                        \
+            _generic_LC_func_utf8_safe(isIDFIRST_LC,                        \
+                                _is_utf8_perl_idstart_with_len, p, e)
+#define isLOWER_LC_utf8_safe(p, e)                                          \
+            _generic_LC_swash_utf8_safe(isLOWER_LC, _CC_LOWER, p, e)
+#define isPRINT_LC_utf8_safe(p, e)                                          \
+            _generic_LC_swash_utf8_safe(isPRINT_LC, _CC_PRINT, p, e)
+#define isPSXSPC_LC_utf8_safe(p, e)    isSPACE_LC_utf8_safe(p, e)
+#define isPUNCT_LC_utf8_safe(p, e)                                          \
+            _generic_LC_swash_utf8_safe(isPUNCT_LC, _CC_PUNCT, p, e)
+#define isSPACE_LC_utf8_safe(p, e)                                          \
+    _generic_LC_non_swash_utf8_safe(isSPACE_LC, is_XPERLSPACE_high, p, e)
+#define isUPPER_LC_utf8_safe(p, e)                                          \
+            _generic_LC_swash_utf8_safe(isUPPER_LC, _CC_UPPER, p, e)
+#define isWORDCHAR_LC_utf8_safe(p, e)                                       \
+            _generic_LC_swash_utf8_safe(isWORDCHAR_LC, _CC_WORDCHAR, p, e)
+#define isXDIGIT_LC_utf8_safe(p, e)                                         \
+        _generic_LC_non_swash_utf8_safe(isXDIGIT_LC, is_XDIGIT_high, p, e)
 
 /* Macros for backwards compatibility and for completeness when the ASCII and
  * Latin1 values are identical */
diff --git a/intrpvar.h b/intrpvar.h
index 1aa94f7f31..a078be4a1f 100644
--- a/intrpvar.h
+++ b/intrpvar.h
@@ -628,6 +628,7 @@ PERLVAR(I, GCB_invlist, SV *)
 PERLVAR(I, LB_invlist, SV *)
 PERLVAR(I, SB_invlist, SV *)
 PERLVAR(I, WB_invlist, SV *)
+PERLVAR(I, seen_deprecated_macro, HV *)
 
 PERLVAR(I, last_swash_hv, HV *)
 PERLVAR(I, last_swash_tmps, U8 *)
diff --git a/op.c b/op.c
index 722ee358c6..394efef5df 100644
--- a/op.c
+++ b/op.c
@@ -652,11 +652,12 @@ Perl_allocmy(pTHX_ const char *const name, const STRLEN 
len, const U32 flags)
                   (UV)flags);
 
     /* complain about "my $<special_var>" etc etc */
-    if (len &&
-       !(is_our ||
-         isALPHA(name[1]) ||
-         ((flags & SVf_UTF8) && isIDFIRST_utf8((U8 *)name+1)) ||
-         (name[1] == '_' && len > 2)))
+    if (   len
+        && !(  is_our
+            || isALPHA(name[1])
+            || (   (flags & SVf_UTF8)
+                && isIDFIRST_utf8_safe((U8 *)name+1, name + len))
+            || (name[1] == '_' && len > 2)))
     {
        if (!(flags & SVf_UTF8 && UTF8_IS_START(name[1]))
         && isASCII(name[1])
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index b6feb46b3e..472d45bbda 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -327,9 +327,46 @@ well.
 
 =item *
 
+New versions of macros like C<isALPHA_utf8> have been added, each with the
+suffix C<_safe>, like C<isSPACE_utf8_safe>.  These take an extra
+parameter, giving an upper limit of how far into the string it is safe
+to read.  Using the old versions could cause attempts to read beyond the
+end of the input buffer if the UTF-8 is not well-formed, and their use
+now raises a deprecation warning.  Details are at
+L<perlapi/Character classification>.
+
+=item *
+
 Calling macros like C<isALPHA_utf8> on malformed UTF-8 have issued a
 deprecation warning since Perl v5.18.  They now die.
 
+=item *
+
+Calling the functions C<utf8n_to_uvchr> and its derivatives, while
+passing a string length of 0 is now asserted against in DEBUGGING
+builds, and otherwise returns the Unicode REPLACEMENT CHARACTER.   If
+you have nothing to decode, you shouldn't call the decode function.
+
+=item *
+
+The functions C<utf8n_to_uvchr> and its derivatives now return the
+Unicode REPLACEMENT CHARACTER if called with UTF-8 that has the overlong
+malformation, and that malformation is allowed by the input parameters.
+This malformation is where the UTF-8 looks valid syntactically, but
+there is a shorter sequence that yields the same code point.  This has
+been forbidden since Unicode version 3.1.
+
+=item *
+
+The functions C<utf8n_to_uvchr> and its derivatives now accept an input
+flag to allow the overflow malformation.  This malformation is when the
+UTF-8 may be syntactically valid, but the code point it represents is
+not capable of being represented in the word length on the platform.
+What "allowed" means in this case is that the function doesn't return an
+error, and advances the parse pointer to beyond the UTF-8 in question,
+but it returns the Unicode REPLACEMENT CHARACTER as the value of the
+code point (since the real value is not representable).
+
 =back
 
 =head1 Selected Bug Fixes
diff --git a/pp.c b/pp.c
index b198b47a8a..6fb20f684e 100644
--- a/pp.c
+++ b/pp.c
@@ -5794,7 +5794,7 @@ PP(pp_split)
     orig = s;
     if (RX_EXTFLAGS(rx) & RXf_SKIPWHITE) {
        if (do_utf8) {
-           while (isSPACE_utf8(s))
+           while (isSPACE_utf8_safe(s, strend))
                s += UTF8SKIP(s);
        }
        else if (get_regex_charset(RX_EXTFLAGS(rx)) == REGEX_LOCALE_CHARSET) {
@@ -5819,9 +5819,9 @@ PP(pp_split)
            m = s;
            /* this one uses 'm' and is a negative test */
            if (do_utf8) {
-               while (m < strend && ! isSPACE_utf8(m) ) {
+               while (m < strend && ! isSPACE_utf8_safe(m, strend) ) {
                    const int t = UTF8SKIP(m);
-                   /* isSPACE_utf8 returns FALSE for malform utf8 */
+                   /* isSPACE_utf8_safe returns FALSE for malform utf8 */
                    if (strend - m < t)
                        m = strend;
                    else
@@ -5859,7 +5859,7 @@ PP(pp_split)
 
            /* this one uses 's' and is a positive test */
            if (do_utf8) {
-               while (s < strend && isSPACE_utf8(s) )
+               while (s < strend && isSPACE_utf8_safe(s, strend) )
                    s +=  UTF8SKIP(s);
            }
            else if (get_regex_charset(RX_EXTFLAGS(rx)) == REGEX_LOCALE_CHARSET)
diff --git a/pp_pack.c b/pp_pack.c
index a75229acca..ee4c69e0ae 100644
--- a/pp_pack.c
+++ b/pp_pack.c
@@ -1073,9 +1073,14 @@ S_unpack_rec(pTHX_ tempsym_t* symptr, const char *s, 
const char *strbeg, const c
                /* 'A' strips both nulls and spaces */
                const char *ptr;
                if (utf8 && (symptr->flags & FLAG_WAS_UTF8)) {
-                   for (ptr = s+len-1; ptr >= s; ptr--)
-                       if (*ptr != 0 && !UTF8_IS_CONTINUATION(*ptr) &&
-                           !isSPACE_utf8(ptr)) break;
+                    for (ptr = s+len-1; ptr >= s; ptr--) {
+                        if (   *ptr != 0
+                            && !UTF8_IS_CONTINUATION(*ptr)
+                            && !isSPACE_utf8_safe(ptr, strend))
+                        {
+                            break;
+                        }
+                    }
                    if (ptr >= s) ptr += UTF8SKIP(ptr);
                    else ptr++;
                    if (ptr > s+len)
diff --git a/proto.h b/proto.h
index c7065cd680..939e821b95 100644
--- a/proto.h
+++ b/proto.h
@@ -54,10 +54,15 @@ PERL_CALLCONV bool  Perl__is_uni_perl_idcont(pTHX_ UV c)
 PERL_CALLCONV bool     Perl__is_uni_perl_idstart(pTHX_ UV c)
                        __attribute__warn_unused_result__;
 
-PERL_CALLCONV bool     Perl__is_utf8_FOO(pTHX_ const U8 classnum, const U8 *p)
+PERL_CALLCONV bool     Perl__is_utf8_FOO(pTHX_ U8 classnum, const U8 * const 
p, const char * const name, const char * const alternative, const bool 
use_utf8, const bool use_locale, const char * const fil ... [23 chars truncated]
                        __attribute__warn_unused_result__;
 #define PERL_ARGS_ASSERT__IS_UTF8_FOO  \
-       assert(p)
+       assert(p); assert(name); assert(alternative); assert(file)
+
+PERL_CALLCONV bool     Perl__is_utf8_FOO_with_len(pTHX_ const U8 classnum, 
const U8 *p, const U8 * const e)
+                       __attribute__warn_unused_result__;
+#define PERL_ARGS_ASSERT__IS_UTF8_FOO_WITH_LEN \
+       assert(p); assert(e)
 
 PERL_CALLCONV bool     Perl__is_utf8_idcont(pTHX_ const U8 *p)
                        __attribute__warn_unused_result__;
@@ -74,15 +79,15 @@ PERL_CALLCONV bool  Perl__is_utf8_mark(pTHX_ const U8 *p)
 #define PERL_ARGS_ASSERT__IS_UTF8_MARK \
        assert(p)
 
-PERL_CALLCONV bool     Perl__is_utf8_perl_idcont(pTHX_ const U8 *p)
+PERL_CALLCONV bool     Perl__is_utf8_perl_idcont_with_len(pTHX_ const U8 *p, 
const U8 * const e)
                        __attribute__warn_unused_result__;
-#define PERL_ARGS_ASSERT__IS_UTF8_PERL_IDCONT  \
-       assert(p)
+#define PERL_ARGS_ASSERT__IS_UTF8_PERL_IDCONT_WITH_LEN \
+       assert(p); assert(e)
 
-PERL_CALLCONV bool     Perl__is_utf8_perl_idstart(pTHX_ const U8 *p)
+PERL_CALLCONV bool     Perl__is_utf8_perl_idstart_with_len(pTHX_ const U8 *p, 
const U8 * const e)
                        __attribute__warn_unused_result__;
-#define PERL_ARGS_ASSERT__IS_UTF8_PERL_IDSTART \
-       assert(p)
+#define PERL_ARGS_ASSERT__IS_UTF8_PERL_IDSTART_WITH_LEN        \
+       assert(p); assert(e)
 
 PERL_CALLCONV bool     Perl__is_utf8_xidcont(pTHX_ const U8 *p)
                        __attribute__warn_unused_result__;
@@ -5261,9 +5266,6 @@ STATIC char*      S_find_byclass(pTHX_ regexp * prog, 
const regnode *c, char *s, cons
 #define PERL_ARGS_ASSERT_FIND_BYCLASS  \
        assert(prog); assert(c); assert(s); assert(strend)
 
-STATIC bool    S_isFOO_lc(pTHX_ const U8 classnum, const U8 character)
-                       __attribute__warn_unused_result__;
-
 STATIC bool    S_isFOO_utf8_lc(pTHX_ const U8 classnum, const U8* character)
                        __attribute__warn_unused_result__;
 #define PERL_ARGS_ASSERT_ISFOO_UTF8_LC \
@@ -5345,6 +5347,11 @@ STATIC void      S_to_utf8_substr(pTHX_ regexp * prog);
 #define PERL_ARGS_ASSERT_TO_UTF8_SUBSTR        \
        assert(prog)
 #endif
+#if defined(PERL_IN_REGEXEC_C) || defined(PERL_IN_UTF8_C)
+PERL_CALLCONV bool     Perl_isFOO_lc(pTHX_ const U8 classnum, const U8 
character)
+                       __attribute__warn_unused_result__;
+
+#endif
 #if defined(PERL_IN_SCOPE_C)
 STATIC void    S_save_pushptri32ptr(pTHX_ void *const ptr1, const I32 i, void 
*const ptr2, const int type);
 STATIC SV*     S_save_scalar_at(pTHX_ SV **sptr, const U32 flags);
@@ -5624,6 +5631,11 @@ PERL_STATIC_INLINE bool  S_is_utf8_common(pTHX_ const U8 
*const p, SV **swash, co
 #define PERL_ARGS_ASSERT_IS_UTF8_COMMON        \
        assert(p); assert(swash); assert(swashname)
 
+PERL_STATIC_INLINE bool        S_is_utf8_common_with_len(pTHX_ const U8 *const 
p, const U8 *const e, SV **swash, const char * const swashname, SV* const 
invlist)
+                       __attribute__warn_unused_result__;
+#define PERL_ARGS_ASSERT_IS_UTF8_COMMON_WITH_LEN       \
+       assert(p); assert(e); assert(swash); assert(swashname)
+
 PERL_STATIC_INLINE bool        S_is_utf8_cp_above_31_bits(const U8 * const s, 
const U8 * const e)
                        __attribute__warn_unused_result__;
 #define PERL_ARGS_ASSERT_IS_UTF8_CP_ABOVE_31_BITS      \
@@ -5652,6 +5664,9 @@ STATIC char *     
S_unexpected_non_continuation_text(pTHX_ const U8 * const s, STRLE
 #define PERL_ARGS_ASSERT_UNEXPECTED_NON_CONTINUATION_TEXT      \
        assert(s)
 
+STATIC void    S_warn_on_first_deprecated_use(pTHX_ const char * const name, 
const char * const alternative, const bool use_locale, const char * const file, 
const unsigned line);
+#define PERL_ARGS_ASSERT_WARN_ON_FIRST_DEPRECATED_USE  \
+       assert(name); assert(alternative); assert(file)
 #endif
 #if defined(PERL_IN_UTF8_C) || defined(PERL_IN_PP_C)
 PERL_CALLCONV UV       Perl__to_upper_title_latin1(pTHX_ const U8 c, U8 *p, 
STRLEN *lenp, const char S_or_s);
diff --git a/regcomp.c b/regcomp.c
index 095b13f3ea..7578a25dd0 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -8271,17 +8271,18 @@ S_reg_scan_name(pTHX_ RExC_state_t *pRExC_state, U32 
flags)
 
     assert (RExC_parse <= RExC_end);
     if (RExC_parse == RExC_end) NOOP;
-    else if (isIDFIRST_lazy_if(RExC_parse, UTF)) {
**** PATCH TRUNCATED AT 2000 LINES -- 1510 NOT SHOWN ****

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.25.8-54-g99a765e9e3

Reply via email to