[perl.git] branch blead, updated. v5.17.3-261-g1703c1f

Dave Mitchell Sat, 08 Sep 2012 07:52:44 -0700

In perl.git, the branch blead has been updated

<http://perl5.git.perl.org/perl.git/commitdiff/1703c1fc996f9d5943ebada0759bc5212289ae8e?hp=2d2826733b14efb7509c9c0c28d27bca6f31d681>


- Log -----------------------------------------------------------------
commit 1703c1fc996f9d5943ebada0759bc5212289ae8e
Merge: 2d28267 3de645a
Author: David Mitchell <da...@iabyn.com>
Date:   Sat Sep 8 15:42:56 2012 +0100

    [MERGE] only copy bits of regex match string
    
    When making a copy of the string being matched against (so that $1, $&
    et al continue to show the correct value even if the original string is
    subsequently modified), only copy that substring of the original string
    needed for the capture variables, rather than copying the whole string.
    
    This is a big win for code like
    
        $&;
        $_ = 'x' x 1_000_000;
        1 while /(.)/;
    
    Also, when pessimizing if the code contains $`, $& or $', record
    the presence of each variable separately, so that the determination of the
    substring range is based on each variable separately. So performance-wise,
    
       $&; /x/
    
    is now roughly equivalent to
    
       /(x)/
    
    whereas previously it was like
    
       /^(.*)(x)(.*)$/
    
    and
    
       $&; $'; /x/
    
    is now roughly equivalent to
    
       /(x)(.*)$/
    
    etc.
    
    Finally, this code (when not in the presence of $& etc)
    
        $_ = 'x' x 1_000_000;
        1 while /(.)/;
    
    used to skip the buffer copy for performance reasons, but suffered from $1
    etc changing if the original string changed. That's now been fixed too.

commit 3de645a82921698b4886d748e3a5a5ed98752f42
Author: David Mitchell <da...@iabyn.com>
Date:   Fri Sep 7 13:32:11 2012 +0100

    fix a bug in handling $+[0] and unicode
    
    The code to decide what substring of a pattern target to copy for the
    sake of $1, $& etc, would, in the absence of $&, only copy the minimum
    range needed to cover $1,$2,...., which might be a shorter range than
    what $& covers. This is fine most of the time, but, when calculating
    $+[0] on a unicode string, it needs a copy of the whole part of the string
    covered by $&, since it needs to convert the byte offest into a char
    offset.
    So to fix this, always copy as a minimum, the $& range.
    I suppose we could be more clever about this: detect the presence
    of @+ in the code, only do it for UTF8 etc; but this is simple
    and non-fragile.

M       regexec.c
M       t/re/re_tests

commit 6f1854a1fe246f8633ccd4d455563cb4210ceb39
Author: David Mitchell <da...@iabyn.com>
Date:   Sat Sep 1 11:43:53 2012 +0100

    m// and s///; don't copy TEMP/AMAGIC strings
    
    Currently pp_match and pp_subst make a copy of the match string if it's
    SvTEMP(), and in the case of pp_match, also if it's SvAMAGIC().
    
    This is no longer necessary, as the code will always copy the string
    anyway if its actually needed after the match, i.e. if it detects the
    presence of $1, $& or //p etc. Until a few commits ago, this wasn't the
    case for pp_match: it would sometimes skip copying even in the presence of
    $1 et al for efficiency reasons. Now that that's fixed, we can remove the
    SvTEMP() and SvAMAGIC() tests.
    
    As to why pp_subst did the SvTEMP test, I don't know: but removing it
    didn't make any tests fail!

M       pp_hot.c

commit fbfb1899dd79253696b441cc1c4968a1057c2574
Author: David Mitchell <da...@iabyn.com>
Date:   Sat Sep 1 11:23:58 2012 +0100

    tidy up patten match copying code
    
    (no functional changes).
    
    1. Remove some dead code from pp_split; it's protected by an assert
    that it could never be called.
    
    2. Simplify the flags settings for the call to CALLREGEXEC() in
    pp_substcont: on subsequent matches we always set REXEC_NOT_FIRST,
    which forces the regex engine not to copy anyway, so passing the
    REXEC_COPY_STR is pointless, as is the conditional code to set it.
    
    3. (whitespace change): split a conditional expression over 2 lines
    for easier reading.

M       pp.c
M       pp_ctl.c
M       pp_hot.c

commit a41aa44d9dc4a3ba586d871754bd11137bdc37a2
Author: David Mitchell <da...@iabyn.com>
Date:   Fri Aug 24 16:17:47 2012 +0100

    stop $foo =~ /(bar)/g skipping copy
    
    Normally in the presence of captures, a successful regex execution
    makes a copy of the matched string, so that $1 et al give the right
    value even if the original string is changed; i.e.
    
        $foo =~ /(123)/g;
        $foo = "bar";
        is("$1", "123");
    
    Until now that test would fail, because perl used to skip the copy for
    the scalar /(...)/g case (but not the C<$&; //g> case). This was to
    avoid a huge slowdown in code like the following:
    
        $x = 'x' x 1_000_000;
        1 while $x =~ /(.)/g;
    
    which would otherwise end up copying a 1Mb string a million times.
    
    Now that (with the last commit but one) we copy only the required
    substring of the original string (a 1-byte substring in the above
    example), we can remove this fast-but-incorrect hack.

M       pp_hot.c
M       t/re/pat_advanced.t
M       t/re/pat_psycho.t

commit 9414be0160a1f343d4ae75ec161fec610da39c84
Author: David Mitchell <da...@iabyn.com>
Date:   Fri Aug 24 15:49:21 2012 +0100

    rationalise t/re/pat_psycho.t
    
    Do some cleanup of this file, without changing its functionality.
    
    Once upon a time, the psycho tests were scattered throughout a single
    pat.t file, before being moved into their own file. Now that they're all
    in a single file, make the $PERL_SKIP_PSYCHO_TEST test a single "skip_all"
    test at the beginning of the file, rather than testing it separately in
    each code block.
    
    Also, make some of the test descriptions more useful, and add a bit of
    debugging output.

M       t/re/pat_psycho.t

commit 6502e08109cd003b2cdf39bc94ef35e52203240b
Author: David Mitchell <da...@iabyn.com>
Date:   Thu Jul 26 16:04:09 2012 +0100

    Don't copy all of the match string buffer
    
    When a pattern matches, and that pattern contains captures (or $`, $&, $'
    or /p are present), a copy is made of the whole original string, so
    that $1 et al continue to hold the correct value even if the original
    string is subsequently modified. This can have severe performance
    penalties; for example, this code causes a 1Mb buffer to be allocated,
    copied and freed a million times:
    
        $&;
        $x = 'x' x 1_000_000;
        1 while $x =~ /(.)/g;
    
    This commit changes this so that, where possible, only the needed
    substring of the original string is copied: in the above case, only a
    1-byte buffer is copied each time. Also, it now reuses or reallocs the
    buffer, rather than freeing and mallocing each time.
    
    Now that PL_sawampersand is a 3-bit flag indicating separately whether
    $`, $& and $' have been seen, they each contribute only their own
    individual penalty; which ones have been seen will limit the extent to
    which we can avoid copying the whole buffer.
    
    Note that the above code *without* the $& is not currently slow, but only
    because the copying is artificially disabled to avoid the performance hit.
    The next but one commit will remove that hack, meaning that it will still
    be fast, but will now be correct in the presence of a modified original
    string.
    
    We achieve this by by adding suboffset and subcoffset fields to the
    existing subbeg and sublen fields of a regex, to indicate how many bytes
    and characters have been skipped from the logical start of the string till
    the physical start of the buffer. To avoid copying stuff at the end, we
    just reduce sublen. For example, in this:
    
        "abcdefgh" =~ /(c)d/
    
    subbeg points to a malloced buffer containing "c\0"; sublen == 1,
    and suboffset == 2 (as does subcoffset).
    
    while if $& has been seen,
    
    subbeg points to a malloced buffer containing "cd\0"; sublen == 2,
    and suboffset == 2.
    
    If in addition $' has been seen, then
    
    subbeg points to a malloced buffer containing "cdefgh\0"; sublen == 6,
    and suboffset == 2.
    
    The regex engine won't do this by default; there are two new flag bits,
    REXEC_COPY_SKIP_PRE and REXEC_COPY_SKIP_POST, which in conjunction with
    REXEC_COPY_STR, request that the engine skip the start or end of the
    buffer (it will still copy in the presence of the relevant $`, $&, $',
    /p).
    
    Only pp_match has been enhanced to use these extra flags; substitution
    can't easily benefit, since the usual action of s///g is to copy the
    whole string first time round, then perform subsequent matching iterations
    against the copy, without further copying. So you still need to copy most
    of the buffer.

M       dump.c
M       ext/Devel-Peek/t/Peek.t
M       mg.c
M       pod/perlreapi.pod
M       pp.c
M       pp_ctl.c
M       pp_hot.c
M       regcomp.c
M       regexec.c
M       regexp.h
M       t/porting/known_pod_issues.dat
M       t/re/re_tests

commit 2c7b5d7698f52b86acffe19a7ec15e85c99337fe
Author: David Mitchell <da...@iabyn.com>
Date:   Thu Jul 26 15:35:39 2012 +0100

    Separate handling of ${^PREMATCH} from $` etc
    
    Currently the handling of getting the value, length etc of ${^PREMATCH}
    etc is identical to that of $` etc.
    
    Handle them separately, by adding RX_BUFF_IDX_CARET_PREMATCH etc
    constants to the existing RX_BUFF_IDX_PREMATCH set.
    
    This allows, when retrieving them, to always return undef if the current
    match didn't use //p. Previously the result depended on stuff such
    as whether the (non-//p) pattern included captures or not.
    
    The documentation for ${^PREMATCH} etc states that it's only guaranteed to
    return a defined value when the last pattern was //p.
    
    As well as making things more consistent, this is a necessary
    prerequisite for the following commit, which may not always copy the
    whole string during a non-//p match.

M       mg.c
M       regcomp.c
M       regexp.h
M       t/re/reg_pmod.t

commit ac0ba89f3ee4e5469c43dc0a34b548a9aa415f98
Author: David Mitchell <da...@iabyn.com>
Date:   Fri Jun 22 16:26:08 2012 +0100

    regexec_flags(): simplify length calculation
    
    The code to calculate the length of the string to copy was
    
        PL_regeol - startpos + (stringarg - strbeg);
    
    This is a hangover from the original (perl 3) regexp implementation
    that under //i, copied and folded the original buffer: so startpos might
    not equal stringarg. These days it always is (except under a match failure
    with (*COMMIT), and the code we're interested is only executed on success).
    
    So simplify to just PL_regeol - strbeg.

M       regexec.c

commit d3b97530399d61590a1500b52bdba553d657bda5
Author: David Mitchell <da...@iabyn.com>
Date:   Fri Jun 22 12:36:03 2012 +0100

    PL_sawampersand: use 3 bit flags rather than bool
    
    Set a separate flag for each of $`, $& and $'.
    It still works fine in boolean context.
    
    This will allow us to have more refined control over what parts
    of a match string to copy (we currently copy the whole string).

M       gv.c
M       intrpvar.h
M       perl.c
M       perl.h

commit 8fd1a95029bf0ff87a3064dec7d6645f40359f2c
Author: David Mitchell <da...@iabyn.com>
Date:   Wed Jun 20 14:17:05 2012 +0100

    document args to regexec_flags and API
    
    Document in the API, and clarify in the source code, what the arguments
    to Perl_regexec_flags are.
    
    NB: this info is based on code inspection, not any real knowledge on my
    part.

M       pod/perlreapi.pod
M       regexec.c
-----------------------------------------------------------------------

Summary of changes:
 dump.c                         |    4 +
 ext/Devel-Peek/t/Peek.t        |    2 +
 gv.c                           |   31 +++++++--
 intrpvar.h                     |    2 +-
 mg.c                           |   74 +++++++++++-----------
 perl.c                         |    7 +-
 perl.h                         |    6 ++
 pod/perlreapi.pod              |   66 ++++++++++++++++++-
 pp.c                           |   10 +--
 pp_ctl.c                       |   17 +++---
 pp_hot.c                       |   39 ++++++++----
 regcomp.c                      |   83 +++++++++++++++++-------
 regexec.c                      |  137 ++++++++++++++++++++++++++++++++++-----
 regexp.h                       |   34 +++++++++-
 t/porting/known_pod_issues.dat |    2 +-
 t/re/pat_advanced.t            |    1 -
 t/re/pat_psycho.t              |   65 ++++++++++++++++---
 t/re/re_tests                  |    6 ++
 t/re/reg_pmod.t                |   58 +++++++++++-------
 19 files changed, 485 insertions(+), 159 deletions(-)

diff --git a/dump.c b/dump.c
index ada6ae9..46893d6 100644
--- a/dump.c
+++ b/dump.c
@@ -2056,6 +2056,10 @@ Perl_do_sv_dump(pTHX_ I32 level, PerlIO *file, SV *sv, 
I32 nest, I32 maxnest, bo
                                (UV)(r->pre_prefix));
            Perl_dump_indent(aTHX_ level, file, "  SUBLEN = %"IVdf"\n",
                                (IV)(r->sublen));
+           Perl_dump_indent(aTHX_ level, file, "  SUBOFFSET = %"IVdf"\n",
+                               (IV)(r->suboffset));
+           Perl_dump_indent(aTHX_ level, file, "  SUBCOFFSET = %"IVdf"\n",
+                               (IV)(r->subcoffset));
            if (r->subbeg)
                Perl_dump_indent(aTHX_ level, file, "  SUBBEG = 0x%"UVxf" %s\n",
                            PTR2UV(r->subbeg),
diff --git a/ext/Devel-Peek/t/Peek.t b/ext/Devel-Peek/t/Peek.t
index 6913d59..164e2ff 100644
--- a/ext/Devel-Peek/t/Peek.t
+++ b/ext/Devel-Peek/t/Peek.t
@@ -350,6 +350,8 @@ do_test('reference to regexp',
     GOFS = 0
     PRE_PREFIX = 4
     SUBLEN = 0
+    SUBOFFSET = 0
+    SUBCOFFSET = 0
     SUBBEG = 0x0
     ENGINE = $ADDR
     MOTHER_RE = $ADDR
diff --git a/gv.c b/gv.c
index c6e474e..e29f2fd 100644
--- a/gv.c
+++ b/gv.c
@@ -1655,12 +1655,23 @@ Perl_gv_fetchpvn_flags(pTHX_ const char *nambeg, STRLEN 
full_len, I32 flags,
                    require_tie_mod(gv, name, 
newSVpvs("Tie::Hash::NamedCapture"), "TIEHASH", 0);
              }
              if (sv_type==SVt_PV || sv_type==SVt_PVGV) {
-              if (*name == '[')
-               require_tie_mod(gv,name,newSVpvs("arybase"),"FETCH",0);
-              else if (*name == '&' || *name == '`' || *name == '\'') {
-               PL_sawampersand = TRUE;
-               (void)GvSVn(gv);
-              }
+                switch (*name) {
+               case '[':
+                   require_tie_mod(gv,name,newSVpvs("arybase"),"FETCH",0);
+                    break;
+               case '`':
+                   PL_sawampersand |= SAWAMPERSAND_LEFT;
+                    (void)GvSVn(gv);
+                    break;
+               case '&':
+                   PL_sawampersand |= SAWAMPERSAND_MIDDLE;
+                    (void)GvSVn(gv);
+                    break;
+               case '\'':
+                   PL_sawampersand |= SAWAMPERSAND_RIGHT;
+                    (void)GvSVn(gv);
+                    break;
+                }
              }
            }
            else if (len == 3 && sv_type == SVt_PVAV
@@ -1866,7 +1877,13 @@ Perl_gv_fetchpvn_flags(pTHX_ const char *nambeg, STRLEN 
full_len, I32 flags,
                sv_type == SVt_PVCV ||
                sv_type == SVt_PVFM ||
                sv_type == SVt_PVIO
-               )) { PL_sawampersand = TRUE; }
+               )) { PL_sawampersand |=
+                        (*name == '`')
+                            ? SAWAMPERSAND_LEFT
+                            : (*name == '&')
+                                ? SAWAMPERSAND_MIDDLE
+                                : SAWAMPERSAND_RIGHT;
+                }
            goto magicalize;
 
        case ':':               /* $: */
diff --git a/intrpvar.h b/intrpvar.h
index f57fa7d..94b7425 100644
--- a/intrpvar.h
+++ b/intrpvar.h
@@ -292,7 +292,7 @@ The C variable which corresponds to Perl's $^W warning 
variable.
 */
 
 PERLVAR(I, dowarn,     U8)
-PERLVAR(I, sawampersand, bool)         /* must save all match strings */
+PERLVAR(I, sawampersand, U8)           /* must save all match strings */
 PERLVAR(I, unsafe,     bool)
 PERLVAR(I, exit_flags, U8)             /* was exit() unexpected, etc. */
 
diff --git a/mg.c b/mg.c
index 1f6d062..26cabbe 100644
--- a/mg.c
+++ b/mg.c
@@ -637,6 +637,8 @@ Perl_magic_regdata_cnt(pTHX_ SV *sv, MAGIC *mg)
     return (U32)-1;
 }
 
+/* @-, @+ */
+
 int
 Perl_magic_regdatum_get(pTHX_ SV *sv, MAGIC *mg)
 {
@@ -665,7 +667,9 @@ Perl_magic_regdatum_get(pTHX_ SV *sv, MAGIC *mg)
                    if (i > 0 && RX_MATCH_UTF8(rx)) {
                        const char * const b = RX_SUBBEG(rx);
                        if (b)
-                           i = utf8_length((U8*)b, (U8*)(b+i));
+                           i = RX_SUBCOFFSET(rx) +
+                                    utf8_length((U8*)b,
+                                        (U8*)(b-RX_SUBOFFSET(rx)+i));
                    }
 
                    sv_setiv(sv, i);
@@ -675,6 +679,8 @@ Perl_magic_regdatum_get(pTHX_ SV *sv, MAGIC *mg)
     return 0;
 }
 
+/* @-, @+ */
+
 int
 Perl_magic_regdatum_set(pTHX_ SV *sv, MAGIC *mg)
 {
@@ -913,9 +919,12 @@ Perl_magic_get(pTHX_ SV *sv, MAGIC *mg)
        if (nextchar == '\0') {       /* ^P */
            sv_setiv(sv, (IV)PL_perldb);
        } else if (strEQ(remaining, "REMATCH")) { /* $^PREMATCH */
-           goto do_prematch_fetch;
+
+            paren = RX_BUFF_IDX_CARET_PREMATCH;
+           goto do_numbuf_fetch;
        } else if (strEQ(remaining, "OSTMATCH")) { /* $^POSTMATCH */
-           goto do_postmatch_fetch;
+            paren = RX_BUFF_IDX_CARET_POSTMATCH;
+           goto do_numbuf_fetch;
        }
        break;
     case '\023':               /* ^S */
@@ -978,55 +987,46 @@ Perl_magic_get(pTHX_ SV *sv, MAGIC *mg)
        break;
     case '\015': /* $^MATCH */
        if (strEQ(remaining, "ATCH")) {
+            paren = RX_BUFF_IDX_CARET_FULLMATCH;
+           goto do_numbuf_fetch;
+        }
+
     case '1': case '2': case '3': case '4':
     case '5': case '6': case '7': case '8': case '9': case '&':
-           if (PL_curpm && (rx = PM_GETRE(PL_curpm))) {
-               /*
-                * Pre-threads, this was paren = atoi(GvENAME((const GV 
*)mg->mg_obj));
-                * XXX Does the new way break anything?
-                */
-               paren = atoi(mg->mg_ptr); /* $& is in [0] */
-               CALLREG_NUMBUF_FETCH(rx,paren,sv);
-               break;
-           }
-           sv_setsv(sv,&PL_sv_undef);
-       }
+        /*
+         * Pre-threads, this was paren = atoi(GvENAME((const GV *)mg->mg_obj));
+         * XXX Does the new way break anything?
+         */
+        paren = atoi(mg->mg_ptr); /* $& is in [0] */
+      do_numbuf_fetch:
+        if (PL_curpm && (rx = PM_GETRE(PL_curpm))) {
+            CALLREG_NUMBUF_FETCH(rx,paren,sv);
+            break;
+        }
+        sv_setsv(sv,&PL_sv_undef);
        break;
     case '+':
        if (PL_curpm && (rx = PM_GETRE(PL_curpm))) {
-           if (RX_LASTPAREN(rx)) {
-               CALLREG_NUMBUF_FETCH(rx,RX_LASTPAREN(rx),sv);
-               break;
-           }
+           paren = RX_LASTPAREN(rx);
+           if (paren)
+                goto do_numbuf_fetch;
        }
        sv_setsv(sv,&PL_sv_undef);
        break;
     case '\016':               /* ^N */
        if (PL_curpm && (rx = PM_GETRE(PL_curpm))) {
-           if (RX_LASTCLOSEPAREN(rx)) {
-               CALLREG_NUMBUF_FETCH(rx,RX_LASTCLOSEPAREN(rx),sv);
-               break;
-           }
-
+           paren = RX_LASTCLOSEPAREN(rx);
+           if (paren)
+                goto do_numbuf_fetch;
        }
        sv_setsv(sv,&PL_sv_undef);
        break;
     case '`':
-      do_prematch_fetch:
-       if (PL_curpm && (rx = PM_GETRE(PL_curpm))) {
-           CALLREG_NUMBUF_FETCH(rx,-2,sv);
-           break;
-       }
-       sv_setsv(sv,&PL_sv_undef);
-       break;
+        paren = RX_BUFF_IDX_PREMATCH;
+        goto do_numbuf_fetch;
     case '\'':
-      do_postmatch_fetch:
-       if (PL_curpm && (rx = PM_GETRE(PL_curpm))) {
-           CALLREG_NUMBUF_FETCH(rx,-1,sv);
-           break;
-       }
-       sv_setsv(sv,&PL_sv_undef);
-       break;
+        paren = RX_BUFF_IDX_POSTMATCH;
+        goto do_numbuf_fetch;
     case '.':
        if (GvIO(PL_last_in_gv)) {
            sv_setiv(sv, (IV)IoLINES(GvIOp(PL_last_in_gv)));
diff --git a/perl.c b/perl.c
index 8444218..7d65719 100644
--- a/perl.c
+++ b/perl.c
@@ -860,7 +860,7 @@ perl_destruct(pTHXx)
     PL_minus_F      = FALSE;
     PL_doswitches   = FALSE;
     PL_dowarn       = G_WARN_OFF;
-    PL_sawampersand = FALSE;   /* must save all match strings */
+    PL_sawampersand = 0;       /* must save all match strings */
     PL_unsafe       = FALSE;
 
     Safefree(PL_inplace);
@@ -2343,8 +2343,9 @@ STATIC void
 S_run_body(pTHX_ I32 oldscope)
 {
     dVAR;
-    DEBUG_r(PerlIO_printf(Perl_debug_log, "%s $` $& $' support.\n",
-                    PL_sawampersand ? "Enabling" : "Omitting"));
+    DEBUG_r(PerlIO_printf(Perl_debug_log, "%s $` $& $' support (0x%x).\n",
+                    PL_sawampersand ? "Enabling" : "Omitting",
+                    (unsigned int)(PL_sawampersand)));
 
     if (!PL_restartop) {
 #ifdef PERL_MAD
diff --git a/perl.h b/perl.h
index 2cc4e91..b299432 100644
--- a/perl.h
+++ b/perl.h
@@ -4854,6 +4854,12 @@ typedef enum {
 #define HINT_SORT_MERGESORT    0x00000002
 #define HINT_SORT_STABLE       0x00000100 /* sort styles (currently one) */
 
+/* flags for PL_sawampersand */
+
+#define SAWAMPERSAND_LEFT       1   /* saw $` */
+#define SAWAMPERSAND_MIDDLE     2   /* saw $& */
+#define SAWAMPERSAND_RIGHT      4   /* saw $' */
+
 /* Various states of the input record separator SV (rs) */
 #define RsSNARF(sv)   (! SvOK(sv))
 #define RsSIMPLE(sv)  (SvOK(sv) && (! SvPOK(sv) || SvCUR(sv)))
diff --git a/pod/perlreapi.pod b/pod/perlreapi.pod
index 35b6b74..1ccc6d8 100644
--- a/pod/perlreapi.pod
+++ b/pod/perlreapi.pod
@@ -209,7 +209,49 @@ faster than C<unpack>.
              I32 minend, SV* screamer,
              void* data, U32 flags);
 
-Execute a regexp.
+Execute a regexp. The arguments are
+
+=over 4
+
+=item rx
+
+The regular expression to execute.
+
+=item screamer
+
+This strangely-named arg is the SV to be matched against. Note that the
+actual char array to be matched against is supplied by the arguments
+described below; the SV is just used to determine UTF8ness, C<pos()> etc.
+
+=item strbeg
+
+Pointer to the physical start of the string.
+
+=item strend
+
+Pointer to the character following the physical end of the string (i.e.
+the \0).
+
+=item stringarg
+
+Pointer to the position in the string where matching should start; it might
+not be equal to C<strbeg> (for example in a later iteration of C</.../g>).
+
+=item minend
+
+Minimum length of string (measured in bytes from C<stringarg>) that must
+match; if the engine reaches the end of the match but hasn't reached this
+position in the string, it should fail.
+
+=item data
+
+Optimisation data; subject to change.
+
+=item flags
+
+Optimisation flags; subject to change.
+
+=back
 
 =head2 intuit
 
@@ -513,6 +555,8 @@ values.
         char *subbeg;  /* saved or original string so \digit works forever. */
         SV_SAVED_COPY  /* If non-NULL, SV which is COW from original */
         I32 sublen;    /* Length of string pointed by subbeg */
+       I32 suboffset;  /* byte offset of subbeg from logical start of str */
+       I32 subcoffset; /* suboffset equiv, but in chars (for @-/@+) */
 
         /* Information about the match that isn't often used */
         I32 prelen;           /* length of precomp */
@@ -653,9 +697,23 @@ occur at a floating offset from the start of the pattern. 
Used to do
 Fast-Boyer-Moore searches on the string to find out if its worth using
 the regex engine at all, and if so where in the string to search.
 
-=head2 C<subbeg> C<sublen> C<saved_copy>
-
-Used during execution phase for managing search and replace patterns.
+=head2 C<subbeg> C<sublen> C<saved_copy> C<suboffset> C<subcoffset>
+
+Used during the execution phase for managing search and replace patterns,
+and for providing the text for C<$&>, C<$1> etc. C<subbeg> points to a
+buffer (either the original string, or a copy in the case of
+C<RX_MATCH_COPIED(rx)>), and C<sublen> is the length of the buffer.  The
+C<RX_OFFS> start and end indices index into this buffer.
+
+In the presence of the C<REXEC_COPY_STR> flag, but with the addition of
+the C<REXEC_COPY_SKIP_PRE> or C<REXEC_COPY_SKIP_POST> flags, an engine
+can choose not to copy the full buffer (although it must still do so in
+the presence of C<RXf_PMf_KEEPCOPY> or the relevant bits being set in
+C<PL_sawampersand>). In this case, it may set C<suboffset> to indicate the
+number of bytes from the logical start of the buffer to the physical start
+(i.e. C<subbeg>). It should also set C<subcoffset>, the number of
+characters in the offset. The latter is needed to support C<@-> and C<@+>
+which work in characters, not bytes.
 
 =head2 C<wrapped> C<wraplen>
 
diff --git a/pp.c b/pp.c
index 29db8ed..e1a6c78 100644
--- a/pp.c
+++ b/pp.c
@@ -5549,13 +5549,9 @@ PP(pp_split)
            if (rex_return == 0)
                break;
            TAINT_IF(RX_MATCH_TAINTED(rx));
-           if (RX_MATCH_COPIED(rx) && RX_SUBBEG(rx) != orig) {
-               m = s;
-               s = orig;
-               orig = RX_SUBBEG(rx);
-               s = orig + (m - s);
-               strend = s + (strend - m);
-           }
+            /* we never pass the REXEC_COPY_STR flag, so it should
+             * never get copied */
+            assert(!RX_MATCH_COPIED(rx));
            m = RX_OFFS(rx)[0].start + orig;
 
            if (gimme_scalar) {
diff --git a/pp_ctl.c b/pp_ctl.c
index 1477373..af0d558 100644
--- a/pp_ctl.c
+++ b/pp_ctl.c
@@ -216,9 +216,7 @@ PP(pp_substcont)
        if (CxONCE(cx) || s < orig ||
                !CALLREGEXEC(rx, s, cx->sb_strend, orig,
                             (s == m) + RX_GOFS(rx), cx->sb_targ, NULL,
-                            ((cx->sb_rflags & REXEC_COPY_STR)
-                             ? (REXEC_IGNOREPOS|REXEC_NOT_FIRST)
-                             : 
(REXEC_COPY_STR|REXEC_IGNOREPOS|REXEC_NOT_FIRST))))
+                                (REXEC_IGNOREPOS|REXEC_NOT_FIRST)))
        {
            SV *targ = cx->sb_targ;
 
@@ -289,6 +287,7 @@ PP(pp_substcont)
     if (RX_MATCH_COPIED(rx) && RX_SUBBEG(rx) != orig) {
        m = s;
        s = orig;
+        assert(!RX_SUBOFFSET(rx));
        cx->sb_orig = orig = RX_SUBBEG(rx);
        s = orig + (m - s);
        cx->sb_strend = s + (cx->sb_strend - m);
@@ -353,9 +352,9 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx)
 
     if (!p || p[1] < RX_NPARENS(rx)) {
 #ifdef PERL_OLD_COPY_ON_WRITE
-       i = 7 + RX_NPARENS(rx) * 2;
+       i = 7 + (RX_NPARENS(rx)+1) * 2;
 #else
-       i = 6 + RX_NPARENS(rx) * 2;
+       i = 6 + (RX_NPARENS(rx)+1) * 2;
 #endif
        if (!p)
            Newx(p, i, UV);
@@ -364,7 +363,7 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx)
        *rsp = (void*)p;
     }
 
-    *p++ = PTR2UV(RX_MATCH_COPIED(rx) ? RX_SUBBEG(rx) : NULL);
+    *p++ = RX_MATCH_COPIED(rx) ? 1 : 0;
     RX_MATCH_COPIED_off(rx);
 
 #ifdef PERL_OLD_COPY_ON_WRITE
@@ -373,9 +372,10 @@ Perl_rxres_save(pTHX_ void **rsp, REGEXP *rx)
 #endif
 
     *p++ = RX_NPARENS(rx);
-
     *p++ = PTR2UV(RX_SUBBEG(rx));
     *p++ = (UV)RX_SUBLEN(rx);
+    *p++ = (UV)RX_SUBOFFSET(rx);
+    *p++ = (UV)RX_SUBCOFFSET(rx);
     for (i = 0; i <= RX_NPARENS(rx); ++i) {
        *p++ = (UV)RX_OFFS(rx)[i].start;
        *p++ = (UV)RX_OFFS(rx)[i].end;
@@ -403,9 +403,10 @@ S_rxres_restore(pTHX_ void **rsp, REGEXP *rx)
 #endif
 
     RX_NPARENS(rx) = *p++;
-
     RX_SUBBEG(rx) = INT2PTR(char*,*p++);
     RX_SUBLEN(rx) = (I32)(*p++);
+    RX_SUBOFFSET(rx) = (I32)*p++;
+    RX_SUBCOFFSET(rx) = (I32)*p++;
     for (i = 0; i <= RX_NPARENS(rx); ++i) {
        RX_OFFS(rx)[i].start = (I32)(*p++);
        RX_OFFS(rx)[i].end = (I32)(*p++);
diff --git a/pp_hot.c b/pp_hot.c
index 6c3f4f6..0d70dfc 100644
--- a/pp_hot.c
+++ b/pp_hot.c
@@ -1319,15 +1319,18 @@ PP(pp_match)
            }
        }
     }
-    /* XXX: comment out !global get safe $1 vars after a
-       match, BUT be aware that this leads to dramatic slowdowns on
-       /g matches against large strings.  So far a solution to this problem
-       appears to be quite tricky.
-       Test for the unsafe vars are TODO for now. */
-    if (       (!global && RX_NPARENS(rx))
-           || SvTEMP(TARG) || SvAMAGIC(TARG) || PL_sawampersand
-           || (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY)))
-       r_flags |= REXEC_COPY_STR;
+    if (       RX_NPARENS(rx)
+            || PL_sawampersand
+            || (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY))
+    ) {
+       r_flags |= (REXEC_COPY_STR|REXEC_COPY_SKIP_PRE);
+        /* in @a =~ /(.)/g, we iterate multiple times, but copy the buffer
+         * only on the first iteration. Therefore we need to copy $' as well
+         * as $&, to make the rest of the string available for captures in
+         * subsequent iterations */
+        if (! (global && gimme == G_ARRAY))
+            r_flags |= REXEC_COPY_SKIP_POST;
+    };
 
   play_it_again:
     if (global && RX_OFFS(rx)[0].start != -1) {
@@ -1472,6 +1475,8 @@ yup:                                      /* Confirmed by 
INTUIT */
     if (global) {
        /* FIXME - should rx->subbeg be const char *?  */
        RX_SUBBEG(rx) = (char *) truebase;
+       RX_SUBOFFSET(rx) = 0;
+       RX_SUBCOFFSET(rx) = 0;
        RX_OFFS(rx)[0].start = s - truebase;
        if (RX_MATCH_UTF8(rx)) {
            char * const t = (char*)utf8_hop((U8*)s, RX_MINLENRET(rx));
@@ -1507,6 +1512,8 @@ yup:                                      /* Confirmed by 
INTUIT */
 #endif
        }
        RX_SUBLEN(rx) = strend - t;
+       RX_SUBOFFSET(rx) = 0;
+       RX_SUBCOFFSET(rx) = 0;
        RX_MATCH_COPIED_on(rx);
        off = RX_OFFS(rx)[0].start = s - t;
        RX_OFFS(rx)[0].end = off + RX_MINLENRET(rx);
@@ -2127,9 +2134,13 @@ PP(pp_subst)
        pm = PL_curpm;
        rx = PM_GETRE(pm);
     }
-    r_flags = (RX_NPARENS(rx) || SvTEMP(TARG) || PL_sawampersand
-           || (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY)) )
-              ? REXEC_COPY_STR : 0;
+
+    r_flags = (    RX_NPARENS(rx)
+                || PL_sawampersand
+                || (RX_EXTFLAGS(rx) & (RXf_EVAL_SEEN|RXf_PMf_KEEPCOPY))
+              )
+          ? REXEC_COPY_STR
+          : 0;
 
     orig = m = s;
     if (RX_EXTFLAGS(rx) & RXf_USE_INTUIT) {
@@ -2203,7 +2214,8 @@ PP(pp_subst)
 #ifdef PERL_OLD_COPY_ON_WRITE
        && !is_cow
 #endif
-       && (I32)clen <= RX_MINLENRET(rx) && (once || !(r_flags & 
REXEC_COPY_STR))
+        && (I32)clen <= RX_MINLENRET(rx)
+        && (once || !(r_flags & REXEC_COPY_STR))
        && !(RX_EXTFLAGS(rx) & RXf_LOOKBEHIND_SEEN)
        && (!doutf8 || SvUTF8(TARG))
        && !(rpm->op_pmflags & PMf_NONDESTRUCT))
@@ -2331,6 +2343,7 @@ PP(pp_subst)
            if (RX_MATCH_COPIED(rx) && RX_SUBBEG(rx) != orig) {
                m = s;
                s = orig;
+                assert(RX_SUBOFFSET(rx) == 0);
                orig = RX_SUBBEG(rx);
                s = orig + (m - s);
                strend = s + (strend - m);
diff --git a/regcomp.c b/regcomp.c
index 921c0e9..a9e92e1 100644
--- a/regcomp.c
+++ b/regcomp.c
@@ -6691,37 +6691,53 @@ Perl_reg_numbered_buff_fetch(pTHX_ REGEXP * const r, 
const I32 paren,
     char *s = NULL;
     I32 i = 0;
     I32 s1, t1;
+    I32 n = paren;
 
     PERL_ARGS_ASSERT_REG_NUMBERED_BUFF_FETCH;
         
-    if (!rx->subbeg) {
-        sv_setsv(sv,&PL_sv_undef);
-        return;
-    } 
-    else               
-    if (paren == RX_BUFF_IDX_PREMATCH && rx->offs[0].start != -1) {
-        /* $` */
+    if ( (    n == RX_BUFF_IDX_CARET_PREMATCH
+           || n == RX_BUFF_IDX_CARET_FULLMATCH
+           || n == RX_BUFF_IDX_CARET_POSTMATCH
+         )
+         && !(rx->extflags & RXf_PMf_KEEPCOPY)
+    )
+        goto ret_undef;
+
+    if (!rx->subbeg)
+        goto ret_undef;
+
+    if (n == RX_BUFF_IDX_CARET_FULLMATCH)
+        /* no need to distinguish between them any more */
+        n = RX_BUFF_IDX_FULLMATCH;
+
+    if ((n == RX_BUFF_IDX_PREMATCH || n == RX_BUFF_IDX_CARET_PREMATCH)
+        && rx->offs[0].start != -1)
+    {
+        /* $`, ${^PREMATCH} */
        i = rx->offs[0].start;
        s = rx->subbeg;
     }
     else 
-    if (paren == RX_BUFF_IDX_POSTMATCH && rx->offs[0].end != -1) {
-        /* $' */
-       s = rx->subbeg + rx->offs[0].end;
-       i = rx->sublen - rx->offs[0].end;
+    if ((n == RX_BUFF_IDX_POSTMATCH || n == RX_BUFF_IDX_CARET_POSTMATCH)
+        && rx->offs[0].end != -1)
+    {
+        /* $', ${^POSTMATCH} */
+       s = rx->subbeg - rx->suboffset + rx->offs[0].end;
+       i = rx->sublen + rx->suboffset - rx->offs[0].end;
     } 
     else
-    if ( 0 <= paren && paren <= (I32)rx->nparens &&
-        (s1 = rx->offs[paren].start) != -1 &&
-        (t1 = rx->offs[paren].end) != -1)
+    if ( 0 <= n && n <= (I32)rx->nparens &&
+        (s1 = rx->offs[n].start) != -1 &&
+        (t1 = rx->offs[n].end) != -1)
     {
-        /* $& $1 ... */
+        /* $&, ${^MATCH},  $1 ... */
         i = t1 - s1;
-        s = rx->subbeg + s1;
+        s = rx->subbeg + s1 - rx->suboffset;
     } else {
-        sv_setsv(sv,&PL_sv_undef);
-        return;
+        goto ret_undef;
     }          
+
+    assert(s >= rx->subbeg);
     assert(rx->sublen >= (s - rx->subbeg) + i );
     if (i >= 0) {
         const int oldtainted = PL_tainted;
@@ -6757,6 +6773,7 @@ Perl_reg_numbered_buff_fetch(pTHX_ REGEXP * const r, 
const I32 paren,
                 SvTAINTED_off(sv);
         }
     } else {
+      ret_undef:
         sv_setsv(sv,&PL_sv_undef);
         return;
     }
@@ -6783,13 +6800,18 @@ Perl_reg_numbered_buff_length(pTHX_ REGEXP * const r, 
const SV * const sv,
     struct regexp *const rx = (struct regexp *)SvANY(r);
     I32 i;
     I32 s1, t1;
+    I32 n = paren;
 
     PERL_ARGS_ASSERT_REG_NUMBERED_BUFF_LENGTH;
 
     /* Some of this code was originally in C<Perl_magic_len> in F<mg.c> */
-       switch (paren) {
-      /* $` / ${^PREMATCH} */
-      case RX_BUFF_IDX_PREMATCH:
+    switch (paren) {
+      case RX_BUFF_IDX_CARET_PREMATCH: /* ${^PREMATCH} */
+         if (!(rx->extflags & RXf_PMf_KEEPCOPY))
+            goto warn_undef;
+        /*FALLTHROUGH*/
+
+      case RX_BUFF_IDX_PREMATCH:       /* $` */
         if (rx->offs[0].start != -1) {
                        i = rx->offs[0].start;
                        if (i > 0) {
@@ -6799,8 +6821,11 @@ Perl_reg_numbered_buff_length(pTHX_ REGEXP * const r, 
const SV * const sv,
                        }
            }
         return 0;
-      /* $' / ${^POSTMATCH} */
-      case RX_BUFF_IDX_POSTMATCH:
+
+      case RX_BUFF_IDX_CARET_POSTMATCH: /* ${^POSTMATCH} */
+         if (!(rx->extflags & RXf_PMf_KEEPCOPY))
+            goto warn_undef;
+      case RX_BUFF_IDX_POSTMATCH:       /* $' */
            if (rx->offs[0].end != -1) {
                        i = rx->sublen - rx->offs[0].end;
                        if (i > 0) {
@@ -6810,6 +6835,13 @@ Perl_reg_numbered_buff_length(pTHX_ REGEXP * const r, 
const SV * const sv,
                        }
            }
         return 0;
+
+      case RX_BUFF_IDX_CARET_FULLMATCH: /* ${^MATCH} */
+         if (!(rx->extflags & RXf_PMf_KEEPCOPY))
+            goto warn_undef;
+        n = RX_BUFF_IDX_FULLMATCH;
+        /*FALLTHROUGH*/
+
       /* $& / ${^MATCH}, $1, $2, ... */
       default:
            if (paren <= (I32)rx->nparens &&
@@ -6819,6 +6851,7 @@ Perl_reg_numbered_buff_length(pTHX_ REGEXP * const r, 
const SV * const sv,
             i = t1 - s1;
             goto getlen;
         } else {
+          warn_undef:
             if (ckWARN(WARN_UNINITIALIZED))
                 report_uninit((const SV *)sv);
             return 0;
@@ -6826,7 +6859,7 @@ Perl_reg_numbered_buff_length(pTHX_ REGEXP * const r, 
const SV * const sv,
     }
   getlen:
     if (i > 0 && RXp_MATCH_UTF8(rx)) {
-        const char * const s = rx->subbeg + s1;
+        const char * const s = rx->subbeg - rx->suboffset + s1;
         const U8 *ep;
         STRLEN el;
 
@@ -14429,6 +14462,8 @@ Perl_save_re_context(pTHX)
 
     PL_reg_oldsaved = NULL;
     PL_reg_oldsavedlen = 0;
+    PL_reg_oldsavedoffset = 0;
+    PL_reg_oldsavedcoffset = 0;
     PL_reg_maxiter = 0;
     PL_reg_leftiter = 0;
     PL_reg_poscache = NULL;
diff --git a/regexec.c b/regexec.c
index 4c9a456..2dc2314 100644
--- a/regexec.c
+++ b/regexec.c
@@ -502,10 +502,13 @@ S_regcp_restore(pTHX_ regexp *rex, I32 ix)
 I32
 Perl_pregexec(pTHX_ REGEXP * const prog, char* stringarg, register char 
*strend,
         char *strbeg, I32 minend, SV *screamer, U32 nosave)
-/* strend: pointer to null at end of string */
-/* strbeg: real beginning of string */
-/* minend: end of match must be >=minend after stringarg. */
-/* nosave: For optimizations. */
+/* stringarg: the point in the string at which to begin matching */
+/* strend:    pointer to null at end of string */
+/* strbeg:    real beginning of string */
+/* minend:    end of match must be >= minend bytes after stringarg. */
+/* screamer:  SV being matched: only used for utf8 flag, pos() etc; string
+ *            itself is accessed via the pointers above */
+/* nosave:    For optimizations. */
 {
     PERL_ARGS_ASSERT_PREGEXEC;
 
@@ -2051,13 +2054,17 @@ S_find_byclass(pTHX_ regexp * prog, const regnode *c, 
char *s,
 I32
 Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, register char 
*strend,
              char *strbeg, I32 minend, SV *sv, void *data, U32 flags)
-/* strend: pointer to null at end of string */
-/* strbeg: real beginning of string */
-/* minend: end of match must be >=minend after stringarg. */
-/* data: May be used for some additional optimizations. 
-         Currently its only used, with a U32 cast, for transmitting 
-         the ganch offset when doing a /g match. This will change */
-/* nosave: For optimizations. */
+/* stringarg: the point in the string at which to begin matching */
+/* strend:    pointer to null at end of string */
+/* strbeg:    real beginning of string */
+/* minend:    end of match must be >= minend bytes after stringarg. */
+/* sv:        SV being matched: only used for utf8 flag, pos() etc; string
+ *            itself is accessed via the pointers above */
+/* data:      May be used for some additional optimizations.
+              Currently its only used, with a U32 cast, for transmitting
+              the ganch offset when doing a /g match. This will change */
+/* nosave:    For optimizations. */
+
 {
     dVAR;
     struct regexp *const prog = (struct regexp *)SvANY(rx);
@@ -2559,9 +2566,7 @@ got_it:
 
     /* make sure $`, $&, $', and $digit will work later */
     if ( !(flags & REXEC_NOT_FIRST) ) {
-       RX_MATCH_COPY_FREE(rx);
        if (flags & REXEC_COPY_STR) {
-           const I32 i = PL_regeol - startpos + (stringarg - strbeg);
 #ifdef PERL_OLD_COPY_ON_WRITE
            if ((SvIsCOW(sv)
                 || (SvFLAGS(sv) & CAN_COW_MASK) == CAN_COW_FLAGS)) {
@@ -2573,17 +2578,105 @@ got_it:
                prog->saved_copy = sv_setsv_cow(prog->saved_copy, sv);
                prog->subbeg = (char *)SvPVX_const(prog->saved_copy);
                assert (SvPOKp(prog->saved_copy));
+                prog->sublen  = PL_regeol - strbeg;
+                prog->suboffset = 0;
+                prog->subcoffset = 0;
            } else
 #endif
            {
-               RX_MATCH_COPIED_on(rx);
-               s = savepvn(strbeg, i);
-               prog->subbeg = s;
-           }
-           prog->sublen = i;
+                I32 min = 0;
+                I32 max = PL_regeol - strbeg;
+                I32 sublen;
+
+                if (    (flags & REXEC_COPY_SKIP_POST)
+                    && !(RX_EXTFLAGS(rx) & RXf_PMf_KEEPCOPY) /* //p */
+                    && !(PL_sawampersand & SAWAMPERSAND_RIGHT)
+                ) { /* don't copy $' part of string */
+                    U32 n = 0;
+                    max = -1;
+                    /* calculate the right-most part of the string covered
+                     * by a capture. Due to look-ahead, this may be to
+                     * the right of $&, so we have to scan all captures */
+                    while (n <= prog->lastparen) {
+                        if (prog->offs[n].end > max)
+                            max = prog->offs[n].end;
+                        n++;
+                    }
+                    if (max == -1)
+                        max = (PL_sawampersand & SAWAMPERSAND_LEFT)
+                                ? prog->offs[0].start
+                                : 0;
+                    assert(max >= 0 && max <= PL_regeol - strbeg);
+                }
+
+                if (    (flags & REXEC_COPY_SKIP_PRE)
+                    && !(RX_EXTFLAGS(rx) & RXf_PMf_KEEPCOPY) /* //p */
+                    && !(PL_sawampersand & SAWAMPERSAND_LEFT)
+                ) { /* don't copy $` part of string */
+                    U32 n = 0;
+                    min = max;
+                    /* calculate the left-most part of the string covered
+                     * by a capture. Due to look-behind, this may be to
+                     * the left of $&, so we have to scan all captures */
+                    while (min && n <= prog->lastparen) {
+                        if (   prog->offs[n].start != -1
+                            && prog->offs[n].start < min)
+                        {
+                            min = prog->offs[n].start;
+                        }
+                        n++;
+                    }
+                    if ((PL_sawampersand & SAWAMPERSAND_RIGHT)
+                        && min >  prog->offs[0].end
+                    )
+                        min = prog->offs[0].end;
+
+                }
+
+                assert(min >= 0 && min <= max && min <= PL_regeol - strbeg);
+                sublen = max - min;
+
+                if (RX_MATCH_COPIED(rx)) {
+                    if (sublen > prog->sublen)
+                        prog->subbeg =
+                                (char*)saferealloc(prog->subbeg, sublen+1);
+                }
+                else
+                    prog->subbeg = (char*)safemalloc(sublen+1);
+                Copy(strbeg + min, prog->subbeg, sublen, char);
+                prog->subbeg[sublen] = '\0';
+                prog->suboffset = min;
+                prog->sublen = sublen;
+           }
+            RX_MATCH_COPIED_on(rx);
+            prog->subcoffset = prog->suboffset;
+            if (prog->suboffset && utf8_target) {
+                /* Convert byte offset to chars.
+                 * XXX ideally should only compute this if @-/@+
+                 * has been seen, a la PL_sawampersand ??? */
+
+                /* If there's a direct correspondence between the
+                 * string which we're matching and the original SV,
+                 * then we can use the utf8 len cache associated with
+                 * the SV. In particular, it means that under //g,
+                 * sv_pos_b2u() will use the previously cached
+                 * position to speed up working out the new length of
+                 * subcoffset, rather than counting from the start of
+                 * the string each time. This stops
+                 *   $x = "\x{100}" x 1E6; 1 while $x =~ /(.)/g;
+                 * from going quadratic */
+                if (SvPOKp(sv) && SvPVX(sv) == strbeg)
+                    sv_pos_b2u(sv, &(prog->subcoffset));
+                else
+                    prog->subcoffset = utf8_length((U8*)strbeg,
+                                        (U8*)(strbeg+prog->suboffset));
+            }
        }
        else {
+            RX_MATCH_COPY_FREE(rx);
            prog->subbeg = strbeg;
+           prog->suboffset = 0;
+           prog->subcoffset = 0;
            prog->sublen = PL_regeol - strbeg;  /* strend may have been 
modified */
        }
     }
@@ -2688,6 +2781,8 @@ S_regtry(pTHX_ regmatch_info *reginfo, char **startpos)
                $` inside (?{}) could fail... */
            PL_reg_oldsaved = prog->subbeg;
            PL_reg_oldsavedlen = prog->sublen;
+           PL_reg_oldsavedoffset = prog->suboffset;
+           PL_reg_oldsavedcoffset = prog->suboffset;
 #ifdef PERL_OLD_COPY_ON_WRITE
            PL_nrs = prog->saved_copy;
 #endif
@@ -2696,6 +2791,8 @@ S_regtry(pTHX_ regmatch_info *reginfo, char **startpos)
        else
            PL_reg_oldsaved = NULL;
        prog->subbeg = PL_bostr;
+       prog->suboffset = 0;
+       prog->subcoffset = 0;
        prog->sublen = PL_regeol - PL_bostr; /* strend may have been modified */
     }
 #ifdef DEBUGGING
@@ -4528,6 +4625,8 @@ S_regmatch(pTHX_ regmatch_info *reginfo, regnode *prog)
                 RXp_MATCH_COPIED_off(re);
                 re->subbeg = rex->subbeg;
                 re->sublen = rex->sublen;
+                re->suboffset = rex->suboffset;
+                re->subcoffset = rex->subcoffset;
                rei = RXi_GET(re);
                 DEBUG_EXECUTE_r(
                     debug_start_match(re_sv, utf8_target, locinput, PL_regeol,
@@ -7160,6 +7259,8 @@ restore_pos(pTHX_ void *arg)
        if (PL_reg_oldsaved) {
            rex->subbeg = PL_reg_oldsaved;
            rex->sublen = PL_reg_oldsavedlen;
+           rex->suboffset = PL_reg_oldsavedoffset;
+           rex->subcoffset = PL_reg_oldsavedcoffset;
 #ifdef PERL_OLD_COPY_ON_WRITE
            rex->saved_copy = PL_nrs;
 #endif
diff --git a/regexp.h b/regexp.h
index db36edd..3e245d0 100644
--- a/regexp.h
+++ b/regexp.h
@@ -124,6 +124,8 @@ struct reg_code_block {
        char *subbeg;                                                   \
        SV_SAVED_COPY   /* If non-NULL, SV which is COW from original */\
        I32 sublen;     /* Length of string pointed by subbeg */        \
+       I32 suboffset;  /* byte offset of subbeg from logical start of str */ \
+       I32 subcoffset; /* suboffset equiv, but in chars (for @-/@+) */ \
        /* Information about the match that isn't often used */         \
        /* offset from wrapped to the start of precomp */               \
        PERL_BITFIELD32 pre_prefix:4;                                   \
@@ -181,9 +183,12 @@ typedef struct regexp_engine {
   paren name. >= 1 is reserved for actual numbered captures, i.e. $1,
   $2 etc.
 */
-#define RX_BUFF_IDX_PREMATCH  -2 /* $` / ${^PREMATCH}  */
-#define RX_BUFF_IDX_POSTMATCH -1 /* $' / ${^POSTMATCH} */
-#define RX_BUFF_IDX_FULLMATCH      0 /* $& / ${^MATCH}     */
+#define RX_BUFF_IDX_CARET_PREMATCH  -5 /* ${^PREMATCH}  */
+#define RX_BUFF_IDX_CARET_POSTMATCH -4 /* ${^POSTMATCH} */
+#define RX_BUFF_IDX_CARET_FULLMATCH -3 /* ${^MATCH}     */
+#define RX_BUFF_IDX_PREMATCH        -2 /* $` */
+#define RX_BUFF_IDX_POSTMATCH       -1 /* $' */
+#define RX_BUFF_IDX_FULLMATCH        0 /* $& */
 
 /*
   Flags that are passed to the named_buff and named_buff_iter
@@ -474,6 +479,18 @@ get_regex_charset_name(const U32 flags, STRLEN* const lenp)
        assert(SvTYPE(_rx_subbeg) == SVt_REGEXP);                       \
        &SvANY(_rx_subbeg)->subbeg;                                     \
     }))
+#  define RX_SUBOFFSET(prog)                                           \
+    (*({                                                               \
+       const REGEXP *const _rx_suboffset = (prog);                     \
+       assert(SvTYPE(_rx_suboffset) == SVt_REGEXP);                    \
+       &SvANY(_rx_suboffset)->suboffset;                               \
+    }))
+#  define RX_SUBCOFFSET(prog)                                          \
+    (*({                                                               \
+       const REGEXP *const _rx_subcoffset = (prog);                    \
+       assert(SvTYPE(_rx_subcoffset) == SVt_REGEXP);                   \
+       &SvANY(_rx_subcoffset)->subcoffset;                             \
+    }))
 #  define RX_OFFS(prog)                                                        
\
     (*({                                                               \
        const REGEXP *const _rx_offs = (prog);                          \
@@ -490,6 +507,8 @@ get_regex_charset_name(const U32 flags, STRLEN* const lenp)
 #  define RX_EXTFLAGS(prog)    RXp_EXTFLAGS((struct regexp *)SvANY(prog))
 #  define RX_ENGINE(prog)      (((struct regexp *)SvANY(prog))->engine)
 #  define RX_SUBBEG(prog)      (((struct regexp *)SvANY(prog))->subbeg)
+#  define RX_SUBOFFSET(prog)   (((struct regexp *)SvANY(prog))->suboffset)
+#  define RX_SUBCOFFSET(prog)  (((struct regexp *)SvANY(prog))->subcoffset)
 #  define RX_OFFS(prog)                (((struct regexp *)SvANY(prog))->offs)
 #  define RX_NPARENS(prog)     (((struct regexp *)SvANY(prog))->nparens)
 #endif
@@ -538,6 +557,11 @@ get_regex_charset_name(const U32 flags, STRLEN* const lenp)
 #define REXEC_SCREAM   0x04            /* use scream table. */
 #define REXEC_IGNOREPOS        0x08            /* \G matches at start. */
 #define REXEC_NOT_FIRST        0x10            /* This is another iteration of 
//g. */
+                                    /* under REXEC_COPY_STR, it's ok for the
+                                     * engine (modulo PL_sawamperand etc)
+                                     * to skip copying ... */
+#define REXEC_COPY_SKIP_PRE  0x20   /* ...the $` part of the string, or */
+#define REXEC_COPY_SKIP_POST 0x40   /* ...the $' part of the string */
 
 #if defined(__GNUC__) && !defined(PERL_GCC_BRACE_GROUPS_FORBIDDEN)
 #  define ReREFCNT_inc(re)                                             \
@@ -760,6 +784,8 @@ typedef struct regmatch_slab {
 #define PL_reg_curpm           PL_reg_state.re_state_reg_curpm
 #define PL_reg_oldsaved                PL_reg_state.re_state_reg_oldsaved
 #define PL_reg_oldsavedlen     PL_reg_state.re_state_reg_oldsavedlen
+#define PL_reg_oldsavedoffset  PL_reg_state.re_state_reg_oldsavedoffset
+#define PL_reg_oldsavedcoffset PL_reg_state.re_state_reg_oldsavedcoffset
 #define PL_reg_maxiter         PL_reg_state.re_state_reg_maxiter
 #define PL_reg_leftiter                PL_reg_state.re_state_reg_leftiter
 #define PL_reg_poscache                PL_reg_state.re_state_reg_poscache
@@ -781,6 +807,8 @@ struct re_save_state {
     PMOP *re_state_reg_curpm;          /* from regexec.c */
     char *re_state_reg_oldsaved;       /* old saved substr during match */
     STRLEN re_state_reg_oldsavedlen;   /* old length of saved substr during 
match */
+    STRLEN re_state_reg_oldsavedoffset;        /* old offset of saved substr 
during match */
+    STRLEN re_state_reg_oldsavedcoffset;/* old coffset of saved substr during 
match */
     STRLEN re_state_reg_poscache_size; /* size of pos cache of WHILEM */
     I32 re_state_reg_oldpos;           /* from regexec.c */
     I32 re_state_reg_maxiter;          /* max wait until caching pos */
diff --git a/t/porting/known_pod_issues.dat b/t/porting/known_pod_issues.dat
index f316fa7..ba4ccf6 100644
--- a/t/porting/known_pod_issues.dat
+++ b/t/porting/known_pod_issues.dat
@@ -266,7 +266,7 @@ pod/perlpacktut.pod Verbatim line length including indents 
exceeds 79 by    6
 pod/perlperf.pod       Verbatim line length including indents exceeds 79 by    
154
 pod/perlpodspec.pod    Verbatim line length including indents exceeds 79 by    
9
 pod/perlpodstyle.pod   Verbatim line length including indents exceeds 79 by    
1
-pod/perlreapi.pod      Verbatim line length including indents exceeds 79 by    
17
+pod/perlreapi.pod      Verbatim line length including indents exceeds 79 by    
18
 pod/perlrebackslash.pod        Verbatim line length including indents exceeds 
79 by    1
 pod/perlref.pod        Verbatim line length including indents exceeds 79 by    
1
 pod/perlreguts.pod     Verbatim line length including indents exceeds 79 by    
17
diff --git a/t/re/pat_advanced.t b/t/re/pat_advanced.t
index 6692e1c..05cc191 100644
--- a/t/re/pat_advanced.t
+++ b/t/re/pat_advanced.t
@@ -1660,7 +1660,6 @@ $x='123';
 print ">$1<\n";
 EOP
 
-        local $::TODO = 'RT #86042';
         fresh_perl_is(<<'EOP', ">abc<\n", {}, 'no mention of $&');
 my $x; 
 ($x='abc')=~/(abc)/g; 
diff --git a/t/re/pat_psycho.t b/t/re/pat_psycho.t
index c5073a5..0433760 100644
--- a/t/re/pat_psycho.t
+++ b/t/re/pat_psycho.t
@@ -3,6 +3,9 @@
 # This is a home for regular expression tests that don't fit into
 # the format supported by re/regexp.t.  If you want to add a test
 # that does fit that format, add it to re/re_tests, not here.
+#
+# this file includes test that my burn a lot of CPU or otherwise be heavy
+# on resources. Set env var $PERL_SKIP_PSYCHO_TEST to skip this file
 
 use strict;
 use warnings;
@@ -21,7 +24,8 @@ BEGIN {
 }
 
 
-plan tests => 11;  # Update this when adding/deleting tests.
+skip_all('$PERL_SKIP_PSYCHO_TEST set') if $ENV{PERL_SKIP_PSYCHO_TEST};
+plan tests => 15;  # Update this when adding/deleting tests.
 
 run_tests() unless caller;
 
@@ -29,16 +33,17 @@ run_tests() unless caller;
 # Tests start here.
 #
 sub run_tests {
+    print "# Set PERL_SKIP_PSYCHO_TEST to skip these tests\n";
 
-  SKIP:
     {
-        print "# Set PERL_SKIP_PSYCHO_TEST to skip this test\n";
-        my @normal = qw [the are some normal words];
 
-        skip "Skipped Psycho", 2 * @normal if $ENV {PERL_SKIP_PSYCHO_TEST};
+       # stress test tries
+
+        my @normal = qw [the are some normal words];
 
         local $" = "|";
 
+       note "setting up trie psycho vars ...";
         my @psycho = (@normal, map chr $_, 255 .. 20000);
         my $psycho1 = "@psycho";
         for (my $i = @psycho; -- $i;) {
@@ -48,13 +53,12 @@ sub run_tests {
         my $psycho2 = "@psycho";
 
         foreach my $word (@normal) {
-            ok $word =~ /($psycho1)/ && $1 eq $word, 'Psycho';
-            ok $word =~ /($psycho2)/ && $1 eq $word, 'Psycho';
+            ok $word =~ /($psycho1)/ && $1 eq $word, qq{"$word" =~ 
/\$psycho1/};
+            ok $word =~ /($psycho2)/ && $1 eq $word, qq{"$word" =~ 
/\$psycho1/};
         }
     }
 
 
-  SKIP:
     {
         # stress test CURLYX/WHILEM.
         #
@@ -63,8 +67,6 @@ sub run_tests {
         # CURLYX and WHILEM blocks, except those related to LONGJMP, the
         # super-linear cache and warnings. It executes about 0.5M regexes
 
-        skip "No psycho tests" if $ENV {PERL_SKIP_PSYCHO_TEST};
-        print "# Set PERL_SKIP_PSYCHO_TEST to skip this test\n";
         my $r = qr/^
                     (?:
                         ( (?:a|z+)+ )
@@ -158,6 +160,49 @@ sub run_tests {
         }
         ok($ok, $msg);
     }
+
+
+    {
+       # these bits of test code used to run quadratically. If we break
+       # anything, they'll start to take minutes to run, rather than
+       # seconds. We don't actually measure times or set alarms, since
+       # that tends to be very fragile and prone to false positives.
+       # Instead, just hope that if someone is messing with
+       # performance-related code, they'll re-run the test suite and
+       # notice it suddenly takes a lot longer.
+
+       my $x;
+
+       $x = 'x' x 1_000_000;
+       1 while $x =~ /(.)/g;
+       pass "ascii =~ /(.)/";
+
+       {
+           local ${^UTF8CACHE} = 1; # defeat debugging
+           $x = "\x{100}" x 1_000_000;
+           1 while $x =~ /(.)/g;
+           pass "utf8 =~ /(.)/";
+       }
+
+       # run these in separate processes, since they set $&
+
+        fresh_perl_is(<<'EOF', "ok\n", {}, 'ascii =~ /(.)/, mention $&');
+$&;
+$x = 'x' x 1_000_000;
+1 while $x =~ /(.)/g;
+print "ok\n";
+EOF
+
+        fresh_perl_is(<<'EOF', "ok\n", {}, 'utf8 =~ /(.)/, mention $&');
+$&;
+local ${^UTF8CACHE} = 1; # defeat debugging
+$x = "\x{100}" x 1_000_000;
+1 while $x =~ /(.)/g;
+print "ok\n";
+EOF
+
+
+    }
 } # End of sub run_tests
 
 1;
diff --git a/t/re/re_tests b/t/re/re_tests
index f44bdc1..1aebbe6 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -163,6 +163,7 @@ ab|cd       abcd    y       $&      ab
 ()ef   def     y       $&-$1   ef-
 ()ef   def     y       $-[0]   1
 ()ef   def     y       $+[0]   3
+()\x{100}\x{1000}      d\x{100}\x{1000}        y       $+[0]   3
 ()ef   def     y       $-[1]   1
 ()ef   def     y       $+[1]   1
 *a     -       c       -       Quantifier follows nothing
@@ -1702,5 +1703,10 @@ ab[c\\\](??{"x"})]{3}d   ab\\](d y       -       -
 \W     \x{200D}        n       -       -
 
 /^(?d:\xdf|_)*_/i      \x{17f}\x{17f}_ y       $&      \x{17f}\x{17f}_
+#
+# check that @-, @+ count chars, not bytes; especially if beginning of
+# string is not copied
+
+(\x{100})      \x{2000}\x{2000}\x{2000}\x{100} y       $-[0]:$-[1]:$+[0]:$+[1] 
3:3:4:4
 
 # vim: softtabstop=0 noexpandtab
diff --git a/t/re/reg_pmod.t b/t/re/reg_pmod.t
index 301aeef..3190e03 100644
--- a/t/re/reg_pmod.t
+++ b/t/re/reg_pmod.t
@@ -11,38 +11,52 @@ use warnings;
 
 our @tests = (
     # /p      Pattern   PRE     MATCH   POST
-    [ '/p',   "456",    "123-", "456",  "-789"],
-    [ '(?p)', "456",    "123-", "456",  "-789"],
-    [ '',     "(456)",  "123-", "456",  "-789"],
-    [ '',     "456",    undef,  undef,  undef ],
+    [ '/p',   "345",    "12-", "345",  "-6789"],
+    [ '(?p)', "345",    "12-", "345",  "-6789"],
+    [ '',     "(345)",  undef,  undef,  undef ],
+    [ '',     "345",    undef,  undef,  undef ],
 );
 
-plan tests => 4 * @tests + 2;
+plan tests => 14 * @tests + 2;
 my $W = "";
 
 $SIG{__WARN__} = sub { $W.=join("",@_); };
 sub _u($$) { "$_[0] is ".(defined $_[1] ? "'$_[1]'" : "undef") }
 
-$_ = '123-456-789';
 foreach my $test (@tests) {
     my ($p, $pat,$l,$m,$r) = @$test;
-    my $test_name = $p eq '/p'   ? "/$pat/p"
-                  : $p eq '(?p)' ? "/(?p)$pat/"
-                  :                "/$pat/";
+    for my $sub (0,1) {
+       my $test_name = $p eq '/p'   ? "/$pat/p"
+                     : $p eq '(?p)' ? "/(?p)$pat/"
+                     :                "/$pat/";
+       $test_name = "s$test_name" if $sub;
 
-    #
-    # Cannot use if/else due to the scope invalidating ${^MATCH} and friends.
-    #
-    my $ok = ok $p eq '/p'   ? /$pat/p
-              : $p eq '(?p)' ? /(?p)$pat/
-              :                /$pat/
-              => $test_name;
-    SKIP: {
-        skip "/$pat/$p failed to match", 3
-            unless $ok;
-        is(${^PREMATCH},  $l,_u "$test_name: ^PREMATCH",$l);
-        is(${^MATCH},     $m,_u "$test_name: ^MATCH",$m );
-        is(${^POSTMATCH}, $r,_u "$test_name: ^POSTMATCH",$r );
+       #
+       # Cannot use if/else due to the scope invalidating ${^MATCH} and 
friends.
+       #
+       $_ = '12-345-6789';
+       my $ok =
+               $sub ?
+                       (   $p eq '/p'   ? s/$pat/abc/p
+                         : $p eq '(?p)' ? s/(?p)$pat/abc/
+                         :                s/$pat/abc/
+                       )
+                    :
+                       (   $p eq '/p'   ? /$pat/p
+                         : $p eq '(?p)' ? /(?p)$pat/
+                         :                /$pat/
+                       );
+       ok $ok, $test_name;
+       SKIP: {
+           skip "/$pat/$p failed to match", 6
+               unless $ok;
+           is(${^PREMATCH},  $l,_u "$test_name: ^PREMATCH",$l);
+           is(${^MATCH},     $m,_u "$test_name: ^MATCH",$m );
+           is(${^POSTMATCH}, $r,_u "$test_name: ^POSTMATCH",$r );
+           is(length ${^PREMATCH}, length $l, "$test_name: ^PREMATCH length");
+           is(length ${^MATCH},    length $m, "$test_name: ^MATCH length");
+           is(length ${^POSTMATCH},length $r, "$test_name: ^POSTMATCH length");
+       }
     }
 }
 is($W,"","No warnings should be produced");

--
Perl5 Master Repository

[perl.git] branch blead, updated. v5.17.3-261-g1703c1f

Reply via email to