demerphq <[EMAIL PROTECTED]> wrote:
:+ make_trie(startbranch,first,last,tail,flags)
[...]
:+ *args*
:+ startbranch: the first branch in the whole branch sequence
:+ first      : start branch of sequence of branch-exact nodes.
:+              May be the same as startbranch
:+ last       : Thing following the last branch.
:+              May be the same as tail.
:+ tail       : item following the branch sequence
:+ flags      : Currently unused.
:+
:+ IE: the regexp /(ac|ad|ab)+/ will produce the folowing output:
:+
:+   1: CURLYM[1] {1,32767}(18)
:+   5:   BRANCH(8)
:+   6:     EXACT <ac>(16)
:+   8:   BRANCH(11)
:+   9:     EXACT <ad>(16)
:+  11:   BRANCH(14)
:+  12:     EXACT <ab>(16)
:+  16:   SUCCEED(0)
:+  17:   NOTHING(18)
:+  18: END(0)
:+
:+ This would be optimizable with startbranch=5, first=5, last=11, tail=16

'last=11' in the example args seems to contradict "may be the same as tail"
in the description above: is 'last' the last BRANCH/EXACT pair to consume,
or the node following that pair? (As far as I can see from the code it's
the latter, and the example should say "last=16".)

:+#ifdef DEBUGGING
:+    AV *revcharmap=newAV();
:+    AV *words=NULL;
:+#endif
:+    DEBUG_r(
:+        words=newAV();
:+    );

As far as I can see the revcharmap initialisation can also be moved inside
the DEBUG_r(), as long as the later SvREFCNT_dec((SV*)revcharmap) is made
similarly conditional.

:+                       uvc = utf8n_to_uvuni(scan, UTF8_MAXLEN, &len, 
uniflags);

Am I right to think that this may die() if uniflags says to? If so, the
malloced spaces (charmap, revcharmap, words) will be leaked in the same
way as discussed for the other part of the patch. Even if dying isn't
intended, I believe utf8n_to_uvuni() may have to go run a load of perl
code, which might die for some reason.

:+    Newz(8482,trans,(charcount+1)*uniquecharcount+1,reg_trie_trans);
:+    Newz(8482,states,charcount+2,reg_trie_state);
[and similar]

Oh, I just noticed this: the first parameter is intended to be a unique
index, to help tracking down memory leaks and similar problems.

Hmm, ((charcount + 1) * uniquecharcount +1) * sizeof(reg_trie_trans)?
Isn't that in danger of ballooning out of all proportion on pathological
data? As in:
  my $count = shift;
  my $str = join '', map chr($_), 0 .. $count;
  /x|$str/;

sizeof(reg_trie_trans) is 8, so that's going to be at least 8MB on 1k unique
characters, or 32GB on 64k unique characters - quadratic memory growth is
scary, and there should probably be a cutoff here that falls back to not
bothering to construct the trie.

(I don't understand the underlying mechanisms here, but that count seems
strange to me - surely we don't need more than around (charcount + 1)
transitions?)

Hmm, when I tried to test this out I noticed that you don't pick up an
EXACT string long enough to split:
  ./perl -Dr -wle '$s="x" x 256; /x|$s/'
nor on patterns large enough to invoke LONGJMP:
  ./perl -Dr -wle '$s=join"|",map chr,256..256+shift;/$s/' 25000 2>&1 | less
.. but slightly smaller patterns manage a panic if I'm lucky:
  zen% ./perl -wle '$s=join"|",map chr,256..256+shift;/$s/' 20000
  panic: malloc at -e line 1.
  zen% 

:+            if (uvc<256) {
:+                charid=charmap[uvc];
:+            } else if(widecharmap) {
:+                SV** svpp=NULL;
:+                svpp=hv_fetch(widecharmap, (char*)&uvc, sizeof(UV), 0);
:+                if (!svpp) {
:+                    charid=0;
:+                } else {
:+                    charid=(U16)SvIV(*svpp);
:+                }
:+            }
:+            if (charid) {
[...]
:+            } else {
:+                Perl_croak(aTHX_ "Internal error in trie construction, no 
char mapping for word?");
:+            }

Since you bother to check and croak, I'd suggest either coping with the
(uvc >= 256 && !widecharmap) case by setting charid to 0, or not
bothering with the 'if(widecharmap)' test: at this point a SEGV is likely
to be as helpful as the croak(), if not more so. Oh yes, I said earlier:

:regcomp.c:910: warning: `charid' might be used uninitialized in this function
[...]
:it took quite an effort to determine that for all but the 'charid' case

And of course (uvc >= 256 && !widecharmap) is precisely when charid *would*
be used uninitialised, silly me.

(Oh, and setting svpp to NULL just before assigning to it isn't needed. :)

:+            /* Its a dupe. So ignore it.
:+
:+            Warn here or not? [...]
:+           */

I think the docs might be a better place for this commentary, it seems
out of place here.

:-S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, I32 *deltap, 
regnode *last, scan_data_t *data, U32 flags)
:+S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, I32 *deltap, 
regnode *last, scan_data_t *data, U32 flags, U32 depth)

Hmm, that extra argument is used only if you've compiled perl with TRIE_DEBUG
(for which you have to patch the source), and are running with -Dr - that
seems like an unreasonable price for everyone else to pay.

:+                   Hypthetically when we know the regex isnt anchored we can
:+                   turn a case 1 into a DFA and let it rip...

I'm guessing this is true only if the (complete) branchset we're replacing
is also the complete pattern.

:+                            
PerlIO_printf(Perl_debug_log,"0x%08X,0x%08X,0x%08X)\n",
:+                               first,last,cur);

Here, and elsewhere, something like "0x%p,0x%p,0x%p" would be likely to cope
better with (for example) 64-bit pointers.

:+                        if ( (optype ? OP(noper) == optype
:+                                     : PL_regkind[(U8)OP(noper)] == EXACT)
:+                              && noper_next==tail )
:+                        {
:+                            if (!first) {
:+                                first=cur;
:+                                optype=OP(noper);

Hmm, the test of 'optype' seems to be trying to do the same as the test
of 'first' is: I'd suggest testing 'first' in both cases, which also
avoids the problem that you don't clear optype after calling make_trie().

:-       PerlIO_printf(Perl_debug_log, "%sCompiling REx%s `%s%*s%s'\n",
:+       PerlIO_printf(Perl_debug_log, "%sCompiling REx%s \"%s%*s%s\"\n",
[...]
:-                     "%sFreeing REx:%s `%s%*.*s%s%s'\n",
:+                     "%sFreeing REx:%s %s%*.*s%s%s\n",

Hmm, I hope you checked no-one is trying to parse that output ...

:+            case 't':
:+                    {
:+                        reg_trie_data *trie=(reg_trie_data*)r->data->data[n];
:+                        OP_REFCNT_LOCK;
:+                        trie->refcount--;
:+                        OP_REFCNT_UNLOCK;
:+                        if (!trie->refcount) {
:+                            Safefree(trie->charmap);
:+                            if (trie->widecharmap) 
SvREFCNT_dec((SV*)trie->widecharmap);
:+                            Safefree(trie->states);
:+                            Safefree(trie->trans);
:+#ifdef DEBUGGING
:+                            if (trie->words) SvREFCNT_dec((SV*)trie->words);
:+#endif
:+                            Safefree(r->data->data[n]); /* do this last!!!! */
:+                        }
:+                        break;
:+                    }

(All but the 'case' should be outdented one stop.)

Hmm, locking a 'refcount--' doesn't make much sense to me: if two threads
both do this at the same time, they'll still then both try to free the
structure. But I see that op_free() does essentially the same thing,
which confuses me. Can anyone explain how that makes sense?

Gosh, am I done? Have fun ...

Hugo

Reply via email to