Re: (blead patch) New TRIE regex patch

demerphq Wed, 02 Mar 2005 01:23:10 -0800

On Wed, 02 Mar 2005 04:08:38 +0000, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> demerphq <[EMAIL PROTECTED]> wrote:
> :+ make_trie(startbranch,first,last,tail,flags)
<snip>
> :+ This would be optimizable with startbranch=5, first=5, last=11, tail=16
> 
> 'last=11' in the example args seems to contradict "may be the same as tail"
> in the description above: is 'last' the last BRANCH/EXACT pair to consume,
> or the node following that pair? (As far as I can see from the code it's
> the latter, and the example should say "last=16".)


Ive allowed this documentation snippet to get out of synch. I
apologise. In the case above last=16 and tail equals 16. However there
definately are cases where tail != last. A good example is
/(?foo|bar)baz/

   1: BRANCH(4)
   2:   EXACT <foo>(8)
   4: BRANCH(7)
   5:   EXACT <bar>(8)
   7: TAIL(8)
   8: EXACT <baz>(10)
  10: END(0)

last would be node 7 tail would be node 8.

<snip> 
> As far as I can see the revcharmap initialisation can also be moved inside
> the DEBUG_r(), as long as the later SvREFCNT_dec((SV*)revcharmap) is made
> similarly conditional.

Yeah, probably true. I think id have to rework a few things later on
in make_trie() but it wouldnt be much. Ill take care of it.

> 
> :+                       uvc = utf8n_to_uvuni(scan, UTF8_MAXLEN, &len, 
> uniflags);
> 
> Am I right to think that this may die() if uniflags says to? If so, the
> malloced spaces (charmap, revcharmap, words) will be leaked in the same
> way as discussed for the other part of the patch. Even if dying isn't
> intended, I believe utf8n_to_uvuni() may have to go run a load of perl
> code, which might die for some reason.

Hmm. Yes I think you are right. Ok. If i rework it so the trie is
allocated and inserted into the regexes data structure at the get-go
then pregfree would clean it up. The alternative is to not warn and
deal with it that way. I mostly used the logic in EXACTFL and some
time staring at utf8.[ch] as the basis for this so it possible that
someone more familiar with unicode should advise here.

> 
> :+    Newz(8482,trans,(charcount+1)*uniquecharcount+1,reg_trie_trans);
> :+    Newz(8482,states,charcount+2,reg_trie_state);
> [and similar]
> 
> Oh, I just noticed this: the first parameter is intended to be a unique
> index, to help tracking down memory leaks and similar problems.

Yeah, all of my usage of it in this patch is 8482. The docs I saw said
this is a mostly unused feature and that lots of code just uses '0' so
i just picked the ascii digits of the letters TR for all of my allocs.
Should i use distinct ids for each call?

> 
> Hmm, ((charcount + 1) * uniquecharcount +1) * sizeof(reg_trie_trans)?
> Isn't that in danger of ballooning out of all proportion on pathological
> data? As in:
>  my $count = shift;
>  my $str = join '', map chr($_), 0 .. $count;
>  /x|$str/;
> 
> sizeof(reg_trie_trans) is 8, so that's going to be at least 8MB on 1k unique
> characters, or 32GB on 64k unique characters - quadratic memory growth is
> scary, and there should probably be a cutoff here that falls back to not
> bothering to construct the trie.

Coming up with limits on pathological cases is something I had meant
to mention already but I forgot. You are quite correct there should be
some hardlimits, but im not sure what they should be, nor how they
should be handled/configured. Should they be defined in regcomp.h? Id
appreciate guidance here.
 
> (I don't understand the underlying mechanisms here, but that count seems
> strange to me - surely we don't need more than around (charcount + 1)
> transitions?)

To build the initial structure we need need to initially construct a
two dimensional array, with the number of unique chars in the cols,
and the length of the strings summed together as the number of rows.

The number of rows is worst case and will be smaller when there are
common prefixes in the trie. The number of columns only matters during
construction as the compression logic removes or eliminates most of
the wasted space. Eg, a string of all of the extended ASCII chars
would take 256 * 256 nodes to construct, but would compress to only
256 nodes.

I suppose its possible to construct a compressed trie directly, but it
would be a lot more complex (youd have to insert all of the first
digits keep track of the transition points for each, then insert all
of the second digits, etc, making the process a lot slower). IMO its
not worth it if intelligent hardlimits are imposed. If people want to
run pathological regexes then they can run them slowly. :-)

> 
> Hmm, when I tried to test this out I noticed that you don't pick up an
> EXACT string long enough to split:
>  ./perl -Dr -wle '$s="x" x 256; /x|$s/'

Huh. I didnt know about that. The string is getting split into
multiple EXACT nodes which will defuse the optimisation. Ill have to
look into changing that.

> nor on patterns large enough to invoke LONGJMP:
>  ./perl -Dr -wle '$s=join"|",map chr,256..256+shift;/$s/' 25000 2>&1 | less

Well this is deliberate at the current time. I wanted to get the base
code bedded down before I tackled things like re's with LONGJMP's in
them. This patch was ambitious enough for me. :-)

> .. but slightly smaller patterns manage a panic if I'm lucky:
>  zen% ./perl -wle '$s=join"|",map chr,256..256+shift;/$s/' 20000
>  panic: malloc at -e line 1.

Yep, some hardblocks are definately required. What they should be
however is a bit hard to say.

<snip>
> :+            if (charid) {
> [...]
> :+            } else {
> :+                Perl_croak(aTHX_ "Internal error in trie construction, no 
> char mapping for word?");
> :+            }
> 
> Since you bother to check and croak, I'd suggest either coping with the
> (uvc >= 256 && !widecharmap) case by setting charid to 0, or not
> bothering with the 'if(widecharmap)' test: at this point a SEGV is likely
> to be as helpful as the croak(), if not more so. Oh yes, I said earlier:
> 
> :regcomp.c:910: warning: `charid' might be used uninitialized in this function
> [...]
> :it took quite an effort to determine that for all but the 'charid' case
> 
> And of course (uvc >= 256 && !widecharmap) is precisely when charid *would*
> be used uninitialised, silly me.
> 
> (Oh, and setting svpp to NULL just before assigning to it isn't needed. :)

Nice analysis. :-) Ill rework it so it just seg-faults. It was a
sanity check that i guess didnt really work. Sigh.
 
> :+            /* Its a dupe. So ignore it.
> :+
> :+            Warn here or not? [...]
> :+           */
> 
> I think the docs might be a better place for this commentary, it seems
> out of place here.

I mostly put it there so a reviewer could read it and decide if they
agree or not. Since it appears you do ill put the comment in the sub
documentation and then replace that comment with somethign smaller.

> :-S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, I32 
> *deltap, regnode *last, scan_data_t *data, U32 flags)
> :+S_study_chunk(pTHX_ RExC_state_t *pRExC_state, regnode **scanp, I32 
> *deltap, regnode *last, scan_data_t *data, U32 flags, U32 depth)
> 
> Hmm, that extra argument is used only if you've compiled perl with TRIE_DEBUG
> (for which you have to patch the source), and are running with -Dr - that
> seems like an unreasonable price for everyone else to pay.

Actually this becomes more important with phase two of my plans. Phase
one is the TRIE code alone and basic infrastructure and phase two is
the TRIEDFA code which relies on the previous stuff. Once the TRIEDFA
stuff comes into play the depth information becomes more than
diagnostic.

Also I feel that there further optimisations where the depth
information also could be used. Considering the potential benefits it
would be nice if it was left in for the time being. Id be happy to
patch it out at a later point if my plans for it came to nothing.


> 
> :+                   Hypthetically when we know the regex isnt anchored we can
> :+                   turn a case 1 into a DFA and let it rip...
> 
> I'm guessing this is true only if the (complete) branchset we're replacing
> is also the complete pattern.

Actually no, its true when the first part of the regex is entirely
convertable to a trie. Ie /(foo|bar)baz/ or derivatives
/(((foo|bar)b)a)z/. Ive got proof of concept of this stage of the
optimisation going, but i need to get the trie code up to scratch
first as the DFA code is just an extension of that.

<snip>
> Here, and elsewhere, something like "0x%p,0x%p,0x%p" would be likely to cope
> better with (for example) 64-bit pointers.

Will do. I guess its easy to forget about other architectures.

> 
> :+                        if ( (optype ? OP(noper) == optype
> :+                                     : PL_regkind[(U8)OP(noper)] == EXACT)
> :+                              && noper_next==tail )
> :+                        {
> :+                            if (!first) {
> :+                                first=cur;
> :+                                optype=OP(noper);
> 
> Hmm, the test of 'optype' seems to be trying to do the same as the test
> of 'first' is: I'd suggest testing 'first' in both cases, which also
> avoids the problem that you don't clear optype after calling make_trie().

Ok. Ill look at reworking that. I think i had a misapprehension as to
how /x|(?i:y)|z/ was handled, thanks for pointing it out.

> 
> :-       PerlIO_printf(Perl_debug_log, "%sCompiling REx%s `%s%*s%s'\n",
> :+       PerlIO_printf(Perl_debug_log, "%sCompiling REx%s \"%s%*s%s\"\n",
> [...]
> :-                     "%sFreeing REx:%s `%s%*.*s%s%s'\n",
> :+                     "%sFreeing REx:%s %s%*.*s%s%s\n",
> 
> Hmm, I hope you checked no-one is trying to parse that output ...

Well, im not sure how I can do that. Certainly nothing is in the test
suite. The result is more regular that the original so anybody new
wanting to parse it will probably thank me. :-)

Compiling REx `x|(?i:y)|z'
....
Freeing REx: `"x|(?i:y)|z"'

versus:
Compiling REx "x|(?i:y)|z"
..
Freeing REx: "x|(?i:y)|z"

I guess this could be omitted, it just annoyed me.

> 
> :+            case 't':
> :+                    {
> :+                        reg_trie_data 
> *trie=(reg_trie_data*)r->data->data[n];
> :+                        OP_REFCNT_LOCK;
> :+                        trie->refcount--;
> :+                        OP_REFCNT_UNLOCK;
> :+                        if (!trie->refcount) {
> :+                            Safefree(trie->charmap);
> :+                            if (trie->widecharmap) 
> SvREFCNT_dec((SV*)trie->widecharmap);
> :+                            Safefree(trie->states);
> :+                            Safefree(trie->trans);
> :+#ifdef DEBUGGING
> :+                            if (trie->words) SvREFCNT_dec((SV*)trie->words);
> :+#endif
> :+                            Safefree(r->data->data[n]); /* do this last!!!! 
> */
> :+                        }
> :+                        break;
> :+                    }
> 
> (All but the 'case' should be outdented one stop.)
> 
> Hmm, locking a 'refcount--' doesn't make much sense to me: if two threads
> both do this at the same time, they'll still then both try to free the
> structure. But I see that op_free() does essentially the same thing,
> which confuses me. Can anyone explain how that makes sense?

Ok, it may be that this is totally wrong and im glad you brought it
up. I basically stole it from the optree code in the same sub and had
a look at what the macros did. My idea was that OP_REFCOUNT_LOCK would
lock out other threads while modifying the refcount so that you didnt
get accidental frees (like when reg_dup is happening at the same time
as pregfree or the like.) It could be that this is totally unnecessary
garbage. Im open to suggestions.

> 
> Gosh, am I done? Have fun ...
> 

Heh, ive been busy addressing the concerns raised in your first
response. I think i have them all covered and I was going to send out
an update today but i figure I might as well deal with these ones
first so itll be a few more days.

I'm sorry this is so much work for you but your help has been really
useful. I'm really glad that you havent found anything that would
appear to critically undermine the idea and i will address all of your
concerns soon.

I was able to address the memory leak issue of EVAL nodes by tracking
when they are present and then using SAVEFREEPV() instead of
Safefree() when an EVAL node is present and there are more than one
accept states. It seems to work fine, insofar that I see no leakage
from
 
  perl -e "eval { ('a' x 100)=~/(?:a|aa|aaa|aaaa)(??{die})/ } while 1"

Can you see a flaw with this approach? 

Cheers and many thanks,
Yves


-- 
First they ignore you, then they laugh at you, then they fight you,
then you win.
  +Gandhi

Re: (blead patch) New TRIE regex patch

Reply via email to