Re: (blead patch) New TRIE regex patch

demerphq Wed, 02 Mar 2005 08:20:50 -0800

On Wed, 02 Mar 2005 14:53:21 +0000, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> demerphq <[EMAIL PROTECTED]> wrote:
> :Hmm. Yes I think you are right. Ok. If i rework it so the trie is
> :allocated and inserted into the regexes data structure at the get-go
> :then pregfree would clean it up.
> 
> Make sure in that case that you can cope with reentrancy.


Can you expand on that a bit please? Im not entirely certain what you
mean there.
 
> :Should i use distinct ids for each call?
> 
> I think it would be desirable.

Ok, fine. :-)

> 
> :Coming up with limits on pathological cases is something I had meant
> :to mention already but I forgot. You are quite correct there should be
> :some hardlimits, but im not sure what they should be, nor how they
> :should be handled/configured. Should they be defined in regcomp.h? Id
> :appreciate guidance here.
> [...]
> :To build the initial structure we need need to initially construct a
> :two dimensional array, with the number of unique chars in the cols,
> :and the length of the strings summed together as the number of rows.
> :
> :The number of rows is worst case and will be smaller when there are
> :common prefixes in the trie. The number of columns only matters during
> :construction as the compression logic removes or eliminates most of
> :the wasted space. Eg, a string of all of the extended ASCII chars
> :would take 256 * 256 nodes to construct, but would compress to only
> :256 nodes.
> :
> :I suppose its possible to construct a compressed trie directly, but it
> :would be a lot more complex (youd have to insert all of the first
> :digits keep track of the transition points for each, then insert all
> :of the second digits, etc, making the process a lot slower). IMO its
> :not worth it if intelligent hardlimits are imposed. If people want to
> :run pathological regexes then they can run them slowly. :-)
> 
> That would be fine if no pathological case would arise in normal usage.
> But even where there are a mere 256 unique characters (ie, a situation
> that does not even require Unicode) you'll need 2k per character, which
> already seems excessive.

Well, this is true, but its only for the short duration of the initial
constuction, most of the space would be returned to the process after
compression shortly later.

Personally I think of cases such as you describe as being pretty well
pathological. I had a look through a large set of regexes and none of
them had more than a small handful of unique characters.

(Do a review of your own code. How many times have your made a regex
with /a|list|of|words/ that had more than a small number of letters
(ie well less than 256)?)

Basically I dont want to add an overall slowdown to every trie just to
support what i consider to be extreme cases. Changine the construction
mechanism to build the trie compressed is a pretty extreme road that
im not super willing to go down as it would make constructing every
regex more expensive, and quite a lot more too.

> 
> Maybe it would make sense to use a poor man's hash: have trans[] be
> a C array of (charcount + 1) vectors, each of which is either a linked
> list or a resizable array of (charid, reg_trie_trans*) pairs, and walk
> the list to search for any existing entry each time - if I understand
> correctly, seeing an 'a' => 'b' transition would then take processing
> time proportional to the number of distinct 'a' => 'x' transitions
> seen previously, but even the worst case would be less bad than trying
> to allocate a 32GB data structure.

Possibly such a strategy could be taken when ( charcount * length )
was high enough although my gut feeling is that it probably wouldnt be
worth it. I think id probably be more inclined to build the compressed
trie directly.

Anyway, i suppose we can discuss this more later. I think for now hard
limits are a more reasonable approach.

> 
> A hardcoded limit is a possible fallback, but my guess is that such
> pathological cases could have most to gain from the trie so I think it'd
> be preferable to try and cater for them.

Im not so certain that pathological cases really do benefit that much
from the trie. Where we really see a benefit from the trie is when we
have a lot of words, not when we have words with a lot of unique chars
in them.

To be honest at this point id rather focus my efforts on optimising
most regexes than trying to cater for pathological cases. I could
spend an awful lot of time reworking things for just that one case and
never get to more useful things like the TRIEDFA stuff. (As compared
to WHILEM the TRIEDFA code is a _lot_ faster.)

I think its more beneficial to everyone involved if we make the common
cases as screaming fast as possible and the pathological cases no
slower than they already are. Once the common cases enjoy as much
optimisation as possible I can focus my time and efforts on the more
pathological cases.

> :> :-       PerlIO_printf(Perl_debug_log, "%sCompiling REx%s `%s%*s%s'\n",
> :> :+       PerlIO_printf(Perl_debug_log, "%sCompiling REx%s \"%s%*s%s\"\n",
> :> [...]
> :> :-                     "%sFreeing REx:%s `%s%*.*s%s%s'\n",
> :> :+                     "%sFreeing REx:%s %s%*.*s%s%s\n",
> :>
> :> Hmm, I hope you checked no-one is trying to parse that output ...
> :
> :Well, im not sure how I can do that.
> 
> You can't, which was the point. :)

Heh. :-)

> 
> :I guess this could be omitted, it just annoyed me.
> 
> It has annoyed me too in the past. Just wanted to be sure you were aware
> of the danger - I'm sure we've never guaranteed the output (and it has
> changed often enough), so it's probably ok. Might want to check with mjd
> and japhy though, since I think neither usually follow p5p.

Ok, ill follow up with both of them offlist.

> :I was able to address the memory leak issue of EVAL nodes by tracking
> :when they are present and then using SAVEFREEPV() instead of
> :Safefree() when an EVAL node is present and there are more than one
> :accept states. It seems to work fine, insofar that I see no leakage
> :from
> :
> :  perl -e "eval { ('a' x 100)=~/(?:a|aa|aaa|aaaa)(??{die})/ } while 1"
> :
> :Can you see a flaw with this approach?
> 
> If I understand correctly, this will free them at the next LEAVE, which
> will normally occur when we leave the scope. I don't know offhand when
> that's occurring in this case, or whether it's possible to construct
> an example that defers it without bound on the number of these structs.
> 
> If the SAVEFREEPV approach works, I suggest using it always - EVAL is
> not the only thing you may need to cater for.

Hmm, well i dont need to use it when accepted <= 1, so i guess using
it the rest of the time cant hurt that much. I could also wrap the
SAVEFREEPV inside an ENTER/LEAVE as well to force the free to come
earlier.

BTW, i wasnt able to find out to my satisfaction what the different
impact is of ENTER vs SAVETMPS and FREETMPS vs LEAVE. The
documentation in perlguts, perlapi, perlcall and perlcallback all
defer proper explanation of these constucts to each other, leaving
them pretty much unexplained beyond the advice that you can/should use
both sets in XS routines for mortalizing variables and defering memory
deallocation to a later time.

What is the difference between the two sets? In my case should I use
ENTER/LEAVE or SAVETMPS/FREETMPS or both sets? Anybody who understands
these details who felt inclined to update the pod on this would find
me very grateful.

Cheers,
Yves
-- 
First they ignore you, then they laugh at you, then they fight you,
then you win.
  +Gandhi

Re: (blead patch) New TRIE regex patch

Reply via email to