demerphq <[EMAIL PROTECTED]> wrote:
:On Wed, 02 Mar 2005 14:53:21 +0000, [EMAIL PROTECTED] <[EMAIL PROTECTED]> 
wrote:
:> demerphq <[EMAIL PROTECTED]> wrote:
:> :Hmm. Yes I think you are right. Ok. If i rework it so the trie is
:> :allocated and inserted into the regexes data structure at the get-go
:> :then pregfree would clean it up.
:> 
:> Make sure in that case that you can cope with reentrancy.
:
:Can you expand on that a bit please? Im not entirely certain what you
:mean there.

If you're attaching stuff to the regexp structure, you need to cope
with reentrancy due to eval blocks, by making sure that stuff is saved
and restored at the right time. There's stuff that does all that, I just
don't remember where right now - I'm sure a trawl would turn it up.

:Personally I think of cases such as you describe as being pretty well
:pathological. I had a look through a large set of regexes and none of
:them had more than a small handful of unique characters.

The sort of situation I'm thinking about is a virus scanner, which would
have a large number of short characteristic sequences to search for;
I can also imagine bioinformatics applications that would do something
similar.

:(Do a review of your own code. How many times have your made a regex
:with /a|list|of|words/ that had more than a small number of letters
:(ie well less than 256)?)

In my current work application (about 48 KLOC) I found only two regexps
that would get a trie, one matching '(before | after)', and the other
'(add | remove | replace)'; however I don't think I'm a typical user -
in general, the people who will get the biggest benefit are those who
wrote their code *without* knowing that this is one case that isn't
currently well optimised.

:> Maybe it would make sense to use a poor man's hash: [...]
:
:Possibly such a strategy could be taken when ( charcount * length )
:was high enough although my gut feeling is that it probably wouldnt be
:worth it.

In general we don't want multiple strategies if we can possibly avoid it.
It's down to finding a best-fit strategy that minimises the resource cost
of the average case while avoiding pathological behaviour in all cases.

Hugo

Reply via email to